This tutorial is part of a series of R tutorials analyzing covid-19 data.
For parts 1 and 2, see the following posts:
Covid19 analysis, part 3: initial data exploration
So far in our R data analysis series, we’ve spent some time getting and wrangling our data.
In part1, we retrieved a single covid19 file and wrangled it into shape. That was the file for covid19 confirmed cases, and we used it as sort of a starting point for our data wrangling. Since it was only one file, it was relatively easy to do, and it gave us some experience getting and wrangling the covid19 data from the JHU covid19 data archive on Github.
In part 2, we went a step further. The JHU covid19 data archive actually has 3 datasets: one each for confirmed cases, deaths, and recovered cases. So in part 2, we created a repeatable process that we could use for all three files. This process enabled us to get the data from all three files, wrangle it into shape, and merge it together into one file.
So at this point, we have a complete covid19 file (I’ll show you where you can get it in a minute).
Now, we’re going to do some data exploration to examine the file and see how it’s structured.
A quick overview of what we’ll do
In this tutorial, we’ll use some techniques from dplyr to aggregate, subset, and explore our data.
For the most part, we’re just going to be “exploring” the data to check the columns, get a rough idea of what’s in those columns, and see how the dataset is structured.
If you want to jump to a section, you can use these links in the Table of Contents.
Table of Contents:
Having said that, this tutorial is fairly linear in structure and it will make much more sense if you read the whole thing from start to finish.
Tools you’ll need
In this tutorial, we’ll be using the R programming language. (But, just so you’re aware, we actually have a related covid19 tutorial series that uses Python. You can find the related Python data exploration tutorial here.)
Specifically, we’ll use two important R packages:
dplyr package to subset and summarize our data. We’ll also use the
lubridate package a little bit.
So ideally, you’ll have a solid understanding of
dplyr already. You’ll understand this tutorial quite a bit more if you already understand
But, if you don’t understand
dplyr, you can still follow along by copy-pasting the code, running it, and reading the explanation. That will still help you understand the general process of exploring a new dataset.
Ok. With all of that in mind, let’s get started.
Exploring the covid19 dataset
Here, we’re going to start running the actual R code to explore our covid19 dataset.
Get Data and Packages
First things first … we need to get our dataset. To do this, we need to load some R packages, and then we’ll retrieve the data file from a URL.
First let’s load some R packages.
In this tutorial, we’ll mostly be using
dplyr, but we’ll also use the
lubridate package. We need to load them both.
#================ # IMPORT PACKAGES #================ library(tidyverse) library(lubridate)
Get covid19 data
Next, we’ll get the covid19 data from a URL.
Remember: we already created this data file in part 2 of this R analysis series.
If you want, you can actually go back to part 2 and run all of the code there. If you do, you’ll get a dataset with the most up-to-date data.
Having said that, if you run the following code, you’ll still get a working dataset, but with data up to May 3, 2020.
For the most part, it doesn’t really matter. If you just download the pre-wrangled data with the code below, you’ll still be able to explore the data. If you go back and re-create the data yourself, you’ll get more experience actually running R code, and the data will be more up-to-date, but it will take more time.
Ok, you can get the pre-created covid19 dataset by running the following code:
#========= # GET DATA #========= file_path <- "https://www.sharpsightlabs.com/datasets/covid19/covid_data_2020-05-04.csv" covid_data <- read_delim(file_path ,delim = ";" )
Because we used the
readr::read_delim() function, the data was imported as a dataframe. In fact, it's a special type of dataframe called a
tibble, but I'm going to mostly just call it a dataframe.
The way we imported it, the dataframe is called
covid_data. This dataframe contains worldwide covid19 data from January, 2020 up to early May, 2020.
But to get a better understanding of what's in this file, let's explore it.
Basic Data Inspection
Data exploration typically involves printing, subsetting, and aggregating a dataset. It's sometimes a little more complicated than that, but at it's essence, that's what we typically do when we initially explore a dataset.
With that in mind, let's start simple and just print out the data.
In R, we can do that with a simple call to the
covid_data %>% print()
# A tibble: 27,398 x 9 country subregion date lat long confirmed dead recovered new_cases [chr] [chr] [date] [dbl] [dbl] [dbl] [dbl] [dbl] [dbl] 1 Afghanistan NA 2020-01-22 33 65 0 0 0 NA 2 Afghanistan NA 2020-01-23 33 65 0 0 0 0 3 Afghanistan NA 2020-01-24 33 65 0 0 0 0 4 Afghanistan NA 2020-01-25 33 65 0 0 0 0 5 Afghanistan NA 2020-01-26 33 65 0 0 0 0 6 Afghanistan NA 2020-01-27 33 65 0 0 0 0 7 Afghanistan NA 2020-01-28 33 65 0 0 0 0 8 Afghanistan NA 2020-01-29 33 65 0 0 0 0 9 Afghanistan NA 2020-01-30 33 65 0 0 0 0 10 Afghanistan NA 2020-01-31 33 65 0 0 0 0 # … with 27,388 more rows
covid_data is a tibble, it prints out with some extra formatting and extra information.
Immediately, we can see a few things.
The very top line, "
A tibble: 27,398 x 9", tells us that this data has 27,398 rows and 9 columns.
We can see the columns themselves:
Underneath the name of each column in the printout, we can see the data type. For example,
chr data, meaning "character." The
new_cases variables are
dbl, meaning "double" ... i.e., decimal numbers.
Underneath that, we can see a printout of about 10 records. This enables us to see the rough structure of the data.
Printing your data like this can be surprisingly useful, because at a glance, you can get a lot of information about the column names, data types, and records.
You want to take note of that information.
You should also look for anything strange:
- are there any missing values that should actually have data?
- do the variable names look right?
- do the data types look correct?
If you've retrieved and wrangled your data properly (like we did in part 1 and part 2) then things should mostly be in place.
But even then, after you wrangle the data, you might not completely understand what's in the data or how it's structured. So, look carefully.
Consider saving some records
After you print out your data, one thing you might consider is saving a few of those printed records in your script.
I mean, take the records and paste them right into your code.
If you do this, you absolutely want to comment them out.
# # A tibble: 27,398 x 9 # country subregion date lat long confirmed dead recovered new_cases #
# 1 Afghanistan NA 2020-01-22 33 65 0 0 0 NA # 2 Afghanistan NA 2020-01-23 33 65 0 0 0 0 # 3 Afghanistan NA 2020-01-24 33 65 0 0 0 0 # 4 Afghanistan NA 2020-01-25 33 65 0 0 0 0 # 5 Afghanistan NA 2020-01-26 33 65 0 0 0 0 # 6 Afghanistan NA 2020-01-27 33 65 0 0 0 0 # 7 Afghanistan NA 2020-01-28 33 65 0 0 0 0 # 8 Afghanistan NA 2020-01-29 33 65 0 0 0 0 # 9 Afghanistan NA 2020-01-30 33 65 0 0 0 0 # 10 Afghanistan NA 2020-01-31 33 65 0 0 0 0 # # … with 27,388 more rows
Some people will disagree with me on this, on the basis that it clutters up the code.
Fair enough. I don't recommend putting records into production code shared across team members, but if you're just in the initial stages of data exploration, it can be useful to just drop some printed records right into your code. It makes them easy to record ... especially if you plan to make any changes.
Print column names
We already have the columns from the printout above, but if you want to have them listed out separately, we can use the
covid_data %>% names()
 "country" "subregion" "date" "lat" "long" "confirmed" "dead" "recovered"  "new_cases"
To further explore the dataset, we'll need to know these variable name, so make sure to remember them, or paste them somewhere useful, like your code or a text document.
Ok, now we'll start exploring individual variables.
We're going to start looking at the variables individually, mostly to see what's in them.
Get unique values: country
First, let's get the unique values of
To do this, we're going to use the
Notice that we're using the pipe operator to do this. It works like a waterfall, starting with the dataset,
covid_data. We're piping
covid_data into the
distinct() function, to get the distinct values of the
country variable. Then we're actually piping that into the
print() function with
n = 190, in order to print out all of the rows.
covid_data %>% distinct(country) %>% print(n = 190)
I recommend that you run this code as is, so you can see all of the values. But since the output is pretty long, I'll truncate it here.
# A tibble: 187 x 1 country
1 Afghanistan 2 Albania 3 Algeria 4 Andorra 5 Angola 6 Antigua and Barbuda 7 Argentina ... 179 US 180 Uzbekistan 181 Venezuela 182 Vietnam 183 West Bank and Gaza 184 Western Sahara 185 Yemen 186 Zambia 187 Zimbabwe
What's going on here?
Our code has just printed out the unique values of country. Remember: our dataset actually has 27,398 rows, so the country names are repeated in
covid_data many times.
distinct() function ignores any repeats and only gives us the unique values.
So now that we have them, what are we looking for?
When you examine a string variable or categorical variable like this, I want to think about a few things:
- Are they all actually country names?
- Are any of them unusual?
- Does anything need to be changed or cleaned?
- Are there any extra characters that need to be removed? (spaces, periods, etc)
In this case, the different values are in good shape, so we don't need to do any real cleanup here.
But if there was anything amiss, we'd need to do some extra data cleaning to clean things up.
Count unique values of country
We'll also get a count of the unique values of
To do this, we'll use the syntax
distinct(country) function, and we'll pipe that into the
count() function. This will literally count the distinct values of country.
#----------------------------- # COUNT UNIQUE VALUES: country #----------------------------- covid_data %>% distinct(country) %>% count()
When you run this on the data that we're using, you'll get the answer 187.
Get unique values of country/subregion
Next, we'll get the unique combinations of
subregion are two different variables in our dataframe.
To get the unique combinations of
subregion, we can use the
distinct() function with both variables. Then we'll pipe the output into
#------------------------------------- # GET UNIQUE VALUES: country/subregion #------------------------------------- covid_data %>% distinct(country, subregion) %>% print(n = 300)
A tibble: 266 x 2 country subregion
1 Afghanistan NA 2 Albania NA 3 Algeria NA 4 Andorra NA 5 Angola NA 6 Antigua and Barbuda NA 7 Argentina NA 8 Armenia NA 9 Australia Australian Capital Territory 10 Australia New South Wales 11 Australia Northern Territory 12 Australia Queensland 13 Australia South Australia 14 Australia Tasmania 15 Australia Victoria 16 Australia Western Australia ...
There's actually quite a few rows, so in the interest of space, I've only posted the first 16.
But you can immediately see something: some of the data is listed strictly at the country level. For example, for Afghanistan, we only have data for the country, but not any subregions.
But for other countries, we have data at the country/subregion level. So for example, for Australia, we have data for "Australian Capital Territory", "New South Wales", "Queensland", etc.
This is important. This means that whenever we're trying to get data at the country level, we're probably going to need to aggregate the data.
To be clear, if we're 100% sure that the data for a country is strictly reported at the country level (and not the country/subregion level) then we don't need to aggregate. But any time we're unsure, if we want country level data, we'll need to aggregate by country by using the
Count unique: country/subregion
We'll also count the number of unique country/subregion combinations.
#--------------------------------------- # COUNT UNIQUE VALUES: country/subregion #--------------------------------------- covid_data %>% distinct(country, subregion) %>% count()
Get unique values: date
Let's continue with this and get the unique values of
This will be very similar to the previous sections.
Again, we'll use the
distinct() function with print to get the unique values.
covid_data %>% distinct(date) %>% print(n = 105)
# A tibble: 103 x 1 date
1 2020-01-22 2 2020-01-23 3 2020-01-24 4 2020-01-25 5 2020-01-26 ... 101 2020-05-01 102 2020-05-02 103 2020-05-03
I've deleted some of the output here, just for the sake of space.
But when you run the code, I want you to look at the dates.
Do they look right?
Do the dates look appropriate for the data (given what you know)?
Are there any dates missing?
As you do this, just keep in mind what we're really doing at this phase in the analysis: we're just checking out the data to find out what's in it, and to make sure that it looks "okay".
Count unique combinations: country, subregion, date
Before we move on to the other variables, let's count the unique combinations of
To do this, we'll again use the
distinct() function with
#---------------------------------------------- # COUNT UNIQUE VALUES: country, subregion, date #---------------------------------------------- covid_data %>% distinct(country, subregion, date) %>% count()
When you run this you'll find that there are 27398 unique combinations of
date. (At least, that's the case if you downloaded the pre-created file above ... but if you created your own file with updated data, the number will be higher.)
Ok. So we have 27398 unique combinations of
As we saw previously, there are also 27398 total rows in the data.
That means that the rows of data are unique on
date. The combination of
date allows us to identify a record.
I won't go into too much explanation of why this is important, but it is very important for performing joins/merges, as well as other operations.
Check variables: lat and long
Let's move on.
Here, we're going to look at the
If you're sort of new to data science, it might not be immediately obvious what these are, but for most people who have worked with data for a while, it's probably clear: these look like variables for latitude and longitude. (Of course, we can learn that by reading about the original csv data at the JHU covid19 data archive on Github).
Even though it's pretty clear what these are, we'll want to take a look and make sure that the values look valid.
Quickly, I want to compute the maximum and minimum of the values. That will help us determine if the data are valid within some rough parameters. (To really know, we'll probably want to plot some data on a map, but that's a little more complicated, so we'll do that in a separate tutorial.)
There are a few ways to do this. You can do it one at a time with code that looks like this:
covid_data %>% select(long) %>% summarise(min(long), max(long)) covid_data %>% select(lat) %>% summarise(min(lat), max(lat))
That's fine, and it's relatively simple to do.
But there's a slightly more elegant way to do it by pivoting the data using the
gather() function, then grouping and summarising.
#------------------------------ # EXAMINE VARIABALES: lat, long #------------------------------ covid_data %>% select(long, lat) %>% gather() %>% group_by(key) %>% summarise('minimum' = min(value) ,'maximum' = max(value) )
key minimum maximum
1 lat -51.8 71.7 2 long -135 178.
This gives you a little table that shows the minimum and maximum for bot
So what are we looking for?
We're just checking that the range of values looks appropriate.
We can actually check these against some information from the internet. A quick google search reveals that:
"The valid range of latitude in degrees is -90 and +90 for the southern and northern hemisphere respectively. Longitude is in the range -180 and +180 specifying coordinates west and east of the Prime Meridian, respectively."
(Source: Stack Overflow)
Based on what we're seeing,
long look okay. We might find some problems later on if/when we map the data, but based on our quick-and-dirty check, these look okay.
Check variables: confirmed, dead, recovered
Next, we'll check the
Here, we'll just look at the min and max values.
Again, like in the previous section, there are a few ways to do this. You can check these values one at a time, but there is a faster way.
Here, we'll create a concise little table that summarizes the information we want.
To do this, we'll use the same strategy as the last section.
We'll select our variables, transpose the data using
gather(), then group and summarise to calculate the minimum and maximum values.
#----------------------------------------------- # EXAMINE VARIABALES: confirmed, dead, recovered #----------------------------------------------- covid_data %>% select(confirmed, dead, recovered) %>% gather() %>% group_by(key) %>% summarise('min' = min(value, na.rm = T) ,'max' = max(value, na.rm = T) )
# A tibble: 3 x 3 key min max
1 confirmed 0 1158040 2 dead 0 67682 3 recovered 0 180152
The output here is a little table that gives us the minimum and maximum values of
One good thing to do at this point is just try to find those records.
Let's quickly try to identify the row of data where the confirmed cases is 1158040.
covid_data %>% filter(confirmed == 1158040)
country subregion date lat long confirmed dead recovered new_cases
1 US NA 2020-05-03 37.1 -95.7 1158040 67682 180152 25501
As you can see, the row of data with 1,158,040 confirmed cases is for the United States on May 3, 2020.
Why did we do this?
Well for starters, we can just ask, does that sound right? Does the US have the most cases? (Yes, as of the date of publication, it does.)
And we can take the information in that record and try to compare it to some external data source to try to validate it.
Ultimately, what we're trying to do is validate that the data in our dataset are valid and "correct".
Examine a few countries
Finally, let's just examine a few countries.
Again, we just want to take a quick look and try to see if the data looks "correct."
To do this, we'll use the
select() function to retrieve the variables we want to work with.
We'll filter the data down to a few select countries (I chose the USA, Italy, Sweden, and China, but you can change these if you'd like).
We're also filtering down to a single date, just to simplify things.
Then we're grouping by
... and finally summing the records using
This will give us the number of confirmed cases for the USA, Italy, Sweden, and China on May 1, 2020.
#------------------------------- # SPOT CHECK IMPORTANT COUNTRIES #------------------------------- covid_data %>% select(country, date, confirmed, dead, recovered) %>% filter(country %in% c('US', 'Italy', 'Sweden', 'China')) %>% filter(date == as_date('2020-05-01')) %>% group_by(country, date) %>% summarise(total_confirmed = sum(confirmed))
country date total_confirmed
1 China 2020-05-01 83959 2 Italy 2020-05-01 207428 3 Sweden 2020-05-01 21520 4 US 2020-05-01 1103461
So what does this give us?
We have a simple table that gives us the total confirmed cases for 4 countries on a single date.
We could change this to include more or different countries.
And we could also change the date.
But ultimately, we're trying to get a quick "sample" of the data to see if it's correct.
I recommend that you try to find an external data source that we can use to compare these numbers against. (For example, we can compare these numbers with the Johns Hopkins Covid-19 Dashboard.) We can use an external data source to try to validate these numbers.
A quick note about the data analysis process
I've mentioned this in previous tutorials, but it's important, so I'll repeat it.
In a real-world environment, we're often working in teams, not just as independent analysts.
When you work in a team in a business environment, you'll almost always work with a subject matter expert (SME) or some type of data expert. The subject matter experts are commonly people outside of the data science team who are (... wait for it) experts in the subject matter that you're investigating.
As you begin to analyze a dataset, you'll frequently go to those experts with your findings and just check: "hey, do these numbers look right to you?" ... "here's what' I'm looking at. Do these variables look correct?"
Subject matter experts can be invaluable for validating that your new dataset is accurate and free of errors.
If you don't have an SME close by, you might also go to more senior coworkers and ask the same questions: "do these numbers look OK to you?"
In a real world environment, you need to develop good working relationships with these people and maintain good communication.
Data science is not strictly about code and pure analysis. Soft skills matter!
The data look pretty good
Now that we've explored the data, it looks like the data are relatively clean and the values are mostly appropriate. So, we can move on to the next phases of our analysis.
But remember: data analysis is highly iterative.
We'll get more information about our data when we begin to visualize it. We might something new that looks incorrect, and we'll possibly need to go back and do more data cleaning. That's okay and totally normal.
We're going to continue with the process, and we can fix other issues if and when we find them.
Now that we've explored our data a little, we're going to start visualizing it.
In the next tutorial, we'll use
ggplot2 to do some standard visualizations like bar charts, line charts, histograms, etc.
Later, we'll also use ggplot for some more exotic types of visualization. We'll probably do a small multiple chart, and possibly a heatmap and geographic map.
This is a pretty rich dataset and there's a lot more that we can do ...
Sign up to learn more
Do you want to see part 4 and the other tutorials in this series?
Do you want to see how we visualize this data?
Then sign up for our email list now.
When you sign up, you'll get our tutorials delivered directly to your inbox.