R Data Exploration: covid-19 [part 3]

An image of R code analyzing covid19 data, with an image of the sars-cov-2 virus in the background.
This tutorial is part of a series of R tutorials analyzing covid-19 data.

For parts 1 and 2, see the following posts:
https://www.sharpsightlabs.com/blog/r-data-analysis-covid-19-part1-data-wrangling/
https://www.sharpsightlabs.com/blog/r-data-analysis-covid-19-part-2-merge-datasets/

Covid19 analysis, part 3: initial data exploration

So far in our R data analysis series, we’ve spent some time getting and wrangling our data.

In part1, we retrieved a single covid19 file and wrangled it into shape. That was the file for covid19 confirmed cases, and we used it as sort of a starting point for our data wrangling. Since it was only one file, it was relatively easy to do, and it gave us some experience getting and wrangling the covid19 data from the JHU covid19 data archive on Github.

In part 2, we went a step further. The JHU covid19 data archive actually has 3 datasets: one each for confirmed cases, deaths, and recovered cases. So in part 2, we created a repeatable process that we could use for all three files. This process enabled us to get the data from all three files, wrangle it into shape, and merge it together into one file.

So at this point, we have a complete covid19 file (I’ll show you where you can get it in a minute).

Now, we’re going to do some data exploration to examine the file and see how it’s structured.

A quick overview of what we’ll do

In this tutorial, we’ll use some techniques from dplyr to aggregate, subset, and explore our data.

For the most part, we’re just going to be “exploring” the data to check the columns, get a rough idea of what’s in those columns, and see how the dataset is structured.

If you want to jump to a section, you can use these links in the Table of Contents.

Table of Contents:

Having said that, this tutorial is fairly linear in structure and it will make much more sense if you read the whole thing from start to finish.

Tools you’ll need

In this tutorial, we’ll be using the R programming language. (But, just so you’re aware, we actually have a related covid19 tutorial series that uses Python. You can find the related Python data exploration tutorial here.)

Specifically, we’ll use two important R packages:

  • dplyr
  • lubridate

We’ll use dplyr package to subset and summarize our data. We’ll also use the lubridate package a little bit.

So ideally, you’ll have a solid understanding of dplyr already. You’ll understand this tutorial quite a bit more if you already understand dplyr.

But, if you don’t understand dplyr, you can still follow along by copy-pasting the code, running it, and reading the explanation. That will still help you understand the general process of exploring a new dataset.

Ok. With all of that in mind, let’s get started.

Exploring the covid19 dataset

Here, we’re going to start running the actual R code to explore our covid19 dataset.

Get Data and Packages

First things first … we need to get our dataset. To do this, we need to load some R packages, and then we’ll retrieve the data file from a URL.

Load packages

First let’s load some R packages.

In this tutorial, we’ll mostly be using dplyr, but we’ll also use the lubridate package. We need to load them both.

#================
# IMPORT PACKAGES
#================
library(tidyverse)
library(lubridate)

Get covid19 data

Next, we’ll get the covid19 data from a URL.

Remember: we already created this data file in part 2 of this R analysis series.

If you want, you can actually go back to part 2 and run all of the code there. If you do, you’ll get a dataset with the most up-to-date data.

Having said that, if you run the following code, you’ll still get a working dataset, but with data up to May 3, 2020.

For the most part, it doesn’t really matter. If you just download the pre-wrangled data with the code below, you’ll still be able to explore the data. If you go back and re-create the data yourself, you’ll get more experience actually running R code, and the data will be more up-to-date, but it will take more time.

Ok, you can get the pre-created covid19 dataset by running the following code:

#=========
# GET DATA
#=========
file_path <- "https://www.sharpsightlabs.com/datasets/covid19/covid_data_2020-05-04.csv"

covid_data <- read_delim(file_path
                        ,delim = ";"
)

Because we used the readr::read_delim() function, the data was imported as a dataframe. In fact, it's a special type of dataframe called a tibble, but I'm going to mostly just call it a dataframe.

The way we imported it, the dataframe is called covid_data. This dataframe contains worldwide covid19 data from January, 2020 up to early May, 2020.

But to get a better understanding of what's in this file, let's explore it.

Basic Data Inspection

Data exploration typically involves printing, subsetting, and aggregating a dataset. It's sometimes a little more complicated than that, but at it's essence, that's what we typically do when we initially explore a dataset.

Print rows

With that in mind, let's start simple and just print out the data.

In R, we can do that with a simple call to the print() function.

covid_data %>% print()

OUT:

# A tibble: 27,398 x 9
   country     subregion date         lat  long confirmed  dead recovered new_cases
   [chr]       [chr]     [date]      [dbl] [dbl]     [dbl] [dbl]     [dbl]     [dbl]
 1 Afghanistan NA        2020-01-22    33    65         0     0         0        NA
 2 Afghanistan NA        2020-01-23    33    65         0     0         0         0
 3 Afghanistan NA        2020-01-24    33    65         0     0         0         0
 4 Afghanistan NA        2020-01-25    33    65         0     0         0         0
 5 Afghanistan NA        2020-01-26    33    65         0     0         0         0
 6 Afghanistan NA        2020-01-27    33    65         0     0         0         0
 7 Afghanistan NA        2020-01-28    33    65         0     0         0         0
 8 Afghanistan NA        2020-01-29    33    65         0     0         0         0
 9 Afghanistan NA        2020-01-30    33    65         0     0         0         0
10 Afghanistan NA        2020-01-31    33    65         0     0         0         0
# … with 27,388 more rows

Because covid_data is a tibble, it prints out with some extra formatting and extra information.

Immediately, we can see a few things.

The very top line, "A tibble: 27,398 x 9", tells us that this data has 27,398 rows and 9 columns.

We can see the columns themselves:

  • country
  • subregion
  • date
  • lat
  • long
  • confirmed
  • dead
  • recovered
  • new_cases

Underneath the name of each column in the printout, we can see the data type. For example, country is chr data, meaning "character." The confirmed, dead, recovered, and new_cases variables are dbl, meaning "double" ... i.e., decimal numbers.

Underneath that, we can see a printout of about 10 records. This enables us to see the rough structure of the data.

Printing your data like this can be surprisingly useful, because at a glance, you can get a lot of information about the column names, data types, and records.

You want to take note of that information.

You should also look for anything strange:

  • are there any missing values that should actually have data?
  • do the variable names look right?
  • do the data types look correct?

If you've retrieved and wrangled your data properly (like we did in part 1 and part 2) then things should mostly be in place.

But even then, after you wrangle the data, you might not completely understand what's in the data or how it's structured. So, look carefully.

Consider saving some records

After you print out your data, one thing you might consider is saving a few of those printed records in your script.

I mean, take the records and paste them right into your code.

If you do this, you absolutely want to comment them out.

# # A tibble: 27,398 x 9
#    country     subregion date         lat  long confirmed  dead recovered new_cases
#                                      
#  1 Afghanistan NA        2020-01-22    33    65         0     0         0        NA
#  2 Afghanistan NA        2020-01-23    33    65         0     0         0         0
#  3 Afghanistan NA        2020-01-24    33    65         0     0         0         0
#  4 Afghanistan NA        2020-01-25    33    65         0     0         0         0
#  5 Afghanistan NA        2020-01-26    33    65         0     0         0         0
#  6 Afghanistan NA        2020-01-27    33    65         0     0         0         0
#  7 Afghanistan NA        2020-01-28    33    65         0     0         0         0
#  8 Afghanistan NA        2020-01-29    33    65         0     0         0         0
#  9 Afghanistan NA        2020-01-30    33    65         0     0         0         0
# 10 Afghanistan NA        2020-01-31    33    65         0     0         0         0
# # … with 27,388 more rows

Some people will disagree with me on this, on the basis that it clutters up the code.

Fair enough. I don't recommend putting records into production code shared across team members, but if you're just in the initial stages of data exploration, it can be useful to just drop some printed records right into your code. It makes them easy to record ... especially if you plan to make any changes.

Print column names

We already have the columns from the printout above, but if you want to have them listed out separately, we can use the names() function.

covid_data %>% names()

OUT:

[1] "country"   "subregion" "date"      "lat"       "long"      "confirmed" "dead"      "recovered"
[9] "new_cases"

To further explore the dataset, we'll need to know these variable name, so make sure to remember them, or paste them somewhere useful, like your code or a text document.

Examine variables

Ok, now we'll start exploring individual variables.

We're going to start looking at the variables individually, mostly to see what's in them.

Get unique values: country

First, let's get the unique values of country.

To do this, we're going to use the distinct() function.

Notice that we're using the pipe operator to do this. It works like a waterfall, starting with the dataset, covid_data. We're piping covid_data into the distinct() function, to get the distinct values of the country variable. Then we're actually piping that into the print() function with n = 190, in order to print out all of the rows.

covid_data %>% 
  distinct(country) %>% 
  print(n = 190)

I recommend that you run this code as is, so you can see all of the values. But since the output is pretty long, I'll truncate it here.

OUT:

# A tibble: 187 x 1
  country                         
                             
1 Afghanistan                     
2 Albania                         
3 Algeria                         
4 Andorra                         
5 Angola                          
6 Antigua and Barbuda             
7 Argentina   
...
179 US                              
180 Uzbekistan                      
181 Venezuela                       
182 Vietnam                         
183 West Bank and Gaza              
184 Western Sahara                  
185 Yemen                           
186 Zambia                          
187 Zimbabwe  

What's going on here?

Our code has just printed out the unique values of country. Remember: our dataset actually has 27,398 rows, so the country names are repeated in covid_data many times.

The distinct() function ignores any repeats and only gives us the unique values.

So now that we have them, what are we looking for?

When you examine a string variable or categorical variable like this, I want to think about a few things:

  • Are they all actually country names?
  • Are any of them unusual?
  • Does anything need to be changed or cleaned?
  • Are there any extra characters that need to be removed? (spaces, periods, etc)

In this case, the different values are in good shape, so we don't need to do any real cleanup here.

But if there was anything amiss, we'd need to do some extra data cleaning to clean things up.

Count unique values of country

We'll also get a count of the unique values of country.

To do this, we'll use the syntax distinct(country) function, and we'll pipe that into the count() function. This will literally count the distinct values of country.

#-----------------------------
# COUNT UNIQUE VALUES: country
#-----------------------------
covid_data %>% 
  distinct(country) %>% 
  count()

When you run this on the data that we're using, you'll get the answer 187.

Get unique values of country/subregion

Next, we'll get the unique combinations of country and subregion.

Remember: country and subregion are two different variables in our dataframe.

To get the unique combinations of country/subregion, we can use the distinct() function with both variables. Then we'll pipe the output into print().

#-------------------------------------
# GET UNIQUE VALUES: country/subregion
#-------------------------------------
covid_data %>%
  distinct(country, subregion) %>%
  print(n = 300)

OUT:

A tibble: 266 x 2
    country                          subregion                       
                                                            
   1 Afghanistan                      NA                              
   2 Albania                          NA                              
   3 Algeria                          NA                              
   4 Andorra                          NA                              
   5 Angola                           NA                              
   6 Antigua and Barbuda              NA                              
   7 Argentina                        NA                              
   8 Armenia                          NA                              
   9 Australia                        Australian Capital Territory    
  10 Australia                        New South Wales                 
  11 Australia                        Northern Territory              
  12 Australia                        Queensland                      
  13 Australia                        South Australia                 
  14 Australia                        Tasmania                        
  15 Australia                        Victoria                        
  16 Australia                        Western Australia           
  ...

There's actually quite a few rows, so in the interest of space, I've only posted the first 16.

But you can immediately see something: some of the data is listed strictly at the country level. For example, for Afghanistan, we only have data for the country, but not any subregions.

But for other countries, we have data at the country/subregion level. So for example, for Australia, we have data for "Australian Capital Territory", "New South Wales", "Queensland", etc.

This is important. This means that whenever we're trying to get data at the country level, we're probably going to need to aggregate the data.

To be clear, if we're 100% sure that the data for a country is strictly reported at the country level (and not the country/subregion level) then we don't need to aggregate. But any time we're unsure, if we want country level data, we'll need to aggregate by country by using the group_by() function.

Count unique: country/subregion

We'll also count the number of unique country/subregion combinations.

#---------------------------------------
# COUNT UNIQUE VALUES: country/subregion
#---------------------------------------
covid_data %>%
  distinct(country, subregion) %>%
  count()

OUT:

266
Get unique values: date

Let's continue with this and get the unique values of date.

This will be very similar to the previous sections.

Again, we'll use the distinct() function with print to get the unique values.

covid_data %>% 
  distinct(date) %>%  
  print(n = 105)

OUT:

# A tibble: 103 x 1
    date      
        
  1 2020-01-22
  2 2020-01-23
  3 2020-01-24
  4 2020-01-25
  5 2020-01-26
...
101 2020-05-01
102 2020-05-02
103 2020-05-03

I've deleted some of the output here, just for the sake of space.

But when you run the code, I want you to look at the dates.

Do they look right?

Do the dates look appropriate for the data (given what you know)?

Are there any dates missing?

As you do this, just keep in mind what we're really doing at this phase in the analysis: we're just checking out the data to find out what's in it, and to make sure that it looks "okay".

Count unique combinations: country, subregion, date

Before we move on to the other variables, let's count the unique combinations of country, subgregion, and date.

To do this, we'll again use the distinct() function with count().

#----------------------------------------------
# COUNT UNIQUE VALUES: country, subregion, date
#----------------------------------------------
covid_data %>% 
  distinct(country, subregion, date) %>% 
  count()

When you run this you'll find that there are 27398 unique combinations of country, subgregion, and date. (At least, that's the case if you downloaded the pre-created file above ... but if you created your own file with updated data, the number will be higher.)

Ok. So we have 27398 unique combinations of country, subgregion, and date.

As we saw previously, there are also 27398 total rows in the data.

That means that the rows of data are unique on country, subgregion, and date. The combination of country, subgregion, and date allows us to identify a record.

I won't go into too much explanation of why this is important, but it is very important for performing joins/merges, as well as other operations.

Check variables: lat and long

Let's move on.

Here, we're going to look at the lat and long variables.

If you're sort of new to data science, it might not be immediately obvious what these are, but for most people who have worked with data for a while, it's probably clear: these look like variables for latitude and longitude. (Of course, we can learn that by reading about the original csv data at the JHU covid19 data archive on Github).

Even though it's pretty clear what these are, we'll want to take a look and make sure that the values look valid.

Quickly, I want to compute the maximum and minimum of the values. That will help us determine if the data are valid within some rough parameters. (To really know, we'll probably want to plot some data on a map, but that's a little more complicated, so we'll do that in a separate tutorial.)

There are a few ways to do this. You can do it one at a time with code that looks like this:

covid_data %>% select(long) %>% summarise(min(long), max(long))
covid_data %>% select(lat) %>% summarise(min(lat), max(lat))

That's fine, and it's relatively simple to do.

But there's a slightly more elegant way to do it by pivoting the data using the gather() function, then grouping and summarising.

#------------------------------
# EXAMINE VARIABALES: lat, long
#------------------------------
covid_data %>% 
  select(long, lat) %>% 
  gather() %>% 
  group_by(key) %>% 
  summarise('minimum' = min(value)
            ,'maximum' = max(value)
            )

OUT:

  key   minimum maximum
        
1 lat     -51.8    71.7
2 long   -135     178. 

This gives you a little table that shows the minimum and maximum for bot lat and long.

So what are we looking for?

We're just checking that the range of values looks appropriate.

We can actually check these against some information from the internet. A quick google search reveals that:

"The valid range of latitude in degrees is -90 and +90 for the southern and northern hemisphere respectively. Longitude is in the range -180 and +180 specifying coordinates west and east of the Prime Meridian, respectively."

(Source: Stack Overflow)

Based on what we're seeing, lat and long look okay. We might find some problems later on if/when we map the data, but based on our quick-and-dirty check, these look okay.

Check variables: confirmed, dead, recovered

Next, we'll check the confirmed, dead, and recovered variables.

Here, we'll just look at the min and max values.

Again, like in the previous section, there are a few ways to do this. You can check these values one at a time, but there is a faster way.

Here, we'll create a concise little table that summarizes the information we want.

To do this, we'll use the same strategy as the last section.

We'll select our variables, transpose the data using gather(), then group and summarise to calculate the minimum and maximum values.

#-----------------------------------------------
# EXAMINE VARIABALES: confirmed, dead, recovered
#-----------------------------------------------

covid_data %>% 
  select(confirmed, dead, recovered) %>% 
  gather() %>% 
  group_by(key) %>% 
  summarise('min' = min(value, na.rm = T)
            ,'max' = max(value, na.rm = T)
            )

OUT:

# A tibble: 3 x 3
  key         min     max
          
1 confirmed     0 1158040
2 dead          0   67682
3 recovered     0  180152

The output here is a little table that gives us the minimum and maximum values of confirmed, dead, and recovered.

One good thing to do at this point is just try to find those records.

Let's quickly try to identify the row of data where the confirmed cases is 1158040.

covid_data %>% filter(confirmed == 1158040)

OUT:

  country subregion date         lat  long confirmed  dead recovered new_cases
                                
1 US      NA        2020-05-03  37.1 -95.7   1158040 67682    180152     25501

As you can see, the row of data with 1,158,040 confirmed cases is for the United States on May 3, 2020.

Why did we do this?

Well for starters, we can just ask, does that sound right? Does the US have the most cases? (Yes, as of the date of publication, it does.)

And we can take the information in that record and try to compare it to some external data source to try to validate it.

Ultimately, what we're trying to do is validate that the data in our dataset are valid and "correct".

Examine a few countries

Finally, let's just examine a few countries.

Again, we just want to take a quick look and try to see if the data looks "correct."

To do this, we'll use the select() function to retrieve the variables we want to work with.

We'll filter the data down to a few select countries (I chose the USA, Italy, Sweden, and China, but you can change these if you'd like).

We're also filtering down to a single date, just to simplify things.

Then we're grouping by country and date.

... and finally summing the records using summarise().

This will give us the number of confirmed cases for the USA, Italy, Sweden, and China on May 1, 2020.

#-------------------------------
# SPOT CHECK IMPORTANT COUNTRIES
#-------------------------------

covid_data %>% 
  select(country, date, confirmed, dead, recovered) %>% 
  filter(country %in% c('US', 'Italy', 'Sweden', 'China')) %>% 
  filter(date == as_date('2020-05-01')) %>% 
  group_by(country, date) %>% 
  summarise(total_confirmed = sum(confirmed))

OUT:

  country date       total_confirmed
                    
1 China   2020-05-01           83959
2 Italy   2020-05-01          207428
3 Sweden  2020-05-01           21520
4 US      2020-05-01         1103461

So what does this give us?

We have a simple table that gives us the total confirmed cases for 4 countries on a single date.

We could change this to include more or different countries.

And we could also change the date.

But ultimately, we're trying to get a quick "sample" of the data to see if it's correct.

I recommend that you try to find an external data source that we can use to compare these numbers against. (For example, we can compare these numbers with the Johns Hopkins Covid-19 Dashboard.) We can use an external data source to try to validate these numbers.

A quick note about the data analysis process

I've mentioned this in previous tutorials, but it's important, so I'll repeat it.

In a real-world environment, we're often working in teams, not just as independent analysts.

When you work in a team in a business environment, you'll almost always work with a subject matter expert (SME) or some type of data expert. The subject matter experts are commonly people outside of the data science team who are (... wait for it) experts in the subject matter that you're investigating.

As you begin to analyze a dataset, you'll frequently go to those experts with your findings and just check: "hey, do these numbers look right to you?" ... "here's what' I'm looking at. Do these variables look correct?"

Subject matter experts can be invaluable for validating that your new dataset is accurate and free of errors.

If you don't have an SME close by, you might also go to more senior coworkers and ask the same questions: "do these numbers look OK to you?"

In a real world environment, you need to develop good working relationships with these people and maintain good communication.

Data science is not strictly about code and pure analysis. Soft skills matter!

The data look pretty good

Now that we've explored the data, it looks like the data are relatively clean and the values are mostly appropriate. So, we can move on to the next phases of our analysis.

But remember: data analysis is highly iterative.

We'll get more information about our data when we begin to visualize it. We might something new that looks incorrect, and we'll possibly need to go back and do more data cleaning. That's okay and totally normal.

We're going to continue with the process, and we can fix other issues if and when we find them.

Next steps

Now that we've explored our data a little, we're going to start visualizing it.

In the next tutorial, we'll use ggplot2 to do some standard visualizations like bar charts, line charts, histograms, etc.

Later, we'll also use ggplot for some more exotic types of visualization. We'll probably do a small multiple chart, and possibly a heatmap and geographic map.

This is a pretty rich dataset and there's a lot more that we can do ...

Sign up to learn more

Do you want to see part 4 and the other tutorials in this series?

Do you want to see how we visualize this data?

Then sign up for our email list now.

When you sign up, you'll get our tutorials delivered directly to your inbox.

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

Leave a Comment