R Data Analysis: covid-19, part 1 [data wrangling]

In this tutorial, we’re going to analyze covid19 data using R and the Tidyverse.

Recently, we started a data analysis series, where we analyzed covid19 data. The first set of these tutorials was performed with Python.

That tutorial series was well received, and many people commented that it was very helpful for seeing exactly how to apply techniques to a real dataset.

Here, in this tutorial, we’re going to start re-doing that work with the R programming language.

So in this tutorial, I’ll show you how to get and “wrangle” a dataset of covid19 data using R.

A quick review of the process

In this specific tutorial, we’ll focus on getting and “wrangling” a single csv file that contains data for covid19 confirmed cases.

This will help us define the data wrangling process for these covid19 files. We’ll be able to use this process later when we get data from different (but related) files for covid19 deaths and recoveries. The confirmed cases, deaths, and recoveries are in separate data files, and we’ll eventually need to get them all and combine them together. But before we do that, we’ll start with a more simple case of getting just one of the files.

After we get the process working for one file in this tutorial, we’ll extend the process in the next tutorial to get the remaining files and merge them all together.

Later, in subsequent tutorials, we’ll visualize and analyze the data using other tools from R’s tidyverse.

Tools you’ll need

In this tutorial, we’ll be using the R programming lanugage.

More specifically, we’ll be using the Tidyverse, and in particular, we’ll use dplyr and lubridate to retrieve data and get it into the right shape.

It’ll definitely be helpful if you know dplyr, but if you don’t, you should be able to follow along by copy-pasting the code into RStudio (or another editor) and running the code.

A quick Table of Contents

Here’s a quick table of contents to give you an overview of the steps.

If you want to jump to a particular section, you can click on any of the links.

Table of Contents:

As always though, you’ll learn a lot more if you read the whole tutorial start to finish.

Ok. Let’s get to it.

Import Libraries

First, let’s just import the libraries that we’ll need to use.

We’re going to use the tidyverse package as well as lubridate.

#================
# IMPORT PACKAGES
#================
library(tidyverse)
library(lubridate)

The tidyverse package actually contains several other packages, including dplyr and ggplot2. We’ll primarily be using dplyr here, but we’ll use a few other functions as well.

The lubridate will help us manipulate the dates in the dataset.

Get raw data

Next, we’re just going to get the data for confirmed covid19 cases. This data was combiled by the Johns Hopkins CSSE team.

For the most part, getting the data simple enough.

The data exists at a URL on github.

We’ll just use the read_csv() function from dplyr to read in the raw data.


#=============
# GET RAW DATA
#=============
url_confirmed <- 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
covid_data_RAW <- read_csv(url_confirmed)

After running this, we'll have a dataframe called covid_data_RAW.

Inspect

Now, let's quickly inspect the data.

Here, we'll use the glimpse() function to get a quick look at the data and how it's structured.

glimpse(covid_data_RAW)

OUT:

Observations: 264
Variables: 102
$ `Province/State`  NA, NA, NA, NA, NA, NA, NA, NA, "Australian Capital Territory", "New Sout…
$ `Country/Region`  "Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Antigua and Ba…
$ Lat               33.0000, 41.1533, 28.0339, 42.5063, -11.2027, 17.0608, -38.4161, 40.0691,…
$ Long              65.0000, 20.1683, 1.6596, 1.5218, 17.8739, -61.7964, -63.6167, 45.0382, 1…
$ `1/22/20`         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `1/23/20`         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `1/24/20`         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `1/25/20`         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
...
$ `4/27/20`         1703, 736, 3517, 743, 27, 24, 4003, 1808, 106, 3004, 28, 1033, 438, 214, …
$ `4/28/20`         1828, 750, 3649, 743, 27, 24, 4127, 1867, 106, 3016, 28, 1034, 438, 218, …

For the sake of space, I've removed some of the output (the middle items in the output).

As you can see, we have variables called "Province/State", "Country/Region", "Lat", "Long", as well as variables for different dates.

The data underneath each date is the number of confirmed cases for that region.

Things to do

There are a few things we'll need to do.

First, we'll need to rename some of the columns.

Second, we'll need to reshape the data. Ideally, we want this data in so-called "tidy" format. That means that we'll want all of the dates to exist under a single "date" variable, instead of spread across multiple variables.

Let's do these things one at a time.

Rename Columns

Let's just start by renaming some of the columns.

As we just saw, many of the columns are sort of poorly named.

We have names like "Province/State", "Country/Region", "Lat", and "Long".

These are somewhat easy to understand (they're mostly self explanatory), but they're a little problematic as well.

Two of the variables have slashes. They're long and hard to type.

All of these variable have capital letters. Again, this makes them harder to type.

We'll manually rename these with the dplyr rename function.

#---------------
# RENAME COLUMNS
#---------------
covid_data_RAW %>% 
  rename('subregion' = `Province/State` 
         ,'country' = 'Country/Region'
         ,'lat' = 'Lat'
         ,'long' = 'Long'
         ) ->
  covid_data

Notice here that we're using the dplyr "pipe" syntax to pipe covid_data_RAW into the rename() function. Inside of rename(), we're providing a set of pairs that specify the new name and the old name.

And we're saving the output as covid_data.

Reshape Data

Next, let's reshape the data.

Remember: we want our data in so-called "tidy" form.

To get there, we need to transpose the data, so that the different date variables are "pivoted", and all of the dates exist under a single "date" variable.

To do this, we're going to use the pivot_longer() function. (We could also use the gather() function.)

Here, we're going to pivot all of the columns besides country, subregion, lat, and long. To do that, notice that we're using the cols parameter in conjunction with -one_of().

The name of the new column in the output will be 'date'. We're specifying that with the names_to parameter.

The name of the new values will be 'confirmed'. We're specifying that with the values_to parameter.

covid_data %>% 
  pivot_longer(cols = -one_of('country','subregion','lat','long')
               ,names_to = 'date'
               ,values_to = 'confirmed'
               ) ->
  covid_data

The output of this process is saved right back to the covid_data dataframe.

Be careful! I tested this code several times, so I know that it works. But before I finalized this code, I was saving it to a different variable with a different name, because I didn't want to overwrite my data with incorrect output.

So again, when you're manipulating your data like this, test your code to make sure that it works before you overwrite your input data!

Ok. Now that we've pivoted the data, let's take a look at the output:

head(covid_data)

OUT:

# A tibble: 6 x 6
  subregion country       lat  long date    confirmed
                       
1 NA        Afghanistan    33    65 1/22/20         0
2 NA        Afghanistan    33    65 1/23/20         0
3 NA        Afghanistan    33    65 1/24/20         0
4 NA        Afghanistan    33    65 1/25/20         0
5 NA        Afghanistan    33    65 1/26/20         0
6 NA        Afghanistan    33    65 1/27/20         0

Very good.

Take a look at the two new variables: date and confirmed.

The date variable contains all of our dates ... they are now all listed under a single variable.

The confirmed variable is the new "values" variable. These are the values that had previously been under the different date variables, before we transposed the data.

To be clear: transposing data like this is a little confusing to many beginners. I recommend that you learn how to use the tidyverse transpose functions that enable you to pivot your data, like pivot_longer().

Ok. So now our data is in the right shape, but there are still some problems.

Let's move on to the column order.

Reorder columns

Here, we're going to reorder the columns somewhat.

This isn't a big deal, but I personally want to see the higher-level country variable to the left of the lower-level subregion variable.

To do this, we'll use the dplyr::select() function to reorder the columns in the dataframe.

#================
# REORDER COLUMNS
#================

covid_data %>% 
  select(country, subregion, everything()) ->
  covid_data

Here, we're putting country as the first variable, followed by subregion, followed by the remaining columns.

Convert Dates

Next, we're going to convert the dates in the 'date' variable to proper dates.

If you look at the data type, the values in the date column are actually character strings.

covid_data %>% 
  select(date)

OUT:

   date   
     
 1 1/22/20
 2 1/23/20
 3 1/24/20
 4 1/25/20
 5 1/26/20
 6 1/27/20
 7 1/28/20
 8 1/29/20
 9 1/30/20
10 1/31/20
# … with 25,862 more rows

Again, this is character data. That's not exactly what we want. We really want them to be R dates.

To do this, we're going to use the mdy() function from lubridate.

I'll give you one guess what exactly the mdy() function does.

Look at the name of the function. Then look at the format of the character dates.

If you guessed that mdy() transforms character dates of the MONTH-DAY-YEAR form into R dates, then congratulations ... you get a gold star.

Lubridate is extremely useful when working with character dates, and very easy to use. All of the functions are named to correspond to the format of the input character data. The mdy() transforms data of the format MONTH-DAY-YEAR. The dmy() function transforms data of the format DAY-MONTH-YEAR. And so on.

The functions are named in a clever and easy-to-use way. Love it.

Ok. Let's actually run the code.

Here, we're going to use mdy() inside of the dplyr::mutate() function.

Technically, dplyr::mutate() creates new variables. But here, we're going to use it to overwrite the date variable.

The mdy() function will take the original date variable and transform it to an R date. The output of mdy() will be stored in the date (overwriting it).

And the output of the entire process is being stored back to covid_data. So we're sending the output of mutate() and overwriting the covid_data dataframe.

#-------------
# CONVERT DATE
#-------------
covid_data %>% 
  mutate(date = mdy(date)) ->
  covid_data

After doing this, the covid_data dataframe still has the same variables, but the date variable is actually an R date.

You can check that by printing out the data and looking at the datatypes.

head(covid_data)

OUT:

# A tibble: 6 x 6
  country     subregion   lat  long date       confirmed
                         
1 Afghanistan NA           33    65 2020-01-22         0
2 Afghanistan NA           33    65 2020-01-23         0
3 Afghanistan NA           33    65 2020-01-24         0
4 Afghanistan NA           33    65 2020-01-25         0
5 Afghanistan NA           33    65 2020-01-26         0
6 Afghanistan NA           33    65 2020-01-27         0

Rearange columns and sort

Here, we're going to do rearange the rows and columns somewhat.

We'll use select() to manually reorder the columns.

And we'll use dplyr::arrange() to sort the rows by country, subregion, and date. (I think they already were, but this will make 100% sure).

#=====================
# SORT & REARANGE DATA
#=====================
covid_data %>% 
  select(country, subregion, date, lat, long, confirmed) %>% 
  arrange(country, subregion, date) ->
  covid_data

And let's print the data:

print(covid_data)

OUT:

# A tibble: 25,872 x 6
   country     subregion date         lat  long confirmed
                          
 1 Afghanistan NA        2020-01-22    33    65         0
 2 Afghanistan NA        2020-01-23    33    65         0
 3 Afghanistan NA        2020-01-24    33    65         0
 4 Afghanistan NA        2020-01-25    33    65         0
 5 Afghanistan NA        2020-01-26    33    65         0
 6 Afghanistan NA        2020-01-27    33    65         0
 7 Afghanistan NA        2020-01-28    33    65         0
 8 Afghanistan NA        2020-01-29    33    65         0
 9 Afghanistan NA        2020-01-30    33    65         0
10 Afghanistan NA        2020-01-31    33    65         0
# … with 25,862 more rows

.... mmmm, looks goooood.

Check Data

Alright, at this point, we're really almost done. The data look to be in the right shape ... we should just check a few things.

First, I just want to get a list of the country names. We'll be using these names later, so I want to check the values.

# GET COUNTRY NAMES
covid_data %>% 
  select(country) %>% 
  unique() #%>% print(n = 200)

In that code, I commented out the print() function. If you un-comment it, it will print all of the data.

But here, for the sake of brevity, output is truncated to the first 10 rows:

   country            
                 
 1 Afghanistan        
 2 Albania            
 3 Algeria            
 4 Andorra            
 5 Angola             
 6 Antigua and Barbuda
 7 Argentina          
 8 Armenia            
 9 Australia          
10 Austria            
# … with 175 more rows

You can see though that the country variable has the names of world countries.

Pull records for United States

Next, let's pull some records from the USA.

This will be good because it will enable us to look at the confirmed cases for specific days and cross reference them with other "official" sources of data.

Here, we're going to use the dplyr filter function to subset our R dataframe, and we'll use it inconjunction with group_by() and summarise() to get the summed-up confirmed cases for the USA, by date.

# PULL DATA FOR UNITED STATES
#-----------
# SPOT CHECK 
#-----------
covid_data %>% 
  select(country, date, confirmed) %>% 
  filter(country == 'US') %>% 
  group_by(country, date) %>% 
  summarise(confirmed = sum(confirmed))

OUT:

# A tibble: 98 x 3
# Groups:   country [1]
   country date       confirmed
               
 1 US      2020-01-22         1
 2 US      2020-01-23         1
 3 US      2020-01-24         2
 4 US      2020-01-25         2
 5 US      2020-01-26         5
 6 US      2020-01-27         5
 7 US      2020-01-28         5
 8 US      2020-01-29         5
 9 US      2020-01-30         5
10 US      2020-01-31         7
# … with 88 more rows

I'll let you do the dirty work of checking the numbers against the official JHU data for the USA.

Next steps

Ok.

We now have a working dataset of confirmed covid19 data, by date, by country.

There's more that we can do though ....

Next, we'll want to get the data for deaths and recovered cases.

We'll also want to inspect/explore our data more, to check it and make sure that everything looks okay.

And later, we'll want to visualize and analyze the data in any number of ways.

There's a lot more to do!

Sign up to learn more

Do you want to see part 2 of this R data analysis series?

What to see the the other tutorials later on?

Sign up for our email list now.

When you sign up, you'll get our tutorials delivered directly to your inbox.

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

20 thoughts on “R Data Analysis: covid-19, part 1 [data wrangling]”

    • I’m going to publish several other installments in this R covid-19 series ….

      At some point, we’ll start visualizing the data with ggplot2.

      Reply
    • Which course are you talking about?

      We have several courses at Sharp Sight. We have a few free “Crash Courses”, but we also have some premium paid courses.

      Based on your comment, I’m not entirely sure which one you’re talking about, so can’t fix the problem.

      If you provide more info, I might be able to help ….

      Reply
  1. You mention RStudio. FWIW, I’ve been using RStudio for the Python parts of this (or previous?) series, courtesy of the reticulate package in R.

    Basically you just open a “Python Script” document, type in some Python code, and run it in the usual way, e.g., CTRL-Enter. A Python REPL opens automatically in the Console window:


    > reticulate::repl_python()
    Python 3.7.7 (/home/mike/miniconda3/bin/python3)
    Reticulate 1.15 REPL -- A Python interpreter in R.
    >>> import numpy as np
    >>> print(np.pi)
    3.141592653589793
    >>>

    In my case I’ve specified a non-standard location for the Python executable, but IIRC that isn’t necessary if you’re willing to use your system’s default Python.

    Reply
  2. I think that the str function is more-useful than just printing out the original date column, i.e., in order to see it’s a character variable:


    > str(covid_data$date)
    chr [1:26866] "1/22/20" "1/23/20" "1/24/20" "1/25/20" "1/26/20" "1/27/20" ...
    > str(covid_data["date"])
    tibble [26,866 × 1] (S3: tbl_df/tbl/data.frame)
    $ date: chr [1:26866] "1/22/20" "1/23/20" "1/24/20" "1/25/20" ...

    Reply
  3. I don’t understand the example for “SPOT CHECK” of US data. The lines after:

    filter(country == 'US')

    seem to be gratuitous. After the filter operation all the values of `country` are guaranteed to be `US`, and each date appears exactly once.

    To be definite:


    > covid_data %>%
    + select(country, date, confirmed) %>%
    + filter(country == 'US')
    # A tibble: 101 x 3
    country date confirmed

    1 US 2020-01-22 1
    2 US 2020-01-23 1
    3 US 2020-01-24 2
    4 US 2020-01-25 2
    5 US 2020-01-26 5
    6 US 2020-01-27 5
    7 US 2020-01-28 5
    8 US 2020-01-29 5
    9 US 2020-01-30 5
    10 US 2020-01-31 7
    # … with 91 more rows

    Am I missing something?

    Reply
    • Hi Michael,

      I am sure your comments are well intentioned. However, I personally think that when someone offers their time, talent and code to us, unless we are correcting mistakes, errors or bugs; we should just thank them. Beautification comments that only bring attention to yourself rather than the author are probably best kept as a thought and not a public expression.

      Reply
  4. Hi, I receive the following error when I run the code to read the data!
    Any advice, please.
    covid_data_RAW <- read_csv(url_confirmed)
    Error in open.connection(con, "rb") :
    Could not resolve proxy: proxy.server.com

    Reply
      • Thanks, Ahmad,

        Yes, of course, I have the url in my Renviron.

        > url_confirmed covid_data_RAW <- read_csv(url_confirmed)
        Error in open.connection(con, "rb") :
        Could not resolve proxy: proxy.server.com

        Reply
  5. covid_data <- read.csv(file_path,sep = ";")
    this one works but I think there are some issues with reading the data with this code as I can not get the right outputs and graphs.

    Reply
    • read.csv(file_path,sep = ";") is not the code from this tutorial. That’s the code from a different tutorial with a different data file.

      It won’t work right if you try to use it for the rest of the code for part 1.

      Reply
      • Thanks, Sharp,

        Yes, you are right, but when I am using read_csv, I receive the following error.

        url_confirmed covid_data_RAW <- read_csv(url_confirmed)
        Error in open.connection(con, "rb") :
        Could not resolve proxy: proxy.server.com

        Any suggestion?

        Reply

Leave a Comment