In this tutorial, we’re going to analyze covid19 data using R and the Tidyverse.
Recently, we started a data analysis series, where we analyzed covid19 data. The first set of these tutorials was performed with Python.
Here, in this tutorial, we’re going to start re-doing that work with the R programming language.
So in this tutorial, I’ll show you how to get and “wrangle” a dataset of covid19 data using R.
A quick review of the process
In this specific tutorial, we’ll focus on getting and “wrangling” a single csv file that contains data for covid19 confirmed cases.
This will help us define the data wrangling process for these covid19 files. We’ll be able to use this process later when we get data from different (but related) files for covid19 deaths and recoveries. The confirmed cases, deaths, and recoveries are in separate data files, and we’ll eventually need to get them all and combine them together. But before we do that, we’ll start with a more simple case of getting just one of the files.
After we get the process working for one file in this tutorial, we’ll extend the process in the next tutorial to get the remaining files and merge them all together.
Later, in subsequent tutorials, we’ll visualize and analyze the data using other tools from R’s tidyverse.
Tools you’ll need
In this tutorial, we’ll be using the R programming lanugage.
More specifically, we’ll be using the Tidyverse, and in particular, we’ll use dplyr and lubridate to retrieve data and get it into the right shape.
It’ll definitely be helpful if you know dplyr, but if you don’t, you should be able to follow along by copy-pasting the code into RStudio (or another editor) and running the code.
A quick Table of Contents
Here’s a quick table of contents to give you an overview of the steps.
If you want to jump to a particular section, you can click on any of the links.
Table of Contents:
- Import packages
- Get data
- Rename columns
- Reshape data
- Convert dates
- Rearange columns and sort data
- Check data
- Next steps
As always though, you’ll learn a lot more if you read the whole tutorial start to finish.
Ok. Let’s get to it.
First, let’s just import the libraries that we’ll need to use.
We’re going to use the
tidyverse package as well as
#================ # IMPORT PACKAGES #================ library(tidyverse) library(lubridate)
tidyverse package actually contains several other packages, including
ggplot2. We’ll primarily be using
dplyr here, but we’ll use a few other functions as well.
lubridate will help us manipulate the dates in the dataset.
Get raw data
Next, we’re just going to get the data for confirmed covid19 cases. This data was combiled by the Johns Hopkins CSSE team.
For the most part, getting the data simple enough.
The data exists at a URL on github.
We’ll just use the
read_csv() function from
dplyr to read in the raw data.
#============= # GET RAW DATA #============= url_confirmed <- 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv' covid_data_RAW <- read_csv(url_confirmed)
After running this, we'll have a dataframe called
Now, let's quickly inspect the data.
Here, we'll use the
glimpse() function to get a quick look at the data and how it's structured.
Observations: 264 Variables: 102 $ `Province/State`
NA, NA, NA, NA, NA, NA, NA, NA, "Australian Capital Territory", "New Sout… $ `Country/Region` "Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Antigua and Ba… $ Lat 33.0000, 41.1533, 28.0339, 42.5063, -11.2027, 17.0608, -38.4161, 40.0691,… $ Long 65.0000, 20.1683, 1.6596, 1.5218, 17.8739, -61.7964, -63.6167, 45.0382, 1… $ `1/22/20` 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ `1/23/20` 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ `1/24/20` 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ `1/25/20` 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ... $ `4/27/20` 1703, 736, 3517, 743, 27, 24, 4003, 1808, 106, 3004, 28, 1033, 438, 214, … $ `4/28/20` 1828, 750, 3649, 743, 27, 24, 4127, 1867, 106, 3016, 28, 1034, 438, 218, …
For the sake of space, I've removed some of the output (the middle items in the output).
As you can see, we have variables called "Province/State", "Country/Region", "Lat", "Long", as well as variables for different dates.
The data underneath each date is the number of confirmed cases for that region.
Things to do
There are a few things we'll need to do.
First, we'll need to rename some of the columns.
Second, we'll need to reshape the data. Ideally, we want this data in so-called "tidy" format. That means that we'll want all of the dates to exist under a single "
date" variable, instead of spread across multiple variables.
Let's do these things one at a time.
Let's just start by renaming some of the columns.
As we just saw, many of the columns are sort of poorly named.
We have names like "Province/State", "Country/Region", "Lat", and "Long".
These are somewhat easy to understand (they're mostly self explanatory), but they're a little problematic as well.
Two of the variables have slashes. They're long and hard to type.
All of these variable have capital letters. Again, this makes them harder to type.
We'll manually rename these with the dplyr rename function.
#--------------- # RENAME COLUMNS #--------------- covid_data_RAW %>% rename('subregion' = `Province/State` ,'country' = 'Country/Region' ,'lat' = 'Lat' ,'long' = 'Long' ) -> covid_data
Notice here that we're using the dplyr "pipe" syntax to pipe
covid_data_RAW into the
rename() function. Inside of
rename(), we're providing a set of pairs that specify the new name and the old name.
And we're saving the output as
Next, let's reshape the data.
Remember: we want our data in so-called "tidy" form.
To get there, we need to transpose the data, so that the different date variables are "pivoted", and all of the dates exist under a single "
To do this, we're going to use the
pivot_longer() function. (We could also use the
Here, we're going to pivot all of the columns besides
long. To do that, notice that we're using the
cols parameter in conjunction with
The name of the new column in the output will be
'date'. We're specifying that with the
The name of the new values will be
'confirmed'. We're specifying that with the
covid_data %>% pivot_longer(cols = -one_of('country','subregion','lat','long') ,names_to = 'date' ,values_to = 'confirmed' ) -> covid_data
The output of this process is saved right back to the
Be careful! I tested this code several times, so I know that it works. But before I finalized this code, I was saving it to a different variable with a different name, because I didn't want to overwrite my data with incorrect output.
So again, when you're manipulating your data like this, test your code to make sure that it works before you overwrite your input data!
Ok. Now that we've pivoted the data, let's take a look at the output:
# A tibble: 6 x 6 subregion country lat long date confirmed
1 NA Afghanistan 33 65 1/22/20 0 2 NA Afghanistan 33 65 1/23/20 0 3 NA Afghanistan 33 65 1/24/20 0 4 NA Afghanistan 33 65 1/25/20 0 5 NA Afghanistan 33 65 1/26/20 0 6 NA Afghanistan 33 65 1/27/20 0
Take a look at the two new variables:
date variable contains all of our dates ... they are now all listed under a single variable.
confirmed variable is the new "values" variable. These are the values that had previously been under the different date variables, before we transposed the data.
To be clear: transposing data like this is a little confusing to many beginners. I recommend that you learn how to use the tidyverse transpose functions that enable you to pivot your data, like
Ok. So now our data is in the right shape, but there are still some problems.
Let's move on to the column order.
Here, we're going to reorder the columns somewhat.
This isn't a big deal, but I personally want to see the higher-level
country variable to the left of the lower-level
To do this, we'll use the
dplyr::select() function to reorder the columns in the dataframe.
#================ # REORDER COLUMNS #================ covid_data %>% select(country, subregion, everything()) -> covid_data
Here, we're putting
country as the first variable, followed by
subregion, followed by the remaining columns.
Next, we're going to convert the dates in the '
date' variable to proper dates.
If you look at the data type, the values in the
date column are actually character strings.
covid_data %>% select(date)
1 1/22/20 2 1/23/20 3 1/24/20 4 1/25/20 5 1/26/20 6 1/27/20 7 1/28/20 8 1/29/20 9 1/30/20 10 1/31/20 # … with 25,862 more rows
Again, this is character data. That's not exactly what we want. We really want them to be R dates.
To do this, we're going to use the
mdy() function from
I'll give you one guess what exactly the
mdy() function does.
Look at the name of the function. Then look at the format of the character dates.
If you guessed that
mdy() transforms character dates of the MONTH-DAY-YEAR form into R dates, then congratulations ... you get a gold star.
Lubridate is extremely useful when working with character dates, and very easy to use. All of the functions are named to correspond to the format of the input character data. The
mdy() transforms data of the format MONTH-DAY-YEAR. The
dmy() function transforms data of the format DAY-MONTH-YEAR. And so on.
The functions are named in a clever and easy-to-use way. Love it.
Ok. Let's actually run the code.
Here, we're going to use
mdy() inside of the
dplyr::mutate() creates new variables. But here, we're going to use it to overwrite the date variable.
mdy() function will take the original
date variable and transform it to an R date. The output of
mdy() will be stored in the
date (overwriting it).
And the output of the entire process is being stored back to
covid_data. So we're sending the output of
mutate() and overwriting the
#------------- # CONVERT DATE #------------- covid_data %>% mutate(date = mdy(date)) -> covid_data
After doing this, the
covid_data dataframe still has the same variables, but the
date variable is actually an R date.
You can check that by printing out the data and looking at the datatypes.
# A tibble: 6 x 6 country subregion lat long date confirmed
1 Afghanistan NA 33 65 2020-01-22 0 2 Afghanistan NA 33 65 2020-01-23 0 3 Afghanistan NA 33 65 2020-01-24 0 4 Afghanistan NA 33 65 2020-01-25 0 5 Afghanistan NA 33 65 2020-01-26 0 6 Afghanistan NA 33 65 2020-01-27 0
Rearange columns and sort
Here, we're going to do rearange the rows and columns somewhat.
select() to manually reorder the columns.
And we'll use
dplyr::arrange() to sort the rows by
date. (I think they already were, but this will make 100% sure).
#===================== # SORT & REARANGE DATA #===================== covid_data %>% select(country, subregion, date, lat, long, confirmed) %>% arrange(country, subregion, date) -> covid_data
And let's print the data:
# A tibble: 25,872 x 6 country subregion date lat long confirmed
1 Afghanistan NA 2020-01-22 33 65 0 2 Afghanistan NA 2020-01-23 33 65 0 3 Afghanistan NA 2020-01-24 33 65 0 4 Afghanistan NA 2020-01-25 33 65 0 5 Afghanistan NA 2020-01-26 33 65 0 6 Afghanistan NA 2020-01-27 33 65 0 7 Afghanistan NA 2020-01-28 33 65 0 8 Afghanistan NA 2020-01-29 33 65 0 9 Afghanistan NA 2020-01-30 33 65 0 10 Afghanistan NA 2020-01-31 33 65 0 # … with 25,862 more rows
.... mmmm, looks goooood.
Alright, at this point, we're really almost done. The data look to be in the right shape ... we should just check a few things.
First, I just want to get a list of the country names. We'll be using these names later, so I want to check the values.
# GET COUNTRY NAMES covid_data %>% select(country) %>% unique() #%>% print(n = 200)
In that code, I commented out the
print() function. If you un-comment it, it will print all of the data.
But here, for the sake of brevity, output is truncated to the first 10 rows:
1 Afghanistan 2 Albania 3 Algeria 4 Andorra 5 Angola 6 Antigua and Barbuda 7 Argentina 8 Armenia 9 Australia 10 Austria # … with 175 more rows
You can see though that the
country variable has the names of world countries.
Pull records for United States
Next, let's pull some records from the USA.
This will be good because it will enable us to look at the confirmed cases for specific days and cross reference them with other "official" sources of data.
Here, we're going to use the dplyr filter function to subset our R dataframe, and we'll use it inconjunction with
summarise() to get the summed-up confirmed cases for the USA, by date.
# PULL DATA FOR UNITED STATES #----------- # SPOT CHECK #----------- covid_data %>% select(country, date, confirmed) %>% filter(country == 'US') %>% group_by(country, date) %>% summarise(confirmed = sum(confirmed))
# A tibble: 98 x 3 # Groups: country  country date confirmed
1 US 2020-01-22 1 2 US 2020-01-23 1 3 US 2020-01-24 2 4 US 2020-01-25 2 5 US 2020-01-26 5 6 US 2020-01-27 5 7 US 2020-01-28 5 8 US 2020-01-29 5 9 US 2020-01-30 5 10 US 2020-01-31 7 # … with 88 more rows
I'll let you do the dirty work of checking the numbers against the official JHU data for the USA.
We now have a working dataset of confirmed covid19 data, by date, by country.
There's more that we can do though ....
Next, we'll want to get the data for deaths and recovered cases.
We'll also want to inspect/explore our data more, to check it and make sure that everything looks okay.
And later, we'll want to visualize and analyze the data in any number of ways.
There's a lot more to do!
Sign up to learn more
Do you want to see part 2 of this R data analysis series?
What to see the the other tutorials later on?
Sign up for our email list now.
When you sign up, you'll get our tutorials delivered directly to your inbox.