Python Data Analysis: covid-19 data, part 1 [data wrangling]

If you’re like most people around the world, you’re probably concerned with the coronavirus, AKA, sars-cov-19, which for the sake of simplicity, I’ll refer to as covid-19.

As you’re reading this, there’s a reasonable chance that you’re stuck working from home or otherwise quarantined in some way.

Welcome to 2020 … it’s crazy.

Assuming that you’re quarantined right now, you probably have some extra time on your hands.

Great.

This is a perfect time to learn a new skill or do something productive.

In fact, it’s a great time to master data science in Python.

With that in mind, I’m putting together a series of tutorials to show you how to analyze covid-19 data in Python.

Want to learn Pandas? Want to learn data science in Python?

If you’re stuck at home, you’ve got no more excuses.

Let’s get after it.

A quick review of the process

In this series of tutorials, we’ll analyze some covid19 data with Python.

Specifically, we’re going to use a few toolkits.

Ultimately, over the course of this series of tutorials, we’ll use Pandas, matplotlib, seaborn, and possibly a little Numpy.

You’ll probably want to be familiar with all of them, but if not, that’s okay.

You’ll still be able to run the code, and I’ll link to some other tutorials that explain individual techniques in more depth.

That being said, you’ll be able to play with the code regardless. But if you’re really serious about this, you’ll want to eventually master data science in Python.

What we’ll cover in this tutorial

As I mentioned, we’re going to work with this data over a series of tutorials.

This tutorial is the first.

And in this tutorial, we’re just going to start with some simple data wrangling.

We’re going to get our data and “wrangle” it into shape. Here’s a quick table of contents, so you can navigate to different sections.

Table of Contents:

Import Packages

First, we’re just going to import some Packages.

For the time being, we’ll just import Pandas, although we’ll need to import other packages in future tutorials.

#================
# IMPORT PACKAGES
#================
import pandas as pd

Get raw data

Next, we’re going to get the raw csv data.

This data comes from a github repository for Covid-19 data, created by Johns Hopkins.

We’re going to use the “raw” data, which is easier to retrieve.

To get the raw data, we’re going to use the Pandas read_csv function. The read_csv function does exactly what it sounds like …. it reads in csv data.

(Note: the following URL was updated on March 30, after the John’s Hopkins CSSE team changed data file location.)

#=============
# GET RAW DATA
#=============

# OLD FILE: 
#url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv'

# NEW FILE:
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'

covid_data_RAW = pd.read_csv(url)

Inspect

Now, let’s just inspect a few rows.

First, we’ll take a look at the columns:

covid_data_RAW.columns

OUT:

Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20', '1/29/20',
       '1/30/20', '1/31/20', '2/1/20', '2/2/20', '2/3/20', '2/4/20', '2/5/20',
       '2/6/20', '2/7/20', '2/8/20', '2/9/20', '2/10/20', '2/11/20', '2/12/20',
       '2/13/20', '2/14/20', '2/15/20', '2/16/20', '2/17/20', '2/18/20',
       '2/19/20', '2/20/20', '2/21/20', '2/22/20', '2/23/20', '2/24/20',
       '2/25/20', '2/26/20', '2/27/20', '2/28/20', '2/29/20', '3/1/20',
       '3/2/20', '3/3/20', '3/4/20', '3/5/20', '3/6/20', '3/7/20', '3/8/20',
       '3/9/20', '3/10/20', '3/11/20', '3/12/20', '3/13/20', '3/14/20',
       '3/15/20', '3/16/20', '3/17/20', '3/18/20'],
      dtype='object')

Ok. This is frankly hard to look at. We’ll have to clean these up.

We can also print out a few rows with the head() method.

covid_data_RAW.head()

OUT:

  Province/State Country/Region      Lat  ...  3/16/20  3/17/20  3/18/20
0            NaN       Thailand  15.0000  ...      147      177      212
1            NaN          Japan  36.0000  ...      825      878      889
2            NaN      Singapore   1.2833  ...      243      266      313
3            NaN          Nepal  28.1667  ...        1        1        1
4            NaN       Malaysia   2.5000  ...      566      673      790

[5 rows x 61 columns]

Keep in mind that your data might look a little different. I’m retrieving this data on March 19, 2020, so the dataset has records up to 3/18/20. If you run this code on a different date, your data will be more up-to-date.

Rename Columns

Now, we’re going to rename the columns.

To do this, we’re going to use the Pandas rename method. Notice that in order to do this, we need to specify new “key/value pairs” with a Python dictionary. Each “key” in the dictionary is the old column name, and the “value” in the dictionary is the new name.

So, you can read the items in the dictionary as {"old_name":"new_name"}.

#---------------
# RENAME COLUMNS
#---------------
covid_data = covid_data_RAW.rename(
        columns = {'Province/State':'subregion'
                   ,'Country/Region':'country'
                   ,'Lat':'lat'
                   ,'Long':'long'
                   }
        )

Notice that we’re saving the output of this method with the name covid_data.

That means that after running this code, our original dataset, covid_data_RAW, will be in its original state. I’m doing that just to keep the raw data, just in case we make any mistakes and need to start over with the raw data. (Although, that might not be a huge concern, since we can re-read the data with read_csv.)

The data that we’ll be working with going forward is covid_data.

Reshape Data

Next, we’re going to reshape the data.

Why?

One thing that you’ll notice about the data is that the dates all exist as separate columns. This is not what we need.

When we use many data analysis or visualization techniques, like groupby, agg, and Seaborn functions, we need the dates to exist as values underneath a single column.

So right now we need to transpose the data so that those dates exist under one single “date” column.

To do this, we need to use the Pandas melt function. Pandas has two functions for transposing data, melt and pivot, but melt is used when we need to transform data from wide form to long form.

So here, we’re going to use the melt function to reshape our data:

#==================================
# RESHAPE THE DATA
# - melt the data into 'tidy' shape
#==================================
covid_data = (covid_data.melt(id_vars = ['country','subregion','lat','long']
                 ,var_name = 'date_RAW'
                 ,value_name = 'confirmed'
                 )
)

Notice that when we did this, we created two new variables: date_RAW and confirmed.

The first is a date field (which we’ll need to wrangle further), and the second is the number of confirmed cases.

Now that we’re finished with that step, let’s print some data to take a look:

print(covid_data)

OUT:

              country   subregion      lat      long date_RAW  confirmed
0            Thailand         NaN  15.0000  101.0000  1/22/20          2
1               Japan         NaN  36.0000  138.0000  1/22/20          2
2           Singapore         NaN   1.2833  103.8333  1/22/20          0
3               Nepal         NaN  28.1667   84.2500  1/22/20          0
4            Malaysia         NaN   2.5000  112.5000  1/22/20          0
              ...         ...      ...       ...      ...        ...
26329     Netherlands       Aruba  12.5186  -70.0358  3/18/20          4
26330          Zambia         NaN -15.4167   28.2833  3/18/20          2
26331        Djibouti         NaN  11.8251   42.5903  3/18/20          1
26332     Gambia, The         NaN  13.4432  -15.3101  3/18/20          1
26333  United Kingdom  Montserrat  16.7425  -62.1874  3/18/20          1

Take a look at the output. We now have two new variables named date_RAW and confirmed.

The confirmed variable is the number of confirmed covid-19 cases, for a particular place, on a particular date.

The date_RAW variable is a string-based date variable, which means that it’s not in the form that we need it to be in.

So next, we’ll convert the dates into proper date/time data.

Convert Dates

First, let’s just print out some dates.

Here, we’re going to do this with the Pandas filter method to select the date_RAW column.

#--------------
# CONVERT DATES
#--------------

# INSPECT DATES
(covid_data
    .filter(['date_RAW'])
)

OUT:

      date_RAW
0      1/22/20
1      1/22/20
2      1/22/20
3      1/22/20
4      1/22/20
       ...
26329  3/18/20
26330  3/18/20
26331  3/18/20
26332  3/18/20
26333  3/18/20

The dates are in a form that with a one-digit month, two-digit day, and a two-digit year.

Let’s try to convert these dates to proper datetime data.

Test date conversion

Instead of directly converting the data, I actually want to test this out first.

Here, we’re going to use several Pandas methods in a “chain” to test our date conversion and give us a preview of the results. (Note that this is similar to dplyr pipes in R.)

Keep in mind, because we’re not storing the output with the equal sign, this will not modify the original DataFrame. That’s good … it will give us the ability to test the operation first.

#---------------------
# TEST DATE CONVERSION
#---------------------
(covid_data
    .assign(date = pd.to_datetime(covid_data.date_RAW, format='%m/%d/%y'))
    .filter(['date','date_RAW','confirmed'])
    .groupby(['date','date_RAW'])
    .agg('sum')
    .sort_values('date')
)

And here is a few rows of output:

                     confirmed
date       date_RAW           
2020-01-22 1/22/20         555
2020-01-23 1/23/20         653
2020-01-24 1/24/20         941
2020-01-25 1/25/20        1434
2020-01-26 1/26/20        2118
2020-01-27 1/27/20        2927
2020-01-28 1/28/20        5578
2020-01-29 1/29/20        6166
2020-01-30 1/30/20        8234
2020-01-31 1/31/20        9927
2020-02-01 2/1/20        12038
2020-02-02 2/2/20        16787
2020-02-03 2/3/20        19881
2020-02-04 2/4/20        23892
2020-02-05 2/5/20        27635

What happened here?

The code created a new variable in the output called date.

Then we retrieved only a few columns (date, date_RAW, confirmed) using the filter method.

Then we grouped and aggregated the data. The grouping is actually the critical step. What I want to see here is all of the unique combinations of date and date_RAW, side by side. The way I chose to do that is by using groupby, then agg.

So again, why did I do this?

I did it because I want to see the old date (date_RAW) and the new date (date) side by side. I want to compare and “spot check” to make sure that the assign operation using pd.to_datetime worked properly.

When you compare the old date and new date side by side, it looks like they match. That is, it looks like our new datetime field, date, was created properly.

But remember: the code we just ran did NOT change the original dateset yet. It was just a “test”.

Now that we tested this date conversion, we’ll run it properly and save the output so that we have our new date.

Create date variable

Now, we’ll actually convert our date.

Before we do that though, we’ll create a copy of our date, just to back it up.

This might be unnecessary and will take up extra storage. But I sometimes do this at intermediate points in my scripts, just to create a backup before I make changes.

You can chose to skip the backup if you want.

# BACKUP
#covid_data_backup_BEFOREDATE = covid_data.copy()

Ok. Now that we backed up our data, we’ll modify covid_data and create our new variable, date.

#-------------
# CONVERT DATE
#-------------
covid_data = covid_data.assign(
         date = pd.to_datetime(covid_data.date_RAW, format='%m/%d/%y')
         )

At this point, covid_data contains a date variable that’s formatted as a proper datetime.

Rearange columns and sort

Now, we’re just going to clean up our data a little.

I’m going to rearrange the columns with filter, and sort the data with sort_values.

#=====================
# SORT & REARANGE DATA
#=====================
covid_data = (covid_data
               .filter(['country', 'subregion', 'date', 'lat', 'long', 'confirmed'])
               .sort_values(['country','subregion','lat','long','date'])
               )

And let’s print the data:

print(covid_data)

OUT:

           country subregion       date      lat     long  confirmed
26     Afghanistan       NaN 2020-01-22  33.0000  65.0000          0
488    Afghanistan       NaN 2020-01-23  33.0000  65.0000          0
950    Afghanistan       NaN 2020-01-24  33.0000  65.0000          0
1412   Afghanistan       NaN 2020-01-25  33.0000  65.0000          0
1874   Afghanistan       NaN 2020-01-26  33.0000  65.0000          0
           ...       ...        ...      ...      ...        ...
24482       Zambia       NaN 2020-03-14 -15.4167  28.2833          0
24944       Zambia       NaN 2020-03-15 -15.4167  28.2833          0
25406       Zambia       NaN 2020-03-16 -15.4167  28.2833          0
25868       Zambia       NaN 2020-03-17 -15.4167  28.2833          0
26330       Zambia       NaN 2020-03-18 -15.4167  28.2833          2

Alright, alright, alright.

This is starting to look goooood.

Set Index

Let’s do one last thing.

Here, we’ll set the index to ‘country‘. This will probably be temporary (we’ll probably change the index in future tutorials), but for the time being, this will give us the ability to retrieve data based on country name.

To set the index for the DataFrame, we’ll use the Pandas set_index method.

# SET INDEX
covid_data.set_index('country', inplace = True)

Notice that when we do this, we’re setting inplace = True. This will directly modify our dataset, covid_data.

Check Data

Finally, I’m just going to look at a few things.

I want to get a list of the country names, because we’re currently using country as our index.

To do this, I’m going to chain together several Pandas methods, including reset_index, filter, and drop_duplicates.

Essentially, we’re creating a list of the unique values of country.

# GET COUNTRY NAMES
pd.set_option('display.max_rows', 155)
(covid_data
    .reset_index()
    .filter(['country'])
    .drop_duplicates()
    .head(n = 200)
)
pd.reset_option('display.max_rows')

Here is a truncated list of the countries.

                                country
0                           Afghanistan
57                              Albania
114                             Algeria
171                             Andorra
228                 Antigua and Barbuda
285                           Argentina
...                                 ...
11571                                US
25650                           Ukraine
25707              United Arab Emirates
25764                    United Kingdom
26049                           Uruguay
26106                        Uzbekistan
26163                         Venezuela
26220                           Vietnam
26277                            Zambia

Note that the official name for the United States in the data is “US“.

Pull records for United States

Ok. Now, one last thing.

We’ll pull the records for the United States using the Pandas loc method.

# PULL DATA FOR UNITED STATES
covid_data.loc['US']

OUT:

               subregion       date      lat      long  confirmed
country                                                          
US             Adams, IN 2020-01-22  39.8522  -77.2865          0
US             Adams, IN 2020-01-23  39.8522  -77.2865          0
US             Adams, IN 2020-01-24  39.8522  -77.2865          0
US             Adams, IN 2020-01-25  39.8522  -77.2865          0
US             Adams, IN 2020-01-26  39.8522  -77.2865          0
                 ...        ...      ...       ...        ...
US       Yolo County, CA 2020-03-14  38.7646 -121.9018          0
US       Yolo County, CA 2020-03-15  38.7646 -121.9018          0
US       Yolo County, CA 2020-03-16  38.7646 -121.9018          0
US       Yolo County, CA 2020-03-17  38.7646 -121.9018          0
US       Yolo County, CA 2020-03-18  38.7646 -121.9018          0

Here, you can see that we have data for the United States at the country level and the subregion level (individual cities). We have multiple dates, and the latitude/longitude for the data. And ultimately, we have the number of confirmed cases.

This gives us a lot that we can potentially do.

Next steps

There’s probably more data wrangling that we need to do. And after that, we’ll want to analyze the data with Pandas, and also with data visualization tools like matplotlib or Seaborn.

In terms of data manipulation, we’ll probably want to add the data for covid-19 deaths and recovered cases. (Remember, the data that we prepared in this tutorial only has confirmed cases.)

I probably also want to redo the index. Right now, the index is country name, but the names are hard to remember. And some of the names are long, or contain columns. That all makes for a bad index. Instead, I’ll probably want to change the index to the three letter ISO code (i.e., ‘GBR‘ instead of ‘United Kingdom‘).

And as I said, there’s a lot more data analysis and data visualization to do after we finalize the dataset.

Sign up to learn more

Do you want to see part 2 and the other tutorials in this series?

Sign up for our email list now.

When you sign up, you’ll get our tutorials delivered directly to your inbox.

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

15 thoughts on “Python Data Analysis: covid-19 data, part 1 [data wrangling]”

  1. It’s a great course i have done a similar project on covid 19 just for fun. I missed a few tricks but i can definitely say your tutorial gives a good insight to pandas analysis

    Reply
  2. Hello Josh. What a great tutorial to improve my skills and learn about Covid-19. With the time off I will be going through both your courses once again. I can’t wait for Part 2 of this tutorial. Thank you for drilling the importance of memorization and providing a strong foundation in data science.

    Reply
    • Dude, you need to be a lot more specific.

      Explain the specific problem that you have and/or post an error message.

      If you just say “I can’t do X”, then I can’t help you.

      Reply

  3. (covid_data
    .assign(date = pd.to_datetime(covid_data.date_RAW, format='%m/%d/%y'))
    .filter(['date','date_RAW','confirmed'])
    .groupby(['date','date_RAW'])
    .agg('sum')
    .sort_values('date')
    )

    after using the same code as mentioned above.
    I am getting the error.
    ValueError: time data ’02-01-2020′ does not match format ‘%m/%d/%y’ (match)
    Please look into this .
    Can also share the screenshot of the same if required.

    Reply
    • Rerun the code up to the point where we inspect the dates and then run this:


      # INSPECT DATES
      (covid_data
      .filter(['date_RAW'])
      )

      What is the output?

      Reply
      • date_RAW
        0 1/22/20
        1 1/22/20
        2 1/22/20
        3 1/22/20
        4 1/22/20
        … …
        21907 4/13/20
        21908 4/13/20
        21909 4/13/20

        Everything seems working to this point.
        However, after that, I get the error as “ValueError: time data ’02-01-2020′ does not match format ‘%m/%d/%y’ (match)”.
        Here is my email ID- abhinavkale4@gmail.com
        You can contact me on the above ID and send you the screenshot for the same.

        Reply
  4. I’m late to the game on this project (4/4/2021). When I inspect the data for the US in the covid_data_RAW set, the city/state details are missing. It seems to be aggregated at the country level. Does anyone know if the city, state details are still available?
    Thanks

    Reply
    • It’s possible that the dataset has changed. Unfortunately, I wrote that code almost a year ago, and JHU has already changed the data a couple of times.

      If you can find the city-level data, it might be a good exercise to re-do some of the analysis …

      Reply

Leave a Comment