Analyzing covid-19 with Python, part 3 [data exploration]

An image of Python code analyzing covid19 data, with an image of the sars-cov-2 virus in the background.

This tutorial is part of a series of tutorials analyzing covid-19 data.

For parts 1 and 2, see the following posts:
https://www.sharpsightlabs.com/blog/python-data-analysis-covid19-part1/
https://www.sharpsightlabs.com/blog/python-data-analysis-covid19-part2/

Covid19 analysis, part 3: initial data exploration

If you’ve been following along with this tutorial series, you’ll know that in the most recent tutorial, we defined a process for getting and cleaning the covid19 data that we’re going to use.

Now that we have a merged dataset, we’re going to do some initial data exploration.

I’ll explain more in the tutorial, but if you want to jump to a specific section, you can use one of the following links.

Table of Contents:

As always though, it’s best if you read through everything, step by step.

Get Data and Packages

First, I want to make sure that you have the data.

Ideally, it’s best if you go back to part 2 and run the code there. That will enable you to get the most up-to-date data.

But if you’re pressed for time, you can download a dataset with the following code.

First, you’ll need to import Pandas and datetime:

import pandas as pd
import datetime

And then you can import the data:

covid_data = pd.read_csv('https://learn.sharpsightlabs.com/datasets/covid19/covid_data_2020-03-22.csv'
                         ,sep = ";"
                         )
covid_data = covid_data.assign(date = pd.to_datetime(covid_data.date, format='%Y-%m-%d'))

Keep in mind, this dataset only has data up to March 22, 2020, so it will not be completely up to date.

If you want completely up-to-date data, you can go back to part 2 and run the code there.

Basic Data Inspection

Now that we have the data, we need to do some simple inspection.

Remember: whenever you start working on a data science project, your first task is to simply understand what is in the data.

And you need to sort of “check” to make sure the data look clean.

So the simplest thing that we can do is simply print a few rows and retrieve the column names.

Print rows

We can print a few rows with the head method.

# PRINT ROWS
covid_data.head()

OUT:

       country subregion       date   lat  long  confirmed  dead  recovered
0  Afghanistan       NaN 2020-01-22  33.0  65.0          0     0          0
1  Afghanistan       NaN 2020-01-23  33.0  65.0          0     0          0
2  Afghanistan       NaN 2020-01-24  33.0  65.0          0     0          0
3  Afghanistan       NaN 2020-01-25  33.0  65.0          0     0          0
4  Afghanistan       NaN 2020-01-26  33.0  65.0          0     0          0

Now when I do this, I’m just looking and trying to spot anything unusual.

Do the values look “typical”? Do they look like what I expect?

In this case, they do. The country column has the name of a country (Afghanistan), the date variable has something that looks like a date. The lat and long variables look like latitude and longitude values, Etcetera.

The one unusual thing is the NaN values. But if you print out a large number of rows, you’ll find that this is just used for cities or sub-regions where they are available.

So, the data largely matches my expectations.

That’s good … if there was anything unusual, we might need to investigate further or do more data cleaning.

Print column names

Now, let’s just print out the column names.

covid_data.columns

OUT:

Index(['country', 'subregion', 'date', 'lat', 'long'
      ,'confirmed', 'dead','recovered'],
      dtype='object')

Commonly, I’ll examine the column names to make sure that the columns are well named. In this case – since we did some data cleaning in part 2 – the variables are well named.

If they weren’t, you could use the Pandas rename method to rename them.

Keep in mind, it’s sometimes useful to paste the names of the columns into a text file or paste them into your script (and comment them out). This is useful so you have them close at hand when you’re working with the data. If you forget the exact column names, you can look at your text file to remember them.

Examine variables

Now, we’ll examine some of the variables.

In particular, we want to look at the values of the variables to see if they look “appropriate.”

For string or “categorical” variables, we want to look at the unique values.

For numerics, we want to look at summary statistics like the mean, median, min, max. Again, we’re trying to check if things look “OK”.

Get unique values: country

First, let’s examine the country variable.

From our print of records earlier, it looks like the country variable contains country names.

Here, we want to actually check that.

In particular, when we examine the variable, I want to think about a few things:

  • Are they all actually country names?
  • Are any of them unusual?
  • Does anything need to be changed or cleaned?

Again, as I said earlier, in this first pass through the data, we’re just exploring. We want to check the data and make sure that things are acceptable.

To do that in this case, we’re going to use a couple of Pandas techniques “chained” together.

We’ll start with the dataframe, covid_data, and then call the filter method to select the country variable.

After that, we call the drop_duplicates method to drop duplicate rows and ultimately get the unique values of country.

pd.set_option('display.max_rows', 180)
(covid_data
    .filter(['country'])
    .drop_duplicates()
)

OUT:

                                country
0                           Afghanistan
61                              Albania
122                             Algeria
183                             Andorra
244                              Angola
305                 Antigua and Barbuda
366                           Argentina
...                                 ...
13664                                US
28731                            Uganda
28792                           Ukraine
28853              United Arab Emirates
28914                    United Kingdom
29341                           Uruguay
29402                        Uzbekistan
29463                         Venezuela
29524                           Vietnam
29585                            Zambia
29646                          Zimbabwe

For the sake of space, I didn’t actually show all of the values.

But you can print them out yourself and check them.

When I looked, they looked okay, for the most part. Some of the names were a little long, and some were formatted with commas in the country name. But for the most part, there was nothing terribly out of place. This variable looks clean.

Get unique values: date

Next, we’ll do something similar with the date variable.

Again, we want to check and make sure that the values are appropriate.

Based on what we know about the data (that the covid-19 outbreak began in China sometime in January) we would expect the dates to start sometime in January of 2020.

I pulled this dataset on March 22, 2020, so we expect the data to end around that date.

Let’s check and try to confirm that.

To do this, we’ll use the Pandas chaining methodology again. Just like with the country variable, we’ll chain the filter method, and then drop_duplicates.

(covid_data
  .filter(['date'])
  .drop_duplicates()
)

OUT:

         date
0  2020-01-22
1  2020-01-23
2  2020-01-24
3  2020-01-25
4  2020-01-26
5  2020-01-27
...
55 2020-03-17
56 2020-03-18
57 2020-03-19
58 2020-03-20
59 2020-03-21
60 2020-03-22

Again, here, I’ve only shown a few of the first and last rows. But if you print out all of them and look you’ll see that the dates look clean.

That’s exactly what we’re trying to verify.

Check variables: lat and long

Next, we’ll look at lat and long.

Again, we want to check to make sure that the values are appropriate.

I recommend that you print out a few dozen rows, just to take a look. In the interest of brevity, I won’t do that here.

But I will how you how to aggregate the data, so we can take a look at the range of values.

Here, we’ll again use the Pandas “chaining” methodology to retrieve the minimum and maximum values for both lat and long.

(covid_data
  .filter(['long','lat'])
  .agg(['min','max'])
)

OUT:

         long      lat
min -157.8584 -41.4545
max  178.0650  71.7069

If you look at these, they look appropriate. Based on this quick analysis, it looks like lat and long are within an appropriate range of values for latitude and longitude (you can check this yourself with a quick google search).

Check variables: confirmed, dead, recovered

Now, let’ take a look at the confirmed, dead, and recovered columns.

These are numeric variables, so we can use the describe method on them.

In order to do these all at once, we can again use the Pandas chaining methodology. We’ll use filter to retrieve the confirmed, dead, and recovered columns, and then we’ll use describe to calculate summary statistics.

#-----------------------------------------------
# EXAMINE VARIABLES: confirmed, dead, recovered
#-----------------------------------------------
(covid_data
 .filter(['confirmed','dead','recovered'])
 .describe()
)

OUT:

          confirmed          dead     recovered
count  29707.000000  29707.000000  29707.000000
mean     174.919615      6.134345     63.373919
std     2587.520759    118.943550   1400.678631
min        0.000000      0.000000      0.000000
25%        0.000000      0.000000      0.000000
50%        0.000000      0.000000      0.000000
75%        1.000000      0.000000      0.000000
max    67800.000000   5476.000000  59433.000000

When looking at these summary statistics, remember that the data are at the country and city level.

From our print of the data earlier, we saw that some rows had zero confirmed, dead, or recovered … meaning that there weren’t any cases for those location. So when we look at the min values in this table and see all of the zeros, that looks appropriate.

In the row that shows the max statistic, we see a maximum confirmed cases of 67800. That looks about right as of March 22. I think that is Italy, but we can confirm in a moment.

Ultimately, we’ve just looked at confirmed, dead, and recovered columns and they appear to be okay.

Count missing values

Now, let’s count missing values.

We can do this by using isnull and sum on our DataFrame. This will give us a count of the null values by column.

(covid_data
 .isnull()
 .sum()
)

OUT:

country         0
subregion    9882
date            0
lat             0
long            0
confirmed       0
dead            0
recovered       0
dtype: int64

This looks pretty good, actually.

There are no missing values in any of the columns except subregion. When we printed out our data, it looked like most of the data was at country level, but when there was specific data for a city, the city name was recorded in subregion.

A quick note about the data analysis process

Now here, I want to briefly mention something about the process of data analysis.

When you work with a new dataset as a data scientist, you’ll often work hand-in-hand with a subject matter expert (SME). An SME will be able to help you and answer questions about the data that you have. They will also help frame an analysis by helping generate hypotheses and questions.

In this case, I don’t have an SME available, so it’s a little harder to answer questions about the dataset. I’m not terribly concerned about that, because I’m really just “playing” with this dataset for fun and to show you how to do an analysis.

Just remember, when you start exploring your data, it is okay to have some unanswered questions about the data. In a real-world business setting, you’ll work with your partners to answer those questions and fix any data problems that you might find.

The good news though is that this dataset is really very clean and I don’t have too many concerns right now.

Spot check records

Finally, let’s get the records for a few countries.

Once again, to do this, we’ll use the Pandas chaining methodology.

We’ll use filter to retrieve a few specific columns.

Then we’re using groupby to group the data on country and date. This will enable us to aggregate the data and summarise by country/date.

After groupby, we execute the aggregation with the agg method. Here, we’re using sum to sum up the confirmed, dead, and recovered.

Then we’re using query to retrieve data for a few countries: the United States, Italy, France, and China. (I chose these because as of March 23, they are the most discussed right now in the media.)

We’re also using query to subset down to a specific date: March 21, 2020.

#-------------------------------
# SPOT CHECK IMPORTANT COUNTRIES
#-------------------------------
pd.set_option('display.max_rows', 300)
(covid_data
   .filter(['country','date','confirmed','dead','recovered'])
   .groupby(['country','date'])
   .agg('sum')
   .query('date == datetime.date(2020, 3, 21)')
   .query('country in ["US","Italy","France","China"]')
)

OUT:

                    confirmed  dead  recovered
country date                                  
China   2020-03-21      81305  3259      71857
France  2020-03-21      14431   562         12
Italy   2020-03-21      53578  4825       6072
US      2020-03-21      25489   307          0

Ultimately, our code has given us a quick summary of the confirmed, dead, and recovered for China, France, Italy, and the US, for the date of March 21, 2020.

We can compare these numbers with the Johns Hopkins Covid-19 Dashboard.

An image that shows the Johns Hopkins covid-19 dashboard on the date of March 23, 2020

At this point, I’ll note that our numbers are pretty close to the numbers in the dashboard, but not exactly the same. That’s because I pulled this dataset on March 22, and on March 22, the data were only fully updated up to the date of March 21.

The image above from the Johns Hopkins dashboard is from the morning of March 23. So there’s a bit of a time difference between the data that I’m working with currently and the data in the dashboard.

Another note about the process of data exploration

This is a good time to make another comment about the process of data exploration.

When you get a new dataset, you’ll often need to validate the data in some way. You want to check the numbers in your dataset verses another source.

When you do this, it’s pretty common to find discrepancies. In fact, very common.

Sometimes, the discrepancies are explainable. For example, in our dataset, this is just a difference in timing. We could re-pull the data and check again. Our numbers would probably be closer to the dashboard. (I’m not doing that, because if I did, I’d have to go back and change several things in this tutorial.)

Other times, the discrepancies are not explainable, and you end up with “two versions of the truth.” Many times in my career at Apple, and at large banks and consulting firms, we’d sometimes end up with two different sets of numbers.

Get used to this. It’s common, and it’s one of the issues that you just have to work through as a data scientist.

If the discrepancy is small, sometimes you just push forward. In our case above, the discrepancies are relatively small and explainable, so I’m not terribly concerned. Furthermore, this is just a tutorial. I’m playing with the data. It’s not mission critical.

Other times though, the discrepancies can be large or unacceptable in some other way. In that case, you’ll just have to go down the rabbit hole and try to check your numbers again, or otherwise try to reconcile them.

It’s a tradeoff between accuracy and productivity, and it changes from project to project.

The data look pretty good

Having said all of that, after exploring a little bit, I think that the data look good, so we’re going to move forward.

Keep in mind though that data analysis is iterative.

As we move forward, we might find out new things about the data.

This is particularly true once we start visualizing the data.

Sometimes, when you visualize your data, you start finding anomalies or unexplained items.

In these cases, sometimes that means that you need to go back and clean your data a little more.

In other cases it means that there’s something that you need to “drill into” with more analysis.

Ultimately, I want you to understand that the process of data analysis is not a perfect straight line. The process of data cleaning and data analysis in the real world is almost always iterative, and sometimes leads to obstacles, dead-ends, or revisions of your earlier work.

With that in mind, we’ll find out more about our data when we start to visualize it.

Next steps

In the next installment of this tutorial series, we’ll probably start to explore the data with data visualization.

In some sense, that will be an extension of this tutorial. In this tutorial, we did some data exploration strictly with Pandas.

But in the next tutorial, we’ll start to use data visualization techniques to visualize and further explore the data. When we do this, we’ll probably need to aggregate or otherwise subset our data, so we’ll be using our data visualization techniques hand-in-hand with Pandas.

All that said, you need to remember: to really master data analysis and data science, you need a full toolkit. You need to be able to do data wrangling and data visualization … often in combination with one another.

Sign up to learn more

Do you want to see part 4 and the other tutorials in this series?

Sign up for our email list now.

When you sign up, you’ll get our tutorials delivered directly to your inbox.

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

13 thoughts on “Analyzing covid-19 with Python, part 3 [data exploration]”

  1. This tutorial may not have been exciting as the last two but I recognize the importance of analyzing the data now. I have overlooked this step many times after being satisfied when I was able to wrangle the data and merge dataframes in earlier projects. It looked pretty and I thought everything was okay but many times NAs were introduced and data were in the wrong columns.

    If I had not overlooked this step in the process I would have saved so much time nipping it in the bud rather than at the end of the process cycle. I look forward to the next tutorials Josh!

    Reply
  2. Nice series of tutorials. I look forward to the one about visualization. Question: the last example produces a table of confirmed, dead, and recovered for four countries. I’d like to add a step in the chain that adds, say, “pct_survived” as the quotient of “recovered” / “confirmed”. I’ve found ways to do something like this, but they all seem to be working at the top level, where the name of the dataframe is accessible, rather than in a chain. Do you know a way to do this? Thanks

    Reply
    • I think I’ve figured this out:

      (covid_data
      .filter([‘country’, ‘date’, ‘confirmed’, ‘dead’, ‘recovered’])
      .groupby([‘country’, ‘date’])
      .agg(‘sum’)
      .query(‘date == datetime.date(2020, 3, 23)’)
      .query(‘country in [“US”, “Italy”, “France”, “China”]’)
      .assign(pct_surviving = lambda x: x[‘recovered’] / x[‘confirmed’])
      )

      Reply
      • The thing I was looking for was the `assign` method with the `lambda` function, as shown in my previous note, but I realized that the math is wrong in that note. Here’s a better version in that respect:
        (covid_data
        .filter([‘country’, ‘date’, ‘confirmed’, ‘dead’, ‘recovered’])
        .groupby([‘country’, ‘date’])
        .agg(‘sum’)
        .query(‘date == datetime.date(2020, 3, 23)’)
        .query(‘country in [“US”, “Italy”, “France”, “China”]’)
        .assign(pct_surviving = lambda x: (1.0 – (x[‘dead’] / x[‘confirmed’])) * 100.0)
        )

        Reply
    • You should be able to add a new variable with the assign() method.

      But be careful … calculating the percent survived this way will not be quite accurate. While the pandemic is currently in process, there is a certain percent of the “confirmed” cases that are still active, so have not officially died or survived from a statistical point of view.

      While the pandemic is in progress, calculating an actual death rate will be more difficult, and will require different data.

      Reply
      • Thanks. Point well-taken. To be honest I was just trying to figure out how to add columns to a dataframe using values from existing columns, a la R’d dplyr::mutate

        Reply
        • Yeah … I suspect that many people understand the caveats, but just wanted to point it out.

          To reiterate and answer your question:

          The Pandas analog to dplyr::mutate is the Pandas assign() method.

          Reply

Leave a Comment