How to do Simple EDA for Machine Learning

In this tutorial, I’ll show you how to do some simple exploratory data analysis (EDA) for a machine learning project.

In this tutorial, we’ll look at the Titanic dataset, which is commonly used in machine learning tutorials, and has previously been used as a Kaggle dataset.

An old image of the Titanic in Belfast, Ireland, with the Python logo off to the upper right hand side of the image.

This tutorial will really only scratch the surface. There’s a lot of analysis that we could do, but in the interest of brevity, I’ll show you a few things.

Table of Contents:

Project Setup (prior to EDA)

First, we need to set a few things up before we do our EDA.

Specifically, we need to import some packages and get our data.

IMPORT PACKAGES

First, we need to import some packages.

import pandas as pd
import seaborn as sns
import seaborn.objects as so

We’re importing Seaborn, which enables us to create many of the visualizations we’ll use to visualize and analyze our data. We’ll also use the relatively new Seaborn Objects sub-package (which you might need to install).

And finally, we’ve imported Pandas, which will give us some tools to wrangle or subset our data.

Load Dataset

We also need to load the titanic dataset, which we’ll be analyzing.

titanic = sns.load_dataset('titanic')

We’ll eventually need to do a little data manipulation on this dataset, but before we do that, we’ll actually need to get a sense of what’s in here. In turn, that will help us decide on how to modify the data going forward.

Basic Data Inspection

Now that we have a dataframe, we’ll do some basic data inspection.

Specifically, we will:

  • Print some records
  • List the data types
  • Count the missing records by column

Print records

Here, we’ll print out a few of the records using the Pandas head method.

# PRINT RECORDS
titanic.head()

OUT:

   survived  pclass     sex   age  ...  deck  embark_town  alive  alone
0         0       3    male  22.0  ...   NaN  Southampton     no  False
1         1       1  female  38.0  ...     C    Cherbourg    yes  False
2         1       3  female  26.0  ...   NaN  Southampton    yes   True
3         1       1  female  35.0  ...     C  Southampton    yes  False
4         0       3    male  35.0  ...   NaN  Southampton     no   True

[5 rows x 15 columns]

List Data Types

Next, we’ll list the data types in the dataframe.

To do this, we’ll call the dtypes property of the dataframe.

titanic.dtypes

OUT:

survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

Here, you can see that we have a mix of integers, floats, categories, and “objects” (which are commonly strings).

We will need to modify or change a few of these, but it will become more obvious how we need to change this as we move forward.

Get Count of Missing Values by Column

Now, before we move on to some visualizations, we’ll get a count of missing values.

# GET COUNT OF MISSING
(titanic
 .isnull()
 .sum()
 )

OUT:

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

Most of these columns have zero or very few missing values. But two of the columns (age and deck) have a substantial number.

We may want to avoid these variables in a model, particularly deck.

But we may still take a look at them and see what they contain.

High-Level Visualizations

And now, we’ll do some high-level visualizations.

Specifically, we’ll create:

  • Histograms of the numeric variables
  • Visualize the target variable
  • Visualize the target variable, broken out by other variables

We’ll also do a little data manipulation along the way.

Create Histograms of Numeric Variables

Next, we’re going to plot histograms of the numeric variables.

To do this, we’re going to use the Seaborn FacetGrid technique, which creates a small multiple chart (AKA, trellis chart).

But, we also need to use select_dtypes to retrieve the numeric variables with the Pandas melt function. The melt function will restructure the data from “tidy” format to long format. Said differently, Pandas melt will put the dataset into a format that allows us to create the small multiple chart.

Notice that we’re calling sns.histplot to actually create the histograms.

# CREATE HISTOGRAMS OF NUMERIC VARIABLES
hist_grid = sns.FacetGrid(data = pd.melt(titanic.select_dtypes(include = np.number))
                       ,col='variable'
                       ,col_wrap = 3
                       ,sharex=False
                       )
hist_grid.map(sns.histplot, 'value', bins=10)

OUT:

An image of a small multiple plot with histograms of numeric variables from the Titanic dataset.

I’m not going to analyze these plots in detail, but a few things stand out.

First, the age variable seems not-quite-exactly normal, but it does roughly have a bell shape. This is somewhat more typical of what we might want for a numeric variable.

But several of the other “numeric” variables are definitely not normal.

In particular, both survived and pclass seem to be categorical variables. They aren’t distributed across many values, but instead have distinct peaks.

That said, we’ll quickly recode those variables to behave more like categoricals.

Recode Variables to Categoricals

Here, we’re going to recode survived and pclass to categorical variables.

To do this, we’ll need to use multiple Pandas tools in combination.

Most importantly, we need to use the Pandas pd.Categorical function to create categorical variables. Notice that we’re specifying the categorical values, and using the ordered parameter to specify that these categories have a specific order.

We’re using the Pandas astype to specify that we want to treat these variables as something other than a numeric.

And we’re using the Pandas assign method to add the newly constructed variables to our dataframe.

titanic_new = (titanic
                  .assign(pclass = pd.Categorical(titanic.pclass.astype(str)
                                                    ,categories = ['1', '2', '3']
                                                    ,ordered = True))
                  .assign(survived = pd.Categorical(titanic.survived.astype(str)
                                                    ,categories = ['0','1']
                                                    ,ordered = True))
                  )

Notice that the whole expression is enclosed in parenthesis. We’re calling multiple functions and methods, and using multiple “assign” operations in series. This is ultimately an example of Pandas method chaining, which you really need to know if you want to do complex data manipulations.

I’ll leave you to inspect the new data with a few data inspection methods.

Visualize the Target Variable (Survived)

Now that we’ve done a little data cleaning, we’ll visualize the survived variable.

Here, we’re going to create a countplot with Seaborn of the survived variable.

sns.countplot(data = titanic_new
              ,x = 'survived'
              )

OUT:

An image of a countplot (i.e., bar chart) of the 'survived' variable from the Titanic dataset.

You’ll notice that more people died (0) than survived (1). In fact, only about 40% survived.

We’re going to analyze the survived variable a little further with some additional visualizations.

Visualize Survived by Sex

Here, we’re going to create a new bar chart of the survived variable, but this time we’re going to break out the data by sex.

To do this, we’re going to use the relatively new Seaborn Objects interface (a new Seaborn visualization package) and create a countplot that plots sex on the x-axis, and creates a breakout of that data by creating additional bars for survived, and “dodging” them to the side.

In other words, we’re creating a dodged bar chart.

(so.Plot(data = titanic_new
        ,x = 'sex'
        ,color = 'survived'
        )
  .add(so.Bar(), so.Count(), so.Dodge())
 )

OUT:

An image of a "dodged" bar chart, that visualizes the count records by sex and survived.

Sorry guys … if you were a male on the Titanic, you were pretty likely to die.

Ladies. A bit more “lucky.” You would have been more likely to live than die.

Importantly, it looks like the sex variable is fairly predictive of survival. This would be important in a classification model.

Visualize Survived by Pclass

Next, let’s visualize survived by pclass.

This variable encodes the “passenger class,” and has values 1, 2, and 3. You can think of this like first class vs coach for modern plane flights.

Here, we’re going to visualize this as a bar chart, but I’m actually going to calculate the percent of people who survived.

I’m also going to facet this chart out by sex, in order to create a small multiple chart.

To do this, I’m using the Seaborn Objects interface with so.Bar to create a bar chart and so.Agg to compute the mean. Because we’re also using the astype method to treat the survived variable as a binary, 0/1 integer, computing the mean of this variable will give us a percent.

Note that I’m also faceting this plot with the Seaborn Objects facet method to create a small multiple chart.

Ok, here’s the code:

(so.Plot(data = titanic_new.assign(survived = titanic_new.survived.astype(int))
        ,x = 'pclass'
        ,y = 'survived'
        )
  .add(so.Bar(), so.Agg('mean'))
  .facet(col = 'sex')
 )

And here’s the output:

An image of a small multiple bar chart of pclass by survived, faceted by sex.

Fascinating.

If you were female, you were more likely to survive than a male in any passenger class. But females in 1st and 2nd class were more likely to survive than females in 3rd class.

If you were male, you were still pretty likely to die no matter what, but much more likely to die if you were in 2nd and 3rd class. The men in 1st class actually had almost a 40% chance of survival, vs under 20% in 2nd and 3rd.

Clearly, passenger class, like sex, had a strong relationship with survival.

Again, this is useful to us when building a machine learning model.

There’s probably more that we could do

I’m going to stop there for now.

We’ve identified a few variables that appear to be related to the target (both sex and plcass).

We also identified a few variables that we needed to recode.

Having said that, there’s probably more that we could do.

Tell me what you want to know

What else do you want to see for machine learning EDA?

Do you have any suggestions?

What did I miss?

I want to hear from you …

Give me your feedback by leaving a comment in the comments section below.

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

14 thoughts on “How to do Simple EDA for Machine Learning”

  1. How do you get the bracket to indent inline with the following bracket in your chains? mine is usually not in line with it, doesnt look as clean

    Reply
    • I typically write my code in Spyder, and it automatically indents parenthesis and brackets when you put them on a new line.

      Having said that, I often clean up my code indentation a little bit simply because it enhances readability, and readability is extremely important to me in my code.

      So, use an editor that does some of the indentation automatically, but then clean it up manually if you need to.

      Reply
  2. Thank you for the tutorial. You are very informative and quite understandable.
    I know it is limited. By that, I would like to see more on EDA for this dataset example. How about going further aside from the cleaning also include the statistical analysis and hypothesis testing.
    I would appreciate it if you have a tutorial on this dataset till preprocessing, modeling until the end.
    Thank you for your time and efforts.
    It is so wonderful that you are sharing your knowledge to those who are trying to enter the field. I would love to learn more.
    Thanks again.

    Reply
    • This is roughly the difficulty level you should expect for many datasets.

      You’ll notice here that to analyze this data, we need to use not only visualization techniques, but also some data wrangling in conjunction with visualization (to get the data in the right form for a particular visualization technique). To be honest, this is actually some intermediate level material (so maybe I shouldn’t have called it “simple”), but it’s what you should expect.

      It can get more complicated, but it depends on the dataset.

      I think more specific to this data, this should be about as difficult as it gets … there’s simply more that we could do.

      Reply
  3. Is this typically a long process or is the data cleaning phase and structuring longer and tougher? I understand Kaggle has clean datasets so good for visualisation and EDA practice

    Reply
    • Honestly, it really depends.

      It’s true that many Kaggle datasets are already pretty clean. I’d also say that this Titanic dataset from Seaborn is also pretty clean (I think that Kaggle also has a version of this dataset, but it might be slightly different).

      Having said that, how much data cleaning you need to do really depends on the project and the source of the data.

      If you scrape some data from the internet, it will probably require a lot of cleaning, which is a pain.

      But a lot of data in corporate environments is already mostly clean. Both at Apple and Bank of America, we had data teams that built and managed our databases. We could pull the data from a database with SQL, and it was often very clean data. Still, we often needed to do some data wrangling even on that clean data. For example, we often needed to recode variables (like recoding categoricals to dummy variables, combining categories, creating calculated numerics, etc).

      In sum, how much data cleaning you do really depends on the data source … but even with “clean” data, it’s common to need to do some recoding or data wrangling.

      (You need to master data wrangling!)

      Reply
      • Wow super helpful. Now I know what to expect. I feel as if I have a decent grasp on some of the techniques in this tutorial.

        Given I will be working for a bank which most likely has databases for relatively clean data, I will focus my approach on data wrangling and visualisation from roughly the same point in the tutorial.

        Yes I think Kaggle datasets may be a decent option to play around with in this case. Will also need to perfect my SQL skills which I hear isn’t too difficult at all.

        Can I ask, what motivates you to give so much value out for free? a lot of this information is gold for us amateurs and I understand you could hide this value behind a pay wall.

        Reply
  4. Hi!
    Thanks for the tutorial.
    I tried to reproduce it in RStudio using the reticulate package repl_python() function to create the python environment.
    But the plots did not appear for me. Could you help me on this?
    Regards
    János

    Reply
    • I don’t use repl_python, so I can’t help. Additionally, saying that the plots “didn’t appear” doesn’t give me much information. Did the other code work? Did you get some error messages? I need more info.

      You might consider writing the R code yourself, as a small project.

      This should be somewhat easy to do with dplyr and ggplot.

      Reply
  5. 1. Do we have to use sns.FacetGrid for the operation you carried out? Or is it possible to do this using so.Facet? I feel like objects is more intuitive so if possible would like to carry out the distributions of numerics this way if possible!

    2. How do we know when to use “.add(so.Bar(), so.Agg(‘mean’))” or “so.Count()”, “so.Hist()” as they’re quite similar? I have a decent grasp of statistics but this is where I start to get a little confused. I probably have some gaps in knowledge

    Reply

Leave a Comment