In this tutorial, I’ll show you how to do some simple exploratory data analysis (EDA) for a machine learning project.
In this tutorial, we’ll look at the Titanic dataset, which is commonly used in machine learning tutorials, and has previously been used as a Kaggle dataset.
This tutorial will really only scratch the surface. There’s a lot of analysis that we could do, but in the interest of brevity, I’ll show you a few things.
Table of Contents:
Project Setup (prior to EDA)
First, we need to set a few things up before we do our EDA.
Specifically, we need to import some packages and get our data.
IMPORT PACKAGES
First, we need to import some packages.
import pandas as pd import seaborn as sns import seaborn.objects as so
We’re importing Seaborn, which enables us to create many of the visualizations we’ll use to visualize and analyze our data. We’ll also use the relatively new Seaborn Objects sub-package (which you might need to install).
And finally, we’ve imported Pandas, which will give us some tools to wrangle or subset our data.
Load Dataset
We also need to load the titanic
dataset, which we’ll be analyzing.
titanic = sns.load_dataset('titanic')
We’ll eventually need to do a little data manipulation on this dataset, but before we do that, we’ll actually need to get a sense of what’s in here. In turn, that will help us decide on how to modify the data going forward.
Basic Data Inspection
Now that we have a dataframe, we’ll do some basic data inspection.
Specifically, we will:
- Print some records
- List the data types
- Count the missing records by column
Print records
Here, we’ll print out a few of the records using the Pandas head method.
# PRINT RECORDS titanic.head()
OUT:
survived pclass sex age ... deck embark_town alive alone 0 0 3 male 22.0 ... NaN Southampton no False 1 1 1 female 38.0 ... C Cherbourg yes False 2 1 3 female 26.0 ... NaN Southampton yes True 3 1 1 female 35.0 ... C Southampton yes False 4 0 3 male 35.0 ... NaN Southampton no True [5 rows x 15 columns]
List Data Types
Next, we’ll list the data types in the dataframe.
To do this, we’ll call the dtypes property of the dataframe.
titanic.dtypes
OUT:
survived int64 pclass int64 sex object age float64 sibsp int64 parch int64 fare float64 embarked object class category who object adult_male bool deck category embark_town object alive object alone bool dtype: object
Here, you can see that we have a mix of integers, floats, categories, and “objects” (which are commonly strings).
We will need to modify or change a few of these, but it will become more obvious how we need to change this as we move forward.
Get Count of Missing Values by Column
Now, before we move on to some visualizations, we’ll get a count of missing values.
# GET COUNT OF MISSING (titanic .isnull() .sum() )
OUT:
survived 0 pclass 0 sex 0 age 177 sibsp 0 parch 0 fare 0 embarked 2 class 0 who 0 adult_male 0 deck 688 embark_town 2 alive 0 alone 0 dtype: int64
Most of these columns have zero or very few missing values. But two of the columns (age
and deck
) have a substantial number.
We may want to avoid these variables in a model, particularly deck
.
But we may still take a look at them and see what they contain.
High-Level Visualizations
And now, we’ll do some high-level visualizations.
Specifically, we’ll create:
- Histograms of the numeric variables
- Visualize the target variable
- Visualize the target variable, broken out by other variables
We’ll also do a little data manipulation along the way.
Create Histograms of Numeric Variables
Next, we’re going to plot histograms of the numeric variables.
To do this, we’re going to use the Seaborn FacetGrid technique, which creates a small multiple chart (AKA, trellis chart).
But, we also need to use select_dtypes to retrieve the numeric variables with the Pandas melt function. The melt function will restructure the data from “tidy” format to long format. Said differently, Pandas melt will put the dataset into a format that allows us to create the small multiple chart.
Notice that we’re calling sns.histplot to actually create the histograms.
# CREATE HISTOGRAMS OF NUMERIC VARIABLES hist_grid = sns.FacetGrid(data = pd.melt(titanic.select_dtypes(include = np.number)) ,col='variable' ,col_wrap = 3 ,sharex=False ) hist_grid.map(sns.histplot, 'value', bins=10)
OUT:
I’m not going to analyze these plots in detail, but a few things stand out.
First, the age
variable seems not-quite-exactly normal, but it does roughly have a bell shape. This is somewhat more typical of what we might want for a numeric variable.
But several of the other “numeric” variables are definitely not normal.
In particular, both survived
and pclass
seem to be categorical variables. They aren’t distributed across many values, but instead have distinct peaks.
That said, we’ll quickly recode those variables to behave more like categoricals.
Recode Variables to Categoricals
Here, we’re going to recode survived
and pclass
to categorical variables.
To do this, we’ll need to use multiple Pandas tools in combination.
Most importantly, we need to use the Pandas pd.Categorical function to create categorical variables. Notice that we’re specifying the categorical values, and using the ordered
parameter to specify that these categories have a specific order.
We’re using the Pandas astype to specify that we want to treat these variables as something other than a numeric.
And we’re using the Pandas assign method to add the newly constructed variables to our dataframe.
titanic_new = (titanic .assign(pclass = pd.Categorical(titanic.pclass.astype(str) ,categories = ['1', '2', '3'] ,ordered = True)) .assign(survived = pd.Categorical(titanic.survived.astype(str) ,categories = ['0','1'] ,ordered = True)) )
Notice that the whole expression is enclosed in parenthesis. We’re calling multiple functions and methods, and using multiple “assign” operations in series. This is ultimately an example of Pandas method chaining, which you really need to know if you want to do complex data manipulations.
I’ll leave you to inspect the new data with a few data inspection methods.
Visualize the Target Variable (Survived)
Now that we’ve done a little data cleaning, we’ll visualize the survived
variable.
Here, we’re going to create a countplot with Seaborn of the survived
variable.
sns.countplot(data = titanic_new ,x = 'survived' )
OUT:
You’ll notice that more people died (0
) than survived (1
). In fact, only about 40% survived.
We’re going to analyze the survived
variable a little further with some additional visualizations.
Visualize Survived by Sex
Here, we’re going to create a new bar chart of the survived variable, but this time we’re going to break out the data by sex
.
To do this, we’re going to use the relatively new Seaborn Objects interface (a new Seaborn visualization package) and create a countplot that plots sex
on the x-axis, and creates a breakout of that data by creating additional bars for survived
, and “dodging” them to the side.
In other words, we’re creating a dodged bar chart.
(so.Plot(data = titanic_new ,x = 'sex' ,color = 'survived' ) .add(so.Bar(), so.Count(), so.Dodge()) )
OUT:
Sorry guys … if you were a male on the Titanic, you were pretty likely to die.
Ladies. A bit more “lucky.” You would have been more likely to live than die.
Importantly, it looks like the sex
variable is fairly predictive of survival. This would be important in a classification model.
Visualize Survived by Pclass
Next, let’s visualize survived by pclass
.
This variable encodes the “passenger class,” and has values 1, 2, and 3. You can think of this like first class vs coach for modern plane flights.
Here, we’re going to visualize this as a bar chart, but I’m actually going to calculate the percent of people who survived.
I’m also going to facet this chart out by sex, in order to create a small multiple chart.
To do this, I’m using the Seaborn Objects interface with so.Bar
to create a bar chart and so.Agg
to compute the mean. Because we’re also using the astype
method to treat the survived variable as a binary, 0/1 integer, computing the mean of this variable will give us a percent.
Note that I’m also faceting this plot with the Seaborn Objects facet
method to create a small multiple chart.
Ok, here’s the code:
(so.Plot(data = titanic_new.assign(survived = titanic_new.survived.astype(int)) ,x = 'pclass' ,y = 'survived' ) .add(so.Bar(), so.Agg('mean')) .facet(col = 'sex') )
And here’s the output:
Fascinating.
If you were female, you were more likely to survive than a male in any passenger class. But females in 1st and 2nd class were more likely to survive than females in 3rd class.
If you were male, you were still pretty likely to die no matter what, but much more likely to die if you were in 2nd and 3rd class. The men in 1st class actually had almost a 40% chance of survival, vs under 20% in 2nd and 3rd.
Clearly, passenger class, like sex, had a strong relationship with survival.
Again, this is useful to us when building a machine learning model.
There’s probably more that we could do
I’m going to stop there for now.
We’ve identified a few variables that appear to be related to the target (both sex
and plcass
).
We also identified a few variables that we needed to recode.
Having said that, there’s probably more that we could do.
Tell me what you want to know
What else do you want to see for machine learning EDA?
Do you have any suggestions?
What did I miss?
I want to hear from you …
Give me your feedback by leaving a comment in the comments section below.
How do you get the bracket to indent inline with the following bracket in your chains? mine is usually not in line with it, doesnt look as clean
I typically write my code in Spyder, and it automatically indents parenthesis and brackets when you put them on a new line.
Having said that, I often clean up my code indentation a little bit simply because it enhances readability, and readability is extremely important to me in my code.
So, use an editor that does some of the indentation automatically, but then clean it up manually if you need to.
Thank you for the tutorial. You are very informative and quite understandable.
I know it is limited. By that, I would like to see more on EDA for this dataset example. How about going further aside from the cleaning also include the statistical analysis and hypothesis testing.
I would appreciate it if you have a tutorial on this dataset till preprocessing, modeling until the end.
Thank you for your time and efforts.
It is so wonderful that you are sharing your knowledge to those who are trying to enter the field. I would love to learn more.
Thanks again.
Ok … thanks for the suggestions and feedback, Charrie.
Does EDA get far more complex than this?
This is roughly the difficulty level you should expect for many datasets.
You’ll notice here that to analyze this data, we need to use not only visualization techniques, but also some data wrangling in conjunction with visualization (to get the data in the right form for a particular visualization technique). To be honest, this is actually some intermediate level material (so maybe I shouldn’t have called it “simple”), but it’s what you should expect.
It can get more complicated, but it depends on the dataset.
I think more specific to this data, this should be about as difficult as it gets … there’s simply more that we could do.
Is this typically a long process or is the data cleaning phase and structuring longer and tougher? I understand Kaggle has clean datasets so good for visualisation and EDA practice
Honestly, it really depends.
It’s true that many Kaggle datasets are already pretty clean. I’d also say that this Titanic dataset from Seaborn is also pretty clean (I think that Kaggle also has a version of this dataset, but it might be slightly different).
Having said that, how much data cleaning you need to do really depends on the project and the source of the data.
If you scrape some data from the internet, it will probably require a lot of cleaning, which is a pain.
But a lot of data in corporate environments is already mostly clean. Both at Apple and Bank of America, we had data teams that built and managed our databases. We could pull the data from a database with SQL, and it was often very clean data. Still, we often needed to do some data wrangling even on that clean data. For example, we often needed to recode variables (like recoding categoricals to dummy variables, combining categories, creating calculated numerics, etc).
In sum, how much data cleaning you do really depends on the data source … but even with “clean” data, it’s common to need to do some recoding or data wrangling.
(You need to master data wrangling!)
Wow super helpful. Now I know what to expect. I feel as if I have a decent grasp on some of the techniques in this tutorial.
Given I will be working for a bank which most likely has databases for relatively clean data, I will focus my approach on data wrangling and visualisation from roughly the same point in the tutorial.
Yes I think Kaggle datasets may be a decent option to play around with in this case. Will also need to perfect my SQL skills which I hear isn’t too difficult at all.
Can I ask, what motivates you to give so much value out for free? a lot of this information is gold for us amateurs and I understand you could hide this value behind a pay wall.
Excellent tutorial as usual. Thanks ????
????????????
Many thanks.
Hi!
Thanks for the tutorial.
I tried to reproduce it in RStudio using the reticulate package repl_python() function to create the python environment.
But the plots did not appear for me. Could you help me on this?
Regards
János
I don’t use repl_python, so I can’t help. Additionally, saying that the plots “didn’t appear” doesn’t give me much information. Did the other code work? Did you get some error messages? I need more info.
You might consider writing the R code yourself, as a small project.
This should be somewhat easy to do with dplyr and ggplot.
1. Do we have to use sns.FacetGrid for the operation you carried out? Or is it possible to do this using so.Facet? I feel like objects is more intuitive so if possible would like to carry out the distributions of numerics this way if possible!
2. How do we know when to use “.add(so.Bar(), so.Agg(‘mean’))” or “so.Count()”, “so.Hist()” as they’re quite similar? I have a decent grasp of statistics but this is where I start to get a little confused. I probably have some gaps in knowledge