Machine learning is a very powerful skill.
And machine learning is a valuable skill.
According to Glassdoor, the average salary for a machine learning engineer is about $140,000 in 2023.
Mastering machine learning could increase your earning power a lot.
The question is, how? What’s the best way to get started with machine learning? What are the prerequisites and essential foundations?
I’ll admit, machine learning is hard to learn.
Part of the problem is that there’s so much terrible advice about how to learn machine learning. In particular, there’s a lot of bad advice about machine learning prerequisites.
So to set the record straight, I want to give you a clear learning path to get started with machine learning in Python.
You’re probably aware that that scikit-learn is the primary machine learning toolkit for Python.
But in this post, I’m going to tell you a little secret that many people will fail to tell you: you need to master more foundational data science and analytics skills first (before you get into scikit-learn).
That’s the real foundation for Python machine learning.
So in this post, I’m going to tell you the Python toolkits and skills that you need to know before you get started with machine learning
At a high level, these are the Python skills you will need to learn before you jump into Python ML:
- Base Python
- Data Analysis in Python
Let’s talk about these one at a time.
Base Python Essentials
First of all, you need to have a strong foundation in “base Python.”
You don’t need to know everything in the Python standard library, but you do need to have a firm grasp of most essential Python programming skills.
You should learn:
- variable basics
- lists, tuples, and dictionaries
- comparison operations (e.g.,
- boolean operators and boolean expressions (e.g.,
- for loops
- functions (i.e., how to define functions, etc)
Essentially, you should have a solid understanding of all of the basics of Python programming syntax.
Some of these things you might not need immediately, but almost all of them will emerge at some point as you start doing Python machine learning.
Next, I recommend that you learn Pandas.
You've probably heard of Pandas before, so you probably know that it's a Python toolkit for doing "data wrangling" and data cleaning.
Specifically, Pandas is a toolkit for doing data wrangling with a special data structure called a dataframe. Dataframes store data (both strings and numerics) in a row-and-column structure. You can think of dataframes like Excel spreadsheets for Python.
As a data wrangling toolkit, Pandas gives you tools to:
- import data from external sources
- create dataframes
- format dataframes (i.e., creating or changing indexes)
- filter data
- aggregate data
- reshape data
- and a lot more ...
Essentially, if your data has rows and columns, and if the data has a mix of numbers and categories, you probably need Pandas.
Pandas is critical for many parts of the data science workflow generally. You need it for data cleaning and analytics.
But once you start doing machine learning, you'll also need to use Pandas to reshape, filter, clean, aggregate, and analyze your data. (The data workflow for machine learning is somewhat similar to the data workflow in more general data science.)
Keep in mind that data analysis is very important for machine learning too. You'll need to use Pandas a lot for data analysis, but it involves some other tools as well. I'll discuss that in a moment.
For data manipulation, you'll also need to learn Numpy (in addition to Pandas).
Numpy is a Python toolkit for working with numeric data. Specifically, in Numpy, there's a data structure called the Numpy array.
A Numpy array is similar to a dataframe, in that it has a row-and-column structure.
But, unlike dataframes, Numpy arrays only contain numbers.
So a Numpy array is a row-and-column structure for working with numeric data.
However, Numpy arrays are different from dataframes in a few ways:
- Numpy arrays only store numeric data
- Numpy arrays can have more that two dimensions
If you've ever taken a linear algebra class in college, Numpy arrays are a lot like vectors, matrices, and tensors (depending on the number of dimensions).
In addition to providing tools for creating these arrays, Numpy also has tools for reshaping, aggregating, and cleaning them.
For example, Numpy has functions for:
- summing the values of a Numpy array
- calculating the mean of a Numpy array (or median, or standard deviation, etc)
- reshaping Numpy arrays
- computing values like logarithms, exponentials, etc
Numpy is effectively a toolkit for wrangling and calculating with numeric data.
Scikit-Learn Uses a Lot of Numpy
It's important to point out here that scikit-learn uses a lot of Numpy.
Many scikit-learn functions use Numpy arrays as inputs (e.g.,
fit_transform(), etc. ).
Many scikit-learn functions also produce Numpy arrays as outputs.
Therefore, if you're doing machine learning in Python with scikit-learn, you need to understand what Numpy arrays are. You also need to be able to work with arrays in order to properly use the tools of scikit-learn. Numpy is extremely important for Python ML.
In addition to learning data wrangling with Pandas and Numpy, you need to learn data visualization.
When you're doing data visualization in Python, I recommend using Seaborn.
Let me explain why.
Data Visualization in Python Can Be a Pain in the A**
I'm going to be honest.
Data visualization in Python is traditionally a pain in the a**.
The reason is that for a long time in the past, data visualization in Python has been done with matplotlib.
Matplotlib is very powerful visualization package, but it's also difficult to use.
Matplotlib syntax is cumbersome. The syntax is hard to remember, and often difficult to understand. Ultimately, the matplotlib plotting system is very confusing for most people.
Moreover, if you need to do anything beyond simple bar charts, line charts, and histograms, matplotlib is very difficult to use. Making "simple" modifications to a matplotlib chart often takes arcane bits of syntax.
If you try to use matplotlib, you'll spend hours on Google and Stack Overflow, trying to figure out how to modify your plots.
Why I recommend Seaborn
If you're doing data visualization in Python, I strongly recommend that you use Seaborn.
Seaborn is a newer plotting library for Python.
But it also allows you to make a lot more than simple charts. With Seaborn, you can easily create density charts, hex plots, pairplots, small multiple charts, and more.
The syntax for all of these tools is relatively simple compared to matplotlib. And, most of the plots are fairly easy to modify.
Moreover, Seaborn is built on top of matplotlib itself. So if you really need to make some serious modifications, you can still rely on matplotlib techniques.
It's sort of the best of both worlds: easy, high level plotting, with the power of matplotlib underneath.
2023 Update: Seaborn Objects
Today, in 2023, I strongly recommend using Seaborn Objects, beyond the original Seaborn package.
Seaborn Objects is a powerful toolkit with a simple, consistent, and flexible syntax.
If you've every used ggplot2 in R, it's basically like a ggplot for Python.
To put it bluntly: I love Seaborn Objects.
I use it for everything that I can when I visualize my data in Python and for Python ML.
Why Data Visualization is Important
If you haven't done much machine learning, you might be asking yourself "why do I need to know data visualization?"
The truth is, data visualization is important for almost every part of the machine learning workflow.
(In fact, data visualization is important for almost every part of the data science workflow.)
At least, you'll need data visualization for:
- data cleaning
- data exploration & exploration
- model diagnostics
Let's quickly discuss these.
It might seem unintuitive, but data visualization is very important for data cleaning. Data visualization is often an important part of data cleaning process.
For example, you often need to use visualization techniques to identify potentially anomalous values, identify categories that might need to be re-coded, etc.
Essentially, you often need to use data visualization to explore your data as you clean it.
Data Exploration and Analysis
You also frequently need to explore and analyze your data after you clean it up, but before you build your machine learning model, like I showed in this machine learning EDA blog post.
For example, some types of machine learning algorithms require the numeric features (i.e., the input variables) to be normally distributed.
One of the best ways to ensure that your numeric features are normally distributed is to plot your numeric variables as histograms.
There are also instances where you want to remove variables that are highly correlated. You can identify correlated variables is with a pair plot.
I could go on and give more examples, but suffice it to say, data visualization is important for exploring and analyzing your data before you build a machine learning model.
After you train your model, you typically need to use data visualizations to evaluate your model, or compare several models against each other.
For example, you'll use machine learning for things like ROC curves, which are a diagnostic tool for machine learning classification systems.
You can also use data visualization to plot learning curves, which diagnose problems with high bias or high variance.
Beyond learning curves and ROC curves, there are dozens of ways that you can use data visualization in model evaluation, for things like hyperparameter tuning, model comparison, and more.
The point is, data visualization is essential for model evaluation.
Data Analysis in Python
Finally, before you learn scikit-learn, you need to learn how to do data analysis in Python.
As I already noted, you often need to explore and analyze your data prior to building a machine learning model.
Additionally, you often need to use data analysis techniques to evaluate your models.
But what exactly is data analysis? There's not really a Python data analysis package, is there?
How to Learn Data Analysis in Python
To do data analysis in Python, you really need to apply other Python skills I've already talked about in a specific way.
Here's a simple formula that explains what data analysis is:
data analysis = data manipulation + data visualization
Of course, it's a little more complicated than that, but this captures the essence of it.
When we do data analysis, we often need to wrangle our data by subsetting, aggregating, and summarizing it.
But then after we subset, aggregate, and summarize, we typically plot.
Data analysis is largely a process that uses data wrangling and data visualization in an applied way to find insights in data.
So to do data analysis, you actually need to learn Pandas, Seaborn, and Numpy.
But you also need to learn how to combine these skills in an applied way to analyze data.
So before you study machine learning in Python, you need to learn how to combine Pandas, Seaborn, and Numpy to do data analysis.
Don't worry about math
(in the beginning)
I bet you're thinking: "He hasn't said anything about math. Doesn't machine learning require a lot of math?"
There's a good chance that someone told you that you need to master calculus, linear algebra, optimization theory, and several other branches of mathematics before you can study machine learning.
You'll be happy to know that math is very over rated for most machine learning beginners.
Having said that, before the academically trained among you start sending me hate mail, let me be clear: math is important for machine learning in an academic setting. If you want to get a PhD in machine learning and you want to do machine learning research, then sorry. You need math.
For Applied Machine Learning, Math is Overrated
But, in a business or industrial setting (i.e., a practical setting), you'll need a lot less math.
This is particularly true for machine learning beginners.
I've written about this several times in the past, but math isn't really necessary for most applied machine learning.
The main reason is that most of the complicated math is done for you. The calculus, the linear algebra operations, the mathematical optimization ... scikit-learn does almost all of that for you, "under the hood."
To be fair, if you want to be a world class machine learning expert at Google, Facebook, or Tesla, you'll eventually need to know the math.
If you want to do highly customized machine learning models, you might need to know the math.
But if you're just getting started with scikit learn, you can probably skip the math for right now.
Do yourself a favor, and focus on data wrangling, data visualization, and data analysis.
Translating that into Python toolkits, that means that you should learn Pandas, Numpy, Seaborn, as well as how to combine them to do data analysis.
A Quick Recap: Master Data Science Foundations
Ultimately, if you want to master machine learning in Python (and get a highly paid machine learning job), then you need to master foundational Python data science tools first.
That means that you need to know:
- Base Python
- Data Analysis in Python
Leave your questions in the comments section
Do you still have questions about the prerequisites to do machine learning in Python?
Do you still have questions about Pandas, Numpy, Seaborn, and data analysis in Python?
If so, leave your questions in the comments section below.
Enroll in our Course to Master Data Science Essentials in Python
This article should help you understand the skills you need to learn before you study machine learning in Python: base Python, Pandas, Numpy, Seaborn, data analysis.
If you're ready to learn these skills, and you want to master them as fast as possible ...
... then enroll in our course, Python Data Mastery.
Python Data Mastery will teach you all of the essential skills for base Python, Numpy, Pandas, and Seaborn.
This course explains the syntax for these toolkits in simple terms and shows clear examples of everything.
But it will also show you a unique training system that will help you memorize all of the syntax you learn. This course is designed to help you become "fluent" in Python data science.
If you're serious about mastering data science in Python, and you want to learn the essential foundations for machine learning in Python, then you should enroll.
We're reopening Python Data Mastery for enrollment soon, so to be notified as soon as the doors open click here and sign up for the wait list: