Machine learning is a very powerful skill.
And machine learning is a valuable skill.
According to Glassdoor, the average salary for a machine learning engineer is about $127,000 in early 2021.
If you already know some Python, then upskilling to add machine learning to your skill set could increase your earning power a lot.
The question is, how? What’s the best way to get started with machine learning? What are the prerequisites and essential foundations?
I’ll admit, machine learning is hard to learn.
But part of the problem is that there’s so much terrible advice about how to learn machine learning. In particular, there’s a lot of bad advice about machine learning prerequisites.
So to set the record straight, I want to clear things up and give you a clear learning path to get started with machine learning in Python.
You’re probably aware that that scikit-learn is the primary machine learning toolkit for Python.
But in this article, I’ll tell you the Python skills that you need to learn before you study scikit-learn.
Briefly, to summarize, here are the Python toolkits and skills that you need to know before you get started with machine learning with scikit-learn:
- Base Python
- Data Analysis in Python
Let’s talk about each of these, one at a time.
Base Python Essentials
First of all, you really need to have a strong foundation in “base Python.”
You don’t need to know everything in the Python standard library, but you do need to have a firm grasp of most of the essentials.
I recommend that you learn:
- variable basics
- lists, tuples, and dictionaries
- comparison operations (e.g.,
- boolean operators and boolean expressions (e.g.,
- for loops
- functions (i.e., how to define, etc)
That is to say, you should have a solid understanding of all of the essentials of Python syntax.
Some of these things you might not need immediately, but almost all of them will emerge at some point as you start using scikit-learn.
Next, I recommend that you learn Pandas.
You've probably heard of Pandas before, so you likely know that it's a Python toolkit for doing "data wrangling" and data cleaning.
Specifically, it's a toolkit for doing data wrangling with a special data structure called a dataframe. Dataframes are row-and-column data structures that store data of different data types. You can think of them like Excel spreadsheets for Python.
Pandas gives you a toolkit for:
- importing data from external sources
- creating dataframes
- formatting dataframes (i.e., creating or changing indexes)
- filtering data
- aggregating data
- reshaping data
- and a lot more ...
Essentially, if you have data that has rows and columns, and if the data has a mix of numbers and categories, you almost certainly need Pandas.
Prior to using scikit-learn to build a model, you'll need to use Pandas to reshape, filter, clean, aggregate, and analyze your data.
The data analysis part is very important as well. Pandas is a big part of that, but it involves some other tools as well. I'll discuss that in a moment.
For data manipulation, you'll also need to learn Numpy in addition to Pandas.
Numpy is a Python toolkit for working with numeric data. Specifically, in Numpy, there's a data structure called the Numpy array.
A Numpy array is a row-and-column data structure that contains all numbers.
So Numpy arrays are somewhat similar to Pandas dataframes, in the sense that they store data in a row-and-column format.
However, Numpy arrays are different from dataframes in a few ways:
- Numpy arrays only store numeric data
- Numpy arrays can have more that two dimensions
So if you've ever taken a linear algebra class in college, Numpy arrays are a lot like vectors, matrices, and tensors (depending on the number of dimensions).
In addition to providing tools for creating these arrays, Numpy also has tools for reshaping, aggregating, and cleaning them.
For example, Numpy has functions for:
- summing the values of a Numpy array
- calculating the mean of a Numpy array (or median, or standard deviation, etc)
- reshaping Numpy arrays
- computing values like logarithms, exponentials, etc
Numpy is effectively a toolkit for wrangling and calculating with numeric data.
Scikit-Learn Uses a Lot of Numpy
It's important to point out here that scikit-learn uses Numpy quite a bit.
Many scikit-learn functions take Numpy arrays as inputs (e.g.,
fit_transform(), etc. ).
Many scikit-learn functions also produce Numpy arrays as outputs.
If you're doing machine learning in Python with scikit-learn, you really need to understand what Numpy arrays are. And you need to be able to work with them in order to properly use the tools of scikit-learn.
In addition to learning data wrangling with Pandas and Numpy, you need to learn data visualization.
When you're doing data visualization in Python, I recommend using Seaborn.
Let me explain why.
Data Visualization in Python Can Be a Pain in the A**
I'll be honest.
Data visualization in Python is typically a pain in the a**.
The reason is that traditionally, data visualization in Python has been done with matplotlib.
Matplotlib is very powerful, but difficult to use. Matplotlib syntax is cumbersome. The syntax is hard to remember, and often difficult to understand. Ultimately, the matplotlib plotting system is just confusing for most people.
Moreover, if you need to do anything beyond simple bar charts, line charts, and histograms, matplotlib is very difficult to use. Making modifications that should be simple often takes arcane bits of syntax.
If you try to use matplotlib, you'll spend hours on Google and Stack Overflow, trying to figure out how to modify your plots.
Why I recommend Seaborn
If you're doing data visualization in Python, I strongly recommend Seaborn.
Seaborn is a newer plotting library for Python.
But beyond making it easy to create simple charts, it also offers a wide variety of other visualization tools. With Seaborn, you can easily create density charts, hex plots, pairplots, small multiple charts, and more.
The syntax for all of these tools is relatively simple compared to matplotlib. And, most of the plots are fairly easy to modify.
Moreover, Seaborn is built on top of matplotlib itself. So if you really need to make some serious modifications, you can still rely on matplotlib techniques.
It's sort of the best of both worlds: easy, high level plotting, with the power of matplotlib underneath.
Why Data Visualization is Important
If you haven't done much machine learning, you might be asking yourself "why do I need to know data visualization?"
The truth is, data visualization is important for almost every part of the machine learning workflow. (In fact, data visualization is important for almost every part of the data science workflow.)
At least, you'll need data visualization for:
- data cleaning
- data exploration & exploration
- model diagnostics
You might not think that data visualization is important for data cleaning, but this is actually true. Data visualization is often an important part of data cleaning.
For example, you often need to use visualization techniques to identify potentially anomalous values, identify categories that might need to be re-coded, etc.
Essentially, you often need to use data visualization to explore your data while you clean it.
Data Exploration and Analysis
You also frequently need to explore and analyze your data after you clean it up, but before you build your machine learning model.
For example, some types of machine learning algorithms require the numeric features (i.e., the input variables) to be normally distributed.
One of the best ways to do this is to simply plot your numeric variables as histograms.
There are also instances where you want to remove variables that are highly correlated. One way to identify correlated variables is with a pair plot.
I could go on and give more examples, but suffice it to say, data visualization is important for exploring and analyzing your data before you build your model.
After you train your model, you typically need to use data visualizations to evaluate your model, or compare several models against each other.
For example, you can use data visualization to plot learning curves, which diagnose problems with high bias or high variance.
Beyond learning curves, there are dozens of ways that you can use data visualization in model evaluation, for things like hyperparameter tuning, model comparison, and more.
The point is, data visualization is essential for model evaluation.
Data Analysis in Python
Finally, before you learn scikit-learn, you need to learn how to do data analysis in Python.
As I already noted, you often need to explore and analyze your data prior to building a machine learning model.
Additionally, you often need to use data analysis techniques to evaluate your models.
But what exactly is data analysis? There's not really a Python data analysis package, is there?
How to Learn Data Analysis in Python
To do data analysis in Python, you really need to apply other Python skills I've already talked about in a specific way.
Here's a simple formula that explains what data analysis is:
data analysis = data manipulation + data visualization
Of course, it's a little more complicated than that, but this captures the essence of it.
When we do data analysis, we often need to wrangle our data by subsetting, aggregating, and summarizing it.
But then after we subset, aggregate, and summarize, we typically plot.
Data analysis is largely a process that uses data wrangling and data visualization in an applied way to find insights in data.
So to do data analysis, you actually need to learn Pandas, Seaborn, and Numpy.
But you also need to learn how to combine these skills in an applied way to analyze data.
So before you study machine learning in Python, you need to learn how to combine Pandas, Seaborn, and Numpy to do data analysis.
Don't worry about math(in the beginning)
I bet you're thinking: "He hasn't said anything about math. Doesn't machine learning require a lot of math?"
There's a good chance that someone told you that you need to master calculus, linear algebra, optimization theory, and several other branches of mathematics before you can study machine learning.
You'll probably be happy to learn that math is way over rated for most machine learning beginners.
Now before the academically trained among you start sending me hate mail, let me be clear: math is important for machine learning in an academic setting. If you want to get a PhD in machine learning and you want to do machine learning research, sorry. You need math.
For Applied Machine Learning, Math is Overrated
But, in a business or industrial setting (i.e., a practical setting), you'll need a lot less math.
This is particularly true for machine learning beginners.
I've written about this several times in the past, but math isn't really necessary for most applied machine learning.
The main reason is that most of the complicated math is done for you. The calculus, the linear algebra operations, the mathematical optimization ... scikit-learn does almost all of that for you, "under the hood."
To be fair, if you want to be a world class machine learning expert at Google, Facebook, or Tesla, you'll eventually need to know the math.
If you want to do highly customized machine learning models, you might need to know the math.
But if you're just getting started with scikit learn, you can probably skip the math for right now.
Do yourself a favor, and focus on data wrangling, data visualization, and data analysis.
Translating that into Python toolkits, that means that you should learn Pandas, Numpy, Seaborn, as well as how to combine them to do data analysis.
Leave your questions in the comments section
Do you still have questions about the prerequisites to do machine learning in Python?
Do you still have questions about Pandas, Numpy, Seaborn, and data analysis in Python?
If so, leave your questions in the comments section below.
Enroll in our Course to Master Data Science Essentials in Python
This article should help you understand the skills you need to learn before you study machine learning in Python:
- base Python
- data analysis
If you're ready to learn these skills, and you want to master them as fast as possible ...
... then enroll in our course, Python Data Mastery.
Python Data Mastery will teach you all of the essential skills for base Python, Numpy, Pandas, and Seaborn.
This course explains the syntax for these toolkits in simple terms and shows clear examples of everything.
But it will also show you a unique training system that will help you memorize all of the syntax you learn. This course is designed to help you become "fluent" in Python data science.
If you're serious about mastering data science in Python, and you want to learn the essential foundations for machine learning in Python, then you should enroll.
We're reopening Python Data Mastery for enrollment soon, so to be notified as soon as the doors open click here and sign up for the wait list: