Recently, a Sharp Sight blog reader emailed me and asked for advice about data science prerequisites.
He was nervous about math.
Someone had told him that in order to study data science, he needed to learn a long list of math topics first:
- Multi variable calculus
- Linear algebra
- Differential equations
Although there are a few items on the list that I hope you’ve learned by the time you’re out of college, you might be surprised about which ones are truly necessary to get started with data science.
As a beginner, you don’t need that much math for data science
The truth is, practical data science doesn’t require very much math at all. It requires some (which we’ll get to in a moment) but a great deal of practical data science only requires skill in using the right tools. Data science does not necessarily require you to understand the mathematical details of those tools.
That being said, it’s important to understand that there’s a difference between the theory that underpins data science, and data science as it is practiced. This makes all the difference.
There’s a difference between theory and practice
When talking about how much math you need for data science, it’s important to distinguish between data science “theory” and data science “practice.” When I say theory, I’m referring to data science as it’s studied in an academic environment for research purposes.
This theoretical data science is often very different than the practical data science performed in business or industry. They are different because the priorities are different, and the deliverables are different. Academics mostly produce papers and novel research, whereas data scientists in business or industry will produce reports and analyses (commonly in the form of PowerPoint presentations); models; and software systems. The focuses are different and the deliverables are different, so the required tools will be different.
This is not to disparage academics. Far from it. People researching data science in universities are needed to push the field forward. But you need to understand that there are differences between theoretical data science and data science as it’s practiced in business or industry.
There’s a difference between junior data scientists and senior data scientists
There’s also a second distinction we need to make that’s relevant to how much math is needed for data science. There is a difference between junior data scientists and senior data scientists.
Junior data scientists often don’t need the same depth of knowledge as senior data scientists.
I hate to break this to you, but when you get hired in as a junior data scientist, you probably won’t be working on the coolest, sexiest projects first. On the contrary, you’ll probably have to do “grunt work” for your first 6-18 months.
At best, many new data scientists will be tasked with very simple projects. Such projects may be simple analyses and simple reports.
At worst, you’ll be pulling data and cleaning data for the more senior members of the team.
But either way, the fact is that as a junior data scientist, it’s pretty likely that you’ll be doing grunt work, and not working on advanced projects. It’s not that bad, but you need to set your expectations and also prepare accordingly.
Foundational data science doesn’t require much math
As a junior data scientist working in business or industry, you will primarily need to work with what I call the “foundational” skills of data science.
By “foundational” skills, I mean the fundamental skills you will need as a practitioner. Here at Sharp Sight, we sometimes call these the “core” data science skills, or “the foundations.”
What are they?
- Data manipulation
- Data visualization
- Data analysis (AKA, exploratory data analysis)
Why are these the core skills? Because in a practical setting, almost everything else relies on these. These are the “core” practical skills for “getting things done.”
You’ve probably heard the rule of thumb that 80% of your work will be data manipulation or data cleaning. Although we might argue with the exact proportion, I can definitely say that 80% sounds close. A very, very large amount of your work will be spent collecting data from a variety of sources like text files, spreadsheets and databases; cleaning that data; and performing basic exploratory data analysis.
Almost all deliverables will require these skills, especially if you’re working in a junior role. Reporting requires the core data science skills. Data analysis requires core data science skills. Building machine learning models requires core data science skills. For almost all deliverables, you’ll need to use data manipulation, visualization, and/or data analysis.
But how much math you need to do these core skills?
This fact runs against the common narrative that data science requires a lot of math knowledge. The truth is, most of these basic skills can be learned without learning math beforehand.
What math you need for “core” data science skills
So how much math do you need for the “foundational skills” of data science?
I think that a smart first-year college student has enough math knowledge to perform the “core” data science skills. You heard that right. You basically only need the sort of lower level algebra and simple statistics that you would have learned in grades 8 to 12.
Let me give you an example.
Example: getting data and cleaning data
How about creating new variables? Again, this requires almost no math skill. Let me give you an example.
Let’s say we have a dataframe with two simple variables,
y. We want to create a new variable that equals
x divided by
y. We’ll call this new variable
# R CODE TO CALCULATE A NEW VARIABLE IN A DATA FRAME library(tidyverse) df <- tibble(x = 1:3 ,y = c(10, 5, 10 ) ) df %>% mutate(new_var = x/y) –> df
This is a very simple operation, but it’s representative of the sort of data manipulation you will need to do.
Does this require calculus? Linear algebra? No.
When you’re creating new variables, you typically don’t need to do anything complicated. It’s commonly no more complicated than dividing one variable by another, or maybe performing a basic statistical manipulation like calculating a mean.
Are there exceptions? Yes. Are there cases where you need to do a complex computation to create a new variable? Yes. But they are much more rare, especially for beginners.
To put it simply, I would estimate that 95% of all data manipulation requires only simple math.
Example: a scatterplot doesn’t require advanced math
Here’s another example.
A very common data task is creating basic charts and graphs for exploratory data analysis. This essentially amounts to using simple data visualization techniques for the purpose of data analysis.
One of the most common is creating scatterplots:
# R CODE TO CREATE A SCATTERPLOT mtcars %>% ggplot(aes(x = disp, y = hp)) + geom_point()
Ask yourself, do you really need calculus for this? How about linear algebra?
(No. The answer is no.)
To create this scatterplot, you don’t need college level math. You only need basic math. If you’ve taken 6th grade math and you know what the Cartesian coordinate system is, you’re half way there. This is not hard … not mathematically.
The hard part about this scatterplot is the syntax. To do data science – to really get things done – you need to master syntax, not math. Of course, you need to be able to apply the syntax and use visual tools the correct way, but that still doesn’t require calculus.
Example: a histogram doesn’t require advanced math
Another example. Let’s create a histogram:
# R CODE TO CREATE A HISTOGRAM diamonds %>% ggplot(aes(x = x)) + geom_histogram()
Once again, this does not require advanced math. Of course, you need to know what a histogram is, but a smart person can learn and understand histograms within about 30 minutes. They are not complicated.
For the most part, if you’re getting started, then core data science skills like data manipulation and data visualization won’t require advanced math. Algebra and basic problem solving skills are probably enough to get started.
You don’t need much math to start learning machine learning
Ok, so the “foundational” skills don’t require advanced math like calculus and linear algebra, but surely machine learning does, right?
I’m going to make a bold claim.
You need almost zero advanced mathematics to get started with machine learning.
Basic machine learning requires limited math
I will admit that machine learning requires more math than data manipulation and data visualization.
Having said that, machine learning does not necessarily require advanced math like calculus and linear algebra.
This will confuse many people. Once again, most people hear that they need to know advanced math before they can start studying machine learning.
What’s going on here?
There’s a difference between theory and practice
Let’s go back to one of the key distinctions I made at the beginning of this blog post:
There’s a difference between theory and practice. That applies generally to data science, but also specifically to machine learning.
On the theory side (meaning, in an academic environment) machine learning requires a lot of math. Seriously, just read a few machine learning papers and it’s really obvious: there’s a lot of math. Calculus, linear algebra, statistics, the occasional reference to information theory …. Machine learning papers use a lot of math.
But how about practitioners?
On the whole, practitioners use a lot less math when doing machine learning. For a typical practitioner, they will use a lot less (although there are exceptions).
The difference between theory and practice becomes even more stark when we re-consider the other distinction I made at the beginning of the article: the distinction between junior and senior data scientists. Although you can find senior data scientists who sometimes need advanced math to solve a machine learning problem, junior data scientists almost never need this.
Keep in mind, it’s somewhat common that junior data scientists won’t work on ML projects anyway. Remember when I said that junior data scientists often get stuck with the “grunt work.” If you’re the junior member of a data science team, you might not be working on an machine learning project anyway.
But if you do work on a machine learning project, how much advanced math do you need?
Not much at all.
You can learn many machine learning topics without advanced math
For almost every machine learning algorithm, you can learn how the algorithm works without any knowledge of calculus or linear algebra whatsoever.
As a case in point, I recommend that you find a copy of the well known machine learning textbook, An Introduction to Statistical Learning. Many people, myself included, consider this to be the best introduction to machine learning that’s available (although the authors use the term “statistical learning”).
ISL (as the book is often called), provides a broad overview of machine learning techniques. In this book, you’ll find explanations of almost every major tool and technique:
- Linear regression
- Logistic regression
- Support vector machines
- Decision trees
- Neural networks
- Regularization (lasso and ridge regression)
- Feature extraction techniques, like principal component analysis
Again, this book provides a broad overview of the most important machine learning techniques.
Do you know how much calculus and linear algebra it uses?
And it uses very few concepts from statistics or computer science.
Here’s a quote by Larry Wasserman, a professor of Machine Learning at Carnegie Mellon (one of the best universities for machine learning):
This book provides clear and intuitive guidance on how to implement cutting edge statistical and machine learning methods. ISL makes modern methods accessible to a wide audience without requiring a background in Statistics or Computer Science.
– Larry Wasserman, Professor, Department of Statistics and Department of Machine Learning, Carnegie Mellon University.
Read that again.
Introduction to Statistical Learning – which is regarded as one of the best introductory books about machine learning – does not require a background in statistics or computer science.
And I can tell you from my own experience that calculus is not required either. The only reference to calculus that I’ve found was in the section concerning smoothing splines. (There’s an integral used in the smoothing spline equation.) This book is several hundred pages long, and there’s only one minor reference to calculus. That’s it.
I’ll say it again: you don’t need advanced math to get started with machine learning. You don’t need advanced math to get started with data science. You don’t need calculus or linear algebra. You can learn the essentials of machine learning with rather limited math background.
You can be an excellent machine learning practitioner without much math knowledge
What this means is that you can learn a lot of practical machine learning without advanced math.
In fact, I’m going to make a bold claim: you can become a very strong machine learning practitioner without knowing much advanced math.
Personally, I know quite a few excellent machine learning practitioners who do not have advanced math training. One works at Apple. Another works at Bank of America. Granted, they are both pretty smart, but neither is a math genius. They just know how to apply the techniques (and they get paid well into 6 figures to do so).
This should be very encouraging to you. There are many very successful machine learning practitioners who know very little advanced math. If you work hard, you can be one of them.
The math skills you actually need to start learning data science
If you want to learn data science, stop worrying about math. You need a lot less math than you probably expect.
If you’re just getting started with data science, here’s what you need to know:
Basic charts and graphs
You need to understand basic Cartesian plotting. If you’ve drawn plots on graph paper in an elementary math class, you probably know enough to get started.
You need to know what a function is and how to plot them. In this case, when I say “how to plot” I just mean that you need to know generally how to plot a function … what the process is. However, you don’t necessarily need to be able to plot a function in R or Python. If you’ve every plotted a function by hand on graph paper, you probably know enough.
You need to know what a variable is. You need to know what an exponent is. You’ll need to be comfortable at a basic level reading math equations.
To be honest, you don’t need to know that much statistics. You definitely need to be familiar with basic statistical calculations like mean, median, standard deviation, and variance.
Basic math notation
You also need to be familiar with standard math notation. You need to know about variables, exponents, subscripts, and other minor items like parenthesis. You should also be familiar with summation notation. Summation notation (AKA, sigma notation) appears quite a bit in machine learning.
Still have questions? Leave a comment below.
This is mostly all you need to get started learning data science.
If you’re smart and motivated, you can learn almost everything else if you know those simple mathematical foundations.
Do you have more concerns? Leave a comment below …