If you’ve been reading the Sharp Sight blog for a while, you’ll know that I’m a big fan of elite performers: Navy SEALs, grandmaster chess players, elite athletes, etcetera.
I want to understand how elite performers operate across multiple disciplines, so that I can find what makes them special. The techniques. The training methods. The mindsets.
In particular, I’m interested in how they become so great. What’s the process? How do they get from a normal person, to someone who has mastered their field?
If you ask this question long enough, and you look at elite performers across many fields, you start to see patterns.
After many years of studying top performers, I’ve discovered many tips, strategies, and “secrets” that I’ve been able to apply to data science to help people learn fast, and become exceptional performers.
In this tutorial, I’m going to distill what I’ve learned into 7 steps that will enable you to master data science fast:
- Focus on Foundations
- Identify the Most Important Techniques
- Identify the Most Commonly Used Parameters
- Break Everything Down into Small Units
- Practice the Small Units
- Repeat Your Practice
- Reintegrate Everything into a Coherent Whole
Let’s talk about each of those steps.
1: Focus on Foundations
If I had a dollar for every time I had the following conversation, I could stop being a data scientist and retire to a quiet beach in the Caribbean:
Data science hopeful: “HoW dO i LEaRN mACHiNE LeARnINg?”
Me: Have you mastered data wrangling and data analysis yet?
Data science hopeful: No.
Me: 🤦
Look. I realize that I’m being a bit of an asshole here.
But, I get it: machine learning is really cool. As I’ve said many times in the past, machine learning will probably impact every major part of our economy. It’s one of the most important general purpose technologies since the invention of electricity. Moreover, people who master machine learning are commonly making over $200,000 per year (or way more).
BUT …
Machine learning is a relatively advanced skill that’s dependent on other skills.
For example, at Tesla, the machine learning team spends 75% of their time on “data.”
Specifically, they spend most of their time visualizing their data and cleaning it up.
For reference, the dude talking in that video is Andrej Karpathy, the head of AI at Tesla.
Listen to what he says: “At Tesla, I spend most of my time just massaging the datasets.”
Meanwhile, he also has a bullet point in his presentation noting that Tesla AI programmers spend a lot of time “visualizing datasets.”
My point here, is that if you want to do the cool, sexy stuff in data science (like machine learning, deep learning, etc) …
Then you must master. the. foundations.
What are the foundations? The foundations of data science are:
- data wrangling
- data visualization
- data analysis
In Python, that means that you need to learn:
- Pandas
- Numpy
- and at least 1 data visualization package, like Seaborn
Additionally, you need to know how to combine these tools to analyze data and get things done.
At the risk of berating you too much with this point, I offer a quote by the basketball GOAT himself, Michael Jordan.
“Get the fundamentals down and the level of everything you do will rise.”
As I said earlier: you can learn good principles from the best of the best. Follow MJ’s advice. Get the fundamentals down.
2: Identify the most important Techniques
Once you commit to mastering the fundamentals, you must identify the principal techniques.
This actually comes from Tim Ferriss (although, you can also find it in many other places).
In The 4 Hour Chef, Ferriss wrote about rapid skill acquisition. Nominally, the book is about cooking, but in reality, it’s about learning.
In the book, Ferriss wrote about the importance of selecting the most important things.
So for example, in the book, Ferriss notes that if you want to learn English, you should focus on the most common words. Why? The English language has almost 200,000 words, but 100 words account for 50% of all printed material. And the top 1000 words accounts for about 80% of printed material. (Ultimately, this is an example of the 80/20 rule.)
Polyglots and “language hackers” know this trick, and use it to learn foreign languages. If they focus on the top 1000 words first, they can make fast progress.
The same thing applies to data science.
If you want to learn data science quickly, you need to focus on the most commonly used tools and techniques.
For example, I wrote previously about the Top 19 Pandas Methods that you Should Memorize:
- read csv
- set index
- reset index
- loc
- iloc
- drop
- dropna
- fillna
- assign
- filter
- query
- rename
- sort values
- agg
- groupby
- concat
- merge
- pivot
- melt
I chose those 19 functions, because they account for the bulk of the data wrangling code that I write and that I see other people write.
You can do something similar for data visualization, by focusing on the most common visualizations, like bar charts, line charts, histograms, and other commonly used tools.
Ultimately, you must identify the top tools and techniques that are most frequently used in data science, so you can focus on them.
3: Identify the Most Commonly Used Parameters
Once you identify the most commonly used techniques, you need to identify the most commonly used parameters.
Take for example, the Seaborn histogram.
By my count, there are around 33 parameters for the sns.histplot()
function.
How many of those parameters will you actually use?
About 5, Chad.
So if you’re likely to use only a few parameters most of the time, then it makes sense to focus on those, right?
Yes. Yes it does.
This is why, at Sharp Sight, you’ll notice that we break things down and select the most important parameters.
For example, I wrote a tutorial about the Seaborn histogram a while ago, and it focused on 5 or 6 parameters. That’s it. So in the syntax explanation, I identified 2 parameters. And then later in the tutorial, I discussed (and used) only a few more.
If you’ve been paying attention, and if you understand our learning philosophy, you’ll understand that this is really just another application of the 80/20 rule.
We want to select the most important things, so we chose to study the most important parameters.
4: Break Everything Down into Simple Units
After you’ve identified the most important techniques, and the most parameters, you need to find the “minimal learnable units.”
This concept comes, once again, from Tim Ferriss. In an episode of The Tim Ferriss Podcast, Ferriss talks about identifying the small “building blocks” of a skill.
The good news is that if you’ve already identified the most important functions, and the most important parameters, you’re better than half way there.
In the case of data science, the minimal learnable units (MLUs) are mostly the function names and parameter names. These are small units of syntax that you need to be able to recall in order to write code “fluently.”
There may also be a few other things that qualify, like parts of import statements, and important arguments for parameters. For example, in Numpy, we often need to execute a technique along a particular “axis.” The arguments 0
and 1
for the axis
parameter are things that you’re going to want to know.
5: Practice the Small Units
Once you’ve identified the important units of syntax, you need to practice.
Practice is one of the critical elements of mastery.
Navy SEAL marksman Chris Sajnog emphasized this in his book Navy SEAL shooting:
“Ever wish you could shoot like a Navy SEAL?
… there is no secret.
It all boils down to practice, and lots of it.”
He notes later that “the best way to learn and reinforce [skills], is through slow, perfect practice.”
Now to be fair, practice systems are a complicated topic.
How exactly you practice really matters. Some practice methods work better than others. I’ll have to write a separate blog post about good practice methods.
Having said that, I’ll mention that Cal Newport’s quiz-and-recall comes close to what I have in mind. You should find a way to quiz yourself on the “minimal learnable units” of syntax, and try to recall those MLUs.
Recall is important, because when it eventually comes time to write code, you need to be able to recall function names and parameter names. If you want to write code “fluently,” you need to be able to recall syntax from memory. (Googling syntax is a huge drag on your productivity.)
6: Repeat Your Practice over time
If you thought that you could just study something once and then know it forever, I have some bad news for you.
The brain naturally forgets.
Have you ever “learned” a new piece of syntax, and then forgotten it a few days later?
Have you ever been introduced to a new person by name, and then forgot their name a few minutes later?
Of course.
We’ve all had that experience.
The brain naturally forgets.
But there’s some good news.
There is a way to strengthen your memory over time: repeat practice.
In particular, if you review your practice materials consistently over time, and at somewhat regular intervals, then your memory of those things will get stronger and stronger.
Reviewing and repeating practice begins to move information from short term memory to long term memory.
The details about the best way to structure this are complex. There are actually ways to optimize the time intervals between review sessions to maximize your learning efficiency (we cover this in our paid courses).
But suffice it to say: you must repeat your practice activities over a period of time to solidify your memory.
7: Reintegrate Everything into a Coherent Whole
Up until this point, most of the tips have been about deconstructing data science into small units that can be practiced.
As you practice, and repeat your practice, you will eventually master those small functions and units of syntax.
But, real data science work requires you to combine multiple techniques together. That’s why the next step is to reintegrate all of those MLUs into a coherent whole.
For example, earlier this week, I published a thread on Twitter where I show how to get, clean, and visualize some Wikipedia data using Python.
In that thread, I demonstrated some fairly detailed data cleaning using Pandas:
This is what real data wrangling looks like. Multiple techniques, used in combination to get sh*t done.
And notice that most of these functions are things that I mentioned earlier: Pandas query, Pandas assign, Pandas melt. But now, they’re used together. If you practice them individually, then you’ll know exactly how to type the code for each individual tool. But by learning how to combine them, you begin to unlock the real power of data wrangling with Pandas.
So again: after you deconstruct data science into small learnable units, you must learn how to put those things back together to accomplish tasks.
Recap: Take it apart, practice, put it back together
Let’s quickly recap:
To learn data science as quickly and efficiently as possible, you need to:
- Focus on Foundations
- Identify the Most Important Techniques
- Identify the Most Commonly Used Parameters
- Break Everything Down into Small Units
- Practice the Small Units
- Repeat Your Practice
- Reintegrate Everything into a Coherent Whole
A different way of saying it, is that you need to take the subject apart, identify the most important things, practice, and then put the pieces back together (all while keeping the 80/20 rule in mind).
This general path will be the fastest, most efficient way to learn data science.
Join Python Data Mastery
Having said that, the devil really is in the details.
Executing on this process could be challenging if you aren’t sure how to implement it.
If you want to save yourself the time of trying to figure it out yourself, you could just join our course: Python Data Mastery.
Python Data Mastery is our premium data science training course that will teach you:
- data wrangling
- data visualization
- data analysis
… using Numpy, Pandas, Seaborn, and other important Python packages.
And this courses uses the training principles I’ve laid out in this blog post.
We’ve designed this to be the fastest and most efficient way to learn data science in Python.
The course will reopen for enrollment on Monday, but you must be on our wait list to be notified when the doors open.
You can join the wait list here:
Many thanks for this great insight. I find it very applicable
👍👍👍
thanks a lot
You’re welcome.
Great insight but with a fundamental flaw: how do you know which parts are the essential core? By experience.
So if your recommendation is to get the grasp of the most important and fundamental you have only one option: search a mentor, because this is something that internet or a book is not going to give it to you.
I agree and disagree.
Yes: it takes experience to know which are the essential parts.
But I also literally told you the top 19 most important Pandas methods. I also mentioned several of the data visualization techniques you should learn.
Moreover, I’ve been writing similar articles for several years. With some clever google searches, you could probably find my recommendations for how to distill Numpy, data visualization, and a few other topics.
Still, you are correct: having a mentor or coach will dramatically accelerate your progress. But isn’t that also true for sports, or chess, or music?
With enough time and effort, you can figure it out on your own (using the path outlined above).
But you’ll make much faster progress with a coach.
What’s your time worth to you?
Hey mate, how many languages do you speak yourself and also if your expertise is worth $200,000 plus why are you not at apple etc anymore? self employment more suitable?
Also super interested in your IQ. You seem super intelligent, if you were to take a wild guess would you say over 130? to major in physics at a prestigous university is a 130 minimum I would assume
I speak a few languages to varying degrees.
Entrepreneurship has advantages and disadvantages. Less money in the short run, but more freedom. And much, much bigger possible upside in the long run.
I like that sir. You have all the tools, good luck on your journey. You haved been vitally important on my journey.