You’ve probably read numerous articles telling you how to start learning data science. Collectively, they tell you to dozens of things you need to learn. Learn Python. Learn R. Learn Hadoop. They tell you all the skills you need: learn machine learning, visualization, data wrangling. Little technical skills like manipulating vectors, matrices, loops. More tools like Pig, Hive, d3, and Tableau.
For most students it’s overwhelming.
It’s like standing at the base of Everest, saying to yourself, “how the hell will I climb that?”
There’s a mountain of material out there, and it will be difficult to climb without knowing where to start.
You need a path
Look, there are lots of people out there that aren’t actually skilled in data science, telling you how to start learning data science (I’m looking at you, HR professionals).
Then, there are people who are actual data scientists, but are terrible teachers and communicators. I’m sure you know what I’m talking about. The guy with the super-elite PhD who says “oh, it’s easy,” and then proceeds to talk for 45 minutes about arcane math that nobody can understand.
Let me break this down for you: you can’t learn everything at once. Your time is limited.
And, you shouldn’t learn everything with equal enthusiasm.
Some skills are more useful than others, and some skills are easier to learn than others.
Some skills are used every day by (almost) every data scientist, and other skills are “specialty” skills, used either by a select few specialists or used only occasionally by the average analyst.
Early in your learning, you need to be able to distinguish between the essential and the “nice to have.”
You need to be selective, and you need to learn things in the proper order. You need a path for getting started.
How to start learning data science
You need to focus on learning the skills with the highest return on investment (ROI). Focus on the skills that are easy to learn, easy to implement, that yield the greatest results. (ahem. Do you know the results clients actually want? Maybe you need to learn what clients want first.)
I believe that it’s best to focus your efforts. Learn one tool.
Right now, the tool I recommend is R.
There are a few reasons I recommend R:
- R is the most common programming language among data scientists. O’Reilly Media just released their 2014 Data Science Salary Survey. In that report, they note that R is the most commonly used programming language (if you exclude SQL).
- R has 2 packages that dramatically streamline the data science workflow:
- dplyr for data manipulation
- ggplot2 for data visualization
To be fair, I think you could also make a strong case for learning Python (it came in just behind R in O’Reilly’s list of data science tools). In fact, I considered learning Python myself.
However, dplyr and ggplot2 “tip the scales” in favor of R.
As I’ve noted in several tutorials, ggplot2 has a deep structure that underlies it’s syntax. When you learn that structure, you begin to think about data visualization in a deep way. In some sense, learning ggplot2 teaches you how to think about visualization.
And as I mentioned in the introductory dplyr tutorial, dplyr’s syntax is easy to learn, easy to use, and operates in a way that streamlines your workflow. Dplyr is one of the best data management tools I’ve ever seen.
Moreover, when you start to combine dplyr with ggplot2 through “chaining” (much like Unix pipes) you can rapidly explore your data with very little effort. (For an example, see this section on dplyr + ggplot2.)
Learn data visualization first
Data visualization is the fastest, easiest means of finding insight when you’re first starting out. In your first few months, it’s the highest leverage, highest ROI skill. It’s also one of the most versatile tools: you can use visualization for data exploration (finding insight) and communication (communicating what you find to business partners and executives).
For learning visualization, I recommend R’s ggplot2. Again, I recommend ggplot2 because it has a deep underlying structure to it. When you learn the structure of the syntax, you are at the same time learning deep principles about visualization.
For starters, I suggest learning the “core” visualization techniques. These core techniques are:
These are important for a few reasons:
- These charts are the basis for more advanced visualizations. For example, the bubble chart is just a variant of the scatterplot. In fact, the dot distribution map is nothing but a modified scatterplot.
- These charts have a structure to them. When you learn that structure, you will start to learn how to think about visualizing your data.
- The charts are the essential “tools of insight.” They are foundational tools for finding insights (during data exploration) and communicating insight. And insight is what clients actually want and need.
Learn data manipulation second
Once you learn the foundational tools of data visualization, you can “back into” data manipulation in R.
When you just start learning data science, you can use “dummy data” or very simple data sets that don’t require much data reshaping.
As you advance though, the “shape” of your data will be a problem: you’ll have multiple data files that you need to join together; you’ll need to subset and change variables; you’ll need to do lots of aggregations. When you reach this point – where you the shape of your data is a bottleneck – then put more time into learning data manipulation. An example of this is the recent tutorial analyzing ‘supercar’ data, where the data were found in five separate files.
When you’re starting to learn data manipulation, I recommend mastering the 5 basic dplyr verbs, as well “chaining” using the %>% operator.
Learn machine learning last
I know: this is the opposite of what most other people are telling you.
The fact is, the vast majority of data jobs – particularly the entry level jobs – are not machine learning jobs.
Think of a baseball team. There are core baseball skills (hitting, throwing, and running). The vast majority of people on the team have a mix of skills. Teams are built from individuals with mixes of the core baseball skills. And then there’s one guy who is highly specialized in the most arcane of skills: pitching (specialization in throwing).
Machine learning is similar to pitching. It’s valuable, technical, a little arcane, and difficult to do well. There are also fewer of those jobs. Strategically, you’re much better served learning visualization and data manipulation: they are easier to learn, easier to implement, and the jobs are more plentiful.
Not to mention: you need data visualization and data manipulation for machine learning anyway! If you’re implementing a ML algorithm, you still need to put your data into the right shape first. And when you’re done, in most cases you’ll need to explore the results. Typically you’ll perform this data exploration with data visualization.
So to summarize my view on ML: in the beginning it’s the lowest ROI skill! It’s the hardest to learn, the hardest to implement (both because it’s difficult and because it requires a foundation in data manipulation and visualization), and there are fewer jobs requiring machine learning.
Keep in mind: I’m not saying that you should never learn machine learning. It’s extremely valuable. I’m just saying that you should learn it after you’ve built solid foundation of data visualization and data manipulation.
TL;DR (How to start learning data science)
Here’s the recap of how to start learning data science:
- Choose one tool: the R programming language
- Learn data visualization first (with R’s ggplot2), using simple data or dummy data. Then find a more complicated dataset
- Learn data manipulation second (with R’s dplyr), and practice data manipulation on your more complex data
- Learn machine learning last