There are multiple data science languages to choose from.
The most popular being R and Python.
Many people will tell you to learn both. Learn R AND Python.
I think this is probably terrible advice.
Let’s discuss why.
“Learn Them Both” And Other Bad Ideas
There are multiple data science languages that are available to learn.
The 2 most common choices are R and Python.
But there are also other less common languages like Julia. C++ and Java even come up, occasionally.
Some people tell you to learn a little bit of several of these languages.
At minimum, you’ll often hear people say to learn both R AND Python. To learn them both.
In fact, I’ve suggested that many people do this.
In hindsight, I sort of regret it.
Let me explain.
Your Time is Limited
First off, let’s address the biggest concern: your time is limited.
Whether you’re in school, or have a full time job. Your time is limited.
You probably have hobbies, responsibilities, friendships, relationships, and generally other things you probably want to do besides study data science.
Almost all of us have limited time.
Personally, I’ve set up my life to have a lot of free time, and I still feel like my time is limited. And the older I get, the more I value my time.
This limited time naturally puts constraints on you, and suggests that you should be careful about what you should and shouldn’t learn.
There’s a Lot to Learn
Beyond time constraints, there’s also the issue of volume.
There’s a lot to learn.
Let’s step outside of discussing programming languages, and talk about data science subject areas more broadly.
At minimum, you need to know:
- data wrangling
- data visualization
- data analysis
As I’ve stated many times in the past, these are the basics.
If you learn multiple programming languages, you’ll need to learn the syntax for how to do these things multiple times.
But setting aside the syntax, you also need to learn these topics conceptually.
Learning how to approach data wrangling and data cleaning is in some way separate from the syntax itself. You need to learn the typical steps to take. The process. How to think about data wrangling.
The same can be said for visualization and analysis. You need to learn how to approach these tasks in terms of process and “rules of thumb” for how to do these things, and do them well.
Again, it’s not just about syntax. Even at foundational levels, there’s a lot to learn about concepts and process.
The Mountain of Things Beyond “the Basics”
Let’s move beyond the basics.
To put it bluntly: there’s a mountain of material to learn.
Let’s run through a quick list:
- machine learning basics
- deep learning basics
- unsupervised learning
- feature engineering
- big data
- query languages and databases
- data processing for ML
- machine learning systems
- natural language processing (maybe)
- computer vision (maybe)
- linear algebra (maybe)
- probability (maybe)
And you’ll probably want some business/subject matter knowledge, if you’ll be working in industry.
Assuming that you’re already a busy person and you have only a few hours per week to devote to these things, each of these topics could take at least 3-6 months to learn. If you’re disciplined. And you’re good at learning new things. And you have clear, easy to understand resources. (And frankly, 3-6 months is generous for some of these topics).
Now, do you need to learn all of them? Probably not. You might be able to skip a few, depending on your goals.
But many of you will want to learn most of these topics if you want to earn big $$$ in the Tech industry.
If that’s true – if you do want to be a high-earning data scientist in the Tech industry – then congratulations, you just signed up for 4-10 years of hard work and study.
And that’s without considering programming language choice.
There’s a lot to learn.
Every hour you spend learning a redundant second (or third) data science language, is an hour that you can’t spend learning the other things that will help you increase your earning power.
Languages can “Interfere” with Each Other
Another issue is that languages can interfere with each other.
For example, in R’s dplyr package, there’s a function called
filter(). In R, the
filter() function enables you to subset rows based on logical conditions. So for example, you can “filter” the data to return only the rows where “
variable_A” is greater than 0 (or any other logical condition). The important part is that it subsets the rows.
But Python’s Pandas package also has a filter technique. The Pandas
.filter() method “filters” the data, but it works differently. First, it can technically operate on the rows or the columns (although I almost exclusively use it for columns). However, it doesn’t really operate by applying logical conditions. So Pandas has something called “
filter(),” and it operates on dataframes, but it works in a fundamentally different way.
(And by the way, if you want to subset the rows based on logical conditions, there’s a different method called Pandas
.query(). It’s really the Python analog to the R filter technique.)
So Python and R both have a “filter” technique. They have the same name, but do different things, and work in different ways.
And “filter” is just one example. There are other techniques that are similar between R and Python that will cause problems.
The fact that they have a similar name, but different functionality will cause “interference” in your learning.
Cognitive psychologists have known about this type of learning interferance for decades.
So have polyglots, who learn multiple spoken languages. Just ask someone who’s tried to learn two similar languages – like Spanish and Italian – at the same time.
So I can almost guarantee that if you try to learn both Python for data science at the same time you learn R for data science, you’ll get confused, and it will make it harder to learn both of them.
It’s a real problem, and I still sometimes deal with it when using R or Python.
There are other skills that “Synergize” better
And finally, let’s talk about choosing complementary skills, rather than somewhat redundant ones.
For the most part, if you learn two data science languages, there’s going to be a lot of redundancy. Overlap.
Learn data wrangling in Python? Great. Now you need to learn the same procedures in a different language.
And the same thing with data visualization.
Learn how to make a scatterplot in Python … and now you need to learn how to make a scatterplot in R.
If you learn two or more data science languages, there’s simply a lot of redundant techniques, which is a bit of a waste.
What’s the alternative?
You could learn complementary skills.
Instead, Learn Skills that “Synergize”
Ideally, you want things that “synergize” together.
I know that when I use the word “synergize,” I sound like a douchy 30 year old McKinsey consultant on his 3rd no-whip, oat-milk latte, but hear me out.
This concept is is actually good.
Two things are synergistic when the interaction of two or more parts “produce a combined effect greater than the sum of their separate effects [Source: Google].”
Ideally, this is what you want.
You want skills that will work well together.
As noted previously, precisely the opposite happens when you learn two or more data science languages. When you learn multiple data science languages, there ends up being a lot of redundant learning.
Instead, you want to learn things that will complement each other.
You want skills that make the new whole more valuable than the sum of the parts.
So after you learn data wrangling, data visualization, and data analytics in one language, it’s probably better to learn machine learning and other valuable topics instead of learning data wrangling, data visualization, and data analytics in a new language.
Again, your time is limited, and there’s a lot to learn.
You’ll get the best ROI by learning the foundations in one language, and then learning synergistic skills in that same language that will make you even more valuable overall.
But, There Are Exceptions
Before I have people sniping at me from the comments section, let me add that there might be exceptions.
There may be environments where you need to know both (I think this is rare, but it’s possible).
Maybe your focus is on something other than maximizing the value of your skills and your personal ROI.
Or, you may be a masochist who likes learning programming languages just for kicks.
Knock yourself out.
But at this point, I think for most people, it’s better to choose one data science language and focus on learning complementary skills in that language.
(And as I mentioned last week, for most people, I recommend that you learn Python.)
What do you think?
What are your thoughts?
Do you agree that a person should choose one data science language?
Or do you think it’s better to learn multiple languages?
Do you have a personal story where you needed both R and Python?
Let me know your thoughts in the comments.