Why You Should Learn ONE Data Science Language

There are multiple data science languages to choose from.

The most popular being R and Python.

Many people will tell you to learn both. Learn R AND Python.

I think this is probably terrible advice.

Let’s discuss why.

“Learn Them Both” And Other Bad Ideas

There are multiple data science languages that are available to learn.

The 2 most common choices are R and Python.

But there are also other less common languages like Julia. C++ and Java even come up, occasionally.

Some people tell you to learn a little bit of several of these languages.

At minimum, you’ll often hear people say to learn both R AND Python. To learn them both.

In fact, I’ve suggested that many people do this.

And to go a step further, I’ve personally learned R and Python. I’ve spent a lot of time learning both of these languages (or at least, the data science packages for these languages).

In hindsight, I sort of regret it.

Let me explain.

Your Time is Limited

First off, let’s address the biggest concern: your time is limited.

Whether you’re in school, or have a full time job. Your time is limited.

You probably have hobbies, responsibilities, friendships, relationships, and generally other things you probably want to do besides study data science.

Almost all of us have limited time.

Personally, I’ve set up my life to have a lot of free time, and I still feel like my time is limited. And the older I get, the more I value my time.

This limited time naturally puts constraints on you, and suggests that you should be careful about what you should and shouldn’t learn.

There’s a Lot to Learn

Beyond time constraints, there’s also the issue of volume.

There’s a lot to learn.

Let’s step outside of discussing programming languages, and talk about data science subject areas more broadly.

At minimum, you need to know:

  • data wrangling
  • data visualization
  • data analysis

As I’ve stated many times in the past, these are the basics.

If you learn multiple programming languages, you’ll need to learn the syntax for how to do these things multiple times.

But setting aside the syntax, you also need to learn these topics conceptually.

Learning how to approach data wrangling and data cleaning is in some way separate from the syntax itself. You need to learn the typical steps to take. The process. How to think about data wrangling.

The same can be said for visualization and analysis. You need to learn how to approach these tasks in terms of process and “rules of thumb” for how to do these things, and do them well.

Again, it’s not just about syntax. Even at foundational levels, there’s a lot to learn about concepts and process.

The Mountain of Things Beyond “the Basics”

Let’s move beyond the basics.

To put it bluntly: there’s a mountain of material to learn.

Let’s run through a quick list:

  • machine learning basics
  • deep learning basics
  • unsupervised learning
  • feature engineering
  • big data
  • query languages and databases
  • data processing for ML
  • machine learning systems
  • natural language processing (maybe)
  • computer vision (maybe)
  • linear algebra (maybe)
  • probability (maybe)

And you’ll probably want some business/subject matter knowledge, if you’ll be working in industry.

Assuming that you’re already a busy person and you have only a few hours per week to devote to these things, each of these topics could take at least 3-6 months to learn. If you’re disciplined. And you’re good at learning new things. And you have clear, easy to understand resources. (And frankly, 3-6 months is generous for some of these topics).

Now, do you need to learn all of them? Probably not. You might be able to skip a few, depending on your goals.

But many of you will want to learn most of these topics if you want to earn big $$$ in the Tech industry.

If that’s true – if you do want to be a high-earning data scientist in the Tech industry – then congratulations, you just signed up for 4-10 years of hard work and study.

And that’s without considering programming language choice.

My point?

There’s a lot to learn.

Every hour you spend learning a redundant second (or third) data science language, is an hour that you can’t spend learning the other things that will help you increase your earning power.

Languages can “Interfere” with Each Other

Another issue is that languages can interfere with each other.

For example, in R’s dplyr package, there’s a function called filter(). In R, the filter() function enables you to subset rows based on logical conditions. So for example, you can “filter” the data to return only the rows where “variable_A” is greater than 0 (or any other logical condition). The important part is that it subsets the rows.

But Python’s Pandas package also has a filter technique. The Pandas .filter() method “filters” the data, but it works differently. First, it can technically operate on the rows or the columns (although I almost exclusively use it for columns). However, it doesn’t really operate by applying logical conditions. So Pandas has something called “filter(),” and it operates on dataframes, but it works in a fundamentally different way.

(And by the way, if you want to subset the rows based on logical conditions, there’s a different method called Pandas .query(). It’s really the Python analog to the R filter technique.)

So Python and R both have a “filter” technique. They have the same name, but do different things, and work in different ways.

And “filter” is just one example. There are other techniques that are similar between R and Python that will cause problems.

The fact that they have a similar name, but different functionality will cause “interference” in your learning.

Cognitive psychologists have known about this type of learning interferance for decades.

So have polyglots, who learn multiple spoken languages. Just ask someone who’s tried to learn two similar languages – like Spanish and Italian – at the same time.

So I can almost guarantee that if you try to learn both Python for data science at the same time you learn R for data science, you’ll get confused, and it will make it harder to learn both of them.

It’s a real problem, and I still sometimes deal with it when using R or Python.

There are other skills that “Synergize” better

And finally, let’s talk about choosing complementary skills, rather than somewhat redundant ones.

For the most part, if you learn two data science languages, there’s going to be a lot of redundancy. Overlap.

Learn data wrangling in Python? Great. Now you need to learn the same procedures in a different language.

And the same thing with data visualization.

Learn how to make a scatterplot in Python … and now you need to learn how to make a scatterplot in R.

If you learn two or more data science languages, there’s simply a lot of redundant techniques, which is a bit of a waste.

What’s the alternative?

You could learn complementary skills.

Instead, Learn Skills that “Synergize”

Ideally, you want things that “synergize” together.

Ok, ok.

I know that when I use the word “synergize,” I sound like a douchy 30 year old McKinsey consultant on his 3rd no-whip, oat-milk latte, but hear me out.

This concept is is actually good.

Two things are synergistic when the interaction of two or more parts “produce a combined effect greater than the sum of their separate effects [Source: Google].”

Ideally, this is what you want.

You want skills that will work well together.

As noted previously, precisely the opposite happens when you learn two or more data science languages. When you learn multiple data science languages, there ends up being a lot of redundant learning.

Instead, you want to learn things that will complement each other.

You want skills that make the new whole more valuable than the sum of the parts.

So after you learn data wrangling, data visualization, and data analytics in one language, it’s probably better to learn machine learning and other valuable topics instead of learning data wrangling, data visualization, and data analytics in a new language.

Again, your time is limited, and there’s a lot to learn.

You’ll get the best ROI by learning the foundations in one language, and then learning synergistic skills in that same language that will make you even more valuable overall.

But, There Are Exceptions

Before I have people sniping at me from the comments section, let me add that there might be exceptions.

There may be environments where you need to know both (I think this is rare, but it’s possible).

Maybe your focus is on something other than maximizing the value of your skills and your personal ROI.

Or, you may be a masochist who likes learning programming languages just for kicks.

Ok.

Knock yourself out.

But at this point, I think for most people, it’s better to choose one data science language and focus on learning complementary skills in that language.

(And as I mentioned last week, for most people, I recommend that you learn Python.)

What do you think?

What are your thoughts?

Do you agree that a person should choose one data science language?

Or do you think it’s better to learn multiple languages?

Do you have a personal story where you needed both R and Python?

Let me know your thoughts in the comments.

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

4 thoughts on “Why You Should Learn ONE Data Science Language”

  1. I don’t disagree with the point, and I’m leaning towards the same conclusion. But I feel that Python (and to some extent R) is not a single language to begin with, so you are kinda grapling with the problem of learning more than one language at a time. You can know a lot of the standard library in Python but that’s not going to help with numpy, Pandas, etc. There’s a lot of potential confusion within Python: Matplotlib has two different APIs, there’s random and numpy.random, it’s easy to forget what’s in scipy versus what’s in numpy, etc.

    I still find R much easier than Python for rectangular data and visualization, and Python easier than R for working with text and machine learning. Given the choice, I’d move between languages based on what I’m trying to do. So it’s complementary across languages, based upon the strengths of the language. Of course, if you have to work in a single language, then it makes sense to learn everything in that language, which increasingly means Python. I’m just hoping seaborn matures and waiting on whether Polars gains traction.

    Reply
    • When I say R, it’s a shorthand for “R data science stack” and when I say Python, it’s shorthand for “Python data science stack.”

      Obviously, both languages are huge (Python in particular).

      Your point is taken though … even within these languages, there needs to be some clarity and focus about what exactly you should learn.

      I’ve written about that plenty over the last 9 years.

      Reply
  2. I get what you’re saying. I agree that your concentration should probably be in one language, but throughout my career I was enthusiastic about learning new languages. Different languages have different approaches to problem solving that you can often adapt to your main language. I definitely think learning multiple languages made me a better problem solver and programmer. Also, you have to keep up with times. I started out in the 80s and if I hadn’t learned new languages as I went, I would have been left behind a long time ago. Finally, I know you’re talking about data science, but if you’re a web developer these days, you need to employ multiple languages in a data driven app anyway (C# or java, JavaScript, HTML, CSS, and so on).

    Reply
    • That’s fair. Over time, you may need to learn more languages.

      But particularly for beginners, I’m beginning to advocate for “focus on one.”

      Obviously, some people have different needs and circumstances, so YMMV.

      Reply

Leave a Comment