Why Python is the Best Data Science Language (in 2023)

The first question that aspiring data scientists ask me is “what is the best data science language … which should I learn.”

And almost always, this is framed as a decision between R and Python.

If you’re wondering which language you should learn first …

And especially if you’re interested in artificial intelligence.

Then keep reading.

The Eternal Debate: R vs Python

The debate between R and Python in the data science community has been raging for quite some time.

And I myself have weighed in more than a few times.

In the past, I’ve been somewhat ambivalent, and my recommendation has been “it depends.” I even wrote a comparison of R vs Python a few years ago.

As I wrote in that piece, both languages have strengths and weaknesses.

A table that compares R vs Python as data science programming languages.

For example, Python is typically better for software development generally. So if you learn Python for data science, then it will probably be easier to add on some software development skills as well, which are complementary with data science skills.

On the other hand, R has historically been better at data visualization, data manipulation, and data analytics. I mean this in the sense that the code to do these things (specifically: ggplot2 and dplyr) is much easier to learn, read, and debug.

Those things are still sort of true, but there have also been some developments in the last year or two that have made me favor one over the other.

That language is Python.

Why I’m Moving to Python

For long time readers of the Sharp Sight blog might be surprised.

The truth is, I really love R.

I’ve loved R for years.

It’s a great language, and a joy to use.

BUT …

There are a few major reasons that I’m now favoring Python for data science:

  1. Python is the Lingua Franca of Modern AI
  2. Python now has a great toolkit for data visualization (Seaborn)
  3. Python is more popular, which means more opportunities

Let’s take each of these in turn.

Python is the Lingua Franca of Modern AI

Let’s start with the big one.

Broadly speaking, Python is now the lingua franca of modern AI.

Now, there’s a lot of nuance around this.

To be a little more accurate, Python is the lingua franca of modern machine learning.

This is especially true at more basic to intermediate levels of machine learning.

Obviously, there are many instances where other languages are preferred.

In production environments, other languages might be used to implement ML systems …. particularly in “big data” systems that need to work with very large volumes of data with tight time constraints.

Some research environments might also favor languages other than Python.

But, broadly speaking, most machine learning and deep learning code that you’ll work with as a student or a standard data scientist in business will be Python code.

AI is really exciting right now

I’ll add here why this is important.

For years, I’ve been saying that AI will change everything.

I’ve been saying that as a civilization, we’ll build little bits of intelligence into almost everything.

That AI is likely to be as transformative to civilization as the invention of the internal combustion engine.

Well, I was right.

And although there were quite a few years recently where young people seemed more interested in trading sh*tcoins than learning ML, it seems that we’re finally at an inflection point.

I spent quite a bit of time recently playing with the newest generative AI tools like Stable Diffusion.

Even as a person who’s been interested in AI and machine learning for at least a decade, I am absolutely facinated. It’s the closest thing that I’ve ever seen to magic.

And let’s also remember ChatGPT, which has captured the imaginations of millions of people in only a few months since it was released in late 2022.

It’s all stunning to see, even for a data enthusiast like me.

And it’s early days.

We’re probably going to see many more developments in the next few years and many, many opportunities to create valuable AI-based products.

It’s a fascinating moment in the data science and machine learning industry.

Want to get involved with AI and machine learning?

You almost certainly need to learn Python.

Python now has a great toolkit for data visualization (Seaborn)

Let’s switch gears.

Although machine learning is really exciting, there are a lot of other parts of data science that are equally important.

Maybe more important at lower or intermediate levels.

I’ve written about this in the past.

All the way back in 2016, I wrote a blog post titled “Stop Trying to Jump to the Sexy Stuff First.

In that post, I cautioned students against jumping immediately to “sexy” topics like AI and machine learning.

Instead, I suggested, that students should focus on foundations, like data visualization, data wrangling, and data analysis.

As I’ve written many times, data visualization is critical for almost every step of the data science process.

… from data cleaning, to exploration, to reporting.

Data visualization is even necessary for the sexy topics like ML.

You need data visualization.

The big issue historically has been that Python data visualization tools sucked.

I hate to be an a$$hole about this, but my opinion has been that Python data visualization tools have been hard to learn, hard to use, and the code has been hard to debug or share. A mess.

All that changed though in 2022.

Seaborn is now “ggplot2 for Python”

In September of 2022, a team of developers and data scientists released a new version (version 0.12.0) of the Seaborn package.

For those of you who are unfamiliar, Seaborn is a data visualization package for Python.

In the past, going back several years, I’ve recommended Seaborn as an alternative to Matplotlib.

I’ll spare you my full thoughts, but I think Matplotlib is difficult to use, difficult to learn, and difficult to debug. And many Matplotlib plots are fugly.

So, I’ve recommended Seaborn as an alternative.

For years, Seaborn has been easier to use and easier to learn. It also makes plots that look pretty good.

But, this new version is much, much better.

Specifically, the new Seaborn contains a new toolkit, the Seaborn Objects API.

This new Seaborn toolkit is like ggplot2 for Python.

Based on the Grammar of Graphics, it enables a Python data scientist to create beautiful, insightful data visualizations with a syntax that’s simple, modular, and very powerful.

Put simply, I love it.

Ggplot2 was one of the main reasons that I prefered R over Python for so many years.

Now that the Seaborn team has created something like a “ggplot2 for Python,” I’m actually more inclined to switch to Python, and recommend that you switch too.

As I said: data visualization is very important for data science.

And the new Seaborn package gives Python users a powerful yet simple toolkit for doing data visualization.

In my opinion, it really tips the scales.

Python is More Popular

Finally, let’s discuss popularity.

To put it simply, Python is more popular than R.

I think that this has been true for a while, and I’ve known it to be true anecdotally.

But recently, I saw more proof.

In 2022, Kaggle released the annual version of their Machine Learning and Data Science Survey.

In that report, they noted that Python is the most popular data science language:

An image from a Kaggle survey report that shows that Python is the most popular data science language in 2022.

Anaconda published a similar report in 2022 based on survey data they collected of data scientists and data science enthusiasts.

In that report, they also showed that Python is the most popular data science language:

An image that shows a horizontal 100% stacked bar chart of popularity of different data science programming languages.  Specifically, showing that Python is the most popular data science language in 2022.

Now to be fair, the data in these reports may have some bias. (Leave a comment to explain how you think they could be biased.)

However, these findings are consistent with other surveys that I’ve seen in the past.

They are also consistent with my personal experience, looking at data science job postings.

The truth is, right now, Python is more popular than R.

It’s debatable whether or not Python is the better data science language. To be honest, I love R’s Tidyverse. It’s a joy to use.

But Python is more popular.

That means more job opportunities and more resources to help you learn.

But, It Depends

In this post, I’ve laid out a high-level case that Python is the best data science language to learn right now, and in particular, it’s probably better than R.

Python is the better language for AI and machine learning.

Python now has a data visualization toolkit – Seaborn Objects – that rivals the best toolkit in R. This fixes one of Python’s biggest historical drawbacks (historically, Python data visualization tools were bad).

And, Python is simply more popular, which leads to more opportunities.

But, as always, it depends.

As I frequently ask aspiring data scientists: “Who are you and what are your goals?”

R is still an excellent language, and it might be a better choice for some people. For example, R is still preferred in some academic environments, or for some specific business tasks.

Additionally, some companies simply let you choose. I talked to a recruiter at Meta a couple of years ago, and they said that an applicant could choose which language they wanted to use, both in an interview and on the job (you could choose between R or Python).

So, there’s a strong case for Python right now, but it’s not always clearcut for every person.

I recommend Python for most people in 2023, but R (or even a different language) may be a better choice depending on your own personal circumstances.

What do You Think?

What do you think?

Is Python the better choice?

Do you prefer R?

Why?

Let me know your thoughts in the comments ….

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

6 thoughts on “Why Python is the Best Data Science Language (in 2023)”

  1. A big part of R is that most functions operate on dataframes, how does that work in Python? And you mention the tidy functions, are they coming to Python too? Another consideration is the sometimes irritating heterogeneity of the syntax of R functions – that may be better in Python ?

    Reply
    • Python also has dataframes … a lot of data science work in Python revolves around them (although Python also has Numpy arrays and some other structures).

      The “Tidyverse” functions from R have good counterparts in Pandas, if you know what you’re doing. Most people don’t know about them or don’t use them properly.

      I totally disagree about the heterogeneity of syntax of R functions … as long as you’re using the Tidyverse functions (and you *should* be) then most of the syntax is highly homogeneous and consistent.

      Reply
  2. The comparison between Python and R is an apples to oranges comparison.
    R is designed by statisticians for statisticians. It is meant to do data/statistical analysis and it does this well.
    Python was designed for general purpose programming. It works well for this. Like you, I’ve used both. They are complementary tools for “data science “ work.

    The reason for Python’s popularity is that there are more software developers than statisticians. Developers want to create software and Python is a better tool for that.

    Is Python better for data science work? I would say no, unless a piece of software is the goal.

    Reply
    • Nope.

      I’m talking about “R data science stack” vs “Python data science stack”, while also taking into consideration additional things they are good at that might be complementary for a data scientist.

      These toolkits are highly comparable (e.g., dplyr vs Pandas, Matplotlib/Seaborn vs ggplot2, etc).

      It’s pretty explicit in my “Quick and Dirty Comparison” chart.

      Reply
  3. R is preferred by the statistics community over Python. It was designed by statisticians for statisticians. It was designed for data analysis and does it really well. People going into DS with a statistics degree will know R.
    Python is a great intro language for programming, and it can serve well for scientific computing. Yes, it does data analysis, but it wasn’t designed for that. It was designed for general purpose programming. If you want to develop software, that is the language to use. Python has become popular not because it can analyze data better than R, but because it is more familiar to software developers who can use it for production purposes. Developers (CS grads) also outnumber statisticians by a lot.

    Reply
    • Hmm, I mostly disagree with your framing.

      R was *originally* designed for statistics, and Python was *originally* designed for software enginneering.

      But both R and Python have suites of packages that are specifically designed for analytics and data science.

      That’s what I’m talking about here.

      Remember: this is a data science blog, so when I say R, it’s shorthand for “The R data science stack” and when I say Python, it’s shorthand for “The Python data science stack.” Obviously, there are exceptions, and I also point out in this and other blog posts the additional, complementary things that these languages do beyond the data science libraries that enhance their usefulness to data scientists.

      Reply

Leave a Comment