The first question that aspiring data scientists ask me is “what is the best data science language … which should I learn.”
And almost always, this is framed as a decision between R and Python.
If you’re wondering which language you should learn first …
And especially if you’re interested in artificial intelligence.
Then keep reading.
The Eternal Debate: R vs Python
The debate between R and Python in the data science community has been raging for quite some time.
And I myself have weighed in more than a few times.
In the past, I’ve been somewhat ambivalent, and my recommendation has been “it depends.” I even wrote a comparison of R vs Python a few years ago.
As I wrote in that piece, both languages have strengths and weaknesses.
For example, Python is typically better for software development generally. So if you learn Python for data science, then it will probably be easier to add on some software development skills as well, which are complementary with data science skills.
On the other hand, R has historically been better at data visualization, data manipulation, and data analytics. I mean this in the sense that the code to do these things (specifically: ggplot2 and dplyr) is much easier to learn, read, and debug.
Those things are still sort of true, but there have also been some developments in the last year or two that have made me favor one over the other.
That language is Python.
Why I’m Moving to Python
For long time readers of the Sharp Sight blog might be surprised.
The truth is, I really love R.
I’ve loved R for years.
It’s a great language, and a joy to use.
There are a few major reasons that I’m now favoring Python for data science:
- Python is the Lingua Franca of Modern AI
- Python now has a great toolkit for data visualization (Seaborn)
- Python is more popular, which means more opportunities
Let’s take each of these in turn.
Python is the Lingua Franca of Modern AI
Let’s start with the big one.
Broadly speaking, Python is now the lingua franca of modern AI.
Now, there’s a lot of nuance around this.
To be a little more accurate, Python is the lingua franca of modern machine learning.
This is especially true at more basic to intermediate levels of machine learning.
Obviously, there are many instances where other languages are preferred.
In production environments, other languages might be used to implement ML systems …. particularly in “big data” systems that need to work with very large volumes of data with tight time constraints.
Some research environments might also favor languages other than Python.
But, broadly speaking, most machine learning and deep learning code that you’ll work with as a student or a standard data scientist in business will be Python code.
AI is really exciting right now
I’ll add here why this is important.
For years, I’ve been saying that AI will change everything.
I’ve been saying that as a civilization, we’ll build little bits of intelligence into almost everything.
That AI is likely to be as transformative to civilization as the invention of the internal combustion engine.
Well, I was right.
And although there were quite a few years recently where young people seemed more interested in trading sh*tcoins than learning ML, it seems that we’re finally at an inflection point.
I spent quite a bit of time recently playing with the newest generative AI tools like Stable Diffusion.
Even as a person who’s been interested in AI and machine learning for at least a decade, I am absolutely facinated. It’s the closest thing that I’ve ever seen to magic.
And let’s also remember ChatGPT, which has captured the imaginations of millions of people in only a few months since it was released in late 2022.
It’s all stunning to see, even for a data enthusiast like me.
And it’s early days.
We’re probably going to see many more developments in the next few years and many, many opportunities to create valuable AI-based products.
It’s a fascinating moment in the data science and machine learning industry.
Want to get involved with AI and machine learning?
You almost certainly need to learn Python.
Python now has a great toolkit for data visualization (Seaborn)
Let’s switch gears.
Although machine learning is really exciting, there are a lot of other parts of data science that are equally important.
Maybe more important at lower or intermediate levels.
I’ve written about this in the past.
All the way back in 2016, I wrote a blog post titled “Stop Trying to Jump to the Sexy Stuff First.”
In that post, I cautioned students against jumping immediately to “sexy” topics like AI and machine learning.
Instead, I suggested, that students should focus on foundations, like data visualization, data wrangling, and data analysis.
As I’ve written many times, data visualization is critical for almost every step of the data science process.
… from data cleaning, to exploration, to reporting.
Data visualization is even necessary for the sexy topics like ML.
You need data visualization.
The big issue historically has been that Python data visualization tools sucked.
I hate to be an a$$hole about this, but my opinion has been that Python data visualization tools have been hard to learn, hard to use, and the code has been hard to debug or share. A mess.
All that changed though in 2022.
Seaborn is now “ggplot2 for Python”
In September of 2022, a team of developers and data scientists released a new version (version 0.12.0) of the Seaborn package.
For those of you who are unfamiliar, Seaborn is a data visualization package for Python.
In the past, going back several years, I’ve recommended Seaborn as an alternative to Matplotlib.
I’ll spare you my full thoughts, but I think Matplotlib is difficult to use, difficult to learn, and difficult to debug. And many Matplotlib plots are fugly.
So, I’ve recommended Seaborn as an alternative.
For years, Seaborn has been easier to use and easier to learn. It also makes plots that look pretty good.
But, this new version is much, much better.
Specifically, the new Seaborn contains a new toolkit, the Seaborn Objects API.
This new Seaborn toolkit is like ggplot2 for Python.
Based on the Grammar of Graphics, it enables a Python data scientist to create beautiful, insightful data visualizations with a syntax that’s simple, modular, and very powerful.
Put simply, I love it.
Ggplot2 was one of the main reasons that I prefered R over Python for so many years.
Now that the Seaborn team has created something like a “ggplot2 for Python,” I’m actually more inclined to switch to Python, and recommend that you switch too.
As I said: data visualization is very important for data science.
And the new Seaborn package gives Python users a powerful yet simple toolkit for doing data visualization.
In my opinion, it really tips the scales.
Python is More Popular
Finally, let’s discuss popularity.
To put it simply, Python is more popular than R.
I think that this has been true for a while, and I’ve known it to be true anecdotally.
But recently, I saw more proof.
In 2022, Kaggle released the annual version of their Machine Learning and Data Science Survey.
In that report, they noted that Python is the most popular data science language:
Anaconda published a similar report in 2022 based on survey data they collected of data scientists and data science enthusiasts.
In that report, they also showed that Python is the most popular data science language:
Now to be fair, the data in these reports may have some bias. (Leave a comment to explain how you think they could be biased.)
However, these findings are consistent with other surveys that I’ve seen in the past.
They are also consistent with my personal experience, looking at data science job postings.
The truth is, right now, Python is more popular than R.
It’s debatable whether or not Python is the better data science language. To be honest, I love R’s Tidyverse. It’s a joy to use.
But Python is more popular.
That means more job opportunities and more resources to help you learn.
But, It Depends
In this post, I’ve laid out a high-level case that Python is the best data science language to learn right now, and in particular, it’s probably better than R.
Python is the better language for AI and machine learning.
Python now has a data visualization toolkit – Seaborn Objects – that rivals the best toolkit in R. This fixes one of Python’s biggest historical drawbacks (historically, Python data visualization tools were bad).
And, Python is simply more popular, which leads to more opportunities.
But, as always, it depends.
As I frequently ask aspiring data scientists: “Who are you and what are your goals?”
R is still an excellent language, and it might be a better choice for some people. For example, R is still preferred in some academic environments, or for some specific business tasks.
Additionally, some companies simply let you choose. I talked to a recruiter at Meta a couple of years ago, and they said that an applicant could choose which language they wanted to use, both in an interview and on the job (you could choose between R or Python).
So, there’s a strong case for Python right now, but it’s not always clearcut for every person.
I recommend Python for most people in 2023, but R (or even a different language) may be a better choice depending on your own personal circumstances.
What do You Think?
What do you think?
Is Python the better choice?
Do you prefer R?
Let me know your thoughts in the comments ….