R vs Python … which to learn for data science

One of the most common questions I get from data science hopefuls is “which programming language should I learn?”

My general advice is “it depends.”

Or to clarify my response, I like to ask the question “who are you, and what are your goals?” The programming language you use depends on your background and your long term goals.

Having said that, there’s typically only two major options that I think most data science students should consider: R and Python.

For the most part, R and Python are the only options that most new data science students should consider.

The question is, which is better?




In this blog post, I’ll walk you through the pros and cons of R vs Python for data science. We’ll start with an analysis of the pros and cons of R, and then later, I’ll discuss the pros and cons of Python.

At the bottom of the post, I’ll summarize my recommendations.

First, let’s start with R.

Why choose R for data science

If you’re interested in becoming a data scientist, R has some distinct advantages.

Let’s talk about the cases where R is the best choice (verses Python).

Choose R if you have limited programming experience

If you have limited programming experience, I would probably recommend learning R first.

This might seem counter-intuitive if you’ve read about the benefits of Python. Python is commonly lauded as a very easy-to-learn programming language. In particular, experts mention Python as a great programming language for people without any prior programming experience.

Fair enough. That’s probably true, if you want to become a software developer. For programming and software development, I think that Python is a great choice for your first programming language.

But data science and software development are not the same thing.

Python might be great for beginning software developers, but I think R is much better for beginning data scientists.

Let me explain why.

The difference comes down to a subtle difference in how data scientists use programming languages verses how software developers use programming languages.

For beginning data scientists, “programs” should look like scripts, not software. This is a subtle difference, but it’s important.

Here’s what I mean. Let’s say you’re working with a dataframe. Here, I’ll show you a dataframe in R, the Auto dataframe:

library(ISLR)
data("Auto")

If you’re not familiar, a dataframe is data in a row-and-column format, sort of like an excel spreadsheet.

In this particular dataframe, there is a variable called weight, which is just the car weight of the cars in the data.

Let’s say that you want to create a new variable called weight_kg, which is the weight in kilograms.

There’s more than one way to do this. One way is to create a for-loop that cycles through all of the values and computes the value of the new variable. That’s sort of the hard way to do it.

A different way is to just use a pre-made function that will automatically compute the values of the new variable:

mutate(Auto, weight_kg = weight * .45)

This is much easier to do. You just need to know the right tool to use (in this case, the mutate() function from R’s Tidyverse package).

When you know the right functions and toolkits to use, data science “programs” become more like data processing “scripts.” You end up calling pre-built tools in sequence to process the data: input the data, clean the data, analyze the data.

What I’m getting at is that you should use pre-built functions and tools to perform these tasks. You should not try to build your own tools to accomplish these tasks.

That means that you shouldn’t need many traditional programming and software concepts. Ideally, you should avoid things like for-loops, classes, object oriented programming, and other software development concepts.

Having said all of that, I think that R is better than Python because R’s data toolkit is better developed and easier to use.

Specifically, I think that R’s toolkit requires less understanding of software development concepts. To be clear, Python does have pre-built data toolkits, just like R does. However, Python’s tools and syntax still have a “software dev” feel to them; they feel more reliant on software development concepts (like for-loops, classes, object orientation, etc). For example, if you skim through a few Python data science books, you’ll still see for-loops, class declarations, and other items that would be challenging to a beginner without a background in software development or computer science.

In contrast, R has a very clean set of tools for performing data processing tasks:

  • ggplot2 for data visualization
  • dplyr for data manipulation
  • lubridate for working with dates and times
  • stringr for working with strings
  • Etc …

In many cases, you can use these R tools without any knowledge of software development or computer science concepts.

Again, I’ll reiterate that Python has data processing libraries too. I’m not denying that. The difference is that the syntax for data science in R feels less like software development, and more like a toolkit for writing data processing scripts. R’s Tidyverse is much easier to learn for data science beginners with limited programming experience.

R’s syntax is better for data science tasks

A related point is that R’s syntax is just a little simpler when performing many data science tasks.

Syntactically, the tools of R’s Tidyverse are very well designed. The functions and tools are very well named. Importantly, this makes them easy to use and easy to remember.

They are also designed in such a way that you can use them without running into small and subtle errors.

Let me show you an example.

Let’s say that you want to remove a variable from a dataframe. (Here, we’ll use the Iris dataframe as an example because you can find it in both R and Python.)

Let’s take a look at the syntax for removing a variable for both R and Python:

Python

import seaborn as sns
import pandas as pd

iris = sns.load_dataset('iris')
iris.drop(['sepal_length','species'], axis = 1)

R

library(dplyr)
select(iris, -sepal_length, -species)

Specifically, we’re comparing the final lines for each programming language; the “drop” syntax for Python and the “select” syntax for R.

Personally, I think that the R syntax is better and “cleaner” in some subtle but important ways.

Let’s first talk about the Python syntax. To remove a variable from a Python dataframe, we use the drop() method. That part is simple. drop() is well named and easy to remember. But after that, it gets a little more complicated.

The first argument of the drop() method is a list of the variables that we want to remove. A small problem here is that the variable names must be inside of brackets. Syntactically, there’s a reason for this (this is a ‘list’ data structure). But no matter the reason, the fact that you need to use brackets around the variable names introduces a subtle bit of complexity that can be confusing for beginners. Often times, a beginner will forget to use those brackets and will be very confused when the code doesn’t work.

Additionally, the variable names need to be enclosed in quotation marks. This is a very subtle bit of syntax, but if you don’t enclose the variables in quotations, you’ll get an error that says “name 'sepal_length' is not defined.” This is extremely subtle and it’s the sort of thing that a beginner will miss.

Finally, to drop the variables with Python’s drop(), we need to specify an “axis.” For a beginner, this begs the question: what the hell is an axis and why is it important?

To be honest, dataframe axes aren’t that hard to understand. However, my point is that these little syntactic quirks are the things that can confuse a beginner. And this is just one example. I can give you dozens of other examples where the Python data science syntax is confusing like this.

Let’s contrast this with the R version.

To drop a variable in R, we’ll use the select() function from dplyr.

I think the worst thing about this syntax is that it’s called “select.” We need to use the select() function to drop a variable (it would probably be easier to remember if we could use a function called drop().)

However, once you remember that you need to use the select() function to remove a variable, the syntax is pretty straightforward. Inside the select() function, the first argument is the name of the dataframe. The next set of arguments is the set of variables that you want to drop with minus signs in front of them, separated by commas.

This syntax feels much more intuitive to me. Yes, you need to remember to use the minus sign, but that feels intuitive to me. In a sense, we are “subtracting” the variables from the data. You also need to remember to separate the variables by commas. But again, that feels intuitive. Using the commas inside of select() feels like listing things in a sentence. Finally, in the R syntax, we don’t have to specify anything about an “axis,” like we did in the Python syntax.

Overall, R data science syntax feels intuitive. It almost feels like writing pseudocode. It’s easy to remember, easy to write, and easy to read. I strongly prefer the syntax for R’s Tidyverse over the data tools of Python.

Choose R if you want to focus on “analytics”

Readers here at the Sharp Sight blog will know that we think that data analysis is a valuable and highly under-appreciated skill.

More often than not, a lot of junior-level data science simply amounts to hard-core data analysis. Lower level data science is like data analysis with power tools (such as R or Python).

At more advanced levels, data science can be more complicated than mere data analysis, but at lower levels, data analysis will be a large amount of your work.

That being the case, it pays to be able to do data analysis. You need to be able to explore datasets. You need to be able to clean datasets. You need to be able to find insights in data.

Traditionally, over the last few decades, the tool of choice for this was Microsoft Excel. More recently, the “data analysis” field evolved and became more advanced. Somewhere in the mid to late 90’s, “analysts” began using power tools like SQL, SAS and SPSS.

As the field evolved, you started seeing data analysis departments start calling themselves “analytics” departments. People in these departments used the data “power tools” of the time (SQL, SAS, SPSS) to create business value from larger amounts of data than was previously possible with Excel alone. In some sense, that’s all “analytics” was … it was just “data analysis” with power tools.

In many ways, “data science” is just the next evolution of analytics, which was just an evolution of data analysis. That is to say, “data science” is often just a really advanced version of data analysis.

R is excellent for analytics and data analysis

If you find yourself in an environment where much of the data work is just “hard core data analysis” (instead of machine learning and advanced topics), I strongly recommend that you use R. Specifically, I recommend using the tools of the Tidyverse.

To put it simply, R’s Tidyverse packages (ggplot2, dplyr, tidyr, stringr, lubridate) are arguably the best set of tools for manipulating, visualizing, and analyzing data on the market. If your work consists mostly of creating large reports and ad-hoc data analyses, R and the Tidyverse are exceptional.

The reason why (as noted elsewhere in this post) is that the syntax for wrangling and analyzing data in the Tidyverse is superior. R’s Tidyverse syntax is easy to learn, easy to remember, and easy to use. Specifically, the various functions of the Tidyverse are very well named. You don’t have to remember a complicated function name with the Tidyverse. You don’t have to remember arcane syntax to get things done.

For example, if you want to “filter” your dataset and create a sub-set, there’s a simple function: filter(). If you want to select a specific column from a dataset, you can use the well-named dplyr function select().

Moreover, the functions of the Tidyverse are highly modular. Every function does one thing, and it does that thing well. This modularity makes the functions easier to learn and remember. It also makes them work like building blocks. You can take many simple Tidyverse functions and “connect” them together to create a more complicated process. Using these modular functions is almost like snapping together little Lego building blocks. If you know how to put together simple pieces, you can perform analyses that are very complicated.

To put it simply, R is excellent for analyzing data and getting things done in an analytics environment. For analytics, R is superior to Python, in my opinion.

Choose R if you have a background in statistics

If you come from a statistics background and you’ve used R in the past, I think R might be a better fit than Python.

I’ve encountered many former statistics students who have used R in the past, but haven’t really done any programming.

In that case, I highly recommend R.

First, I specifically recommend the Tidyverse dialect of R, because it’s so easy to learn and use.

Moreover, there’s strong ecosystem of statisticians and statistics experts in the R world. Anecdotally, more statisticians seem to use R than the Python. For example, when looking at statistics textbooks, you’ll find that if they use code, they often use R.

You’ll also find that almost every statistical tool or algorithm has been implemented in R. Therefore, if you want to use some rare statistical techniques in your data science work, you’ll probably have an easier time finding those tools in R than Python.

Why choose Python for data science

Although I strongly recommend R for many beginning data science students, it’s not always the best choice.

For some people, Python is the best language to learn for data science. Python may be a better choice than R for people with specific background, goals, and interests.

Let’s talk about some cases where Python might be a better choice than R.

Choose Python if you have a background in software development or computer science

If you have experience in software development or computer science, I think that Python may be a better choice.

Here, I have in mind people who’ve learned basic programming and programming principles. For example, if you took a computer science class in college, or you were a CS major, Python may be a better choice than R. Similarly, if you come from a web development background, Python may be a better choice.

Now, I want to make it clear again that data science is not the same thing as software development. Frequently in data science, you’ll see fewer programming structures like for-loops, while-loops, and control structures. You’ll see more data manipulation or data visualization tools. More data wrangling. More charts and graphs. As I mentioned earlier, data science code often looks less like “software,” and more like a data analysis “script.” Of course, it’s not perfectly clear cut, but at entry levels, a lot of data science looks like data analysis scripting. It’s typically at more advanced levels where data scientists start creating proper software.

That being said, if you have programming experience, you might still feel more comfortable with Python.

Part of the reason for this is that in my opinion, Python is better for software.

Choose Python if you want to build software

I’ve already said that I think R is superior when you’re creating “data analysis scripts.” If you want to slice and dice some data, wrangle data, or visualize data, I think R’s Tidyverse packages are the best.

But if you want to build software systems , I think that Python is actually the better choice.

Writing software is where Python shines. For software, writing Python code just feels more effortless. As many experts have noted, writing Python code almost feels like you’re writing pseudocode.

Moreover, it’s commonly noted that Python is a better “all purpose” programming language. When discussing this, people frequently point out that Python is used more often by companies in production systems. People frequently comment that Python is more “production ready” and “all purpose” compared to R.

To be clear, I’m not saying that you can’t write software in R. I’m not saying that you can’t build production systems in R. I’m just saying that when a production system is necessary, many people prefer to build it in Python. Therefore, if you plan to create software systems as a data scientist, Python may be a better choice than R.

Choose Python if you want to focus on machine learning

If you want to focus on machine learning in the long run, Python may be the best choice.

Now, I want to be clear: R does have a machine learning ecosystem. In particular, the caret package is well developed. caret has the ability to execute a wide variety of machine learning tasks. For example, with R’s caret package you can create regression models; you can create support vector machines; you can create decision trees (both regression and classification); you can perform cross validation. R’s machine learning ecosystem is fairly well developed.

Having said that, Python comes out ahead here. Python’s scikit-learn provides a clean and easy-to-read syntax for implementing a variety of different machine learning techniques.

A big benefit here is just the simplicity of the scikit-learn syntax when comparing it to caret. R’s caret syntax feels a little clumsy sometimes. In particular, caret doesn’t integrate well with the Tidyverse ecosystem of R packages. Related to this, R’s tools for machine learning often produce outputs that are difficult to work with in the context of R’s data science ecosystem. In contrast, Python’s scikit-learn syntax feels a little better integrated into the broader Python ecosystem.

I think that Python also has better resources for studying machine learning. Although two of my favorite machine learning books use R code, I think that there is a broader set of books for machine learning that use Python.

All of that is to say, if you want to focus on machine learning, I think Python may be the better programming language.

Other factors that might influence your decision

Now that we’ve covered the strengths and weaknesses of R and Python, let’s talk about some other factors that might influence your decision.

Pick the language your friends and associates use

This one is a big one.

If you already have friends or associates that use either R or Python, this might be a good reason to choose that particular language.

The reason for this should be obvious: you can learn a lot from people when you have direct contact with them.

So for example, if a good friend or associate is a highly skilled Python programmer, it might be a good reason to choose Python.

To put it simply, having a close community of people that you can learn from might trump the strength & weakness calculus that I discussed above.

Choose the language that your “dream company” uses

Similar to the case where your friends use one particular language, you might want to choose a particular language if you have a specific career goal.

Specifically, if you want to work for a particular company, and you find out that they use a particular language, that could be a major influence on your decision.

For example, if you know you want to work at Google, and you find out that your ideal “team” at Google uses Python, that may be a reason to start learning Python.

Having said that, here’s a word of caution: don’t get your heart set on one particular company. In the short run, it can be difficult to get a dream job at the exact company of your choice. Landing a “dream job” takes hard work. You need the right skill set, and you often need the right network of friends, which will be tough to build.

So, targeting a specific company might influence your decision. On the other hand, it might be smart to keep your options open, just in case. Don’t let this be the only reason you choose one language over another.

Which is best: R vs Python

So, which should you choose, R or Python?

I think there are pros and cons for both, so the ultimate answer is “it depends.”

R and Python are both great for data science, but they excel at different things.

Where R excels

I think that R and the Tidyverse are far superior for data visualization and analytics (i.e., finding insights in data). R also comes out ahead for most true beginners. If you’ve never done any programming or data science in the past, R is probably the better option.

Where Python excels

On the other hand, Python – while being inferior for data visualization and analytics – is superior for machine learning. In my opinion, Python is also better for building software.

Where does that leave us? It depends. Who are you and what are your goals? If you want to be really good at data visualization, I think that R’s ggplot2 is the best tool around. If you want to specialize in analytics and “finding insights,” I think that R is superior.

If you want to be a machine learning specialist, Python and sci-kit learn are probably preferable.

You should probably learn them both, BUT …

Very quickly, I want to address a point raised by several other smart people.

This is not strictly an either/or decision. There is a third option: learn both.

I think that in the long run, a top-performing data scientist should know both R and Python. They are good at different things, so if you want to have a full toolkit, you should consider learning them both.

But … learn only one at a time

Having said that, you should focus on only one language at a time.

If you try to learn both at the same time, it will probably take longer. Dividing your attention reduces your focus. You’ll make much faster progress if you focus intensely on one language at a time.

That being said, you still should choose one right now, which leads me to my final point.

Pick something, and get to f*cking work

We’ve talked about the strengths and weaknesses of R vs Python, but now I want to bring up something that’s more important than making the “right” choice.

Making a choice is the most important thing.

Don’t spend months trying to figure out the “best” language for you.

Take a week or two to think about the pros and cons. Ask a few friends or mentors what they think. Think about your short, medium and long-term data science goals.

But then pick something.

Pick something and get started.

Take action. Start mastering the data science skill set. Whether you choose R or Python, you’ll still be able to learn data visualization and data analysis. Whether you choose R or Python, you’ll still be able to learn machine learning.

It’s important that you don’t get paralyzed trying to decide on the best language. Some people waste months trying to decide between languages, and they end up wasting time that they could spend mastering data skills.

Ultimately, whether you choose R or Python, it’s more important that you pick something.

There are pros and cons to each language. Both have strengths and weaknesses.

But either way, both R and Python are pretty damn good. It’s hard to make a mistake with either one unless you have very specific goals in mind that would require one over the other.

Here’s what I recommend. Think about this blog post. Re-read it again if you need to. If you have questions, send them to me at josh@sharpsightlabs.com.

Then give it week to research and decide. After that, choose something. Get started.

The sooner you start, the sooner you’ll be prepared to actually work as a data scientist.

Sign up for R and Python tutorials

Whether you want to master R or Python, we can help.

Here at Sharp Sight, we teach data science.

And every week we publish data science tutorials to help you learn.

So if you’re interested in data science, sign up for our email list.

When you sign up for our email list, you’ll get free tutorials delivered to your inbox.

You’ll learn about data science in R, including ggplot2, dplyr, tidyr, readr, and the other packages of the Tidyverse.

You’ll also get tutorials about data science in Python, including tutorials about numpy, pandas, matplotlib, and scikit-learn.

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

11 thoughts on “R vs Python … which to learn for data science”

  1. Great post! I generally use both in my role, but I stick to R for data analysis and Python for any sort of automation I need to accomplish. I tried using Python for analysis, but I wasn’t a fan.

    Reply
    • Yeah, the Tidyverse syntax in R is exceptional for data analysis. It’s a bit of a personal preference, but I think the Tidyverse tools really outshine the Python tools for data visualization and data manipulation.

      There are some new Tidyverse tools coming online as well (i.e., the sf package and the forthcoming tidymodels package), so I think that R may get even better.

      But as you pointed out in your comment, Python has strengths too.

      Reply
    • Since 2014, we’ve only published R tutorials …

      But starting soon in 2018, we will begin publishing Python data science tutorials as well.

      Reply
  2. Thanks, installed the things you mentioned after I read this. I did notice the feather package is not available for 3.7 yet (feather loaded fine and works great in R) … that would be a fun mini tutorial when they finish it! Thanks!
    pip3.7 install scikit-learn
    pip3.7 install matplotlib
    pip3.7 install pandas
    pip3.7 install numpy

    Reply

Leave a Comment