Select Page

The scatter plot …

It’s so common that almost everyone knows how to make one in one way or another.

You see them in business, academia, media, news. Students use them.

Scatter plots are also extremely common in data science and analytics.

The scatter plot is everywhere, partially due to its simplicity and partially because its incredible usefulness for finding and communicating insights.

As simple as it might be, if you want to master data science, one of your first steps should be mastering the scatter plot. It’s a fundamental technique that you absolutely need to know backwards and forwards.

In this blog post, I’ll show you how to make a scatter plot in R.

There’s actually more than one way to make a scatter plot in R, so I’ll show you two:

  1. How to make a scatter plot with base R
  2. How to make a scatter plot with ggplot2

I definitely have a preference for the ggplot2 version, but the base R version is still common. Because you’re likely to see the base R version, I’ll show you that version as well (just in case you need it).

Let’s get started. First, I’ll show you how to make a scatter plot in R using base R.

How to make a scatter plot in R with base R

Let’s talk about how to make a scatter plot with base R.

I have to admit: I don’t like the base R method. I think that many of the visualization tools from base R are awkward to use and hard to remember. I also think that the resulting visualizations are a little ugly.

Having said that, you’ll still see visualizations made with base R, so I want to show you how it’s done.

Let’s take a step-by-step look at how to make a scatter plot using base R:

Create the dataset

Here, we’ll quickly create a sample dataset.

set.seed(55)
df <- tibble(x_var = runif(100, min = 0, max = 25)
             ,y_var = log2(x_var) + rnorm(100)
             )

And let’s print out the dataframe so we can take a look:

print(df)

Sample dataframe for creating a scatter plot in R.

As you can see, the dataframe df contains two numeric variables, x_var and y_var.

Plot the scatter plot with the plot() function

Next, we’ll plot the scatter plot using the plot() function.

plot(x = df$x_var, y = df$y_var)

A scatter plot in R created using base R.

Ok, let me explain how that code works.

We’re initiating plotting using the plot() function. Inside of the plot() function, the x = parameter and y = parameter allow us to specify the which variables should be plotted on the x-axis and y-axis respectively.

Keep in mind though, the plot() function does not directly work with dataframes. Instead, the plot() function works with vectors.

The variables we want to plot are inside of the dataframe df. Because of this, we need to access those vectors; we need to “pull them out” of the dataframe and tell the plot() function where to get them. To do this, we need to use the $ operator. The $ operator enables us to extract specific columns from a dataframe. So notice the syntax: df$x_var is basically getting the x_var variable from df, and df$y_var is basically getting the y_var variable from df.

Essentially, we’re extracting our variables from the dataframe using the $ operator, and then plotting them with the plot() function.

That’s basically it. You can do more with a scatter plot in base R, but as I said earlier, I really don’t like them. I strongly prefer to use ggplot2 to create almost all of my visualizations in R. That being the case, let me show you the ggplot2 version of a scatter plot.

How to make a scatter plot in R with ggplot2

As I just mentioned, when using R, I strongly prefer making scatter plots with ggplot2. By default, a ggplot2 scatter plot is more refined. It just looks “better right out of the box.”

Having said that, ggplot2 can be a little intimidating for beginners, so let’s quickly review what ggplot2 is and how it works.

What is ggplot2

ggplot2 is an add-on package for the R programming language. The focus of ggplot2 is data visualization. It enables R users to create a wide range of data visualizations using a relatively compact syntax.

Although the syntax seems confusing to new users, it is extremely systematic. The systematic nature of ggplot2 syntax is one of it’s core advantages. Once you know how to use the syntax, creating simple visualizations like the scatter plot becomes easy. Moreover, more advanced visualizations become relatively easy as well.

How ggplot2 works

The secret to using ggplot2 properly is understanding how the syntax works.

There are a few critical pieces you need to know:

  • The ggplot() function
  • The data = parameter
  • The aes() function
  • Geometric objects (AKA, “geoms”)

The ggplot() function is simply the function that we use to initiate a ggplot2 plot.

The data parameter tells ggplot2 the name of the dataframe that you want to visualize. When you use ggplot2, you need to use variables that are contained within a dataframe. The data parameter tells ggplot where to find those variables. Remember: ggplot2 operates on dataframes.

The aes() function tells ggplot() the “variable mappings.” This might sound complex, but it’s really straightforward once you understand. When we visualize data, we are essentially connecting variables in a dataframe to parts of the plot. For example, when we make a scatter plot, we “connect” one numeric variable to the x axis, and another numeric variable to the y axis. We “map” these variables to different axes within the visualization. The aes() function allows us to specify those mappings; it enables us to specify which variables in a dataframe should connect to which parts of the visualization. If this doesn’t make sense, just sit tight. I’ll show you an example in a minute.

Finally, a geometric object is the thing that we draw. When you create a bar chart, you are drawing “bar geoms.” When you create a line chart, you are drawing “line geoms.” And when you create a scatter plot, you are drawing “point geoms.” The geom is the thing that you draw. In ggplot2, we need to explicitly state the type of geom that we want to use (bars, lines, points, etc). When drawing a scatter plot, we’ll do this by using geom_point().

Ok. Now that I’ve quickly reviewed how ggplot2 works, let’s take a look at an example of how to create a scatter plot in R with ggplot2.

Example: how to make a scatter plot with ggplot2

Load the ggplot2 package

First, you need to make sure that you’ve loaded the ggplot2 package. This also assumes that you’ve installed the ggplot2 package. (If you haven’t installed the ggplot2 package, do that before running this code.)

library(ggplot2)

Create a sample dataset

Next, we’ll need some data to plot.

We already created the dataframe, df, earlier in this post. But just in case you didn’t run that code yet, here it is again. (This is the same as the code to create the dataframe above, so if you’ve already run that, you won’t need to run this again. But just in case, here’s the code one more time.)

set.seed(55)
df <- tibble(x_var = runif(100, min = 0, max = 25)
             ,y_var = log2(x_var) + rnorm(100)
             )

df

This code creates a simple dataframe with two variables, x_var and y_var.

Plot a scatter plot with ggplot

Now that we have our dataframe, df, we will plot it with ggplot2.

ggplot(data = df, aes(x = x_var, y = y_var)) +
  geom_point()

Scatter plot in R made with ggplot2.

Ok, we have our scatter plot. It’s pretty straightforward, but let me explain it.

We’re initiating the ggplot2 plotting system by calling the ggplot() function.

Inside of the ggplot2() function, we’re telling ggplot that we’ll be plotting data in the df dataframe. We do this with the syntax data = df.

Next, inside the ggplot2() function, we’re calling the aes() function. Remember, the aes() function enables us to specify the “variable mappings.” Here, we’re telling ggplot2 to put our variable x_var on the x-axis, and put y_var on the y-axis. Syntactically, we’re doing that with the code x = x_var, which maps x_var to the x-axis, and y = y_var, which maps y_var to the y-axis.

Finally, on the second line, we’re using geom_point() to tell ggplot that we want to draw point geoms (i.e., points).

That’s it. That’s all there is to it. The syntax might look a little arcane to beginners, but once you understand how it works, it’s pretty easy.

Having said that, there are still a few enhancements we could make to improve the chart. Let’s talk about a few of those.

Change the color of the points

To change the color of the points in our ggplot scatterplot to a solid color, we need to use the color parameter.

ggplot(data = df, aes(x = x_var, y = y_var)) +
  geom_point(color = 'red')

Scatterplot in R made with ggplot2, with red points.

Again, this is very straightforward. To do this, we just set color = 'red' inside of geom_point(). We do this inside of geom_point() because we’re changing the color of the points. (There are more complex examples were we have multiple geoms, and we need to be able to specify how to modify one geom layer at a time.)

How to add a trend line

To add a trend line, we can use the statistical operation stat_smooth().

A ggplot scatterplot in R with a smooth line.

Keep in mind that the default trend line is a LOESS smooth line, which means that it will capture non-linear relationships.

But, you can also add a linear trend line.

Add a linear trend line

To add a linear trend line, you can use stat_smooth() and specify the exact method for creating a trend line using the method parameter.

Specifically, you’ll use the code method = 'lm' as follows:

ggplot(data = df, aes(x = x_var, y = y_var)) +
  geom_point(color = 'red') +
  stat_smooth(method = 'lm')

ggplot scatterplot in R with a straight line.

This is essentially using the lm() function to build a linear model and fit a straight line to the data.

Add a title

Finally, let’s add a quick title to the plot.

There are a few ways to add a title to a plot in ggplot2, but here we’ll just use the labs() function with the title parameter.

ggplot(data = df, aes(x = x_var, y = y_var)) +
  geom_point(color = 'red') +
  stat_smooth(method = 'lm') +
  labs(title = 'This is a scatter plot of x_var vs y_var')

a ggplot scatterplot in R with a linear line and a title.

Ok, I want to be clear: this is not a very good title. I’m only using this as an example (the whole chart is sort of a dummy example). Writing good chart titles is a bit of an art, and I’m not going to discuss it here.

I really just want you to understand that you can add a plot to a ggplot scatterplot by using the labs() function with the title parameter.

Sign up to learn more data science in R

There’s definitely more I could show you, but the examples above should get you started with making a scatter plot in R.

If you want to learn more about data visualization and data science in R, sign up for our email list.

When you sign up, you’ll receive weekly data science tutorials, delivered directly to your inbox.

You’ll also get immediate access to our FREE Data Science Crash Course.

If you want our free tutorials and our free Data Science Crash Course, sign up for our email list now.