The scatter plot …
It’s so common that almost everyone knows how to make one in one way or another.
You see them in business, academia, media, news. Students use them.
Scatter plots are also extremely common in data science and analytics.
The scatter plot is everywhere, partially due to its simplicity and partially because its incredible usefulness for finding and communicating insights.
As simple as it might be, if you want to master data science, one of your first steps should be mastering the scatter plot. It’s a fundamental technique that you absolutely need to know backwards and forwards.
In this blog post, I’ll show you how to make a scatter plot in R.
There’s actually more than one way to make a scatter plot in R, so I’ll show you two:
- How to make a scatter plot with base R
- How to make a scatter plot with ggplot2
I definitely have a preference for the ggplot2 version, but the base R version is still common. Because you’re likely to see the base R version, I’ll show you that version as well (just in case you need it).
Let’s get started. First, I’ll show you how to make a scatter plot in R using base R.
How to make a scatter plot in R with base R
Let’s talk about how to make a scatter plot with base R.
I have to admit: I don’t like the base R method. I think that many of the visualization tools from base R are awkward to use and hard to remember. I also think that the resulting visualizations are a little ugly.
Having said that, you’ll still see visualizations made with base R, so I want to show you how it’s done.
Let’s take a step-by-step look at how to make a scatter plot using base R:
Create the dataset
Here, we’ll quickly create a sample dataset.
set.seed(55) df <- tibble(x_var = runif(100, min = 0, max = 25) ,y_var = log2(x_var) + rnorm(100) )
And let's print out the dataframe so we can take a look:
As you can see, the dataframe
df contains two numeric variables,
Plot the scatter plot with the plot() function
Next, we'll plot the scatter plot using the
plot(x = df$x_var, y = df$y_var)
Ok, let me explain how that code works.
We're initiating plotting using the
plot() function. Inside of the
plot() function, the
x = parameter and
y = parameter allow us to specify the which variables should be plotted on the x-axis and y-axis respectively.
Keep in mind though, the
plot() function does not directly work with dataframes. Instead, the
plot() function works with vectors.
The variables we want to plot are inside of the dataframe
df. Because of this, we need to access those vectors; we need to "pull them out" of the dataframe and tell the
plot() function where to get them. To do this, we need to use the
$ operator. The
$ operator enables us to extract specific columns from a dataframe. So notice the syntax:
df$x_var is basically getting the
x_var variable from
df$y_var is basically getting the
y_var variable from
Essentially, we're extracting our variables from the dataframe using the
$ operator, and then plotting them with the
That's basically it. You can do more with a scatter plot in base R, but as I said earlier, I really don't like them. I strongly prefer to use ggplot2 to create almost all of my visualizations in R. That being the case, let me show you the ggplot2 version of a scatter plot.
How to make a scatter plot in R with ggplot2
As I just mentioned, when using R, I strongly prefer making scatter plots with ggplot2. By default, a ggplot2 scatter plot is more refined. It just looks "better right out of the box."
Having said that, ggplot2 can be a little intimidating for beginners, so let's quickly review what ggplot2 is and how it works.
What is ggplot2
ggplot2 is an add-on package for the R programming language. The focus of ggplot2 is data visualization. It enables R users to create a wide range of data visualizations using a relatively compact syntax.
Although the syntax seems confusing to new users, it is extremely systematic. The systematic nature of ggplot2 syntax is one of it's core advantages. Once you know how to use the syntax, creating simple visualizations like the scatter plot becomes easy. Moreover, more advanced visualizations become relatively easy as well.
How ggplot2 works
The secret to using ggplot2 properly is understanding how the syntax works.
There are a few critical pieces you need to know:
- Geometric objects (AKA, "geoms")
ggplot() function is simply the function that we use to initiate a ggplot2 plot.
data parameter tells ggplot2 the name of the dataframe that you want to visualize. When you use ggplot2, you need to use variables that are contained within a dataframe. The data parameter tells ggplot where to find those variables. Remember: ggplot2 operates on dataframes.
aes() function tells
ggplot() the "variable mappings." This might sound complex, but it's really straightforward once you understand. When we visualize data, we are essentially connecting variables in a dataframe to parts of the plot. For example, when we make a scatter plot, we "connect" one numeric variable to the x axis, and another numeric variable to the y axis. We "map" these variables to different axes within the visualization. The
aes() function allows us to specify those mappings; it enables us to specify which variables in a dataframe should connect to which parts of the visualization. If this doesn't make sense, just sit tight. I'll show you an example in a minute.
Finally, a geometric object is the thing that we draw. When you create a bar chart, you are drawing "bar geoms." When you create a line chart, you are drawing "line geoms." And when you create a scatter plot, you are drawing "point geoms." The geom is the thing that you draw. In ggplot2, we need to explicitly state the type of geom that we want to use (bars, lines, points, etc). When drawing a scatter plot, we'll do this by using
Ok. Now that I've quickly reviewed how ggplot2 works, let's take a look at an example of how to create a scatter plot in R with ggplot2.
Example: how to make a scatter plot with ggplot2
Load the ggplot2 package
Load the ggplot2 package
First, you need to make sure that you've loaded the ggplot2 package. This also assumes that you've installed the ggplot2 package. (If you haven't installed the ggplot2 package, do that before running this code.)
Create a sample dataset
Next, we'll need some data to plot.
We already created the dataframe,
df, earlier in this post. But just in case you didn't run that code yet, here it is again. (This is the same as the code to create the dataframe above, so if you've already run that, you won't need to run this again. But just in case, here's the code one more time.)
set.seed(55) df <- tibble(x_var = runif(100, min = 0, max = 25) ,y_var = log2(x_var) + rnorm(100) ) df
This code creates a simple dataframe with two variables,
Plot a scatter plot with ggplot
Now that we have our dataframe,
df, we will plot it with ggplot2.
ggplot(data = df, aes(x = x_var, y = y_var)) + geom_point()
Ok, we have our scatter plot. It's pretty straightforward, but let me explain it.
We're initiating the ggplot2 plotting system by calling the
Inside of the
ggplot2() function, we're telling ggplot that we'll be plotting data in the
df dataframe. We do this with the syntax
data = df.
Next, inside the
ggplot2() function, we're calling the
aes() function. Remember, the
aes() function enables us to specify the "variable mappings." Here, we're telling ggplot2 to put our variable
x_var on the x-axis, and put
y_var on the y-axis. Syntactically, we're doing that with the code
x = x_var, which maps
x_var to the x-axis, and
y = y_var, which maps
y_var to the y-axis.
Finally, on the second line, we're using
geom_point() to tell ggplot that we want to draw point geoms (i.e., points).
That's it. That's all there is to it. The syntax might look a little arcane to beginners, but once you understand how it works, it's pretty easy.
Having said that, there are still a few enhancements we could make to improve the chart. Let's talk about a few of those.
Change the color of the points
To change the color of the points in our ggplot scatterplot to a solid color, we need to use the
ggplot(data = df, aes(x = x_var, y = y_var)) + geom_point(color = 'red')
Again, this is very straightforward. To do this, we just set
color = 'red' inside of
geom_point(). We do this inside of
geom_point() because we're changing the color of the points. (There are more complex examples were we have multiple geoms, and we need to be able to specify how to modify one geom layer at a time.)
How to add a trend line
To add a trend line, we can use the statistical operation
Keep in mind that the default trend line is a LOESS smooth line, which means that it will capture non-linear relationships.
But, you can also add a linear trend line.
Add a linear trend line
To add a linear trend line, you can use
stat_smooth() and specify the exact method for creating a trend line using the
Specifically, you'll use the code
method = 'lm' as follows:
ggplot(data = df, aes(x = x_var, y = y_var)) + geom_point(color = 'red') + stat_smooth(method = 'lm')
This is essentially using the
lm() function to build a linear model and fit a straight line to the data.
Add a title
Finally, let's add a quick title to the plot.
There are a few ways to add a title to a plot in ggplot2, but here we'll just use the
labs() function with the
ggplot(data = df, aes(x = x_var, y = y_var)) + geom_point(color = 'red') + stat_smooth(method = 'lm') + labs(title = 'This is a scatter plot of x_var vs y_var')
Ok, I want to be clear: this is not a very good title. I'm only using this as an example (the whole chart is sort of a dummy example). Writing good chart titles is a bit of an art, and I'm not going to discuss it here.
I really just want you to understand that you can add a plot to a ggplot scatterplot by using the
labs() function with the
Sign up to learn more data science in R
There's definitely more I could show you, but the examples above should get you started with making a scatter plot in R.
If you want to learn more about data visualization and data science in R, sign up for our email list.
When you sign up, you'll receive weekly data science tutorials, delivered directly to your inbox.
You'll also get immediate access to our FREE Data Science Crash Course.
If you want our free tutorials and our free Data Science Crash Course, sign up for our email list now.