The scatter plot …

It’s so common that almost everyone knows how to make one in one way or another.

Scatter plots are also extremely common in data science and analytics.

The scatter plot is everywhere, partially due to its simplicity and partially because its incredible usefulness for finding and communicating insights.

As simple as it might be, if you want to master data science, one of your first steps should be mastering the scatter plot. It’s a fundamental technique that you absolutely need to know backwards and forwards.

In this blog post, I’ll show you how to make a scatter plot in R.

There’s actually more than one way to make a scatter plot in R, so I’ll show you two:

1. How to make a scatter plot with base R
2. How to make a scatter plot with ggplot2

I definitely have a preference for the ggplot2 version, but the base R version is still common. Because you’re likely to see the base R version, I’ll show you that version as well (just in case you need it).

Let’s get started. First, I’ll show you how to make a scatter plot in R using base R.

## How to make a scatter plot in R with base R

Let’s talk about how to make a scatter plot with base R.

I have to admit: I don’t like the base R method. I think that many of the visualization tools from base R are awkward to use and hard to remember. I also think that the resulting visualizations are a little ugly.

Having said that, you’ll still see visualizations made with base R, so I want to show you how it’s done.

Let’s take a step-by-step look at how to make a scatter plot using base R:

#### Create the dataset

Here, we’ll quickly create a sample dataset.

```set.seed(55)
df <- tibble(x_var = runif(100, min = 0, max = 25)
,y_var = log2(x_var) + rnorm(100)
)
```

And let's print out the dataframe so we can take a look:

```print(df)
``` As you can see, the dataframe `df` contains two numeric variables, `x_var` and `y_var`.

#### Plot the scatter plot with the plot() function

Next, we'll plot the scatter plot using the `plot()` function.

```plot(x = df\$x_var, y = df\$y_var)
``` Ok, let me explain how that code works.

We're initiating plotting using the `plot()` function. Inside of the `plot()` function, the `x =` parameter and `y =` parameter allow us to specify the which variables should be plotted on the x-axis and y-axis respectively.

Keep in mind though, the `plot()` function does not directly work with dataframes. Instead, the `plot()` function works with vectors.

The variables we want to plot are inside of the dataframe `df`. Because of this, we need to access those vectors; we need to "pull them out" of the dataframe and tell the `plot()` function where to get them. To do this, we need to use the `\$` operator. The `\$` operator enables us to extract specific columns from a dataframe. So notice the syntax: `df\$x_var` is basically getting the `x_var` variable from `df`, and `df\$y_var` is basically getting the `y_var` variable from `df`.

Essentially, we're extracting our variables from the dataframe using the `\$` operator, and then plotting them with the `plot()` function.

That's basically it. You can do more with a scatter plot in base R, but as I said earlier, I really don't like them. I strongly prefer to use ggplot2 to create almost all of my visualizations in R. That being the case, let me show you the ggplot2 version of a scatter plot.

## How to make a scatter plot in R with ggplot2

As I just mentioned, when using R, I strongly prefer making scatter plots with ggplot2. By default, a ggplot2 scatter plot is more refined. It just looks "better right out of the box."

Having said that, ggplot2 can be a little intimidating for beginners, so let's quickly review what ggplot2 is and how it works.

#### What is ggplot2

ggplot2 is an add-on package for the R programming language. The focus of ggplot2 is data visualization. It enables R users to create a wide range of data visualizations using a relatively compact syntax.

Although the syntax seems confusing to new users, it is extremely systematic. The systematic nature of ggplot2 syntax is one of it's core advantages. Once you know how to use the syntax, creating simple visualizations like the scatter plot becomes easy. Moreover, more advanced visualizations become relatively easy as well.

#### How ggplot2 works

The secret to using ggplot2 properly is understanding how the syntax works.

There are a few critical pieces you need to know:

• The `ggplot()` function
• The `data =` parameter
• The `aes()` function
• Geometric objects (AKA, "geoms")

The `ggplot()` function is simply the function that we use to initiate a ggplot2 plot.

The `data` parameter tells ggplot2 the name of the dataframe that you want to visualize. When you use ggplot2, you need to use variables that are contained within a dataframe. The data parameter tells ggplot where to find those variables. Remember: ggplot2 operates on dataframes.

The `aes()` function tells `ggplot()` the "variable mappings." This might sound complex, but it's really straightforward once you understand. When we visualize data, we are essentially connecting variables in a dataframe to parts of the plot. For example, when we make a scatter plot, we "connect" one numeric variable to the x axis, and another numeric variable to the y axis. We "map" these variables to different axes within the visualization. The `aes()` function allows us to specify those mappings; it enables us to specify which variables in a dataframe should connect to which parts of the visualization. If this doesn't make sense, just sit tight. I'll show you an example in a minute.

Finally, a geometric object is the thing that we draw. When you create a bar chart, you are drawing "bar geoms." When you create a line chart, you are drawing "line geoms." And when you create a scatter plot, you are drawing "point geoms." The geom is the thing that you draw. In ggplot2, we need to explicitly state the type of geom that we want to use (bars, lines, points, etc). When drawing a scatter plot, we'll do this by using `geom_point()`.

Ok. Now that I've quickly reviewed how ggplot2 works, let's take a look at an example of how to create a scatter plot in R with ggplot2.

## Example: how to make a scatter plot with ggplot2

First, you need to make sure that you've loaded the ggplot2 package. This also assumes that you've installed the ggplot2 package. (If you haven't installed the ggplot2 package, do that before running this code.)

```library(ggplot2)
```

#### Create a sample dataset

Next, we'll need some data to plot.

We already created the dataframe, `df`, earlier in this post. But just in case you didn't run that code yet, here it is again. (This is the same as the code to create the dataframe above, so if you've already run that, you won't need to run this again. But just in case, here's the code one more time.)

```set.seed(55)
df <- tibble(x_var = runif(100, min = 0, max = 25)
,y_var = log2(x_var) + rnorm(100)
)

df
```

This code creates a simple dataframe with two variables, `x_var` and `y_var`.

#### Plot a scatter plot with ggplot

Now that we have our dataframe, `df`, we will plot it with ggplot2.

```ggplot(data = df, aes(x = x_var, y = y_var)) +
geom_point()
``` Ok, we have our scatter plot. It's pretty straightforward, but let me explain it.

We're initiating the ggplot2 plotting system by calling the `ggplot()` function.

Inside of the `ggplot2()` function, we're telling ggplot that we'll be plotting data in the `df` dataframe. We do this with the syntax `data = df`.

Next, inside the `ggplot2()` function, we're calling the `aes()` function. Remember, the `aes()` function enables us to specify the "variable mappings." Here, we're telling ggplot2 to put our variable `x_var` on the x-axis, and put `y_var` on the y-axis. Syntactically, we're doing that with the code `x = x_var`, which maps `x_var` to the x-axis, and `y = y_var`, which maps `y_var` to the y-axis.

Finally, on the second line, we're using `geom_point()` to tell ggplot that we want to draw point geoms (i.e., points).

That's it. That's all there is to it. The syntax might look a little arcane to beginners, but once you understand how it works, it's pretty easy.

Having said that, there are still a few enhancements we could make to improve the chart. Let's talk about a few of those.

## Change the color of the points

To change the color of the points in our ggplot scatterplot to a solid color, we need to use the `color` parameter.

```ggplot(data = df, aes(x = x_var, y = y_var)) +
geom_point(color = 'red')
``` Again, this is very straightforward. To do this, we just set `color = 'red'` inside of `geom_point()`. We do this inside of `geom_point()` because we're changing the color of the points. (There are more complex examples were we have multiple geoms, and we need to be able to specify how to modify one geom layer at a time.)

## How to add a trend line

To add a trend line, we can use the statistical operation `stat_smooth()`. Keep in mind that the default trend line is a LOESS smooth line, which means that it will capture non-linear relationships.

But, you can also add a linear trend line.

#### Add a linear trend line

To add a linear trend line, you can use `stat_smooth()` and specify the exact method for creating a trend line using the `method` parameter.

Specifically, you'll use the code `method = 'lm'` as follows:

```ggplot(data = df, aes(x = x_var, y = y_var)) +
geom_point(color = 'red') +
stat_smooth(method = 'lm')
``` This is essentially using the `lm()` function to build a linear model and fit a straight line to the data.

Finally, let's add a quick title to the plot.

There are a few ways to add a title to a plot in ggplot2, but here we'll just use the `labs()` function with the `title` parameter.

```ggplot(data = df, aes(x = x_var, y = y_var)) +
geom_point(color = 'red') +
stat_smooth(method = 'lm') +
labs(title = 'This is a scatter plot of x_var vs y_var')
``` Ok, I want to be clear: this is not a very good title. I'm only using this as an example (the whole chart is sort of a dummy example). Writing good chart titles is a bit of an art, and I'm not going to discuss it here.

I really just want you to understand that you can add a plot to a ggplot scatterplot by using the `labs()` function with the `title` parameter.