This tutorial will explain how to create a ggplot boxplot.
It explains the syntax, and shows clear, step-by-step examples of how to create a boxplot in R using ggplot2.
If you need something specific, you can click on any of the following links, and it will take you to the appropriate section in the tutorial:
Table of Contents:
If you have the time though, you should probably read the whole tutorial. It will make more sense if you do.
A Quick Review of Boxplots
Before we look at the syntax for the ggplot boxplot, let’s quickly review what boxplots are and how they’re structured.
Boxplots visualize summary statistics for your data
Boxlots are a type of data visualization that shows summary statistics for your data.
More specifically, boxplots visualize what we call the “five number summary.” The five number summary is a set of values that includes:
- the minimum
- the first quartile (25th percentile)
- the median
- the third quartile (75th percentile)
- the maximum
When we plot these statistics in the form of a boxplot, it looks something like this:
Take a look specifically at the structure. The different parts of the box and the two ends of the “whiskers” visualize our 5 number summary.
The box itself forms the core of the boxplot.
One side of the box represents the 25th percentile of our data (this is also called “the 1st quartile”, or Q1). The other end of the box represents the 75th percentile of our data (this is also called “the 3rd quartile”, or Q3).
Notice as well that there’s a line that’s a drawn interior of the box (the dotted line, in the above example). That line represents the median of the data (AKA, the “second quartile” or Q2).
So the box itself shows us the 25th percentile, the median, and the 75th percentile. All by itself, this gives us a lot of information about how the data are distributed.
Additionally, the width of the box gives us some information. The width of the box ranges from the 25th percentile and the 75th percentile. This is commonly known as the “interquartile range,” or IQR for short.
Notice that on either side of the box, there are some lines that extend beyond the box. We typically call these the “whiskers.”
These whisker lines show the location of the minimum value on one side, and the maximum value on the other.
Typically, these minimum and maximum values are calculated according to a formula. Commonly, the minimum is calculated as Q1 – 1.5*IQR and the maximum is calculated as Q3 + 1.5*IQR.
So in addition to showing the interquartile range, the boxplot also shows us minima and maxima. This can help us understand the high and low ranges for the data.
Finally, in the simple example above, you might notice some dots that exist beyond one of the whiskers.
These points represent outliers.
Remember, as noted in the section above, the “minimum” and “maximum” values in the boxplot are commonly calculated values.
Any outliers that we plot are simply values that are more extreme than those calculated minima and maxima (i.e., beyond 1.5*IQR from either end of the box).
These outliers show us the extreme values that might exist in the data.
So that’s the basic structure of a boxplot.
Now that we’ve reviewed the parts of a boxplot, let’s look at how to create one with ggplot2.
An Introduction to the ggplot Boxplot
Now, let’s talk about how to create a boxplot in R with ggplot2.
In the next few sections, I’ll explain the syntax, and then I’ll show you clear examples of how to create both a simple boxplot, and also how to create variations of the boxplot.
Syntax of the ggplot Boxplot
Let’s take a look at the syntax.
The syntax is relatively straightforward, as long as you already know how ggplot2 works. (To learn more about the ggplot2 visualization system check out our guide to ggplot2 for beginners.)
To plot a boxplot, you’ll call the
ggplot function. Inside the function, you’ll have the
data parameter, the
y parameter (which are typically called inside the
aes function). And finally you have the
Let’s talk about each of these.
The data parameter
data parameter enables us to specify the dataframe that we want to plot.
Remember that ggplot2 is primarily set up to work with R dataframes, so we specify the dataframe with this parameter. For example, if your dataframe is named
mydataframe, then you’ll set the syntax to
data = mydataframe.
You’ll see examples of how this works in the examples section.
The x and y parameters
y parameters enable you to specify the variables that you want to map to the x-axis and y-axis, respectively.
Note that these parameters are called inside of the
aes() function. Remember that in the ggplot2 system, the the
aes() function specifies how we map variables to aesthetic attributes of the plot.
(Again, to learn more about the
aes() function, check out our guide to ggplot2 for beginners.)
The boxplot “geom”
Finally, we have the syntax
geom_boxplot(). This syntax tells ggplot that we want to create a boxplot from our data, and from the variable mappings that we’ve set with the
If you’re confused about this, you need to understand what geoms are. Once again, to understand “geoms” and how they fit into the ggplot2 system, please see our our guide to ggplot2 for beginners.
The ggplot system also has other parameters that you can manipulate, like:
I’ll show you some examples of some simple modifications that you can made in the upcoming examples.
Examples: How to make boxplots with ggplot2
The boxplot is very easy to make using ggplot2. We’ll take a look at a few variations.
- Simple ggplot boxplot
- Ggplot boxplot by category
- Horizontal boxplot
- Change the box color
- Add a title
But before we actually make our boxplots, we’ll need to run some code.
In order to run our examples, we need to load the
tidyverse package. We should also look at the data we’re going to plot.
First, we’ll load the
tidyverse package. The
tidyverse package actually contains the
ggplot2 package, as well as several other important R packages like
tidyr, and others.
# LOAD TIDYVERSE PACKAGE library(tidyverse)
In these examples, we’ll be working with the
This dataset contains data on the sleep patterns of different animals.
We can take a look with the
# INSPECT DATA msleep %>% glimpse()
Rows: 83 Columns: 11 $ name
"Cheetah", "Owl monkey", "Mountain beaver", "Greater short-tai… $ genus "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bradypus… $ vore "carni", "omni", "herbi", "omni", "herbi", "herbi", "carni", N… $ order "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Artiodac… $ conservation "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "domestica… $ sleep_total 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5.3, 9… $ sleep_rem NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, 0.7, … $ sleep_cycle NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, NA, 0.… $ awake 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 18.7, … $ brainwt NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0.0982… $ bodywt 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.045, 14…
Notice that there are several categorical variables, as well as numeric variables.
This makes it very well suited for visualization with a boxplot.
Let’s look at some examples.
EXAMPLE 1: simple ggplot boxplot
First, we’ll create a very simple boxplot.
# PLOT BOXPLOT ggplot(data = msleep, aes(x = sleep_total)) + geom_boxplot()
And here’s what it looks like:
Here, we’ve mapped a single numeric variable to the
When we create a boxplot with this mapping, ggplot outputs a horizontal boxplot of that numeric variable.
EXAMPLE 2: ggplot boxplot by category
Next, we’ll create a boxplot that’s broken out by a categorical variable.
Let’s run the code, and then I’ll explain.
# PLOT BOXPLOT ggplot(data = msleep, aes(x = vore, y = sleep_total)) + geom_boxplot()
And here’s what it looks like:
Here, we mapped the categorical variable
vore to the
x parameter and the numeric variable
sleep_total to the
Notice that the orientation of the boxplot depends on what variable you map to which axis!
As you can see, since
vore is a categorical variable, ggplot creates a separate boxplot for each category.
This is very useful for comparing data distributions across categories in your data.
EXAMPLE 3: make a horizontal boxplot by category
Next, we’ll create a horizontal boxplot.
This will be the same as the boxplot in example 2, except the orientation will be different.
# PLOT BOXPLOT ggplot(data = msleep, aes(x = sleep_total, y = vore)) + geom_boxplot()
Again, this is the same boxplot that we had in example 2, except it’s flipped on it’s side.
Notice again that the orientation of the boxplot depends on which variables are mapped to the
EXAMPLE 4: Change the box color
Next we’ll change the color of the boxes.
To do this, we actually need to use the
fill parameter controls the color of the interior of the boxes, but the
color parameter actually controls the border color.
ggplot(data = msleep, aes(x = sleep_total, y = vore)) + geom_boxplot(fill = 'red')
Here, we changed the box color to
red by setting
fill = 'red'.
Notice that we did this inside the
geom_boxplot() function. This tells ggplot2 that we’re specifically changing the fill color of the boxes. (This comes in handy if we have a layered plot with more than one geom type.)
EXAMPLE 5: Add a title
Let’s do one more thing.
Here, we’ll add a title:
ggplot(data = msleep, aes(x = sleep_total, y = vore)) + geom_boxplot(fill = 'red') + labs(title = 'On average, insects sleep more than other organism types')
Here, we added a title using the
Titles and axis labels are relatively easy, but there are some important details that you might need to know.
Having said that, for more information on titles and axis labels, check out our tutorial on ggplot titles.
Leave your questions in the comments below
Do you have questions about the ggplot boxplot?
Is there something that I missed, or something else you’d like to know?
If so, leave your question in the comments section near the bottom of the page.
To learn more, you need to understand the ggplot system
There’s actually more that we could do, but not without a much broader understanding of the ggplot sytax system.
If you’re a beginner, you can use this blog post as a starting point.
After you learn the basics or use this to create a simple boxplot, I recommend that you study the complete ggplot system and master it. This is particularly true if you want to get a solid data science job. Put simply, you’ll need to be able to create simple plots like the boxplot in your sleep. And you’ll need to do a lot more. You’ll need to be “fluent” in the basics.
If you’re serious about mastering data science, I strongly suggest you sign up for our email list. Here at Sharp Sight, we publish tutorials that explain how to master data science fast.