This tutorial will explain how to create a ggplot boxplot.
It explains the syntax, and shows clear, step-by-step examples of how to create a boxplot in R using ggplot2.
If you need something specific, you can click on any of the following links, and it will take you to the appropriate section in the tutorial:
Table of Contents:
If you have the time though, you should probably read the whole tutorial. It will make more sense if you do.
A Quick Review of Boxplots
Before we look at the syntax for the ggplot boxplot, let’s quickly review what boxplots are and how they’re structured.
Boxplots visualize summary statistics for your data
Boxlots are a type of data visualization that shows summary statistics for your data.
More specifically, boxplots visualize what we call the “five number summary.” The five number summary is a set of values that includes:
- the minimum
- the first quartile (25th percentile)
- the median
- the third quartile (75th percentile)
- the maximum
When we plot these statistics in the form of a boxplot, it looks something like this:
Take a look specifically at the structure. The different parts of the box and the two ends of the “whiskers” visualize our 5 number summary.
The Box
The box itself forms the core of the boxplot.
One side of the box represents the 25th percentile of our data (this is also called “the 1st quartile”, or Q1). The other end of the box represents the 75th percentile of our data (this is also called “the 3rd quartile”, or Q3).
Notice as well that there’s a line that’s a drawn interior of the box (the dotted line, in the above example). That line represents the median of the data (AKA, the “second quartile” or Q2).
So the box itself shows us the 25th percentile, the median, and the 75th percentile. All by itself, this gives us a lot of information about how the data are distributed.
Additionally, the width of the box gives us some information. The width of the box ranges from the 25th percentile and the 75th percentile. This is commonly known as the “interquartile range,” or IQR for short.
The Whiskers
Notice that on either side of the box, there are some lines that extend beyond the box. We typically call these the “whiskers.”
These whisker lines show the location of the minimum value on one side, and the maximum value on the other.
Typically, these minimum and maximum values are calculated according to a formula. Commonly, the minimum is calculated as Q1 – 1.5*IQR and the maximum is calculated as Q3 + 1.5*IQR.
So in addition to showing the interquartile range, the boxplot also shows us minima and maxima. This can help us understand the high and low ranges for the data.
The Outliers
Finally, in the simple example above, you might notice some dots that exist beyond one of the whiskers.
These points represent outliers.
Remember, as noted in the section above, the “minimum” and “maximum” values in the boxplot are commonly calculated values.
Any outliers that we plot are simply values that are more extreme than those calculated minima and maxima (i.e., beyond 1.5*IQR from either end of the box).
These outliers show us the extreme values that might exist in the data.
So that’s the basic structure of a boxplot.
Now that we’ve reviewed the parts of a boxplot, let’s look at how to create one with ggplot2.
An Introduction to the ggplot Boxplot
Now, let’s talk about how to create a boxplot in R with ggplot2.
In the next few sections, I’ll explain the syntax, and then I’ll show you clear examples of how to create both a simple boxplot, and also how to create variations of the boxplot.
Syntax of the ggplot Boxplot
Let’s take a look at the syntax.
The syntax is relatively straightforward, as long as you already know how ggplot2 works. (To learn more about the ggplot2 visualization system check out our guide to ggplot2 for beginners.)
To plot a boxplot, you’ll call the ggplot
function. Inside the function, you’ll have the data
parameter, the x
and y
parameter (which are typically called inside the aes
function). And finally you have the geom_boxplot
function.
Let’s talk about each of these.
The data parameter
The data
parameter enables us to specify the dataframe that we want to plot.
Remember that ggplot2 is primarily set up to work with R dataframes, so we specify the dataframe with this parameter. For example, if your dataframe is named mydataframe
, then you’ll set the syntax to data = mydataframe
.
You’ll see examples of how this works in the examples section.
The x and y parameters
The x
and y
parameters enable you to specify the variables that you want to map to the x-axis and y-axis, respectively.
Note that these parameters are called inside of the aes()
function. Remember that in the ggplot2 system, the the aes()
function specifies how we map variables to aesthetic attributes of the plot.
(Again, to learn more about the aes()
function, check out our guide to ggplot2 for beginners.)
The boxplot “geom”
Finally, we have the syntax geom_boxplot()
. This syntax tells ggplot that we want to create a boxplot from our data, and from the variable mappings that we’ve set with the aes
function.
If you’re confused about this, you need to understand what geoms are. Once again, to understand “geoms” and how they fit into the ggplot2 system, please see our our guide to ggplot2 for beginners.
Additional parameters
The ggplot system also has other parameters that you can manipulate, like:
color
fill
alpha
(i.e., opacity)
And others.
I’ll show you some examples of some simple modifications that you can made in the upcoming examples.
Examples: How to make boxplots with ggplot2
The boxplot is very easy to make using ggplot2. We’ll take a look at a few variations.
Examples:
- Simple ggplot boxplot
- Ggplot boxplot by category
- Horizontal boxplot
- Change the box color
- Add a title
But before we actually make our boxplots, we’ll need to run some code.
Preliminary code
In order to run our examples, we need to load the tidyverse
package. We should also look at the data we’re going to plot.
load tidyverse
First, we’ll load the tidyverse
package. The tidyverse
package actually contains the ggplot2
package, as well as several other important R packages like dplyr
, tidyr
, and others.
# LOAD TIDYVERSE PACKAGE library(tidyverse)
Inspect data
In these examples, we’ll be working with the msleep
dataframe.
This dataset contains data on the sleep patterns of different animals.
We can take a look with the glimpse()
function.
# INSPECT DATA msleep %>% glimpse()
OUT:
Rows: 83 Columns: 11 $ name"Cheetah", "Owl monkey", "Mountain beaver", "Greater short-tai… $ genus "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bradypus… $ vore "carni", "omni", "herbi", "omni", "herbi", "herbi", "carni", N… $ order "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Artiodac… $ conservation "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "domestica… $ sleep_total 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5.3, 9… $ sleep_rem NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, 0.7, … $ sleep_cycle NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, NA, 0.… $ awake 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 18.7, … $ brainwt NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0.0982… $ bodywt 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.045, 14…
Notice that there are several categorical variables, as well as numeric variables.
This makes it very well suited for visualization with a boxplot.
Let’s look at some examples.
EXAMPLE 1: simple ggplot boxplot
First, we’ll create a very simple boxplot.
Here, we’l
# PLOT BOXPLOT ggplot(data = msleep, aes(x = sleep_total)) + geom_boxplot()
And here’s what it looks like:
Explanation
Here, we’ve mapped a single numeric variable to the x
parameter, sleep_total
.
When we create a boxplot with this mapping, ggplot outputs a horizontal boxplot of that numeric variable.
EXAMPLE 2: ggplot boxplot by category
Next, we’ll create a boxplot that’s broken out by a categorical variable.
Let’s run the code, and then I’ll explain.
# PLOT BOXPLOT ggplot(data = msleep, aes(x = vore, y = sleep_total)) + geom_boxplot()
And here’s what it looks like:
Explanation
Here, we mapped the categorical variable vore
to the x
parameter and the numeric variable sleep_total
to the y
parameter.
Notice that the orientation of the boxplot depends on what variable you map to which axis!
As you can see, since vore
is a categorical variable, ggplot creates a separate boxplot for each category.
This is very useful for comparing data distributions across categories in your data.
EXAMPLE 3: make a horizontal boxplot by category
Next, we’ll create a horizontal boxplot.
This will be the same as the boxplot in example 2, except the orientation will be different.
# PLOT BOXPLOT ggplot(data = msleep, aes(x = sleep_total, y = vore)) + geom_boxplot()
OUT:
Explanation
Again, this is the same boxplot that we had in example 2, except it’s flipped on it’s side.
Notice again that the orientation of the boxplot depends on which variables are mapped to the x
and y
parameters.
EXAMPLE 4: Change the box color
Next we’ll change the color of the boxes.
To do this, we actually need to use the fill
parameter.
The fill
parameter controls the color of the interior of the boxes, but the color
parameter actually controls the border color.
ggplot(data = msleep, aes(x = sleep_total, y = vore)) + geom_boxplot(fill = 'red')
OUT:
Explanation
Here, we changed the box color to red
by setting fill = 'red'
.
Notice that we did this inside the geom_boxplot()
function. This tells ggplot2 that we’re specifically changing the fill color of the boxes. (This comes in handy if we have a layered plot with more than one geom type.)
EXAMPLE 5: Add a title
Let’s do one more thing.
Here, we’ll add a title:
ggplot(data = msleep, aes(x = sleep_total, y = vore)) + geom_boxplot(fill = 'red') + labs(title = 'On average, insects sleep more than other organism types')
OUT:
Explanation
Here, we added a title using the labs()
function.
Titles and axis labels are relatively easy, but there are some important details that you might need to know.
Having said that, for more information on titles and axis labels, check out our tutorial on ggplot titles.
Leave your questions in the comments below
Do you have questions about the ggplot boxplot?
Is there something that I missed, or something else you’d like to know?
If so, leave your question in the comments section near the bottom of the page.
To learn more, you need to understand the ggplot system
There’s actually more that we could do, but not without a much broader understanding of the ggplot sytax system.
If you’re a beginner, you can use this blog post as a starting point.
After you learn the basics or use this to create a simple boxplot, I recommend that you study the complete ggplot system and master it. This is particularly true if you want to get a solid data science job. Put simply, you’ll need to be able to create simple plots like the boxplot in your sleep. And you’ll need to do a lot more. You’ll need to be “fluent” in the basics.
If you’re serious about mastering data science, I strongly suggest you sign up for our email list. Here at Sharp Sight, we publish tutorials that explain how to master data science fast.
Ultimate, my ass.
How to change f*ing quantiles without defualt example from help?
Well, if you had asked nicely, I might have offered some insight into how to do it.
But since you’re being an a$$hole ….
How do you extract the outliers? (Using builtin R graphing, you would say plot <- boxplot …. and then plot$out).
Thank you
I have almost no idea what you’re asking here.
You want to remove the outliers?
You should be using dplyr
filter()
to filter out observations that you don’t want.Idiotic tutorial
If you have a serious critique about how I can improve it, great, tell me how to improve it.
But your current comment shows you to be nothing but an ill-mannered pest and a low-value waste of time.