The ultimate guide to the ggplot boxplot

This tutorial will explain how to create a ggplot boxplot.

It explains the syntax, and shows clear, step-by-step examples of how to create a boxplot in R using ggplot2.

If you need something specific, you can click on any of the following links, and it will take you to the appropriate section in the tutorial:

Table of Contents:

If you have the time though, you should probably read the whole tutorial. It will make more sense if you do.

A Quick Review of Boxplots

Before we look at the syntax for the ggplot boxplot, let’s quickly review what boxplots are and how they’re structured.

Boxplots visualize summary statistics for your data

Boxlots are a type of data visualization that shows summary statistics for your data.

More specifically, boxplots visualize what we call the “five number summary.” The five number summary is a set of values that includes:

  • the minimum
  • the first quartile (25th percentile)
  • the median
  • the third quartile (75th percentile)
  • the maximum

When we plot these statistics in the form of a boxplot, it looks something like this:

A simple visual explanation of a boxplot, with the "five number summary".

Take a look specifically at the structure. The different parts of the box and the two ends of the “whiskers” visualize our 5 number summary.

The Box

The box itself forms the core of the boxplot.

One side of the box represents the 25th percentile of our data (this is also called “the 1st quartile”, or Q1). The other end of the box represents the 75th percentile of our data (this is also called “the 3rd quartile”, or Q3).

Notice as well that there’s a line that’s a drawn interior of the box (the dotted line, in the above example). That line represents the median of the data (AKA, the “second quartile” or Q2).

So the box itself shows us the 25th percentile, the median, and the 75th percentile. All by itself, this gives us a lot of information about how the data are distributed.

Additionally, the width of the box gives us some information. The width of the box ranges from the 25th percentile and the 75th percentile. This is commonly known as the “interquartile range,” or IQR for short.

The Whiskers

Notice that on either side of the box, there are some lines that extend beyond the box. We typically call these the “whiskers.”

These whisker lines show the location of the minimum value on one side, and the maximum value on the other.

Typically, these minimum and maximum values are calculated according to a formula. Commonly, the minimum is calculated as Q1 – 1.5*IQR and the maximum is calculated as Q3 + 1.5*IQR.

So in addition to showing the interquartile range, the boxplot also shows us minima and maxima. This can help us understand the high and low ranges for the data.

The Outliers

Finally, in the simple example above, you might notice some dots that exist beyond one of the whiskers.

These points represent outliers.

Remember, as noted in the section above, the “minimum” and “maximum” values in the boxplot are commonly calculated values.

Any outliers that we plot are simply values that are more extreme than those calculated minima and maxima (i.e., beyond 1.5*IQR from either end of the box).

These outliers show us the extreme values that might exist in the data.

So that’s the basic structure of a boxplot.

Now that we’ve reviewed the parts of a boxplot, let’s look at how to create one with ggplot2.

An Introduction to the ggplot Boxplot

Now, let’s talk about how to create a boxplot in R with ggplot2.

In the next few sections, I’ll explain the syntax, and then I’ll show you clear examples of how to create both a simple boxplot, and also how to create variations of the boxplot.

Syntax of the ggplot Boxplot

Let’s take a look at the syntax.

The syntax is relatively straightforward, as long as you already know how ggplot2 works. (To learn more about the ggplot2 visualization system check out our guide to ggplot2 for beginners.)

A simple explanation of the syntax for a ggplot boxplot.

To plot a boxplot, you’ll call the ggplot function. Inside the function, you’ll have the data parameter, the x and y parameter (which are typically called inside the aes function). And finally you have the geom_boxplot function.

Let’s talk about each of these.

The data parameter

The data parameter enables us to specify the dataframe that we want to plot.

Remember that ggplot2 is primarily set up to work with R dataframes, so we specify the dataframe with this parameter. For example, if your dataframe is named mydataframe, then you’ll set the syntax to data = mydataframe.

You’ll see examples of how this works in the examples section.

The x and y parameters

The x and y parameters enable you to specify the variables that you want to map to the x-axis and y-axis, respectively.

Note that these parameters are called inside of the aes() function. Remember that in the ggplot2 system, the the aes() function specifies how we map variables to aesthetic attributes of the plot.

(Again, to learn more about the aes() function, check out our guide to ggplot2 for beginners.)

The boxplot “geom”

Finally, we have the syntax geom_boxplot(). This syntax tells ggplot that we want to create a boxplot from our data, and from the variable mappings that we’ve set with the aes function.

If you’re confused about this, you need to understand what geoms are. Once again, to understand “geoms” and how they fit into the ggplot2 system, please see our our guide to ggplot2 for beginners.

Additional parameters

The ggplot system also has other parameters that you can manipulate, like:

  • color
  • fill
  • alpha (i.e., opacity)

And others.

I’ll show you some examples of some simple modifications that you can made in the upcoming examples.

Examples: How to make boxplots with ggplot2

The boxplot is very easy to make using ggplot2. We’ll take a look at a few variations.

Examples:

But before we actually make our boxplots, we’ll need to run some code.

Preliminary code

In order to run our examples, we need to load the tidyverse package. We should also look at the data we’re going to plot.

load tidyverse

First, we’ll load the tidyverse package. The tidyverse package actually contains the ggplot2 package, as well as several other important R packages like dplyr, tidyr, and others.

# LOAD TIDYVERSE PACKAGE
library(tidyverse)
Inspect data

In these examples, we’ll be working with the msleep dataframe.

This dataset contains data on the sleep patterns of different animals.

We can take a look with the glimpse() function.

# INSPECT DATA
msleep %>% glimpse()

OUT:

Rows: 83
Columns: 11
$ name          "Cheetah", "Owl monkey", "Mountain beaver", "Greater short-tai…
$ genus         "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bradypus…
$ vore          "carni", "omni", "herbi", "omni", "herbi", "herbi", "carni", N…
$ order         "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Artiodac…
$ conservation  "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "domestica…
$ sleep_total   12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5.3, 9…
$ sleep_rem     NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, 0.7, …
$ sleep_cycle   NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, NA, 0.…
$ awake         11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 18.7, …
$ brainwt       NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0.0982…
$ bodywt        50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.045, 14…

Notice that there are several categorical variables, as well as numeric variables.

This makes it very well suited for visualization with a boxplot.

Let’s look at some examples.

EXAMPLE 1: simple ggplot boxplot

First, we’ll create a very simple boxplot.

Here, we’l

# PLOT BOXPLOT
ggplot(data = msleep, aes(x = sleep_total)) +
  geom_boxplot()

And here’s what it looks like:

A simple ggplot2 boxplot with a numeric variable mapped to the x axis.

Explanation

Here, we’ve mapped a single numeric variable to the x parameter, sleep_total.

When we create a boxplot with this mapping, ggplot outputs a horizontal boxplot of that numeric variable.

EXAMPLE 2: ggplot boxplot by category

Next, we’ll create a boxplot that’s broken out by a categorical variable.

Let’s run the code, and then I’ll explain.

# PLOT BOXPLOT
ggplot(data = msleep, aes(x = vore, y = sleep_total)) +
  geom_boxplot()

And here’s what it looks like:

A simple ggplot boxplot.

Explanation

Here, we mapped the categorical variable vore to the x parameter and the numeric variable sleep_total to the y parameter.

Notice that the orientation of the boxplot depends on what variable you map to which axis!

As you can see, since vore is a categorical variable, ggplot creates a separate boxplot for each category.

This is very useful for comparing data distributions across categories in your data.

EXAMPLE 3: make a horizontal boxplot by category

Next, we’ll create a horizontal boxplot.

This will be the same as the boxplot in example 2, except the orientation will be different.

# PLOT BOXPLOT
ggplot(data = msleep, aes(x = sleep_total, y = vore)) +
  geom_boxplot()

OUT:

A horizontal boxplot, broken out by category.

Explanation

Again, this is the same boxplot that we had in example 2, except it’s flipped on it’s side.

Notice again that the orientation of the boxplot depends on which variables are mapped to the x and y parameters.

EXAMPLE 4: Change the box color

Next we’ll change the color of the boxes.

To do this, we actually need to use the fill parameter.

The fill parameter controls the color of the interior of the boxes, but the color parameter actually controls the border color.

ggplot(data = msleep, aes(x = sleep_total, y = vore)) +
  geom_boxplot(fill = 'red')

OUT:

An R boxplot made with ggplot2, with the box colors changed to red.

Explanation

Here, we changed the box color to red by setting fill = 'red'.

Notice that we did this inside the geom_boxplot() function. This tells ggplot2 that we’re specifically changing the fill color of the boxes. (This comes in handy if we have a layered plot with more than one geom type.)

EXAMPLE 5: Add a title

Let’s do one more thing.

Here, we’ll add a title:

ggplot(data = msleep, aes(x = sleep_total, y = vore)) +
  geom_boxplot(fill = 'red') +
  labs(title = 'On average, insects sleep more than other organism types')

OUT:

An R boxplot with a title added.

Explanation

Here, we added a title using the labs() function.

Titles and axis labels are relatively easy, but there are some important details that you might need to know.

Having said that, for more information on titles and axis labels, check out our tutorial on ggplot titles.

Leave your questions in the comments below

Do you have questions about the ggplot boxplot?

Is there something that I missed, or something else you’d like to know?

If so, leave your question in the comments section near the bottom of the page.

To learn more, you need to understand the ggplot system

There’s actually more that we could do, but not without a much broader understanding of the ggplot sytax system.

If you’re a beginner, you can use this blog post as a starting point.

After you learn the basics or use this to create a simple boxplot, I recommend that you study the complete ggplot system and master it. This is particularly true if you want to get a solid data science job. Put simply, you’ll need to be able to create simple plots like the boxplot in your sleep. And you’ll need to do a lot more. You’ll need to be “fluent” in the basics.

If you’re serious about mastering data science, I strongly suggest you sign up for our email list. Here at Sharp Sight, we publish tutorials that explain how to master data science fast.

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

8 thoughts on “The ultimate guide to the ggplot boxplot”

  1. How do you extract the outliers? (Using builtin R graphing, you would say plot <- boxplot …. and then plot$out).

    Thank you

    Reply
    • If you have a serious critique about how I can improve it, great, tell me how to improve it.

      But your current comment shows you to be nothing but an ill-mannered pest and a low-value waste of time.

      Reply
  2. When I increase the Y-axis, the Box Polt disappears to the bottom, but I have been asked to use 18 minimum and 42 maximum numbers. How can I bring it to display while my numbers are in there?

    Reply
    • I’m not at all sure what you’re talking about.

      Please give me a clear example … ideally make some example code and an output image. You can’t post an image in the comments, so you could post that content somewhere else and then add a link here.

      Reply

Leave a Comment