Readers here at the Sharp Sight blog will know how much we stress data visualization and data anlaysis as the entry point to data science.

Contrary to what most people will tell you, at entry levels, data science is often not about complex math. With a few exceptions, you probably won’t need calculus, linear algebra, regression, or even machine learning to be a valuable junior member of a data team.

In many cases, junior members can create the most value by simply being masterful at more “basic” skills like analysis and data wrangling.

But that means that if you want to create value as a junior data scientist, you need to know the basic “toolkit” of analysis. You need to essentially master the basics. You need to be “fluent” in writing code to perform basic tasks.

One of the basic tools of analysis is the boxplot.

The ultimate guide to the ggplot boxplot. Click To Tweet

What is a boxplot?

The boxplot visualizes numerical data by drawing the quartiles of the data: the first quartile, second quartile (the median), and the third quartile. Often they also show “whiskers” that extend to the maximum and minimum values.

Another way of saying this is that the boxplot is a visualization of the five number summary.

What’s a five number summary? Let me show you. Let’s use the following code:

msleep %>% 
  select(sleep_total) %>% 
  summary()

# sleep_total   
# Min.   : 1.90  
# 1st Qu.: 7.85  
# Median :10.10  
# Mean   :10.43  
# 3rd Qu.:13.75  
# Max.   :19.90  

The five number summary is just a description of the min, max, interquartile range, and the median (note that the code we just ran shows the “mean” as well).

These five summary numbers are useful, so you should probably know how to calculate it as well. To do that, just use dplyr::select() to select the variable you want to analyze, and then use the summary() function:

Essentially, the boxplot helps us see the “spread” or the “dispersion” of the data by visualizing the interquartile range (i.e. the middle 50% of observations), median, maxima, and minima.

Basic syntax for a ggplot boxplot

The boxplot is very easy to make using ggplot2. I’ll explain how to create a ggplot boxplot, but first let’s take a quick look at the code:

# LOAD TIDYVERSE PACKAGE
library(tidyverse)

# INSPECT DATA
msleep %>% glimpse()

# PLOT BOXPLOT
ggplot(data = msleep, aes(x = vore, y = sleep_total)) +
  geom_boxplot()

And here’s what it looks like:


A simple ggplot boxplot.

Like I said, this is very easy to do, but if you don’t know how ggplot2 works, it can be easy to get confused.

That being the case, let’s do a quick review of how ggplot2 works in general.

Quick overview of ggplot2

ggplot2 is my favorite tool for data visualization and data analysis, but it takes a little getting used to. If you understand how it works, you know that it makes visualization very easy. But if you don’t understand it, it can seem a little enigmatic.

Let’s quickly talk about the basics of ggplot.

How the ggplot() function works

The ggplot() function just initiates plotting for the ggplot2 visualization system. It’s basically saying “we’re going to plot something.”

The data= parameter

Inside of the ggplot() function, the first thing you’ll see is the data parameter. Specifically, in the following ggplot boxplot, you’ll see the code data = msleep. This is simply identifying the data that we’ll plot.

So the ggplot() function indicates that we will plot some data, and the data parameter (inside of the ggplot() function), indicates exactly what dataset that we’ll be using in the plot. Note also that the data parameter does not specify exactly which variables that we’ll be plotting. That’s essentially performed by the aes() function.

ggplot geoms, and geom_boxplot()

Notice that on the line below ggplot(), there’s a piece of syntax that says something about a boxplot: geom_boxplot(). What is this doing?

This just indicates that we’re going to plot a boxplot. A little more technically, it says that we will plot a boxplot “geom”.

So what the hell is a geom? To put it simply, a “geom” is just a “geometric object” that we can draw. Basic geoms are things like points, lines, bars, and polygons. In ggplot2, a “boxplot” is also considered a type of geom, and we can specify it using it’s own syntax … geom_boxplot().

If you’re a little confused about “geoms,” I suggest that you don’t overthink them. “Geoms” are just the things in a visualization that we draw; points, bars, lines, etc.

The aes() function

Importantly, geoms have “aesthetic attributes.”

Again, this is more simple than it sounds like, so don’t overthink it.

An “aesthetic attribute” is just a graphical attribute of the things that we draw. Aesthetic attributes are the attributes of geoms. So, we’re drawing things (geoms) and those geoms have attributes (aesthetic attributes).

What sorts of aesthetic attributes do geoms have? Simple things like their position along the x-axis, position along the y axis, color, shape, etc. So for example, if you draw points (geom_point()), those points will have x-axis positions, y-axis positions, colors, shapes, etc.

In very simple visualizations (like the ggplot boxplot), we’ll just be plotting variables on the x-axis and y-axis. How do we indicate which variable to “connect” to the x-axis and which variable to “connect” to the y-axis?

We do this with the aes() function.

In slightly more technical terms, we use the aes() function to create a “mapping” from the dataset to the “aesthetic attributes” of the things that we plot. The term “aesthetic

Recap: how to make a simple ggplot boxplot

Now that we’ve reviewed how ggplot2 works, let’s go back and take a second look at our boxplot code.

# PLOT BOXPLOT
ggplot(data = msleep, aes(x = vore, y = sleep_total)) +
  geom_boxplot()

What have we done here?

We called the ggplot() function. Inside the ggplot() function, we specified that we will plot data from the msleep dataframe with the code data = msleep.

Also inside of the ggplot() function, we called the aes() function. Here, the aes() function indicates that we are going to “map” the vore variable to the x-axis and we will map the sleep_total variable to the y-axis.

Finally, on the second line, we indicated that we will plot a boxplot by using the syntax geom_boxplot().

Like I said … it’s really straightforward to make a boxplot in ggplot2 once you know how ggplot2 works.

Modifying the ggplot boxplot

Now that you know how to make a simple ggplot2 boxplot, let’s modify the basic plot to create a few variations or enhanced versions.

How to make a “sideways” boxplot

By default, geom_boxplot() assumes that we have a categorical variable mapped to the x-axis and a quantitative variable mapped to the y-axis. So in the simple boxplot example above, the boxes of the boxplot are positioned vertically; they are drawn top to bottom.

What if we want to draw the boxes sideways? As it turns out, it’s not as simple as changing the variable mappings. We can not just reverse the variable mappings and map vore to the y-axis and sleep_total to the x-axis.

Instead, we need to use a special piece of code to “flip” the axes of the chart. We will use ggplot2::coord_flip().


# FLIPPED COORDINATES
ggplot(data = msleep, aes(x = vore, y = sleep_total)) +
  geom_boxplot() +
  coord_flip()

A ggplot boxplot that has been "flipped" so the x and y axes are exchanged.

Notice that when we do this, we just use the ‘+‘ sign after geom_boxplot() and then add coord_flip(). It’s very easy to do.

How to make a boxplot with one variable

Next, let’s make a boxplot with one variable.

Typically, a ggplot2 boxplot requires you to have two variables: one categorical variable and one numeric variable.

In some instances though, you might just want to visualize the distribution of a single numeric variable without breaking it out by category.

This is one instance where the ggplot2 syntax is a little strange. To make a ggplot boxplot with only one variable, we need to use a special piece of syntax. We will set the x-axis to an empty string inside of the aes() function:

# BOX PLOT WITH 1 VARIABLE
ggplot(data = msleep, aes(x = "", y = sleep_total)) +
  geom_boxplot()

Basically, ggplot2 expects something to be mapped to the x-axis, so we can’t just remove the x= parameter. Instead, we need put x = "" here. It’s a rare instance of an unintuitive piece of syntax in ggplot2, but it works.

Notice that when we make a boxplot with one variable, it basically just shows the 5 number summary for that variable.

The 5 number summary is useful, so you should probably know how to calculate it. To do that, just use dplyr::select() to select the variable you want to analyze, and then use the summary() function:

msleep %>% 
  select(sleep_total) %>% 
  summary()

By the way, if you want to be a data scientist, this is the sort of code snippet you should have memorized.

Add a title

Now let’s polish the boxplot a little.

Here, we’ll just add a title to the boxplot. To do this, we’ll just use the labs() function.

ggplot(data = msleep, aes(x = vore, y = sleep_total)) +
  geom_boxplot() +
  labs(title = 'On average, insects sleep more than other organism types')

Note here that I’ve used the title as a tool to “tell a story” about the data. This is a best practice. Ideally, you shouldn’t use the title to just say something like “Plot of vore vs. sleep_total“. You want to use your titles to point something out.

Having said that, we could probably copy-edit this title more, but this is good enough for a working draft. Really, I just want to show you how it’s done.

To add a title to your box plot, just use the title parameter inside of the ggplot2::labs() function.

Add axis titles

We can also add axis titles using the labs() function.

To do this, we will just use the x and y parameters inside of the labs() function.

ggplot(data = msleep, aes(x = vore, y = sleep_total)) +
  geom_boxplot() +
  labs(title = 'On average, insects sleep more than other organism types'
       ,x = 'Organism type'
       ,y = 'Total amount of sleep\n (hours)'
  )

ggplot boxplot with a plot title and axis titles.

Now we have a boxplot with a plot title, but also the x and y-axis titles.

Format the titles of the boxplot

Once you have a basic ggplot boxplot, you’ll probably want to do a little formatting.

A full discussion of the ggplot2 formatting system is outside the scope of this post, but I’ll give you a quick view of how to format the title.

We’re going to take the code that we just used, and we’ll add a new line of code that calls the ggplot theme() function.

ggplot(data = msleep, aes(x = vore, y = sleep_total)) +
  geom_boxplot() +
  labs(title = 'On average, insects sleep more than other organism types') +
  theme(text = element_text(color = "#333333", family = 'Avenir')
        ,plot.title = element_text(size = '18', face = 'bold'))

A formatted ggplot boxplot

To learn more, you need to understand the ggplot system

There’s actually more that we could do, but not without a much broader understanding of the ggplot sytax system.

If you’re a beginner, you can use this blog post as a starting point.

After you learn the basics or use this to create a simple boxplot, I recommend that you study the complete ggplot system and master it. This is particularly true if you want to get a solid data science job. Put simply, you’ll need to be able to create simple plots like the boxplot in your sleep. And you’ll need to do a lot more. You’ll need to be “fluent” in the basics.

If you’re serious about mastering data science, I strongly suggest you sign up for our email list. Here at Sharp Sight, we publish tutorials that explain how to master data science fast.