Readers here at the Sharp Sight blog will know how much we stress data visualization and data anlaysis as the entry point to data science.
Contrary to what most people will tell you, at entry levels, data science is often not about complex math. With a few exceptions, you probably won’t need calculus, linear algebra, regression, or even machine learning to be a valuable junior member of a data team.
In many cases, junior members can create the most value by simply being masterful at more “basic” skills like analysis and data wrangling.
But that means that if you want to create value as a junior data scientist, you need to know the basic “toolkit” of analysis. You need to essentially master the basics. You need to be “fluent” in writing code to perform basic tasks.
One of the basic tools of analysis is the boxplot.The ultimate guide to the ggplot boxplot. Click To Tweet
What is a boxplot?
The boxplot visualizes numerical data by drawing the quartiles of the data: the first quartile, second quartile (the median), and the third quartile. Often they also show “whiskers” that extend to the maximum and minimum values.
Another way of saying this is that the boxplot is a visualization of the five number summary.
What’s a five number summary? Let me show you. Let’s use the following code:
msleep %>% select(sleep_total) %>% summary() # sleep_total # Min. : 1.90 # 1st Qu.: 7.85 # Median :10.10 # Mean :10.43 # 3rd Qu.:13.75 # Max. :19.90
The five number summary is just a description of the min, max, interquartile range, and the median (note that the code we just ran shows the “mean” as well).
These five summary numbers are useful, so you should probably know how to calculate it as well. To do that, just use
dplyr::select() to select the variable you want to analyze, and then use the
Essentially, the boxplot helps us see the “spread” or the “dispersion” of the data by visualizing the interquartile range (i.e. the middle 50% of observations), median, maxima, and minima.
Basic syntax for a ggplot boxplot
The boxplot is very easy to make using ggplot2. I’ll explain how to create a ggplot boxplot, but first let’s take a quick look at the code:
# LOAD TIDYVERSE PACKAGE library(tidyverse) # INSPECT DATA msleep %>% glimpse() # PLOT BOXPLOT ggplot(data = msleep, aes(x = vore, y = sleep_total)) + geom_boxplot()
And here’s what it looks like:
Like I said, this is very easy to do, but if you don’t know how
ggplot2 works, it can be easy to get confused.
That being the case, let’s do a quick review of how
ggplot2 works in general.
Quick overview of ggplot2
ggplot2 is my favorite tool for data visualization and data analysis, but it takes a little getting used to. If you understand how it works, you know that it makes visualization very easy. But if you don’t understand it, it can seem a little enigmatic.
Let’s quickly talk about the basics of ggplot.
How the ggplot() function works
ggplot() function just initiates plotting for the
ggplot2 visualization system. It’s basically saying “we’re going to plot something.”
The data= parameter
Inside of the
ggplot() function, the first thing you’ll see is the
data parameter. Specifically, in the following ggplot boxplot, you’ll see the code
data = msleep. This is simply identifying the data that we’ll plot.
ggplot() function indicates that we will plot some data, and the
data parameter (inside of the
ggplot() function), indicates exactly what dataset that we’ll be using in the plot. Note also that the
data parameter does not specify exactly which variables that we’ll be plotting. That’s essentially performed by the
ggplot geoms, and geom_boxplot()
Notice that on the line below
ggplot(), there’s a piece of syntax that says something about a boxplot:
geom_boxplot(). What is this doing?
This just indicates that we’re going to plot a boxplot. A little more technically, it says that we will plot a boxplot “geom”.
So what the hell is a geom? To put it simply, a “geom” is just a “geometric object” that we can draw. Basic geoms are things like points, lines, bars, and polygons. In
ggplot2, a “boxplot” is also considered a type of geom, and we can specify it using it’s own syntax …
If you’re a little confused about “geoms,” I suggest that you don’t overthink them. “Geoms” are just the things in a visualization that we draw; points, bars, lines, etc.
The aes() function
Importantly, geoms have “aesthetic attributes.”
Again, this is more simple than it sounds like, so don’t overthink it.
An “aesthetic attribute” is just a graphical attribute of the things that we draw. Aesthetic attributes are the attributes of
geoms. So, we’re drawing things (geoms) and those geoms have attributes (aesthetic attributes).
What sorts of aesthetic attributes do geoms have? Simple things like their position along the x-axis, position along the y axis, color, shape, etc. So for example, if you draw points (
geom_point()), those points will have x-axis positions, y-axis positions, colors, shapes, etc.
In very simple visualizations (like the ggplot boxplot), we’ll just be plotting variables on the x-axis and y-axis. How do we indicate which variable to “connect” to the x-axis and which variable to “connect” to the y-axis?
We do this with the
In slightly more technical terms, we use the
aes() function to create a “mapping” from the dataset to the “aesthetic attributes” of the things that we plot. The term “aesthetic
Recap: how to make a simple ggplot boxplot
Now that we’ve reviewed how
ggplot2 works, let’s go back and take a second look at our boxplot code.
# PLOT BOXPLOT ggplot(data = msleep, aes(x = vore, y = sleep_total)) + geom_boxplot()
What have we done here?
We called the
ggplot() function. Inside the
ggplot() function, we specified that we will plot data from the
msleep dataframe with the code
data = msleep.
Also inside of the
ggplot() function, we called the
aes() function. Here, the
aes() function indicates that we are going to “map” the
vore variable to the x-axis and we will map the
sleep_total variable to the y-axis.
Finally, on the second line, we indicated that we will plot a boxplot by using the syntax
Like I said … it’s really straightforward to make a boxplot in ggplot2 once you know how
Modifying the ggplot boxplot
Now that you know how to make a simple ggplot2 boxplot, let’s modify the basic plot to create a few variations or enhanced versions.
How to make a “sideways” boxplot
geom_boxplot() assumes that we have a categorical variable mapped to the x-axis and a quantitative variable mapped to the y-axis. So in the simple boxplot example above, the boxes of the boxplot are positioned vertically; they are drawn top to bottom.
What if we want to draw the boxes sideways? As it turns out, it’s not as simple as changing the variable mappings. We can not just reverse the variable mappings and map
vore to the y-axis and
sleep_total to the x-axis.
Instead, we need to use a special piece of code to “flip” the axes of the chart. We will use
# FLIPPED COORDINATES ggplot(data = msleep, aes(x = vore, y = sleep_total)) + geom_boxplot() + coord_flip()
Notice that when we do this, we just use the ‘
+‘ sign after
geom_boxplot() and then add
coord_flip(). It’s very easy to do.
How to make a boxplot with one variable
Next, let’s make a boxplot with one variable.
Typically, a ggplot2 boxplot requires you to have two variables: one categorical variable and one numeric variable.
In some instances though, you might just want to visualize the distribution of a single numeric variable without breaking it out by category.
This is one instance where the ggplot2 syntax is a little strange. To make a ggplot boxplot with only one variable, we need to use a special piece of syntax. We will set the x-axis to an empty string inside of the
# BOX PLOT WITH 1 VARIABLE ggplot(data = msleep, aes(x = "", y = sleep_total)) + geom_boxplot()
Basically, ggplot2 expects something to be mapped to the x-axis, so we can’t just remove the
x= parameter. Instead, we need put
x = "" here. It’s a rare instance of an unintuitive piece of syntax in ggplot2, but it works.
Notice that when we make a boxplot with one variable, it basically just shows the 5 number summary for that variable.
The 5 number summary is useful, so you should probably know how to calculate it. To do that, just use
dplyr::select() to select the variable you want to analyze, and then use the
msleep %>% select(sleep_total) %>% summary()
By the way, if you want to be a data scientist, this is the sort of code snippet you should have memorized.
Add a title
Now let’s polish the boxplot a little.
Here, we’ll just add a title to the boxplot. To do this, we’ll just use the
ggplot(data = msleep, aes(x = vore, y = sleep_total)) + geom_boxplot() + labs(title = 'On average, insects sleep more than other organism types')
Note here that I’ve used the title as a tool to “tell a story” about the data. This is a best practice. Ideally, you shouldn’t use the title to just say something like “
Plot of vore vs. sleep_total“. You want to use your titles to point something out.
Having said that, we could probably copy-edit this title more, but this is good enough for a working draft. Really, I just want to show you how it’s done.
To add a title to your box plot, just use the
title parameter inside of the
Add axis titles
We can also add axis titles using the
To do this, we will just use the
y parameters inside of the
ggplot(data = msleep, aes(x = vore, y = sleep_total)) + geom_boxplot() + labs(title = 'On average, insects sleep more than other organism types' ,x = 'Organism type' ,y = 'Total amount of sleep\n (hours)' )
Now we have a boxplot with a plot title, but also the x and y-axis titles.
Format the titles of the boxplot
Once you have a basic ggplot boxplot, you’ll probably want to do a little formatting.
A full discussion of the ggplot2 formatting system is outside the scope of this post, but I’ll give you a quick view of how to format the title.
We’re going to take the code that we just used, and we’ll add a new line of code that calls the ggplot
ggplot(data = msleep, aes(x = vore, y = sleep_total)) + geom_boxplot() + labs(title = 'On average, insects sleep more than other organism types') + theme(text = element_text(color = "#333333", family = 'Avenir') ,plot.title = element_text(size = '18', face = 'bold'))
To learn more, you need to understand the ggplot system
There’s actually more that we could do, but not without a much broader understanding of the ggplot sytax system.
If you’re a beginner, you can use this blog post as a starting point.
After you learn the basics or use this to create a simple boxplot, I recommend that you study the complete ggplot system and master it. This is particularly true if you want to get a solid data science job. Put simply, you’ll need to be able to create simple plots like the
boxplot in your sleep. And you’ll need to do a lot more. You’ll need to be “fluent” in the basics.
If you’re serious about mastering data science, I strongly suggest you sign up for our email list. Here at Sharp Sight, we publish tutorials that explain how to master data science fast.