Select Page

Readers here at the Sharp Sight blog will know how much we stress data visualization and data anlaysis as the entry point to data science.

Contrary to what most people will tell you, at entry levels, data science is often not about complex math. With a few exceptions, you probably won’t need calculus, linear algebra, regression, or even machine learning to be a valuable junior member of a data team.

In many cases, junior members can create the most value by simply being masterful at more “basic” skills like analysis and data wrangling.

But that means that if you want to create value as a junior data scientist, you need to know the basic “toolkit” of analysis. You need to essentially master the basics. You need to be “fluent” in writing code to perform basic tasks.

One of the basic tools of analysis is the boxplot.

What is a boxplot?

The boxplot visualizes numerical data by drawing the quartiles of the data: the first quartile, second quartile (the median), and the third quartile. Often they also show “whiskers” that extend to the maximum and minimum values.

Another way of saying this is that the boxplot is a visualization of the five number summary.

What’s a five number summary? Let me show you. Let’s use the following code:

```msleep %>%
select(sleep_total) %>%
summary()

# sleep_total
# Min.   : 1.90
# 1st Qu.: 7.85
# Median :10.10
# Mean   :10.43
# 3rd Qu.:13.75
# Max.   :19.90

```

The five number summary is just a description of the min, max, interquartile range, and the median (note that the code we just ran shows the “mean” as well).

These five summary numbers are useful, so you should probably know how to calculate it as well. To do that, just use `dplyr::select()` to select the variable you want to analyze, and then use the `summary()` function:

Essentially, the boxplot helps us see the “spread” or the “dispersion” of the data by visualizing the interquartile range (i.e. the middle 50% of observations), median, maxima, and minima.

Basic syntax for a ggplot boxplot

The boxplot is very easy to make using ggplot2. I’ll explain how to create a ggplot boxplot, but first let’s take a quick look at the code:

```# LOAD TIDYVERSE PACKAGE
library(tidyverse)

# INSPECT DATA
msleep %>% glimpse()

# PLOT BOXPLOT
ggplot(data = msleep, aes(x = vore, y = sleep_total)) +
geom_boxplot()
```

And here’s what it looks like:

Like I said, this is very easy to do, but if you don’t know how `ggplot2` works, it can be easy to get confused.

That being the case, let’s do a quick review of how `ggplot2` works in general.

Quick overview of ggplot2

`ggplot2` is my favorite tool for data visualization and data analysis, but it takes a little getting used to. If you understand how it works, you know that it makes visualization very easy. But if you don’t understand it, it can seem a little enigmatic.

Let’s quickly talk about the basics of ggplot.

How the ggplot() function works

The `ggplot()` function just initiates plotting for the `ggplot2` visualization system. It’s basically saying “we’re going to plot something.”

The data= parameter

Inside of the `ggplot()` function, the first thing you’ll see is the `data` parameter. Specifically, in the following ggplot boxplot, you’ll see the code `data = msleep`. This is simply identifying the data that we’ll plot.

So the `ggplot()` function indicates that we will plot some data, and the `data` parameter (inside of the `ggplot()` function), indicates exactly what dataset that we’ll be using in the plot. Note also that the `data` parameter does not specify exactly which variables that we’ll be plotting. That’s essentially performed by the `aes()` function.

ggplot geoms, and geom_boxplot()

Notice that on the line below `ggplot()`, there’s a piece of syntax that says something about a boxplot: `geom_boxplot()`. What is this doing?

This just indicates that we’re going to plot a boxplot. A little more technically, it says that we will plot a boxplot “geom”.

So what the hell is a geom? To put it simply, a “geom” is just a “geometric object” that we can draw. Basic geoms are things like points, lines, bars, and polygons. In `ggplot2`, a “boxplot” is also considered a type of geom, and we can specify it using it’s own syntax … `geom_boxplot()`.

If you’re a little confused about “geoms,” I suggest that you don’t overthink them. “Geoms” are just the things in a visualization that we draw; points, bars, lines, etc.

The aes() function

Importantly, geoms have “aesthetic attributes.”

Again, this is more simple than it sounds like, so don’t overthink it.

An “aesthetic attribute” is just a graphical attribute of the things that we draw. Aesthetic attributes are the attributes of `geoms`. So, we’re drawing things (geoms) and those geoms have attributes (aesthetic attributes).

What sorts of aesthetic attributes do geoms have? Simple things like their position along the x-axis, position along the y axis, color, shape, etc. So for example, if you draw points (`geom_point()`), those points will have x-axis positions, y-axis positions, colors, shapes, etc.

In very simple visualizations (like the ggplot boxplot), we’ll just be plotting variables on the x-axis and y-axis. How do we indicate which variable to “connect” to the x-axis and which variable to “connect” to the y-axis?

We do this with the `aes()` function.

In slightly more technical terms, we use the `aes()` function to create a “mapping” from the dataset to the “aesthetic attributes” of the things that we plot. The term “aesthetic

Recap: how to make a simple ggplot boxplot

Now that we’ve reviewed how `ggplot2` works, let’s go back and take a second look at our boxplot code.

```# PLOT BOXPLOT
ggplot(data = msleep, aes(x = vore, y = sleep_total)) +
geom_boxplot()
```

What have we done here?

We called the `ggplot()` function. Inside the `ggplot()` function, we specified that we will plot data from the `msleep` dataframe with the code `data = msleep`.

Also inside of the `ggplot()` function, we called the `aes()` function. Here, the `aes()` function indicates that we are going to “map” the `vore` variable to the x-axis and we will map the `sleep_total` variable to the y-axis.

Finally, on the second line, we indicated that we will plot a boxplot by using the syntax `geom_boxplot()`.

Like I said … it’s really straightforward to make a boxplot in ggplot2 once you know how `ggplot2` works.

Modifying the ggplot boxplot

Now that you know how to make a simple ggplot2 boxplot, let’s modify the basic plot to create a few variations or enhanced versions.

How to make a “sideways” boxplot

By default, `geom_boxplot()` assumes that we have a categorical variable mapped to the x-axis and a quantitative variable mapped to the y-axis. So in the simple boxplot example above, the boxes of the boxplot are positioned vertically; they are drawn top to bottom.

What if we want to draw the boxes sideways? As it turns out, it’s not as simple as changing the variable mappings. We can not just reverse the variable mappings and map `vore` to the y-axis and `sleep_total` to the x-axis.

Instead, we need to use a special piece of code to “flip” the axes of the chart. We will use `ggplot2::coord_flip()`.

```
# FLIPPED COORDINATES
ggplot(data = msleep, aes(x = vore, y = sleep_total)) +
geom_boxplot() +
coord_flip()
```

Notice that when we do this, we just use the ‘`+`‘ sign after `geom_boxplot()` and then add `coord_flip()`. It’s very easy to do.

How to make a boxplot with one variable

Next, let’s make a boxplot with one variable.

Typically, a ggplot2 boxplot requires you to have two variables: one categorical variable and one numeric variable.

In some instances though, you might just want to visualize the distribution of a single numeric variable without breaking it out by category.

This is one instance where the ggplot2 syntax is a little strange. To make a ggplot boxplot with only one variable, we need to use a special piece of syntax. We will set the x-axis to an empty string inside of the `aes()` function:

```# BOX PLOT WITH 1 VARIABLE
ggplot(data = msleep, aes(x = "", y = sleep_total)) +
geom_boxplot()
```

Basically, ggplot2 expects something to be mapped to the x-axis, so we can’t just remove the `x=` parameter. Instead, we need put `x = ""` here. It’s a rare instance of an unintuitive piece of syntax in ggplot2, but it works.

Notice that when we make a boxplot with one variable, it basically just shows the 5 number summary for that variable.

The 5 number summary is useful, so you should probably know how to calculate it. To do that, just use `dplyr::select()` to select the variable you want to analyze, and then use the `summary()` function:

```msleep %>%
select(sleep_total) %>%
summary()
```

By the way, if you want to be a data scientist, this is the sort of code snippet you should have memorized.

Now let’s polish the boxplot a little.

Here, we’ll just add a title to the boxplot. To do this, we’ll just use the `labs()` function.

```ggplot(data = msleep, aes(x = vore, y = sleep_total)) +
geom_boxplot() +
labs(title = 'On average, insects sleep more than other organism types')
```

Note here that I’ve used the title as a tool to “tell a story” about the data. This is a best practice. Ideally, you shouldn’t use the title to just say something like “`Plot of vore vs. sleep_total`“. You want to use your titles to point something out.

Having said that, we could probably copy-edit this title more, but this is good enough for a working draft. Really, I just want to show you how it’s done.

To add a title to your box plot, just use the `title` parameter inside of the `ggplot2::labs()` function.

We can also add axis titles using the `labs()` function.

To do this, we will just use the `x` and `y` parameters inside of the `labs()` function.

```ggplot(data = msleep, aes(x = vore, y = sleep_total)) +
geom_boxplot() +
labs(title = 'On average, insects sleep more than other organism types'
,x = 'Organism type'
,y = 'Total amount of sleep\n (hours)'
)

```

Now we have a boxplot with a plot title, but also the x and y-axis titles.

Format the titles of the boxplot

Once you have a basic ggplot boxplot, you’ll probably want to do a little formatting.

A full discussion of the ggplot2 formatting system is outside the scope of this post, but I’ll give you a quick view of how to format the title.

We’re going to take the code that we just used, and we’ll add a new line of code that calls the ggplot `theme()` function.

```ggplot(data = msleep, aes(x = vore, y = sleep_total)) +
geom_boxplot() +
labs(title = 'On average, insects sleep more than other organism types') +
theme(text = element_text(color = "#333333", family = 'Avenir')
,plot.title = element_text(size = '18', face = 'bold'))
```

After you learn the basics or use this to create a simple boxplot, I recommend that you study the complete ggplot system and master it. This is particularly true if you want to get a solid data science job. Put simply, you’ll need to be able to create simple plots like the `boxplot in your sleep`. And you’ll need to do a lot more. You’ll need to be “fluent” in the basics.