Select Page

I’ve often said that the foundation of data science is data analysis.

That’s not to say that data analysis and data science are 100% synonymous (they’re not), but rather that you need to master the basic tools of data analysis before you can get into the more advanced skills that make up data science.

Said more simply: if you want too be a great data scientist, you need to be a great data analyst first.

Data analysis is often about making comparisons

In many cases, data analysis is a game of comparison.

As the great Edward Tufte noted, when we analyze data, we’re often trying to answer the question “Compared to what?”

At the heart of quantitative reasoning is a single question: Compared to what?

– Edward Tufte

When we analyze data, we’re frequently comparing things. Comparing one bar in a bar chart to another bar. Comparing one category to another. Comparing one variable to another. Comparing a variable to itself, with different constraints or filters. Comparing one model to another, or comparing a model with one set of parameters/assumptions/etc to another model with a different set of parameters/assumptions/etc.

The closer you look at data analysis, sometimes it just looks like comparisons all the way down the rabbit hole.

The fact is, if you want to be a great data scientist, you need to be a great data analyst. And if you want to be a great data analyst, you need to be a master at making comparisons with data.

Arguably the best tool to do this, is the small multiple chart.

Quick introduction to small multiples

So what is the small multiple?

It’s OK if you don’t know. The small multiple is not used that often (which I will discuss in a moment).

The small multiple is a chart that essentially creates several similar, small panels of the same chart type, where each panel is a little different.

Frequently, each panel represents a different subset of the data. The different panels will represent a subset for a specific category (of a categorical variable). There are also other, more complicated small multiple designs where we subset by two different categorical variables, and each panel represents a unique combination of categories for those categoricals.

Let’s take a look at a simple example.

Before showing you a small multiple, let’s just take a look at a simple scatterplot:

```library(ISLR)

ggplot(data = Auto, aes(x = horsepower, y = acceleration)) +
geom_point()
```

This is a fairly straightforward chart. Horsepower vs acceleration plotted as a scatterplot.

We can see a basic relationship between these two variables. However, there might be more to see. There is a variety of ways to slice and dice this dataset, and if we will examine different “slices” of the data, we might find something interesting or insightful.

A simple way to “slice” the data and would be to repeatedly subset your data. You could subset the data several times, and make a separate chart for every subset.

However, a much faster way is to simply make a small multiple version of this chart, broken out by a third categorical variable.

Let’s do that. Here’ we’ll break out the above scatterplot by the “cylinders” variable.

```ggplot(data = Auto, aes(x = horsepower, y = acceleration)) +
geom_point() +
facet_wrap(~cylinders)
```

What have we done here?

We broke our original chart out into five separate but similar charts, or “panels.” We broke it out on a categorical variable, such that each panel of the chart is a separate version of the initial chart for a particular value of the categorical variable. So we have one scatterplot for 3-cylinder cars, one scatterplot for 4-cylinder cars, and so on. Each panel is effectively a subsetted version of our initial scatterplot for a particular value of “cylinders”.

Essentially, this technique created multiple small panels with different versions of the same chart.

Small panels. Multiple panels.

Small multiple.

Get it?

Take another look at the first simple scatterplot that we just made, and then look at the small multiple.

It’s immediately obvious that there are differences between the panels of the small multiple chart. We couldn’t see these differences in the initial simple scatterplot, because all of the data were lumped together into a single chart. But by breaking the chart out into several small panels, we can suddenly see that there are differences in the data for different values of the cylinder variable.

… differences that we couldn’t see initially, in the original chart.

… differences that might be important (depending on our analytical goals).

This is why you need to master the small multiple design: The small multiple enables you to see differences. You are literally seeing the original chart in a new way. Seeing the data more clearly. Seeing with **ahem** sharper sight.

Small multiples, when used properly, enable you to make very quick comparisons across categorical classes in a dataset and see important differences.

When you need to analyze, visualize, or otherwise explore a dataset – and drill down into a dataset to find insights – the small multiple design is a powerful tool that you need to use.

The small multiple is wildly underused

Because they are so useful for making visual comparisons, I think the small multiple design is quite frequently the best tool for finding insights in data.

And I’m not the only person.

Let’s go back to that quote from the data visualization guru, Tufte. This time, I’ll give you the full quote:

At the heart of quantitative reasoning is a single question: Compared to what?

Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range of problems in data presentation, small multiples are the best design solution.

– Edward Tufte

Read that again: “for a wide range of problems … small multiples are the best design solution.”

While we can’t take the opinion of one man as gospel, I must say that I agree with Tufte. Small multiples frequently are the best tool for making comparisons in data. They are a tool for directly comparing across groups. This means, that in many, many cases, they are actually the best tool for making comparisons, making distinctions, drilling into your data, and ultimately finding insights.

So when you use a simple tool like the bar chart, the histogram, the scatterplot, and the line chart, there’s frequently an opportunity to further “drill into” the data by breaking the chart out into small multiples.

But even if the small multiple is often one of the best tools for data analysis and exploration, small multiples are wildly underused. I would estimate that only one out of ten data analysts or data scientists actually use the technique.

Part of the reason that they’re so rarely used is that they’re typically just hard to create with most toolsets. In base R, they’re a little hard to create (and a little ugly). And don’t get me started on trying to do them in Excel. You can create small multiples in Excel, but the process for creating them is tedious at best. Because they are hard to create with most tools, they are just not used very often.

This is excellent news for you, being the disciplined data science student that you are.

Because they are so powerful and so underused, you can distinguish yourself by learning and using the small multiple. Your analyses will be more insightful, and you will personally be more effective in getting things done compared to people who don’t use them.

If you want to be a great data analyst and a great data scientist, you should learn and master the small multiple.

How to use the small multiple

Ok, you should get it by now. The small multiple is a powerful tool.

Let’s quickly take a look at a few examples of how this is used.

Data Exploration

One of the most frequent uses for the small multiple chart is for simple data exploration.

When you initially approach a new dataset, you’ll want to get an idea of what’s in it.

You may be looking for something in particular, but more often than not, you’re just trying to get a sense of how the data are structured; if there any issues with the data; and if there is anything “interesting” that needs to be investigated further.

One of the first things you’ll do when you’re exploring a dataset, is you will create histograms or density plots of your variables.

You’ll also sometimes want to create subsetted density plots for different categories or subsets of your data.

This is a perfect use case for the small multiple design.

Let’s take a look.

```Credit %>%
ggplot(aes(x = Income)) +
geom_density() +
facet_grid(Student ~ Married)
```

We’ve used the small multiple design to break out the density plot into several new plots based on two categorical variables. This gives us a new view of the data and helps us “look closer” for new insights.

Now, when you do this, sometimes you find something, and sometimes you don’t.

The important thing though, is that you commonly need to create a lot of these charts. You need to “slice and dice” your data a variety of ways.

The small multiple gives you a way to do this. The small multiple gives you a way to look at different cuts of your dataset to see if there’s anything interesting or in need of further exploration.

Data Visualization

Beyond simple, preliminary data exploration, you can also use the small multiple design to visualize and present results. In particular, the small multiple design gives you a way to convey a lot of information in a single chart.

A great example of this is a data visualization by Andrew Gelman:

This is a great visualization.

It conveys vast amounts of information in a small, well contained visualization. If you look at it, you can clearly see information about the intersection of various categories. You can quickly view a different “small” version of the map for different combinations of income and ethnic background.

The small multiple technique will increase your data visualization skill, because it will give you the ability to display vast amounts of information quickly and concisely.

Data preparation (prior to machine learning)

Small multiples are also excellent for a special type of data exploration: exploring data for data preparation, prior to building a machine learning model.

When you build a machine learning model, it’s very common that you need to clean, reshape, or prepare your data. For example, you’ll need to identify and possibly deal with outliers. You will need to identify skewness in your data (and possibly deal with it). You’ll sometimes need to verify that your input variables are normally distributed.

One way to identify and diagnose these potential issues is by using data exploration techniques.

Here’s a quick example.

Let’s say you approach a new dataset with the goal of building a regression model. One of the first things you might want to do is examine the numeric variables and look for skewness, outliers, etc.

To do this quickly, you could simply subset down to the numeric variables, and plot those numeric variables as histograms or density plots.

```library(caret)
data(Sacramento)

Sacramento %>%
select_if(is.numeric) %>%
gather() %>%
ggplot(aes(x = value)) +
geom_density() +
facet_wrap(~key, scales = 'free')
```

It’s true that you could do these one at a time as individual density plots, but it’s much faster if you just use the small multiple design.

Moreover, if you have dozens or hundreds of numeric variables, it will be highly impractical to plot a density chart for each variable, one by one. So to quickly solve this problem, you can just use the small multiple.

If you want to eventually learn machine learning, this is something you need to be able to do. You need to be able to plot the predictor distributions of your variables. The small multiple design makes this straightforward.

Small multiples are easy with `ggplot2`

Long time readers will know that I love `ggplot2` and the tidyverse.

I want to be clear about this: I don’t strongly recommend the tidyverse toolkit just because everyone else is using it or out of some unthinking loyalty to its creators.

Quite the opposite: if the creators of the tidyverse did something that I firmly disagreed with, I would be the first to say it.

No. I recommend the tidyverse because it is extremely well designed. It is so well designed that things like the small multiple chart – which is really hard to make with almost any other toolkit – is as simple as adding one more line of code.

That’s it. One extra line.

Let me show you by reviewing our initial example:

In the first example in this blog post, we made a scatterplot.

```library(ISLR)

ggplot(data = Auto, aes(x = horsepower, y = acceleration)) +
geom_point()
```

This is a very simple chart in `ggplot2`. We’re mapping horsepower to the x axis and acceleration to the y axis.

But with the addition of only a single extra line of code, we can break this out into small panels; one extra line of code transforms a simple scatterplot into a small multiple chart:

```ggplot(data = Auto, aes(x = horsepower, y = acceleration)) +
geom_point() +
facet_wrap(~cylinders)
```

We’ve taken the original chart and broken it out on the cylinder variable using the line `facet_wrap(~cylinders)`.

Essentially, ggplot2 enables you to create small multiple charts with ease.

If you aren’t already using `ggplot2` and the tidyverse, I highly recommend them.

Do you want to be a great data scientist? Learn data analysis.

… And if you want to quickly and effortlessly use the small multiple technique, you should learn some `ggplot2`.

Discover how to master data science

Mastering data science takes practice and hard work.

Don’t try to figure it out alone.

In our tutorials, we will help you learn and master the tools you need to be a top-performing data scientist.