Select Page

The ggplot histogram is very easy to make.

But like many things in ggplot2, it can seem a little complicated at first. In this article, we’ll show you exactly how to make a simple ggplot histogram, show you how to modify it, explain how it can be used, and more.

Let’s jump in.

Building a simple ggplot histogram

In order to build a histogram using ggplot2, you need to know how the ggplot system works. It’s not terribly hard once you get the hang of it, but it can be a little confusing to beginners.

Before we get into it, let’s install ggplot2 and the tidyverse package. We’ll also inspect txhousing, which is the dataset that we’ll be using.

#-----------------
# INSTALL PACKAGES
#-----------------
library(tidyverse)
library(ggplot2)

#--------
# INSPECT
#--------
txhousing %>% glimpse()

# Observations: 8,602
# Variables: 9
# $ city      <chr> "Abilene", "Abilene", "Abilene", "Abilene", "Abilene", "Abilene", "Abil...
# $ year      <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,...
# $ month     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1...
# $ sales     <dbl> 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100, 92, 75, 112, 118, 1...
# $ volume    <dbl> 5380000, 6505000, 9285000, 9730000, 10590000, 13910000, 12635000, 10710...
# $ median    <dbl> 71400, 58700, 58100, 68600, 67300, 66900, 73500, 75000, 64500, 59300, 7...
# $ listings  <dbl> 701, 746, 784, 785, 794, 780, 742, 765, 771, 764, 721, 658, 779, 700, 7...
# $ inventory <dbl> 6.3, 6.6, 6.8, 6.9, 6.8, 6.6, 6.2, 6.4, 6.5, 6.6, 6.2, 5.7, 6.8, 6.0, 6...
# $ date      <dbl> 2000.000, 2000.083, 2000.167, 2000.250, 2000.333, 2000.417, 2000.500, 2...

Now, let’s make a simple ggplot histogram:

#-----
# PLOT
#-----

# BASIC HISTOGRAM
ggplot(data = txhousing, aes(x = median)) +
  geom_histogram()
  

A simple ggplot histogram.

This histogram is pretty simple to create if you know how ggplot works.

But on the assumption that you’re a little unfamiliar with ggplot, let’s quickly review how the ggplot2 system works.

Quick review: how the ggplot2 system works

I love ggplot2.

Part of the reason is that it’s extremely systematic. Once you know how the ggplot2 system works, you can create almost any visualization with relative ease. Histograms are just a very simple example.

The ggplot() function

The ggplot() function essentially initiates ggplot plotting. It tells R that we’ll be using the ggplot2 library to build a plot or data visualization.

The aes() function

The aes() indicates our variable mappings.

If you haven’t done this before, then “variable mapping” might not immediately make sense. It’s relatively straightforward though.

Let’s take a look at our histogram code again to try to make this more clear.

# BASIC HISTOGRAM
ggplot(data = txhousing, aes(x = median)) +
  geom_histogram()

Notice that inside of the aes() we have the expression x = median.

What are we doing there?

We are “mapping” the median variable to the x axis. Notice again that this expression appears inside of the aes() function. Why? Because it is a variable mapping. All mappings from datasets to “aesthetic attributes” like the x-axis occur inside of the aes() function.

This can get a lot more complicated. For example, with a scatterplot, you’ll map a variable to the x axis and another variable to the y axis. It can get even more complicated with advanced visualization techniques, but the basics are straightforward. A dataset has variables. A visualization has aesthetic attributes like the x axis, y axis, color, shape, etc. We need to “connect” the variables to the aesthetic attributes. This can be accomplished with the aes() function.

A step-by-step breakdown of a ggplot histogram

Ok.

The ggplot() function initiates plotting. The aes() function specifies how we want to “map” or “connect” variables in our dataset to the aesthetic attributes of the shapes we plot.

With that knowledge in mind, let’s revisit our ggplot histogram and break it down.

#-----
# PLOT
#-----

# BASIC HISTOGRAM
ggplot(data = txhousing, aes(x = median)) +
  geom_histogram()
  

What are we doing here?

ggplot() indicates that we’re going to plot something. The data = parameter indicates that we’ll plot data from the txhousing dataset. Inside of the aes() function, we’re specifying that we want to put the “median” variable on the x axis. Finally, geom_histogram() indicates that we are going to plot a histogram.

(I wont’ go over “geom” entirely here. Suffice it to say, there are many different geoms in ggplot2 that plot different types of things.)

How to modify the ggplot histogram

Now that we’ve created a simple histogram with ggplot2, let’s make some simple modifications.

Change the bar colors

The first modification we’ll make is we will change the color of the bars.

This is very simple to do. Changing the bar colors for a ggplot histogram is essentially the same as changing the color of the bars in a ggplot bar chart.

We will take the simple ggplot histogram that we just made, and we’re going to add a little piece of code inside of the call to geom_histogram(). Inside of geom_histogram(), we will add the code fill = 'red'. This will effectively change the interior fill color of all of the histogram bars.

# ADD COLOR
# - fill in bars
ggplot(data = txhousing, aes(x = median)) +
  geom_histogram(fill = 'red')

As an aside, I recommend that you learn ggplot and R like this. Start with a simple technique. Learn it. Master it. Then systematically make small changes (and master how to make those changes). Start simple and expand your skill outward.

Change the border colors

Next, we’ll change the color of the borders of the histogram bars.

This is very similar to changing the fill color, but instead of using the fill = parameter we will use the color = parameter.

# ADD COLOR
# - color the edges of the bars
ggplot(data = txhousing, aes(x = median)) +
  geom_histogram(color = 'red')

Simple ggplot histogram with colored borders.

You’ll notice that this histogram is basically the same as the original except the borders are colored red.

Change the number of histogram bins

Now, let’s change the number of histogram bins.

By default, ggplot2 will use 30 bins for the histogram.

However, we can manually change the number of bins. This can be useful depending on how the data are distributed. If there is a lot of variability in the data we can use a larger number of bins to see some of that variation. Or, we can use a smaller number of bins to “smooth out” the variability.

Either way, changing the number of bins is extremely easy to do.

We will simply use the bins = parameter to change the number of bins.

Use fewer histogram bins

First, here’s a look at using fewer bins. Here, we’ll use 10 bins.

# USE FEWER BINS
ggplot(data = txhousing, aes(x = median)) +
  geom_histogram(bins = 10)

ggplot histogram with 10 bins

Use more histogram bins

Next, we’ll use more bins. We’ll increase the number of bins to 100:

# USE MORE BINS
ggplot(data = txhousing, aes(x = median)) +
  geom_histogram(bins = 100)

ggplot histogram with 100 bins

Again, which one you use depends on what your objectives are. Personally, in this case, 30 bins works well, but again, it depends on your objective.

Bonus: how to make a “small multiple” histogram

As I already said, I love ggplot2. It makes things easy.

A great example of this is the small multiple chart. Personally, I think the small multiple chart (AKA, the trellis chart) is wildly under-used. It’s extremely useful for a variety of data science and data analysis tasks. But you rarely see them because they are difficult to create in other software.

ggplot2 makes the small multiple easy to create. To create a small multiple in ggplot, we’ll just add a piece of code that will “break out” the chart based on a categorical variable.

Here, we will use the code facet_wrap(~city) to make a small version of the chart for each value of the city variable.

# SMALL MULTIPLE HISTOGRAM
ggplot(data = txhousing, aes(x = median)) +
  geom_histogram() +
  facet_wrap(~city)

Small multiple version of a ggplot histogram

There’s a lot of data here and a lot of detail. It will be easier to see if you run the code on your own computer and increase the size of the chart. (Try it …)

What’s great about the small multiple is that it let’s you see a lot of information in a very small space.

In this chart, we can see individual histograms for each city. This might be very useful if you were doing an analysis on cities and how they are different.

Bonus: how to make a density plot

The density plot is just a variation of the histogram, but instead of the y axis showing the number of observations, it shows the “density” of the data.

In ggplot2, the density plot is actually very easy to create. Just take the code for the basic ggplot histogram that we used above and swap out geom_histogram() with geom_density().

# DENSITY PLOT
ggplot(data = txhousing, aes(x = median)) +
  geom_density()

ggplot2 density chart

Again, ggplot2 makes things like this easy to do. Once you know the basics, changing a histogram to a density plot is as easy as changing one line of code.

Applications and uses of a ggplot histogram

We typically use histograms to examine the density of a variable or how a variable is distributed.

Moreover, there are several reasons that we might want this information.

As a data scientist, many times you may need your data to be distributed in a particular way. For example, linear regression often requires that the variables are normally distributed. Therefore, prior to building a linear regression model, a data scientist might examine the variable distributions to verify that they are normal. To do this, a data scientist will commonly use a histogram.

Histograms can also be used for outlier detection, detection of skewness, and detection of other features that may be important for particular data science tasks.

Moreover, histograms are often useful simply for high level exploratory data analysis. A full explanation of EDA and how to use histograms for EDA is beyond the scope of this post. However, to put it simply, we can use histograms to examine variables and look for “insights” or interesting features in the data.

Sign up for more data science tutorials

That’s just about everything that you need to know about the ggplot histogram.

But, if you want to get a job as a data scientist, you’ll need to know a lot more.

Don’t try to learn it alone.

Here at Sharp Sight, we’re committed to helping you master data science as fast as possible.

Sign up for our email list, and discover how to rapidly master data science.

When you sign up, you’ll get weekly tutorials delivered to your inbox.

Moreover, if you sign up now, you’ll get access to our FREE Data Science Crash Course.

In the Data Science Crash Course, you’ll learn:

  • a step-by-step data science learning plan

  • the 1 programming language you need to learn

  • 3 essential data visualizations
  • how to do data manipulation in R
  • how to get started with machine learning
  • the difference between machine learning and statistics

SIGN UP NOW