The ggplot histogram is very easy to make.
But like many things in
ggplot2, it can seem a little complicated at first. In this article, we’ll show you exactly how to make a simple ggplot histogram, show you how to modify it, explain how it can be used, and more.
Let’s jump in.
Building a simple ggplot histogram
In order to build a histogram using
ggplot2, you need to know how the ggplot system works. It’s not terribly hard once you get the hang of it, but it can be a little confusing to beginners.
Before we get into it, let’s install
ggplot2 and the
tidyverse package. We’ll also inspect
txhousing, which is the dataset that we’ll be using.
#----------------- # INSTALL PACKAGES #----------------- library(tidyverse) library(ggplot2) #-------- # INSPECT #-------- txhousing %>% glimpse() # Observations: 8,602 # Variables: 9 # $ city <chr> "Abilene", "Abilene", "Abilene", "Abilene", "Abilene", "Abilene", "Abil... # $ year <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,... # $ month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1... # $ sales <dbl> 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100, 92, 75, 112, 118, 1... # $ volume <dbl> 5380000, 6505000, 9285000, 9730000, 10590000, 13910000, 12635000, 10710... # $ median <dbl> 71400, 58700, 58100, 68600, 67300, 66900, 73500, 75000, 64500, 59300, 7... # $ listings <dbl> 701, 746, 784, 785, 794, 780, 742, 765, 771, 764, 721, 658, 779, 700, 7... # $ inventory <dbl> 6.3, 6.6, 6.8, 6.9, 6.8, 6.6, 6.2, 6.4, 6.5, 6.6, 6.2, 5.7, 6.8, 6.0, 6... # $ date <dbl> 2000.000, 2000.083, 2000.167, 2000.250, 2000.333, 2000.417, 2000.500, 2...
Now, let’s make a simple ggplot histogram:
#----- # PLOT #----- # BASIC HISTOGRAM ggplot(data = txhousing, aes(x = median)) + geom_histogram()
This histogram is pretty simple to create if you know how ggplot works.
But on the assumption that you’re a little unfamiliar with ggplot, let’s quickly review how the
ggplot2 system works.
Quick review: how the ggplot2 system works
Part of the reason is that it’s extremely systematic. Once you know how the
ggplot2 system works, you can create almost any visualization with relative ease. Histograms are just a very simple example.
ggplot() function essentially initiates ggplot plotting. It tells
R that we’ll be using the
ggplot2 library to build a plot or data visualization.
aes() indicates our variable mappings.
If you haven’t done this before, then “variable mapping” might not immediately make sense. It’s relatively straightforward though.
Let’s take a look at our histogram code again to try to make this more clear.
# BASIC HISTOGRAM ggplot(data = txhousing, aes(x = median)) + geom_histogram()
Notice that inside of the
aes() we have the expression
x = median.
What are we doing there?
We are “mapping” the
median variable to the x axis. Notice again that this expression appears inside of the
aes() function. Why? Because it is a variable mapping. All mappings from datasets to “aesthetic attributes” like the x-axis occur inside of the
This can get a lot more complicated. For example, with a scatterplot, you’ll map a variable to the x axis and another variable to the y axis. It can get even more complicated with advanced visualization techniques, but the basics are straightforward. A dataset has variables. A visualization has aesthetic attributes like the x axis, y axis, color, shape, etc. We need to “connect” the variables to the aesthetic attributes. This can be accomplished with the
A step-by-step breakdown of a ggplot histogram
ggplot() function initiates plotting. The
aes() function specifies how we want to “map” or “connect” variables in our dataset to the aesthetic attributes of the shapes we plot.
With that knowledge in mind, let’s revisit our ggplot histogram and break it down.
#----- # PLOT #----- # BASIC HISTOGRAM ggplot(data = txhousing, aes(x = median)) + geom_histogram()
What are we doing here?
ggplot() indicates that we’re going to plot something. The
data = parameter indicates that we’ll plot data from the
txhousing dataset. Inside of the
aes() function, we’re specifying that we want to put the “
median” variable on the x axis. Finally,
geom_histogram() indicates that we are going to plot a histogram.
(I wont’ go over “geom” entirely here. Suffice it to say, there are many different geoms in
ggplot2 that plot different types of things.)
How to modify the ggplot histogram
Now that we’ve created a simple histogram with
ggplot2, let’s make some simple modifications.
Change the bar colors
The first modification we’ll make is we will change the color of the bars.
This is very simple to do. Changing the bar colors for a ggplot histogram is essentially the same as changing the color of the bars in a ggplot bar chart.
We will take the simple ggplot histogram that we just made, and we’re going to add a little piece of code inside of the call to
geom_histogram(). Inside of
geom_histogram(), we will add the code
fill = 'red'. This will effectively change the interior fill color of all of the histogram bars.
# ADD COLOR # - fill in bars ggplot(data = txhousing, aes(x = median)) + geom_histogram(fill = 'red')
As an aside, I recommend that you learn ggplot and
R like this. Start with a simple technique. Learn it. Master it. Then systematically make small changes (and master how to make those changes). Start simple and expand your skill outward.
Change the border colors
Next, we’ll change the color of the borders of the histogram bars.
This is very similar to changing the
fill color, but instead of using the
fill = parameter we will use the
color = parameter.
# ADD COLOR # - color the edges of the bars ggplot(data = txhousing, aes(x = median)) + geom_histogram(color = 'red')
You’ll notice that this histogram is basically the same as the original except the borders are colored red.
Change the number of histogram bins
Now, let’s change the number of histogram bins.
ggplot2 will use 30 bins for the histogram.
However, we can manually change the number of bins. This can be useful depending on how the data are distributed. If there is a lot of variability in the data we can use a larger number of bins to see some of that variation. Or, we can use a smaller number of bins to “smooth out” the variability.
Either way, changing the number of bins is extremely easy to do.
We will simply use the
bins = parameter to change the number of bins.
Use fewer histogram bins
First, here’s a look at using fewer bins. Here, we’ll use 10 bins.
# USE FEWER BINS ggplot(data = txhousing, aes(x = median)) + geom_histogram(bins = 10)
Use more histogram bins
Next, we’ll use more bins. We’ll increase the number of bins to 100:
# USE MORE BINS ggplot(data = txhousing, aes(x = median)) + geom_histogram(bins = 100)
Again, which one you use depends on what your objectives are. Personally, in this case, 30 bins works well, but again, it depends on your objective.
Bonus: how to make a “small multiple” histogram
As I already said, I love
ggplot2. It makes things easy.
A great example of this is the small multiple chart. Personally, I think the small multiple chart (AKA, the trellis chart) is wildly under-used. It’s extremely useful for a variety of data science and data analysis tasks. But you rarely see them because they are difficult to create in other software.
ggplot2 makes the small multiple easy to create. To create a small multiple in ggplot, we’ll just add a piece of code that will “break out” the chart based on a categorical variable.
Here, we will use the code
facet_wrap(~city) to make a small version of the chart for each value of the
# SMALL MULTIPLE HISTOGRAM ggplot(data = txhousing, aes(x = median)) + geom_histogram() + facet_wrap(~city)
There’s a lot of data here and a lot of detail. It will be easier to see if you run the code on your own computer and increase the size of the chart. (Try it …)
What’s great about the small multiple is that it let’s you see a lot of information in a very small space.
In this chart, we can see individual histograms for each city. This might be very useful if you were doing an analysis on cities and how they are different.
Bonus: how to make a density plot
The density plot is just a variation of the histogram, but instead of the y axis showing the number of observations, it shows the “density” of the data.
ggplot2, the density plot is actually very easy to create. Just take the code for the basic ggplot histogram that we used above and swap out
# DENSITY PLOT ggplot(data = txhousing, aes(x = median)) + geom_density()
ggplot2 makes things like this easy to do. Once you know the basics, changing a histogram to a density plot is as easy as changing one line of code.
Applications and uses of a ggplot histogram
We typically use histograms to examine the density of a variable or how a variable is distributed.
Moreover, there are several reasons that we might want this information.
As a data scientist, many times you may need your data to be distributed in a particular way. For example, linear regression often requires that the variables are normally distributed. Therefore, prior to building a linear regression model, a data scientist might examine the variable distributions to verify that they are normal. To do this, a data scientist will commonly use a histogram.
Histograms can also be used for outlier detection, detection of skewness, and detection of other features that may be important for particular data science tasks.
Moreover, histograms are often useful simply for high level exploratory data analysis. A full explanation of EDA and how to use histograms for EDA is beyond the scope of this post. However, to put it simply, we can use histograms to examine variables and look for “insights” or interesting features in the data.
Sign up for more data science tutorials
That’s just about everything that you need to know about the ggplot histogram.
But, if you want to get a job as a data scientist, you’ll need to know a lot more.
Don’t try to learn it alone.
Here at Sharp Sight, we’re committed to helping you master data science as fast as possible.
Sign up for our email list, and discover how to rapidly master data science.
When you sign up, you’ll get weekly tutorials delivered to your inbox.
Moreover, if you sign up now, you’ll get access to our FREE Data Science Crash Course.
In the Data Science Crash Course, you’ll learn:
- a step-by-step data science learning plan
- the 1 programming language you need to learn
- 3 essential data visualizations
- how to do data manipulation in R
- how to get started with machine learning
- the difference between machine learning and statistics
SIGN UP NOW