Histogram

The histogram groups data into bins and plots the binned data as bars.

Uses

Histograms are used to plot the density of a data distribution.
(read more at Wikipedia)

Code: Histogram in R

set.seed(11)
df.histogram_dummy <- data.frame( x_var = rnorm(2000) )

# DEFAULT BINWIDTH (range/30)
ggplot(data=df.histogram_dummy, aes(x=x_var)) +
  geom_histogram()


Results

histogram-in-r-basic

Explanation

The code to create a histogram in R is very straight forward.

The fundamentals of building a histogram in R (using ggplot2) are basically the same as those for building any data visualization. We:

1. Specify our dataset
2. Create a “mapping” (i.e., a relationship) between variables in our dataset and aesthetic attributes in our plot
3. Specify the exact geometric objects we want to plot

(This deep underlying structure is essentially the same for all visualizations. We’ve seen a similar process for building a bar chart, scatterplot, and line chart. This deep structure will also remain as we create more sophisticated visualizations.)

Let’s take a look at the code:

ggplot(data=df.histogram_dummy, aes(x=x_var)) 



Here we start by calling the ggplot() function, and use the data= parameter to specify the dataset we want to plot.

Next, with the aes(x=x_var) call, we create our “mapping” from our dataset to “visual space;” specifically, we map the variable x_var to the x-axis. Here, x_var is a quantitative (i.e., numeric) variable.

The typical process of creating a histogram, requires that we “bin” the data; that is, we need to divide the numeric variable into intervals (analysts frequently call these “bins”, or sometimes “buckets”). After creating the numeric intervals, we count the number of records that fall within each interval.

ggplot bins our data automatically when we use the geom_histogram() geom.

Let’s take a look at geom_histogram(). It’s the next line of code.

  geom_histogram()



Here, this line tells the ggplot() function that we want to plot a histogram.

Again, this automatically bins our data for us: by default, when we use geom_histogram(), ggplot() takes our numerical variable’s range (in this case, x_var) and divides it into 30 bins. (Note: in many programming languages or software like Excel, you need to do this manually).

That said, we can specify the bin widths manually.

Here, we set binwidth to .5 (using the binwidth= parameter), which is wider than the default.

# Wider binwidth
ggplot(data=df.histogram_dummy, aes(x=x_var)) +
  geom_histogram(binwidth=.5)

histogram-in-r_wider-bin

We can also narrow the binwidth, which gives us a slightly different view of the data distribution.

# Narrower binwidth
ggplot(data=df.histogram_dummy, aes(x=x_var)) +
  geom_histogram(binwidth=.1)

histogram-in-r_narrower-bin

Related Visualizations

Bar chart