In the world of data visualization, the heatmap is underrated and underutilized.
It has limitations, but overall, it’s an excellent tool in your data science and data visualization toolkit.
Code: how to create a heatmap in ggplot2
library(ggplot2) #------------------ # CREATE DATA FRAME #------------------ df.team_data <- expand.grid(teams = c("Team A", "Team B", "Team C", "Team D") ,metrics = c("Metric 1", "Metric 2", "Metric 3", "Metric 4", "Metric 5") ) # add variable: performance set.seed(41) df.team_data$performance <- rnorm(nrow(df.team_data)) #inspect head(df.team_data) #--------------------------- # PLOT: heatmap # - here, we use geom_tile() #--------------------------- ggplot(data = df.team_data, aes(x = metrics, y = teams)) + geom_tile(aes(fill = performance))
Here's the resulting chart:
Make no mistake. This is really a basic heatmap.
There are many enhancements that we could make here. At the very least, we could put the metric names along the top of the chart, and we could change the color scale. The default light-blue/dark-blue scale is acceptable, but we could probably change the color scale to make the performance differences more obvious, and also to make the chart more visually interesting.
Having said that, this is a simple "no frills" piece of code that should be simple to understand if you've already learned some of the basics of ggplot2 syntax.
The heatmap has several uses in data science. It can be used as a simple tool for analytics and reporting, but it also has more advanced uses in machine learning and other data science sub-fields.
Here are a few:
Multi-metric reporting and analysis
Any time you’re evaluating multiple people/groups/teams on multiple metrics, you can use a heatmap.
For example, Nathan Yau of Flowing data created a heatmap to display 2008 basketball player statistics for a select group of NBA players.
- Source: flowingdata.com
Here, Yao used the heatmap to evaluate 50 top players on 19 different metrics.
As tool for generating insight – a tool for revealing what would otherwise remain hidden – this is powerful. Immediately, you can see overperformers and underperformers for specific metrics. For example, you can immediately identify one player who excels at three point shots, a player who excels at steals, and a couple of players who score more overall points.
The precision of this chart is limited (which I’ll get to in a moment) but it serves as an excellent at-a-glance tool for identifying hot-spots and problem areas.
A special case of multi-metric reporting is KPI reporting.
Most modern, data-driven corporations have key performance indicators (AKA: key performance metrics, KPIs). They use these metrics as (surprise, surprise) the key indicators of the organization’s performance.
For example, in a marketing organization, KPI’s might be a few things like new leads, new customers, conversion rate, return on investment, and churn.
So let’s say you work in a large multinational and you’re doing marketing data science. Let’s say that the organization has 4 regions and each region has 5 teams. That’s a total of 20 teams across those 4 regions. Each team needs to be scored on the KPIs, and once a month, the executive team (the leaders of each region, as well as the world-wide leader) want to see the performance of each team and each overall region for each of the KPIs.
In this case, you could use a heatmap to create a "KPI scorecard." You’d essentially use the heatmap technique to display the performance of each region and team for each of the KPIs. The executive team could then use this as a quick "snapshot" of performance; they could use the heatmap to rapidly identify "problem areas," like metrics that have a consistently low score, teams that are performing poorly, or perhaps a team that’s performing well on most metrics, but has a very high or very low score on one particular metric.
Machine learning: heatmaps for model tuning
You can also use heat maps in machine learning. One area where you can apply the heatmap for machine learning is model tuning.
For example, if you build many versions of a model to test different values for tuning parameters, you can create a heatmap to help identify the best model. That is, you can map a metric like RMSE or area-under-ROC to the "fill" aesthetic of your ggplot2 heatmap, and then use the heatmap to identify optimal combinations of tuning parameters.
Having detailed some of the uses, I want to point out that the heatmap does have some limitations.
Comparisons are imprecise
The major limitation of the heatmap is that the comparisons you make will be imprecise.
To be clear, this isn’t so much problem of the heatmap itself, but rather, this is an artifact of the human brain and how it encodes and decodes visual data.
To state this simply, the human mind decodes visual data with different levels of precision, depending on how the data were encoded.
Here’s an example to explain what I mean:
In the above example, you can encode the "performance" level of each variable in a variety of different ways. You can encode that data as colors (which is what we do in the heatmap). But you could also use a bar chart to visualize some of the data. If we used a bar chart, we’d be encoding the metric performance as "length." (Keep in mind that if we used a bar chart, we'd have to rework the data. There are 4 teams and 4 metrics; this might make for an over-complicated bar chart.)
As it turns out, humans are much better at decoding information that’s been encoded as a length. We’re much more precise when we evaluate and judge differences in length. So, if we encode the data in a bar chart, we’ll be able to make more precise and detailed comparisons than if we use a heatmap (which makes for imprecise comparisons).
So, the heatmap is limited in this way. If you need to make very precise comparisons, don’t use the heatmap.
You can also create heatmaps in other tools (like Excel)
To be honest, you don’t have to do this with ggplot. You can actually create heatmaps in Excel without too much difficulty. In fact, if I need to make a quick-and-dirty heatmap, Excel has long been a go-to tool. (Excel gets a fair amount of hate, but it’s quite useful for small tasks.)
The advantage of creating heatmaps in R/ggplot2 is that they’re much easier to reproduce and much easer to create "at scale." That is, if you have to create these over and over, it’s much easier to do that when you use code. Moreover, if you have lots categories (for example, 10 to 20 metrics that you’re tracking) it’s much easier to create a heatmap programatically. It’s also much easier to edit these programmatically when you’re working with lots of categories.
How to get started with ggplot2 (if you're still confused)
If the above code didn’t make sense, I’ll suggest that you review some older posts. This blog post assumes that you understand some of the basic workings of ggplot2 syntax. So, if this was still a bit confusing, check out these older posts on more fundamental data visualizations: