As you’re learning data science, you ultimately need to learn several different toolkits.
You need to learn the tools of data visualization. You need to learn the tools of data manipulation. You also need a variety of other tools for specialized tasks, like geospatial visualization, machine learning, and others.
Here at Sharp Sight, we have a particular philosophy about how to learn these things.
I recommend that you first learn these skills in isolation.
For example, in the past, I’ve recommended that you learn a topic like simple geospatial visualization in a highly isolated way. In particular, I’ve recommended that you drill the syntax for very simple operations, like retrieving a map using
More recently, I recommended learning (and mastering) the 2-density plot.
Why master these tools in isolation?
Because they are easier to practice (i.e., drill), when you practice them in small, isolated pieces.
Learning data tools in isolation seems simplistic to people, but it has a dramatic payoff.
Once you master small pieces of syntax in isolation, they become like Lego building blocks. You can start to combine the small, simple tools into more complicated structures.
One example of this is creating maps.
Creating a crime heatmap in R
We’ve done quite a bit of geospatial mapping here at Sharp Sight, and part of the reason is that maps are good intermediate projects that allow you to combine simpler tools.
In the case of a geospatial heatmap, you’re basically combining a 2-dimensional density plot with an underlying geospatial map of some kind.
Let me show you an example. Here, we’re going to create a heatmap of San Francisco crime.
Creating a crime heatmap in R like this is easy, once you know the right “building blocks.” Critically, this will be a combination of two skills: the 2-dimentional density plot and a simple Google map.
Let’s walk through it, and I’ll explain as we go.
Install the packages
First, we’ll install the packages that we will use.
# INSTALL PACKAGES library(tidyverse) library(ggmap) library(stringr) library(viridis)
Read in the data
Next, we will import our csv dataset using
read_csv() from the
# IMPORT CSV DATA sf_crime <- read_csv('https://sharpsightlabs.com/datasets/sf_crime-data-2017.csv' ,col_names = c('incident_number' ,'crime_category' ,'crime_description' ,'day_of_week' ,'date' ,'time' ,'police_district' ,'resolution' ,'address' ,'lon' ,'lat' ,'location' ,'PdId' ) ,skip = 1 ) # INSPECT sf_crime %>% glimpse() # Observations: 154,724 # Variables: 13 # $ incident_number <int> 170533616, 170527017, 170514133, 170465285, 170451814, 17045110... # $ crime_category <chr> "FRAUD", "SEX OFFENSES, FORCIBLE", "SUSPICIOUS OCC", "NON-CRIMI... # $ crime_description <chr> "CREDIT CARD, THEFT BY USE OF", "SEXUAL BATTERY", "SUSPICIOUS O... # $ day_of_week <chr> "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Su... # $ date <chr> "01/01/2017", "01/01/2017", "01/01/2017", "01/01/2017", "01/01/... # $ time <time> 08:00:00, 00:01:00, 13:00:00, 00:01:00, 00:01:00, 15:00:00, 12... # $ police_district <chr> "INGLESIDE", "MISSION", "INGLESIDE", "SOUTHERN", "RICHMOND", "T... # $ resolution <chr> "NONE", "NONE", "NONE", "NONE", "NONE", "NONE", "NONE", "NONE",... # $ address <chr> "3800 Block of SAN BRUNO AV", "100 Block of GUERRERO ST", "0 Bl... # $ lon <dbl> -122.4018, -122.4246, -122.4451, -122.4154, -122.4745, -122.504... # $ lat <dbl> 37.71452, 37.76854, 37.71295, 37.77829, 37.78043, 37.75952, 37.... # $ location <chr> "(37.71452361297482, -122.40176175711419)", "(37.76854434073097... # $ PdId <dbl> 1.705336e+13, 1.705270e+13, 1.705141e+13, 1.704653e+13, 1.70451...
By the way, loading in a csv file using
readr::read_csv() is another one of those “building block” skills that you should memorize. You should be able to write the code to import a csv file quickly. Reading in data is one of those critical “get things done” skills that you need to know.
Check the data with a scatterplot
Ok, now that we have our dataset, let’s quickly visualize it using a ggplot scatterplot.
# PLOT SCATTERPLOT # - we'll do this as a quick data-check ggplot() + geom_point(data = sf_crime, aes(x = lon, y = lat), alpha = .05)
This is an extremely simple use of the scatterplot, but it’s important. Here, we’re actually using the scatterplot to check the data.
Once again, I’ll point out that I’ve told you many, many times that the scatterplot is a critical skill that you need to master. It is one of those critical “building block” skills.
This is one of the reasons why.
Not only can you use the scatterplot as a tool to analyze data and storytell with data, but you can also use it as a tool for checking your data. It is a multipurpose tool that you can use for a variety of tasks at a variety of stages of the data workflow. The scatterplot is something that you should master. If you don’t understand the code that I just showed you above, or you can’t write it fluently (and from memory), go back and practice it.
Based on the scatterplot we just created, the data appear to be okay.
Create a quick “first draft” heatmap
In the next step, let’s very quickly create a “heatmap” of the data, which is essentially a 2-dimentional density plot.
# SIMPLE HEATMAP ggplot() + stat_density2d(data = sf_crime, aes(x = lon, y = lat, fill = ..density..), geom = 'tile', contour = F)
From the looks of it, there are some modifications that we could make here. The color scale is not particularly sensitive to differences in the data, so we may need to make some minor adjustments to the color scale later.
At a quick glance, however, this basic density plot looks okay.
Get a google map
Next, let’s switch gears. We’re going to retrieve a Google map.
Up until this point, we’ve been working exclusively with the crime data. But ultimately, we want to overlay the crime data over a map of some type.
To do this, we need to get a map that we can work with. Let’s retrieve and plot a simple map from Google.
To get this Google map, we will use the
get_map() function from
ggmap. We’ll use
get_map() to retrieve a simple map of San Francisco.
# GET MAP map_sf <- get_map('San Francisco', zoom = 12, maptype = 'satellite')
This is a relatively simple use of
Keep in mind though, that you need to know the function it self, but also a little bit about how to use it. For example, you need to know the syntax for
get_map(), but you also need to know the
When learning a small function like this, I recommend breaking it down into small pieces. For example, you can learn and drill
get_map() all by itself. Once you’ve memorized
get_map(), you can drill more complicated examples where you adjust the zoom using the
zoom parameter. Start simple and then increase the complexity.
Plot the Google map
Ok, moving on.
We have the map, let’s plot it using the
ggmap() function from the
# PLOT BASIC SF MAP ggmap(map_sf)
This is a very simple example of how to use
ggmap(). Again though, this is one of our “building blocks.” When we combine it with the crime-data, it will be more complex.
Create a simple crime heatmap
Now that we have both “building blocks” in place, let’s combine them together.
To do this we will use
ggmap(map_sf) to plot the underlying map, and then we will use
stat_density2d() to plot a heatmap of the crime data on top of the underlying geographic map.
# MAP WITH HEATMAP OVERLAY ggmap(map_sf) + stat_density2d(data = sf_crime, aes(x = lon, y = lat, fill = ..density..), geom = 'tile', contour = F, alpha = .5)
Change color palette of heatmap
Ok … we’re almost there.
There’s more that we will need to do to “polish” this visualization, but understand what we’ve done here. We’ve taken two different building blocks – the geographic map from
ggmap() and the 2-d density plot from
stat_density2d() – to create a more complicated visualization.
Again, this is why I recommend learning and mastering simple tools: because you can combine them together into more complicated structures.
Also, keep in mind that we are able to do this because the
ggplot2 syntax enables you to build plots in layers.
ggmap() itself is an extention of
ggplot2 and it follows the
ggplot2 convention of building plots in layers using the
+ sign. In some sense, the
ggplot2 system is structured for this layer-by-layer “building block” strategy.
Ok, finally, let’s start to polish the chart.
The first thing that we can do is use a different color scale. Personally, I think that the color scales from the
viridis package are excellent. Let’s change the color scale by using
# SIMPLE HEATMAP WITH VIRIDIS COLORING ggmap(map_sf) + stat_density2d(data = sf_crime, aes(x = lon, y = lat, fill = ..density..), geom = 'tile', contour = F, alpha = .5) + scale_fill_viridis()
Let’s also try the “inferno” color palette that comes with the
# VIRIDIS (inferno), alpha = .5 ggmap(map_sf) + stat_density2d(data = sf_crime, aes(x = lon, y = lat, fill = ..density..), geom = 'tile', contour = F, alpha = .5) + scale_fill_viridis(option = 'inferno')
Yeah … that’s it. I like this one better.
Create finalized “polished” crime heatmap
Ok, let’s clean it up a little.
We’ll add a title, remove the extra text on the x and y axes, and do some formatting on the legend and other parts of the plot.
# CREATE "POLISHED" CRIME HEATMAP IN R ggmap(map_sf) + stat_density2d(data = sf_crime, aes(x = lon, y = lat, fill = ..density..), geom = 'tile', contour = F, alpha = .5) + scale_fill_viridis(option = 'inferno') + labs(title = str_c('SF has largest concentration of crime\n' ,'near Downtown & Tenderloin' ) ,subtitle = 'There are also moderate pockets of crime in SOMA & the Mission' ,fill = str_c('Number of', '\ncrime incidents') ) + theme(text = element_text(color = "#444444") ,plot.title = element_text(size = 22, face = 'bold') ,plot.subtitle = element_text(size = 12) ,axis.text = element_blank() ,axis.title = element_blank() ,axis.ticks = element_blank() ) + guides(fill = guide_legend(override.aes= list(alpha = 1)))
I’ll admit: there’s more we could do here. In particular, we might be able to play with the color scale a little bit to make the low-crime areas more apparent; they sort of wash out right now.
Having said that, I think this looks pretty damn good, and it’s enough to show you how you can combine tools together to create more advanced visualizations.
Learn the basics, then put them together
As I’ve said many times, you need to master the basics before you move on to advanced topics.
Before attempting something like this map on your own, learn the individual pieces.
Learn how to retrieve a simple Google map using
ggmap. Practice it! Practice the syntax until you remember it.
Learn how to create a 2-d density plot.
Learn how to do some basic ggplot formatting.
Study, learn, and practice these skills independently. Once you know them backwards and forwards, you can use them like building blocks to create more advanced charts and visualizations.