A few days ago, I was inspired by a set of photographs of Earth from space, at night.
The images are amazing, so I decided to try to replicate them using R.
To be a little more specific, I found a single dataset – a data set with French population data – and I’m using it to replicate the images for a single country: France.
A few notes and warnings before you get started
As usual, I’ve included the code for how to create the final data visualization.
Having said that, in this case I decided to use some new tools. Therefore, before trying to run the code yourself, there are a few notes and caveats to keep in mind.
This was built with the new
I decided to build this with the new
Warning: setting up
sf and dependencies may be problematic
Before getting started, you should know that installing
If you decide to install
Code: mapping French population density
Ok, here’s the code.
First, we’ll load the packages that we’re going to need. (Of course, you need to have them installed as well.)
Notice also that we’re using the development version of
#======================================= # INSTALL DEVELOPMENT VERSION OF ggplot2 #======================================= library(devtools) dev_mode(on = T) install_github("hadley/ggplot2") #============== # LOAD PACKAGES #============== library(sf) library(tidyverse) library(stringr) library(scales)
Get the data
Next, we’ll get the data from a URL at Sharp Sight and load it into R:
#========= # GET DATA #========= url.france_pop <- url("https://vrzkj25a871bpq7t1ugcgmn9-wpengine.netdna-ssl.com/wp-content/datasets/france_population_data_2016.RData") load(url.france_pop) # INSPECT glimpse(df.france)
Change column names to lower case
We won’t have to make too many adjustments to the data, but we will transform the variable names to lower case.
#====================================== # CHANGE COLUMN NAMES: lowercase # - in the original dataset the column # names were capitalized # - we'll transform the names to # lower case #====================================== colnames(df.france) <- colnames(df.france) %>% str_to_lower() # INSPECT colnames(df.france)
In the visualization, we’re going to plot population density.
The dataset does not have a density variable, so we need to calculate it based on
#========================= # CREATE DENSITY VARIABLE #========================= df.france$population %>% summary() #------------------------- # CREATE VARIABLE: density #------------------------- df.france <- df.france %>% mutate(density = population/superficie*100) # INSPECT colnames(df.france) head(df.france)
Plot a test chart
Next, we’ll plot a very simple test chart to see if things are roughly OK.
As we do this, note how simple it is to use
#========================================== # PLOT TEST CHART # - here, we'll just plot a test chart # to see if the data are roughly correct # and to make sure that the data are # roughly in working order #========================================== #ggplot(df.france) + geom_sf() ggplot(df.france) + geom_sf(aes(fill = density))
This appears to be OK, but it’s tough to tell because the borders are too thick.
Let’s do one more test chart to try to remove the lines and get a better look:
#========================================== # PLOT SECOND TEST CHART # - here, we'll plot another test with # smaller borders # - we'll set size = .1 #========================================== ggplot(df.france) + geom_sf(aes(fill = density), color = NA)
Ok, we can see the fill color now (although the borders are still there).
Looking at this, we’ll have to solve at least 2 problems:
- Adjust the color scale
- Remove the borders
Most of the remaining work will be modifying the
Calculate quantiles to identify color break points
Ultimately, we need to adjust the fill-colors associated with the
To do this, we’ll calculate quantiles.
#======================================================= # IDENTIFY QUANTILES # - the hardest part of creating the finalized # data visualization is getting the color scale # just right # - on final analysis, the best way to adjust the color # scale was by capping the population density # - we'll use these quantiles to find a good cap # via trial and error #======================================================= #--------- # VENTILES #--------- quantile(df.france$density, seq(from = 0, to = 1, by = .05)) # 0% 5% 10% 15% 20% 25% 30% # 0.000000 6.815208 10.032525 12.824891 15.614076 18.485419 21.719516 # 35% 40% 45% 50% 55% 60% 65% # 25.510974 29.538791 34.568568 40.088124 46.762867 54.489134 64.356473 # 70% 75% 80% 85% 90% 95% 100% # 77.376987 94.117647 118.935653 160.578630 242.086849 510.070301 41929.234973 #---------------------- # PERCENTILES: 90 - 100 #---------------------- quantile(df.france$density, seq(from = .9, to = 1, by = .01)) # 90% 91% 92% 93% 94% 95% 96% 97% 98% # 242.0868 268.1167 300.2817 347.6989 412.2042 510.0703 637.9211 876.0090 1300.6435 # 99% 100% # 2649.3653 41929.2350 # MEDIAN median(df.france$density) #40.08812
Create finalized version of chart
Let’s plot the finalized version. I’ll give a little explanation afterwards.
#================== # DETAILED VERSION # - add theme # - add colors #================== df.france %>% mutate(fill_var = density) %>% ggplot() + geom_sf(aes(fill = fill_var, color = fill_var)) + scale_fill_gradientn(colours = c("#000233","#f9cf86","#fceccf","white") ,values = rescale(c(0,500,3000,40000)) ,breaks = c(0,500,1500) ,guide = FALSE) + scale_color_gradientn(colours = c("#000233","#f9cf86","#fceccf","white") ,values = rescale(c(0,500,3000,40000)) ,guide = FALSE) + labs(title = "Population density in France\nvisualized to imitate satellite images of France at night" ,subtitle = "(people/sq km)" ) + theme(text = element_text(family = "Gill Sans", color = "#E1E1E1") ,plot.title = element_text(size = 18, color = "#E1E1E1") ,plot.subtitle = element_text(size = 10) ,plot.background = element_rect(fill = "#000223") ,panel.background = element_rect(fill = "#000223") ,panel.grid.major = element_line(color = "#000223") ,axis.text = element_blank() ,axis.ticks = element_blank() ,legend.background = element_blank() ) #================== # TURN OFF DEV MODE #================== dev_mode(on = F)
OK, the truth is that to create this finalized version, you actually need to do a lot of trial and error.
Essentially, you need to find the right cutoffs in the data and associate those values to specific fill values.
The point where we do that is in the following code:
scale_fill_gradientn(colours = c("#000233","#f9cf86","#fceccf","white") ,values = rescale(c(0,500,3000,40000))
Here, we’re taking values of our fill variable,
Trial and error.
You need to take the quantile values that we calculated earlier, and test them out. Try different cutoff values for
To create visualizations like this, master
I’ll admit that this data visualization is a little challenging to create, simple because of the extensive trial and error that’s required.
It’s also a little out of the ordinary, because we’re using the some packages that are under development.
Having said that, this visualization is still essentially just a fancy application of
So, if you want to learn to create compelling data visualizations, master
Sign up now to master ggplot2
To get a job as a highly-paid data scientist, you need to know how to create compelling data visualizations.
To create compelling visualizations in R, you need to master
Sign up now we’ll show you how.
By signing up, you’ll get our free Data Science Crash Course, where you’ll learn:
- the foundations of ggplot2
- 3 essential data visualizations
- a step-by-step data science learning plan
- the 1 programming language you need to learn
- how to do data manipulation in R
- how to get started with machine learning
- and more …
In addition, you’ll also get weekly data science tutorials, delivered right to your inbox.
SIGN UP NOW