What that means is that you need to identify the most important tools and functions of the Tidyverse, and then practice them until you are fluent.
But once you have mastered the essential functions as isolated units, you need to put them together. By putting the individual piece together, you solidify your knowledge of how they work individually but also begin to learn how you can combine small tools together to create novel effects.
With that in mind, I want to show you another small project. Here, we’re going to use a fairly small set of functions to create a map of the largest cities in Europe.
As we do this, pay attention:
- How many packages and functions do you really need?
- Evaluate: how long would it really take to memorize each individual function? (Hint: it’s much, much less time than you think.)
- Which functions have you seen before? Are some of the functions and techniques used more often than others (if you look across many different analyses)?
Ok, with those questions in mind, let’s get after it.
First we’ll just load a few packages.
#============== # LOAD PACKAGES #============== library(rvest) library(tidyverse) library(ggmap) library(stringr)
Next, we’re going to use the
#=========================== # SCRAPE DATA FROM WIKIPEDIA #=========================== html.population <- read_html('https://en.wikipedia.org/wiki/List_of_European_cities_by_population_within_city_limits') df.euro_cities <- html.population %>% html_nodes("table") %>% .[] %>% html_table() # inspect df.euro_cities %>% head() df.euro_cities %>% names()
Here at Sharp Sight, we haven’t worked with the
Having said that, just take a close look. How many functions did we use from
Ok. Now we’ll do a little data cleaning.
First, we’re going to remove some of the variables using
#============================ # REMOVE EXTRANEOUS VARIABLES #============================ df.euro_cities <- select(df.euro_cities, -Date, -Image, -Location, -`Ref.`, -`2011 Eurostat\npopulation`) # inspect df.euro_cities %>% names()
After removing the variables that we don’t want, we only have four variables. These remaining raw variable names could be cleaned up a little.
Ideally, we want names that are lower case (because they are easier to type). We also want variable names that are brief and descriptive.
In this case, renaming these variables to be brief, descriptive, and lower-case is fairly straightforward. Here, we will use very simple variable names: like rank, city, country, and population.
To add these new variable names, we can simply assign them by using the
#=============== # RENAME COLUMNS #=============== colnames(df.euro_cities) <- c("rank", "city", "country", "population") # inspect df.euro_cities %>% names() df.euro_cities %>% head()
Now that we have clean variable names, we will do a little modification of the data itself.
When we scraped the data from Wikipedia, some extraneous characters appeared in the
To do this, we will use a few functions from the
First, we use
This is a quick way to get the numbers at the end of the string, but we actually don’t want to keep the ‘♠’ character. So, after we extract the population numbers (along with the ‘♠’), we then strip off the ‘♠’ character by using
#======================================================================== # CLEAN UP VARIABLE: population # - when the data are scraped, there are some extraneous characters # in the "population" variable. # ... you can see leading numbers and some other items # - We will use stringr functions to extract the actual population data # (and remove the stuff we don't want) # - We are executing this transformation inside dplyr::mutate() to # modify the variable inside the dataframe #======================================================================== df.euro_cities <- df.euro_cities %>% mutate(population = str_extract(population, "♠.*$") %>% str_replace("♠","") %>% parse_number()) df.euro_cities %>% head()
We will also do some quick data wrangling on the city names. Two of the city names on the Wikipedia page (Istanbul and Moscow) had footnotes. Because of this, those two city names had extra bracket characters when we read them in (e.g. “Istanbul[a]”).
We want to strip off those footnotes. To do this we will once again use
#========================================================================== # REMOVE "notes" FROM CITY NAMES # - two cities had extra characters for footnotes # ... we will remove these using stringr::str_replace and dplyr::mutate() #========================================================================== df.euro_cities <- df.euro_cities %>% mutate(city = str_replace(city, "\\[.\\]","")) df.euro_cities %>% head()
For the sake of making the data a little easier to explain, we’re going to filter the data to records where the population is over 1,000,000.
Keep in mind: this is a straightforward use of
#========================= # REMOVE CITIES UNDER 1 MM #========================= df.euro_cities <- filter(df.euro_cities, population >= 1000000) #================= # COERCE TO TIBBLE #================= df.euro_cities <- df.euro_cities %>% as_tibble()
Before we map the cities on a map, we need to get geospatial information. That is, we need to geocode these records.
To do this, we will use the
After obtaining the geo data, we will join it back to the original data using
#======================================================== # GEOCODE # - here, we're just getting longitude and latitude data # using ggmap::geocode() #======================================================== data.geo <- geocode(df.euro_cities$city) df.euro_cities <- cbind(df.euro_cities, data.geo) #inspect df.euro_cities
To map the data points, we also need a map that will sit in the background, underneath the points.
We will use the function
#============== # GET WORLD MAP #============== map.europe <- map_data("world")
Now that the data are clean, and we have a world map, we will plot the data.
#================================= # PLOT BASIC MAP # - this map is "just the basics" #================================= ggplot() + geom_polygon(data = map.europe, aes(x = long, y = lat, group = group)) + geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = "red", alpha = .3) + coord_cartesian(xlim = c(-9,45), ylim = c(32,70))
This first plot is a "first iteration." In this version, we haven't done any serious formatting. It's just a "first pass" to make sure that the data are in the right format. If we had found anything "out of line," we would go back to an earlier part of the analysis and modify our code to correct any problems in the data.
Based on this plot, it looks like the data are essentially correct.
Now, we just want to "polish" the visualization by changing colors, fonts, sizes, etc.
#==================================================== # PLOT 'POLISHED' MAP # - this version is formatted and cleaned up a little # just to make it look more aesthetically pleasing #==================================================== #------------- # CREATE THEME #------------- theme.maptheeme <- theme(text = element_text(family = "Gill Sans", color = "#444444")) + theme(plot.title = element_text(size = 32)) + theme(plot.subtitle = element_text(size = 16)) + theme(panel.grid = element_blank()) + theme(axis.text = element_blank()) + theme(axis.ticks = element_blank()) + theme(axis.title = element_blank()) + theme(legend.background = element_blank()) + theme(legend.key = element_blank()) + theme(legend.title = element_text(size = 18)) + theme(legend.text = element_text(size = 10)) + theme(panel.background = element_rect(fill = "#596673")) + theme(panel.grid = element_blank()) #------ # PLOT #------ #fill = "#AAAAAA",colour = "#818181", size = .15) ggplot() + geom_polygon(data = map.europe, aes(x = long, y = lat, group = group), fill = "#DEDEDE",colour = "#818181", size = .15) + geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = "red", alpha = .3) + geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = "red", shape = 1) + coord_cartesian(xlim = c(-9,45), ylim = c(32,70)) + labs(title = "European Cities with Large Populations", subtitle = "Cities with over 1MM population, within city limits") + scale_size_continuous(range = c(.7,15), breaks = c(1100000, 4000000, 8000000, 12000000), name = "Population", labels = scales::comma_format()) + theme.maptheeme
Not too bad.
Keep in mind that as a reader, you get to see the finished product: the finalized visualization and the finalized code.
But as always, the process for creating a visualization like this is highly iterative. If you work on a similar project, expect to change your code dozens of times. You'll change your data-wrangling code as you work with the data and identify new items you need to change or fix. You'll also change your
If you master the basics, the hard things never seem hard
Creating this visualization is actually not terribly hard to do, but if you're somewhat new to R, it might seem rather challenging.
If you look at this, and it seems difficult then you need to understand: once you master the basics, the hard things never seem hard.
What I mean by that, is that this visualization is nothing more than a careful application of a few dozen simple tools, arranged in a way to create something new.
Once you master individual tools from ggplot2, dplyr, and the rest of the Tidyverse, projects like this become very easy to execute.
Sign up now, and discover how to rapidly master data science
To rapidly master data science, you need to master the essential tools.
You need to know what tools are important, which tools are not important, and how to practice.
Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.
Sign up now for our email list, and you'll receive regular tutorials and lessons.
- What data science tools you should learn (and what not to learn)
- How to practice those tools
- How to put those tools together to execute analyses and machine learning projects
- ... and more
If you sign up for our email list right now, you'll also get access to our "Data Science Crash Course" for free.
SIGN UP NOW