Last week, Amazon announced that it has started to search for a new city in which to build a second headquarters.
Among several selection criteria, they indicated that they’re looking for a city with more than 1 million people, and a city with a good pool of tech talent.
While reading about Amazon’s new HQ search on a news website, I encountered a dataset of cities that might qualify: cities with over 1 million people in the metro area, and the corresponding percent of people with college degrees in each city.
The news website already visualized it, but I want to show you how to do this in
With that in mind, we’re going to scrape the data, wrangle it into shape (using
I’ll preface this by saying that this is an imperfect analysis. We don’t know the full and final selection criteria, and even if we did, a full analysis would be far beyond the scope of a simple blog post.
Having said that, this is a good “first pass” at such an analysis: the quick-and-dirty version.
Furthermore, if you’re getting involved in data science, this will give you some hints about how to use
Ok. Let’s jump in.
First, we’ll just load the packages that we will use.
#============== # LOAD PACKAGES #============== library(rvest) library(tidyverse) library(stringr) library(ggmap)
Next, we’ll use several functions from the
#======= # SCRAPE #======= html.amz_cities <- read_html("https://www.cbsnews.com/news/amazons-hq2-cities-second-headquarters-these-cities-are-contenders/") df.amz_cities <- html.amz_cities %>% html_nodes("table") %>% .[] %>% html_table() # inspect df.amz_cities %>% head()
Next, we’ll change the column names.
When we scraped the data, the column names were not read in properly from the HTML table, so we need to add them manually.
#==================== # CHANGE COLUMN NAMES #==================== # inspect initial column names colnames(df.amz_cities) # assign new column names colnames(df.amz_cities) <- c("metro_area", 'state', 'population_tot', 'bachelors_degree_pct') # inspect df.amz_cities %>% head()
As it turns out, when we scraped the data, the original column names (the column names that appeared on the website) ended up in the first row of our newly created dataframe.
This is inappropriate, so we need to remove the first row of data.
#============================================== # REMOVE FIRST ROW # - when we scraped the data, the column names # on the table were read in as the first row # of data. # - Therefore, we need to remove the first row #============================================== df.amz_cities <- df.amz_cities %>% filter(row_number() != 1)
Now we’re going to modify the data type of two variables.
#=================================================================================== # MODIFY VARIABLES # - both bachelors_degree_pct and population_tot were scraped as character variables # but we need them in numeric format # - we will use techniques to parse/coerce these variable from char to numeric #=================================================================================== #-------------------------------- # PARSE AS NUMBER: population_tot #-------------------------------- df.amz_cities <- mutate(df.amz_cities, population_tot = parse_number(population_tot)) # check typeof(df.amz_cities$population_tot) # inspect df.amz_cities %>% head() #----------------------------- # COERCE: bachelors_degree_pct #----------------------------- df.amz_cities <- mutate(df.amz_cities, bachelors_degree_pct = as.numeric(bachelors_degree_pct))
Next, we're going to create a variable that contains the city name.
When we read in the data, there was a variable for '
This being the case, we will create a new
#============================================================= # CREATE VARIABLE: city # - here, we're using the stringr function str_extract() to # extract the primary city name from the metro_area variable # - to do this, we're using a regex to pull out the city name # prior to the first '-' character #============================================================= df.amz_cities <- df.amz_cities %>% mutate(city = str_extract(metro_area, "^[^-]*"))
Now that we have proper city names, we will geocode our data. We will use the
#========================================= # GEOCODE # - here, we're getting the lat/long data #========================================= data.geo <- geocode(df.amz_cities$city) #inspect data.geo %>% head() data.geo #======================================== # RECOMBINE: merge geo data to data frame #======================================== df.amz_cities <- cbind(df.amz_cities, data.geo) df.amz_cities
Quickly, we'll use the
#============================================================== # RENAME VARIABLE: lon -> long # - we'll rename lon to lon, just because 'long' is consistent # with the name for longitude in other data sources # that we will use #============================================================== df.amz_cities <- rename(df.amz_cities, long = lon) # get column names names df.amz_cities %>% names()
Next, to make our data a little easier to read, we will re-order the variables. We'll organize it so that the
#========================================== # REORDER COLUMN NAMES # - here, we're just doing it manually ... #========================================== df.amz_cities <- select(df.amz_cities, city, state, metro_area, long, lat, population_tot, bachelors_degree_pct) # inspect df.amz_cities %>% head()
Now, we're going to get a map. In order to visualize the data as a map, we need a map of the United States.
To get this, we will use
#================================================ # GET USA MAP # - this is the map of the USA states, upon which # we will plot our city data points #================================================ map.states <- map_data("state")
Finally, we're ready to plot.
We'll initially do a "first iteration" to check that everything looks good.
#==================================== # PLOT # - here, we're actually creating the # data visualizations with ggplot() #==================================== #------------------------------------------------ # FIRST ITERATION # - this is just a 'first pass' to check that # everything looks good before we take the time # to format it #------------------------------------------------ ggplot() + geom_polygon(data = map.states, aes(x = long, y = lat, group = group)) + geom_point(data = df.amz_cities, aes(x = long, y = lat, size = population_tot, color = bachelors_degree_pct))
At a high level, everything looks OK. The points are in the right locations, and at a glance, everything looks good.
Keep in mind, that compared to the finalized version below, the 'first iteration' is much much simpler to build. This is a great example of the 80/20 rule in data analysis: in this visualization, you can get 80% of the way with only 20% of the total
Now that we have an initial version, we'll polish it by adding titles, formatting theme elements, and by adjusting the legends.
#-------------------------------------------------- # FINALIZED VERSION (FORMATTED) # - this is the 'finalized' version with all of the # detailed formatting #-------------------------------------------------- ggplot() + geom_polygon(data = map.states, aes(x = long, y = lat, group = group)) + geom_point(data = df.amz_cities, aes(x = long, y = lat, size = population_tot, color = bachelors_degree_pct*.01), alpha = .5) + geom_point(data = df.amz_cities, aes(x = long, y = lat, size = population_tot, color = bachelors_degree_pct*.01), shape = 1) + coord_map(projection = "albers", lat0 = 30, lat1 = 40, xlim = c(-121,-73), ylim = c(25,51)) + scale_color_gradient2(low = "red", mid = "yellow", high = "green", midpoint = .41, labels = scales::percent_format()) + scale_size_continuous(range = c(.9, 11), breaks = c(2000000, 10000000, 20000000),labels = scales::comma_format()) + guides(color = guide_legend(reverse = T, override.aes = list(alpha = 1, size = 4) )) + labs(color = "Bachelor's Degree\nPercent" ,size = "Total Population\n(metro area)" ,title = "Possible cities for new Amazon Headquarters" ,subtitle = "Based on population & percent of people with college degrees") + theme(text = element_text(colour = "#444444", family = "Gill Sans") ,panel.background = element_blank() ,axis.title = element_blank() ,axis.ticks = element_blank() ,axis.text = element_blank() ,plot.title = element_text(size = 28) ,plot.subtitle = element_text(size = 12) ,legend.key = element_rect(fill = "white") )
A quick note: this is not supposed to be a comprehensive analysis
I want to point out that this is not intended to be comprehensive or conclusive in any way. Without detailed selection criteria, it will be difficult to come to any solid conclusions.
Rather, this is intended to give you a hint of what's possible using R tools. If you were so inclined, you could certainly extend this into a much more comprehensive analysis by gathering more data and producing more charts.
Creating great visualizations gets easier once you master your toolkit
As you progress as a data scientist, you will get better at creating visualizations like this.
Sign up now, and discover how to rapidly master data science
To rapidly master data science, you need to master the essential tools.
You need to know what tools are important, which tools are not important, and how to practice.
Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.
Sign up now for our email list, and you'll receive regular tutorials and lessons.
- What data science tools you should learn (and what not to learn)
- How to practice those tools
- How to put those tools together to execute analyses and machine learning projects
- ... and more
If you sign up for our email list right now, you'll also get access to our "Data Science Crash Course" for free.
SIGN UP NOW