It’s clear that there are some economic shifts happening in the world, if not the US itself.

In light of this, I decided to do some simple investigation into the economic performance of US cities.

This is, by the way, one of the critical reasons to master data science. One you know a few critical skills, you will be able to very rapidly get some basic information about (almost) any topic.

In a case such as this (when you’re just personally interested), you can just scrape some data and plot it.

But if you’re working in a business, you will need to be able to generate these types of insights quickly. A large part of your job will be gathering data and quickly plotting it in ways that generate insight …

Plotting GDP data for top US cities

In the following code, we’ll scrape some data about US cities and plot a line chart using ggplot2.

There’s actually quite a bit more that we could do with this data, so feel free to create your own plots and leave the code in the comments below.

#=================
# INSTALL PACKAGES
#=================
library(tidyverse)
library(stringr)
library(forcats)
library(rvest)
library(ggthemes)


#============
# SCRAPE DATA
#============
df.metro_gdp <- read_html('https://en.wikipedia.org/wiki/List_of_U.S._metropolitan_areas_by_GDP') %>% 
  html_nodes('table') %>% 
  .[[1]] %>% 
  html_table() %>% 
  as.tibble()


#=======================
# REMOVE 'Rank' VARIABLE
#=======================
df.metro_gdp <- df.metro_gdp %>% 
  select(-Rank)


#================
# RENAME VARIABLE
#================
df.metro_gdp <- df.metro_gdp %>% rename(metro_area = `Metropolitan area`)


# inspect
df.metro_gdp


# REMOVE 'MSA' FROM metro_area
df.metro_gdp <- df.metro_gdp %>% mutate(metro_area = str_replace(metro_area, ' MSA', ''))


# COERCE TO 'metro_area' FACTOR
df.metro_gdp <- df.metro_gdp %>% mutate(metro_area = metro_area %>% as_factor())


#========================================================
# CREATE NEW VARIABLE: 
# - the original 'metro_area' variable is rather long
#   because it's  a full 'metropolitan statistical area'
# - we can abbreviate these as the plain city name
# - we'll call the new variable 'metro_brief'
#========================================================

# get unique values
df.metro_gdp %>% 
  select(metro_area) %>% 
  unique()


#---------------------------------------------------
# RECODE VALUES
# here we will create the new variable 'metro_brief'
#---------------------------------------------------
df.metro_gdp <- df.metro_gdp %>% 
  mutate(metro_area_brief = recode(metro_area,'New York–Northern New Jersey–Long Island, NY–NJ–PA' = 'New York'
         ,'Los Angeles–Long Beach–Santa Ana, CA' = 'Los Angeles'
         ,'Chicago–Joliet–Naperville, IL–IN–WI' = 'Chicago'
         ,'Dallas–Fort Worth–Arlington, TX' = 'Dallas'
         ,'Washington–Arlington–Alexandria, DC–VA–MD–WV' = 'Washington DC'
         ,'Houston–Sugar Land–Baytown, TX' = 'Houston'
         ,'San Francisco–Oakland–Fremont, CA' = 'San Francisco'
         ,'Philadelphia–Camden–Wilmington, PA–NJ–DE–MD' = 'Philadelphia'
         ,'Boston–Cambridge–Quincy, MA–NH' = 'Boston'
         ,'Atlanta–Sandy Springs–Marietta, GA' = 'Atlanta'
         ))



# INSPECT VALUES
df.metro_gdp %>% glimpse()
df.metro_gdp %>% select(metro_area_brief)


# CHECK TABLE OF CROSS-VALUES
df.metro_gdp %>% 
  #select(metro_area, metro_brief) %>% 
  group_by(metro_area, metro_area_brief) %>% 
  summarise()


#======================
# RESHAPE: WIDE TO LONG
#======================
df.metro_gdp <- df.metro_gdp %>% gather(key = year, value = gdp_nominal, -metro_area, -metro_area_brief)


#========================
# COERCE 'year' TO FACTOR
#========================
df.metro_gdp <- df.metro_gdp %>% mutate(year = year %>% as.factor())


#===========================================
# WRANGLE AND COERCE 'gdp_nominal' TO DOUBLE
#===========================================
df.metro_gdp <- mutate(df.metro_gdp, gdp_nominal = str_remove_all(gdp_nominal, ",") %>% as.double())


#================
# PLOT BASIC PLOT
#================
ggplot(df.metro_gdp, aes(x = year, y = gdp_nominal, group = metro_area_brief)) +
  geom_line(aes(color = metro_area_brief))



#==========
# FORMATTED
#==========


df.metro_gdp %>% 
  mutate(highlight_flag = if_else(metro_area_brief == 'New York', T, F)) %>%
  ggplot(aes(x = year, y = gdp_nominal, group = metro_area_brief)) +
    geom_line(aes(color = highlight_flag, alpha = highlight_flag), size = 1.5) +
    scale_color_manual(values = c('grey', 'red')) +
    scale_alpha_manual(values = c(.7, 1)) +
    labs(title = 'New York is the best performing US city by metro GDP'
         ,subtitle = str_c("Consistently, New York has a much higher GDP than other metro areas."
                           ,"\n77% higher than next highest metro in 2017.")
         ,y = "Nominal GDP\n(metro area, millions of dollars)"
         ,x = 'Year') +
    theme(legend.position = 'none'
          ,text = element_text(color = '#3A3A3A'
                               ,family = 'sans')
          ,plot.title = element_text(margin = margin(b = 10)
                                     ,face = 'bold'
                                     ,size = 20)
          ,axis.title = element_text()
          ,plot.subtitle = element_text(size = 12)
          ) +
     scale_y_continuous(labels = scales::comma_format())


And here is the finalized chart:



Sign up now, and get access to our free Data Science Crash Course

Want to learn more about data analysis and data science?

Sign up now for our email list, and you’ll get access to our free Data Science Crash Course.

In the Data Science Crash Course, you’ll learn:

  • a step-by-step data science learning plan

  • the 1 programming language you need to learn

  • 3 essential data visualizations
  • how to do data manipulation in R
  • how to get started with machine learning
  • the difference between machine learning and statistics

SIGN UP NOW