Python Data Wrangling Example: Wikipedia City Crime Data

In this tutorial, I’ll show you how to scrape and wrangle a dataset using Python.

Specifically, we’ll use Pandas to:

  • scrape the data
  • clean the data by cleaning up variable names, recoding values, etc

This sort of work is somewhat similar to the work performed by data scientists.

Although many data scientists work with data from pre-cleaned databases, there are many cases when a data scientist may need to work with relatively “raw” data from an unprocessed source.

Table of Contents:

This tutorial video will give you some insight into how to use Pandas tools to get and clean a dataset.

There is code at the bottom of the page, and the video will walk you through the code, step by step.

Walkthough Video

Code

This is the code used in the video.

You can copy and paste it into your IDE or notebook and follow along with the video.

##################################################
# WIKIPEDIA CITY CRIME DATA
# – scrape data and clean it up
#
# TOOLS:
# – Pandas
# – Seaborn
# – Numpy
#
# © Copyright Sharp Sight, Inc.
# sharpsightlabs.com
#
##################################################


#----------------
# IMPORT PACKAGES
#----------------
import pandas as pd
import numpy as np
import seaborn as sns
import seaborn.objects as so

#import altair as alt
#alt.renderers.enable('altair_viewer')


#------------------------
# GET RAW DATA
# – scrape from Wikipedia
#------------------------
url = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate'

city_crime_usa_HTML = pd.read_html(url)
city_crime_usa_RAW = city_crime_usa_HTML[0]


# INSPECT
type(city_crime_usa_RAW)
city_crime_usa_RAW.columns

print(city_crime_usa_RAW)


#-------------------
# RESET COLUMN NAMES
#-------------------

# GET COLUMN NAMES FROM MULTI-INDEX
original_column_multindex = city_crime_usa_RAW.columns
column_names = city_crime_usa_RAW.columns.get_level_values(2)

# COPY & RENAME DATAFRAME
city_crime_usa = city_crime_usa_RAW

# RESET COLUMN NAMES
city_crime_usa.columns = column_names

city_crime_usa.columns


# REMOVE TOTAL COLUMNS
# city_crime_usa.drop(columns = ['Total']).columns
city_crime_usa = city_crime_usa.drop(columns = ['Total'])

# CHECK
city_crime_usa.columns




#-----------------------
# RENAME VARIABLES
# – rename to lower case
#-----------------------

city_crime_usa.columns
# ['State', 'City', 'Population'
# ,'Murder andNonnegligentmanslaughter'
# ,'Rape1'
# ,'Robbery'
# ,'Aggravatedassault'
# ,'Burglary'
# ,'Larceny-theft'
# ,'Motorvehicletheft'
# ,'Arson2']

new_names = ['state', 'city', 'population'
            ,'murder_manslaughter'
            ,'rape'
            ,'robbery'
            ,'aggravated_assault'
            ,'burglary'
            ,'larceny_theft'
            ,'motor_vehicle_theft'
            ,'arson']

city_crime_usa.columns = new_names 

print(city_crime_usa.columns)

# check
city_crime_usa.head()


#------------------------------------------
# CLEAN UP CITY NAME
# - city names have numbers and commas from
#   footnotes in the original table data
# -
#------------------------------------------

# GET CITY NAME
city_crime_usa.city.unique()

# array(['Mobile3', 'Anchorage', 'Chandler', 'Gilbert', 'Glendale', 'Mesa',
#        'Phoenix', 'Scottsdale', 'Tucson', 'Anaheim', 'Bakersfield',
#        'Chula Vista', 'Fremont', 'Fresno', 'Irvine', 'Long Beach',
#        'Los Angeles', 'Oakland', 'Riverside', 'Sacramento',
#        'San Bernardino', 'San Diego', 'San Francisco', 'San Jose',
#        'Santa Ana', 'Santa Clara', 'Stockton4', 'Aurora',
#        'Colorado Springs', 'Denver', 'Washington, D.C.', 'Hialeah',
#        'Jacksonville', 'Miami', 'Orlando', 'St. Petersburg', 'Tampa',
#        'Atlanta', 'Savannah', 'Honolulu', 'Boise', 'Chicago',
#        'Fort Wayne', 'Indianapolis', 'Des Moines', 'Wichita', 'Lexington',
#        'Louisville Metro6', 'Baton Rouge', 'New Orleans', 'Baltimore',
#        'Boston', 'Detroit', 'Minneapolis', 'St. Paul', 'Kansas City',
#        'St. Louis', 'Omaha', 'Henderson', 'Las Vegas', 'North Las Vegas',
#        'Reno', 'Jersey City', 'Newark', 'Albuquerque', 'Buffalo',
#        'New York', 'Charlotte-Mecklenburg', 'Durham', 'Greensboro',
#        'Cincinnati', 'Cleveland', 'Columbus', 'Toledo4,5,7',
#        'Oklahoma City', 'Tulsa', 'Portland', 'Philadelphia', 'Pittsburgh',
#        'Memphis', 'Nashville', 'Arlington4', 'Austin', 'Corpus Christi',
#        'Dallas', 'El Paso', 'Fort Worth', 'Garland', 'Houston', 'Irving',
#        'Laredo', 'Plano', 'San Antonio', 'Chesapeake', 'Norfolk',
#        'Virginia Beach', 'Seattle', 'Spokane', 'Madison', 'Milwaukee'],
#       dtype=object)


# TRY TO GET MATCH FOR CITIES
# – look for cities with a number at the end
# – this is how they look if there was a footnote
#   for a city in the original dataset

(city_crime_usa
 .filter(['city'])
 .drop_duplicates()
 .query('city.str.match(".*[1-9]$")')
)

#                  city
# 0             Mobile3
# 26          Stockton4
# 47  Louisville Metro6
# 73        Toledo4,5,7
# 81         Arlington4



# SET DATAFRAME PRINT OPTION
# – print 100 rows
pd.set_option('display.max_rows', 100)


# TEST CITY RECODE/CLEANUP
# – here, we'll use Series str.replace to strip out 
#   numbers as well as commas
# – filter and drop_duplicates are just to get the
#   unique values of city, so we can see if the 
#   recode worked properly

(city_crime_usa
 .assign(city = (city_crime_usa.city.str.replace('[1-9]','')
                 .str.replace(',*$','',regex = True)))
 .filter(['city'])
 .drop_duplicates()
 .head(n = 100)
)


# TEST CITY RECODE/CLEANUP
# – again, here we're using Series str.replace to strip out 
#   numbers and commas
# – we're actually assigning the output of the process 
#   this time using the '='
city_crime_usa = (city_crime_usa
                 .assign(city = (city_crime_usa.city.str.replace('[1-9]','')
                                 .str.replace(',*$','',regex = True)))
)



# CHECK
(city_crime_usa
 .filter(['city'])
 .drop_duplicates()
 .sort_values(by = 'city')
 )


#-------------------------------------------------------------
# CREATE 'TECH CITY' VARIABLE
# – here, we'll add an indicator variable 
#   that flags "tech" cities
# – this list of cities is somewhat arbitrary
# – Some are recognized tech cities like San Francisco
#   as well as San Jose and Santa Clara (i.e., Silicon Valley)
# – Others, like Austin and Miami, are up-and-coming 
#   tech cities
# – I want to subset down to this small list, so I can compare
#   and contrast the oldschool tech cities, with the
#   challengers
#-------------------------------------------------------------
tech_cities = ['Austin'
                ,'Denver'
                ,'Los Angeles'
                ,'Miami'
                ,'New York'
                ,'San Jose'
                ,'San Francisco'
                ,'Santa Clara'
                ,'Seattle']


# np.isin(city_crime_usa.city,tech_cities)


# TEST NEW VARIABLE CREATION
(city_crime_usa
  .assign(tech_city_flg = city_crime_usa.city.isin(tech_cities))
  .filter(['city','tech_city_flg'])
  .drop_duplicates()
  .sort_values(by = 'tech_city_flg')
)


# CREATE/ASSIGN NEW VARIABLE
city_crime_usa = (city_crime_usa
    .assign(tech_city_flg = city_crime_usa.city.isin(tech_cities))
    )


Questions

Do you still have questions about this Python data wrangling example?

Leave your questions in the comments section below.

To learn more about Pandas, sign up for our email list

This tutorial should have shown you a good example of Python data wrangling, but if you really want to master data manipulation and data science in Python, there’s a lot more to learn.

So if you’re ready to learn more about Pandas and more about data science, then sign up for our email newsletter.

We publish FREE tutorials almost every week on:

  • Base Python
  • NumPy
  • Pandas
  • Scikit learn
  • Machine learning
  • Deep learning
  • … and more.

When you sign up for our email list, we’ll deliver these free tutorials directly to your inbox.

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

Leave a Comment