In this tutorial, I’ll show you how to scrape and wrangle a dataset using Python.
Specifically, we’ll use Pandas to:
- scrape the data
- clean the data by cleaning up variable names, recoding values, etc
This sort of work is somewhat similar to the work performed by data scientists.
Although many data scientists work with data from pre-cleaned databases, there are many cases when a data scientist may need to work with relatively “raw” data from an unprocessed source.
Table of Contents:
This tutorial video will give you some insight into how to use Pandas tools to get and clean a dataset.
There is code at the bottom of the page, and the video will walk you through the code, step by step.
Walkthough Video
Code
This is the code used in the video.
You can copy and paste it into your IDE or notebook and follow along with the video.
################################################## # WIKIPEDIA CITY CRIME DATA # – scrape data and clean it up # # TOOLS: # – Pandas # – Seaborn # – Numpy # # © Copyright Sharp Sight, Inc. # sharpsightlabs.com # ################################################## #---------------- # IMPORT PACKAGES #---------------- import pandas as pd import numpy as np import seaborn as sns import seaborn.objects as so #import altair as alt #alt.renderers.enable('altair_viewer') #------------------------ # GET RAW DATA # – scrape from Wikipedia #------------------------ url = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate' city_crime_usa_HTML = pd.read_html(url) city_crime_usa_RAW = city_crime_usa_HTML[0] # INSPECT type(city_crime_usa_RAW) city_crime_usa_RAW.columns print(city_crime_usa_RAW) #------------------- # RESET COLUMN NAMES #------------------- # GET COLUMN NAMES FROM MULTI-INDEX original_column_multindex = city_crime_usa_RAW.columns column_names = city_crime_usa_RAW.columns.get_level_values(2) # COPY & RENAME DATAFRAME city_crime_usa = city_crime_usa_RAW # RESET COLUMN NAMES city_crime_usa.columns = column_names city_crime_usa.columns # REMOVE TOTAL COLUMNS # city_crime_usa.drop(columns = ['Total']).columns city_crime_usa = city_crime_usa.drop(columns = ['Total']) # CHECK city_crime_usa.columns #----------------------- # RENAME VARIABLES # – rename to lower case #----------------------- city_crime_usa.columns # ['State', 'City', 'Population' # ,'Murder andNonnegligentmanslaughter' # ,'Rape1' # ,'Robbery' # ,'Aggravatedassault' # ,'Burglary' # ,'Larceny-theft' # ,'Motorvehicletheft' # ,'Arson2'] new_names = ['state', 'city', 'population' ,'murder_manslaughter' ,'rape' ,'robbery' ,'aggravated_assault' ,'burglary' ,'larceny_theft' ,'motor_vehicle_theft' ,'arson'] city_crime_usa.columns = new_names print(city_crime_usa.columns) # check city_crime_usa.head() #------------------------------------------ # CLEAN UP CITY NAME # - city names have numbers and commas from # footnotes in the original table data # - #------------------------------------------ # GET CITY NAME city_crime_usa.city.unique() # array(['Mobile3', 'Anchorage', 'Chandler', 'Gilbert', 'Glendale', 'Mesa', # 'Phoenix', 'Scottsdale', 'Tucson', 'Anaheim', 'Bakersfield', # 'Chula Vista', 'Fremont', 'Fresno', 'Irvine', 'Long Beach', # 'Los Angeles', 'Oakland', 'Riverside', 'Sacramento', # 'San Bernardino', 'San Diego', 'San Francisco', 'San Jose', # 'Santa Ana', 'Santa Clara', 'Stockton4', 'Aurora', # 'Colorado Springs', 'Denver', 'Washington, D.C.', 'Hialeah', # 'Jacksonville', 'Miami', 'Orlando', 'St. Petersburg', 'Tampa', # 'Atlanta', 'Savannah', 'Honolulu', 'Boise', 'Chicago', # 'Fort Wayne', 'Indianapolis', 'Des Moines', 'Wichita', 'Lexington', # 'Louisville Metro6', 'Baton Rouge', 'New Orleans', 'Baltimore', # 'Boston', 'Detroit', 'Minneapolis', 'St. Paul', 'Kansas City', # 'St. Louis', 'Omaha', 'Henderson', 'Las Vegas', 'North Las Vegas', # 'Reno', 'Jersey City', 'Newark', 'Albuquerque', 'Buffalo', # 'New York', 'Charlotte-Mecklenburg', 'Durham', 'Greensboro', # 'Cincinnati', 'Cleveland', 'Columbus', 'Toledo4,5,7', # 'Oklahoma City', 'Tulsa', 'Portland', 'Philadelphia', 'Pittsburgh', # 'Memphis', 'Nashville', 'Arlington4', 'Austin', 'Corpus Christi', # 'Dallas', 'El Paso', 'Fort Worth', 'Garland', 'Houston', 'Irving', # 'Laredo', 'Plano', 'San Antonio', 'Chesapeake', 'Norfolk', # 'Virginia Beach', 'Seattle', 'Spokane', 'Madison', 'Milwaukee'], # dtype=object) # TRY TO GET MATCH FOR CITIES # – look for cities with a number at the end # – this is how they look if there was a footnote # for a city in the original dataset (city_crime_usa .filter(['city']) .drop_duplicates() .query('city.str.match(".*[1-9]$")') ) # city # 0 Mobile3 # 26 Stockton4 # 47 Louisville Metro6 # 73 Toledo4,5,7 # 81 Arlington4 # SET DATAFRAME PRINT OPTION # – print 100 rows pd.set_option('display.max_rows', 100) # TEST CITY RECODE/CLEANUP # – here, we'll use Series str.replace to strip out # numbers as well as commas # – filter and drop_duplicates are just to get the # unique values of city, so we can see if the # recode worked properly (city_crime_usa .assign(city = (city_crime_usa.city.str.replace('[1-9]','') .str.replace(',*$','',regex = True))) .filter(['city']) .drop_duplicates() .head(n = 100) ) # TEST CITY RECODE/CLEANUP # – again, here we're using Series str.replace to strip out # numbers and commas # – we're actually assigning the output of the process # this time using the '=' city_crime_usa = (city_crime_usa .assign(city = (city_crime_usa.city.str.replace('[1-9]','') .str.replace(',*$','',regex = True))) ) # CHECK (city_crime_usa .filter(['city']) .drop_duplicates() .sort_values(by = 'city') ) #------------------------------------------------------------- # CREATE 'TECH CITY' VARIABLE # – here, we'll add an indicator variable # that flags "tech" cities # – this list of cities is somewhat arbitrary # – Some are recognized tech cities like San Francisco # as well as San Jose and Santa Clara (i.e., Silicon Valley) # – Others, like Austin and Miami, are up-and-coming # tech cities # – I want to subset down to this small list, so I can compare # and contrast the oldschool tech cities, with the # challengers #------------------------------------------------------------- tech_cities = ['Austin' ,'Denver' ,'Los Angeles' ,'Miami' ,'New York' ,'San Jose' ,'San Francisco' ,'Santa Clara' ,'Seattle'] # np.isin(city_crime_usa.city,tech_cities) # TEST NEW VARIABLE CREATION (city_crime_usa .assign(tech_city_flg = city_crime_usa.city.isin(tech_cities)) .filter(['city','tech_city_flg']) .drop_duplicates() .sort_values(by = 'tech_city_flg') ) # CREATE/ASSIGN NEW VARIABLE city_crime_usa = (city_crime_usa .assign(tech_city_flg = city_crime_usa.city.isin(tech_cities)) )
Questions
Do you still have questions about this Python data wrangling example?
Leave your questions in the comments section below.
To learn more about Pandas, sign up for our email list
This tutorial should have shown you a good example of Python data wrangling, but if you really want to master data manipulation and data science in Python, there’s a lot more to learn.
So if you’re ready to learn more about Pandas and more about data science, then sign up for our email newsletter.
We publish FREE tutorials almost every week on:
- Base Python
- NumPy
- Pandas
- Scikit learn
- Machine learning
- Deep learning
- … and more.
When you sign up for our email list, we’ll deliver these free tutorials directly to your inbox.