This tutorial will show you how to recode a categorical variable in a Python dataframe.
Specifically, it will show how to recode a column a Pandas Dataframe.
You can click on any of the following links to jump to a specific spot in the tutorial.
Table of Contents:
Having said that, it will probably be helpful if you read the whole tutorial from start to finish.
Syntax: How to Recode a Categorical Variable with Pandas
Recoding a categorical variable with in Python using Pandas can be performed with a single line of code, but really requires two steps:
- retrieve the variable and remap the old values to new values,
using Pandas map - assign the output of the map step to a new variable,
using Pandas assign
Syntactically, it looks like this:
It’s somewhat easy to do, but as shown above, you really need to know how to use two different Pandas tools: Pandas assign and Pandas map.
That said, it’s best to learn how to do this by working through an example.
Example: How to Recode a Categorical Variable with Pandas
In this example, we’re going to recode a categorical variable with a single letter to the full word.
Specifically, we’re going to recode a “region” variable that contains abbreviated regions (e.g., “N”) and recode them to the full region name (e.g., “North”).
As noted above, we’ll use a couple of Pandas tools to do this.
Import Pandas
First of all, you need to import Pandas.
You can do that with this code.
import pandas as pd
Create Dataframe
Next, we’ll create a dataframe.
Specifically, we’ll create some dummy sales data that contains 4 variables:
name
: the name of the sales personregion
: the region that the sales person operates insales
: their amount of total salesexpenses
: their expenses
sales_data = pd.DataFrame({ "name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"] ,"sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000] ,"expenses":[42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000] ,"region":["E","N","E","S","W","W","S","W","W","E","S"]})
And let’s print out the data:
print(sales_data)
OUT:
name sales expenses region 0 William 50000 42000 E 1 Emma 52000 43000 N 2 Sofia 90000 50000 E 3 Markus 34000 44000 S 4 Edward 42000 38000 W 5 Thomas 72000 39000 W 6 Ethan 49000 42000 S 7 Olivia 55000 60000 W 8 Arun 67000 39000 W 9 Anika 65000 44000 E 10 Paulo 67000 45000 S
Notice that the region variable has the following values:
N
S
E
W
It’s probably obvious that these stand for North, South, East, and West.
In this example, we’re going to recode the abbreviated region values to the full word for the region.
Create Mapping with Old Vales and New Values
Here, we’ll use a dictionary to create a mapping that connects the old values (that are already in the dataframe) to the new values that we want to output.
region_mapping = {'N':'North' ,'S':'South' ,'E':'East' ,'W':'West' }
Notice that the left-hand side of each item is the old value and the right-hand side is the new value.
Test the Recode
Next, we’re going to test our variable recode.
Why?
Because sometimes, code doesn’t do exactly what you think it will. When you’re changing data, it’s almost always a good idea to test your process first, so you make sure that it’s working properly before you overwrite your data.
(Trust me, if you overwrite your data with something that’s wrong, it can be a pain in the @$$ … you sometimes need to start all over with your data processing.)
As mentioned elsewhere, we’re going to perform the recode by using the Pandas Map method, in concert with the assign Pandas method.
Remember that Pandas map modifies a Pandas series object, and Pandas assign modifies a dataframe.
So we’re going to:
- retrieve the column that we want to operate on using dot syntax
- use the map method to recode the values in that column
- use the assign method to assign the output of Pandas map to a variable in our dataframe
Again: to recode a variable in a dataframe, we need to use a couple different tools.
Ok, let’s do it.
sales_data.assign(TEST = sales_data.region.map(region_mapping))
OUT:
name sales expenses region TEST 0 William 50000 42000 E East 1 Emma 52000 43000 N North 2 Sofia 90000 50000 E East 3 Markus 34000 44000 S South 4 Edward 42000 38000 W West 5 Thomas 72000 39000 W West 6 Ethan 49000 42000 S South 7 Olivia 55000 60000 W West 8 Arun 67000 39000 W West 9 Anika 65000 44000 E East 10 Paulo 67000 45000 S South
Notice here that I’m assigning the output to a new variable called TEST
.
Notice also that here, the original region
variable is still in the dataframe. This is useful because we can compare them side by side, and make sure that our recoded values are correct and appropriate.
If you see anything out of the ordinary, you may need to go back and modify your variable mapping.
Recode the Variable
Assuming that everything looks good when you test out the variable recode, then you can finalize the variable recode.
Again: remember our strategy here.
We’re using Pandas map to operate to recode the individual column values, and we’re using Pandas assign to assign that new column to our dataframe.
So specifically, we’re going to assign the newly recoded region
data back to the variable name, region
. Note that this will overwrite the original region
variable. That’s why we needed to test the recode first.
sales_data = sales_data.assign(region = sales_data.region.map(region_mapping))
Notice as well that we’re storing the output of this whole process to the sales_data
dataframe. Remember, this overwrites your original data. That’s why you need to test the operation first.
Check the Recode
Now, let’s just print out the dataframe so we can check the output.
print(sales_data)
OUT:
name sales expenses region 0 William 50000 42000 East 1 Emma 52000 43000 North 2 Sofia 90000 50000 East 3 Markus 34000 44000 South 4 Edward 42000 38000 West 5 Thomas 72000 39000 West 6 Ethan 49000 42000 South 7 Olivia 55000 60000 West 8 Arun 67000 39000 West 9 Anika 65000 44000 East 10 Paulo 67000 45000 South
This looks good.
All of the one-letter, abbreviated regions have been replaced with the full region name.
Final Notes
This is a somewhat simple example with only 4 levels in the categorical variable.
In the past, on Twitter, I’ve shown some other examples of variable recoding with a larger number of categories.
If you have a lot of categories, you typically need to get a list of the unique category values, add all of those category mappings to your dictionary mapping, and then carefully review/test the output.
Frequently Asked Questions About Recoding Variables with Pandas
Now that I’ve shown you an example of how to recode a variable in Python using Pandas, let’s look at some frequently asked questions.
Frequently asked questions:
Question 1: Why is my dataframe still the same after recoding my variable
If you used assign()
and map()
to recode a categorical variable, but your dataframe is unchanged, then you probably forgot to store the output.
Remember: by default, the output of most Pandas operations is sent to the console.
So if you run this code:
sales_data.assign(region = sales_data.region.map(region_mapping))
… then the original sales_data
dataframe will not be changed.
To change the dataframe (with the new recoded variable), you need to use the equals sign to save the output, like this:
sales_data = sales_data.assign(region = sales_data.region.map(region_mapping))
But be careful.
This will overwrite your data.
Make sure that your code works properly before you finalize this data overwrite.
Leave your other questions in the comments below
Do you have other questions about recoding variable in Python?
If so, leave your questions in the comments section below.
To learn more about Pandas, sign up for our email list
This tutorial should have how to recode variables in a Pandas dataframe, but if you really want to master data manipulation and data science in Python, there’s a lot more to learn.
So if you’re ready to learn more about Pandas and more about data science, then sign up for our email newsletter.
We publish FREE tutorials almost every week on:
- Base Python
- NumPy
- Pandas
- Scikit learn
- Machine learning
- Deep learning
- … and more.
When you sign up for our email list, we’ll deliver these free tutorials directly to your inbox.