How to Recode a Categorical Variable in a Python Dataframe

This tutorial will show you how to recode a categorical variable in a Python dataframe.

Specifically, it will show how to recode a column a Pandas Dataframe.

You can click on any of the following links to jump to a specific spot in the tutorial.

Table of Contents:

Having said that, it will probably be helpful if you read the whole tutorial from start to finish.

Syntax: How to Recode a Categorical Variable with Pandas

Recoding a categorical variable with in Python using Pandas can be performed with a single line of code, but really requires two steps:

  1. retrieve the variable and remap the old values to new values,
    using Pandas map
  2. assign the output of the map step to a new variable,
    using Pandas assign

Syntactically, it looks like this:

An image that shows the syntax for recoding a column in a Pandas dataframe.

It’s somewhat easy to do, but as shown above, you really need to know how to use two different Pandas tools: Pandas assign and Pandas map.

That said, it’s best to learn how to do this by working through an example.

Example: How to Recode a Categorical Variable with Pandas

In this example, we’re going to recode a categorical variable with a single letter to the full word.

Specifically, we’re going to recode a “region” variable that contains abbreviated regions (e.g., “N”) and recode them to the full region name (e.g., “North”).

As noted above, we’ll use a couple of Pandas tools to do this.

Import Pandas

First of all, you need to import Pandas.

You can do that with this code.

import pandas as pd

Create Dataframe

Next, we’ll create a dataframe.

Specifically, we’ll create some dummy sales data that contains 4 variables:

  • name: the name of the sales person
  • region: the region that the sales person operates in
  • sales: their amount of total sales
  • expenses: their expenses
sales_data = pd.DataFrame({
"name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"]
,"sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000]
,"expenses":[42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]
,"region":["E","N","E","S","W","W","S","W","W","E","S"]})

And let’s print out the data:

print(sales_data)

OUT:

       name  sales  expenses region
0   William  50000     42000      E
1      Emma  52000     43000      N
2     Sofia  90000     50000      E
3    Markus  34000     44000      S
4    Edward  42000     38000      W
5    Thomas  72000     39000      W
6     Ethan  49000     42000      S
7    Olivia  55000     60000      W
8      Arun  67000     39000      W
9     Anika  65000     44000      E
10    Paulo  67000     45000      S

Notice that the region variable has the following values:

  • N
  • S
  • E
  • W

It’s probably obvious that these stand for North, South, East, and West.

In this example, we’re going to recode the abbreviated region values to the full word for the region.

Create Mapping with Old Vales and New Values

Here, we’ll use a dictionary to create a mapping that connects the old values (that are already in the dataframe) to the new values that we want to output.

region_mapping = {'N':'North'
                    ,'S':'South'
                    ,'E':'East'
                    ,'W':'West'
                    }

Notice that the left-hand side of each item is the old value and the right-hand side is the new value.

Test the Recode

Next, we’re going to test our variable recode.

Why?

Because sometimes, code doesn’t do exactly what you think it will. When you’re changing data, it’s almost always a good idea to test your process first, so you make sure that it’s working properly before you overwrite your data.

(Trust me, if you overwrite your data with something that’s wrong, it can be a pain in the @$$ … you sometimes need to start all over with your data processing.)

As mentioned elsewhere, we’re going to perform the recode by using the Pandas Map method, in concert with the assign Pandas method.

Remember that Pandas map modifies a Pandas series object, and Pandas assign modifies a dataframe.

So we’re going to:

  • retrieve the column that we want to operate on using dot syntax
  • use the map method to recode the values in that column
  • use the assign method to assign the output of Pandas map to a variable in our dataframe

Again: to recode a variable in a dataframe, we need to use a couple different tools.

Ok, let’s do it.

sales_data.assign(TEST  = sales_data.region.map(region_mapping))

OUT:

       name  sales  expenses region   TEST
0   William  50000     42000      E   East
1      Emma  52000     43000      N  North
2     Sofia  90000     50000      E   East
3    Markus  34000     44000      S  South
4    Edward  42000     38000      W   West
5    Thomas  72000     39000      W   West
6     Ethan  49000     42000      S  South
7    Olivia  55000     60000      W   West
8      Arun  67000     39000      W   West
9     Anika  65000     44000      E   East
10    Paulo  67000     45000      S  South

Notice here that I’m assigning the output to a new variable called TEST.

Notice also that here, the original region variable is still in the dataframe. This is useful because we can compare them side by side, and make sure that our recoded values are correct and appropriate.

If you see anything out of the ordinary, you may need to go back and modify your variable mapping.

Recode the Variable

Assuming that everything looks good when you test out the variable recode, then you can finalize the variable recode.

Again: remember our strategy here.

We’re using Pandas map to operate to recode the individual column values, and we’re using Pandas assign to assign that new column to our dataframe.

So specifically, we’re going to assign the newly recoded region data back to the variable name, region. Note that this will overwrite the original region variable. That’s why we needed to test the recode first.

sales_data = sales_data.assign(region = sales_data.region.map(region_mapping))

Notice as well that we’re storing the output of this whole process to the sales_data dataframe. Remember, this overwrites your original data. That’s why you need to test the operation first.

Check the Recode

Now, let’s just print out the dataframe so we can check the output.

print(sales_data)

OUT:

       name  sales  expenses region
0   William  50000     42000   East
1      Emma  52000     43000  North
2     Sofia  90000     50000   East
3    Markus  34000     44000  South
4    Edward  42000     38000   West
5    Thomas  72000     39000   West
6     Ethan  49000     42000  South
7    Olivia  55000     60000   West
8      Arun  67000     39000   West
9     Anika  65000     44000   East
10    Paulo  67000     45000  South

This looks good.

All of the one-letter, abbreviated regions have been replaced with the full region name.

Final Notes

This is a somewhat simple example with only 4 levels in the categorical variable.

In the past, on Twitter, I’ve shown some other examples of variable recoding with a larger number of categories.

If you have a lot of categories, you typically need to get a list of the unique category values, add all of those category mappings to your dictionary mapping, and then carefully review/test the output.

Frequently Asked Questions About Recoding Variables with Pandas

Now that I’ve shown you an example of how to recode a variable in Python using Pandas, let’s look at some frequently asked questions.

Frequently asked questions:

Question 1: Why is my dataframe still the same after recoding my variable

If you used assign() and map() to recode a categorical variable, but your dataframe is unchanged, then you probably forgot to store the output.

Remember: by default, the output of most Pandas operations is sent to the console.

So if you run this code:

sales_data.assign(region = sales_data.region.map(region_mapping))

… then the original sales_data dataframe will not be changed.

To change the dataframe (with the new recoded variable), you need to use the equals sign to save the output, like this:

sales_data = sales_data.assign(region = sales_data.region.map(region_mapping))

But be careful.

This will overwrite your data.

Make sure that your code works properly before you finalize this data overwrite.

Leave your other questions in the comments below

Do you have other questions about recoding variable in Python?

If so, leave your questions in the comments section below.

To learn more about Pandas, sign up for our email list

This tutorial should have how to recode variables in a Pandas dataframe, but if you really want to master data manipulation and data science in Python, there’s a lot more to learn.

So if you’re ready to learn more about Pandas and more about data science, then sign up for our email newsletter.

We publish FREE tutorials almost every week on:

  • Base Python
  • NumPy
  • Pandas
  • Scikit learn
  • Machine learning
  • Deep learning
  • … and more.

When you sign up for our email list, we’ll deliver these free tutorials directly to your inbox.

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

Leave a Comment