Pandas Drop Duplicates, Explained

This tutorial will show you how to use Pandas drop duplicates to remove duplicate rows from a dataframe.

The tutorial will explain what the technique does, explain the syntax, and it will also show you clear examples.

You can click on any of the following links, and it will take you to the appropriate section in the tutorial.

Table of Contents:

If you need something specific, you can click on one of the links above.

Having said that, if you really want to know how this technique works, you should probably read the whole tutorial.

An Introduction to Pandas Drop Duplicates

Stated simply, the Pandas drop duplicates method removes duplicate rows from a Pandas dataframe.

I’ll show you some examples of this in the examples section, but first, I want to quickly review some fundamentals about Pandas and Pandas dataframes.

This will give you some context, and help you understand exactly what this technique does, and why we might use it.

Dataframes Store Python Data

First, let’s quickly review what a dataframe is. (Remember: the drop duplicates method operates on Pandas dataframes.)

A dataframe is a data structure in Python that’s available in the Pandas package.

We use dataframes to store certain types of data, and we use Pandas techniques to manipulate dataframe data.

Dataframes have Rows and Columns

Let’s look more carefully at dataframe structure. Pandas dataframes store data in a row-and-column format.

An example of a Pandas dataframe, that shows how dataframes have a row-and-column structure.

Typically, the columns of a dataframe are variables, and the rows typically record individual data records. For example, if you had a dataframe with sales data, the individual rows might record the sales information for individual people.

If you’ve ever used Excel, a Pandas dataframe is really a lot like an Excel spreadsheet, in the sense that they both have this row-and-column structure.

It’s Possible to Have Rows with the Exact Same Data

Importantly, it’s possible for a dataframe to have duplicate rows of data.

So for example, if two rows had the same value for every column, we’d consider those to be duplicate rows.

A picture of a dataframe with duplicate rows.

There’s also a different type of duplicate, where two rows have the same value for one or more important columns (but not necessarily every column).

A picture of a dataframe that has duplicate data for two variables, but where the data for the remaining variables is different.

Sometimes, Duplicate Records are Bad

Sometimes, these duplicates are okay.

When you’re working with some types of data, it might be normal – even expected – to have duplicate rows of data.

But there are some instances where having duplicate rows of data is bad. Sometimes, duplicate rows can cause problems with an analysis or a specific data science technique.

For example, imagine you’re working with sales data. If a data system accidentally recorded a duplicate record for a particular salesperson, it might over-report that person’s actual sales performance.

Frankly, there are many possible examples in fields like marketing, finance, accounting, and other areas, where having duplicate data could be an issue.

Additionally, there are certain types of data wrangling operations where duplicates can cause big problems.

In particular, duplicated data often causes problems when data scientists aggregate data.

Duplicate rows can also be a really big problem when you merge or join multiple datasets together. In fact, before you do a join, you almost always need to check for duplicate records!

With that said, as an analyst or data scientist, you need ways to identify and remove duplicate rows of data.

Drop Duplicates Removes Duplicate Rows

There are several ways to identify duplicate rows. In Python, you can use the Pandas unique method. There are also some ways to identify duplicates by using aggregation and summation.

We’re not going to cover those tools in this tutorial.

Here, we’re going to discuss the drop duplicates technique, which enables you to remove duplicate rows from a dataframe, once you’ve found them.

An image that shows the Pandas drop duplicates method removing a duplicate row from a Pandas dataframe.

There’s actually a few different ways to remove duplicate rows, and it really depends on several parameters in the syntax.

Having said that, let’s take a look at the syntax of Pandas drop duplicates, so we can better understand how it works.

The syntax of drop_duplicates

Here, I’ll explain how the syntax of the Pandas drop_duplicates() method.

I’ll show some examples later in the example section, but here, I just want to break down the syntax piece by piece.

dataframe.drop_duplicates syntax

So how does the syntax work?

The first thing to know about the drop_duplicates syntax is that this technique is a method, not a function.

Specifically, it is a dataframe method.

That being the case, the first step to calling the method is to type the name of your Pandas dataframe.

(Obviously, this assumes that you’ve created a dataframe first. For more information about creating dataframes, read our tutorial about Pandas dataframes.)

Ok. So first, you type the name of your dataframe.

Then, you use so-called “dot syntax” to call the drop_duplicates method.

An image that explains the syntax of the Pandas drop_duplicates method.

Then, inside the parenthesis, you can use a few parameters to control exactly how drop_duplicates works.

Let’s take a look at those.

The parameters of drop_duplicates

There are a few parameters for the drop_duplicates() function:

  • subset
  • keep
  • inplace
  • ignore_index

Let’s discuss what each of those parameters does.

subset (optional)

The subset parameter enables us to specify a subset of columns in which to look for duplicate data.

By default, drop_duplicates() will look at all variables … meaning that it will look for rows of data where all of the data is the same.

However, when we use subset, we can specify a list or sequence of column names in which to search for duplicate data. (I’ll show you examples of this in example 2 and example 3.)

Also note that this parameter is optional. If you don’t use it, it will look at all columns by default.

keep (optional)

The keep parameter enables us to specify which duplicate row of data to keep.

This parameter is optional.

If you don’t use it, then by default, this will be set to keep = False. This will cause drop_duplicates to delete all of the duplicated rows.

Alternatively, you can set keep to 'first' or 'last', in which case it will keep the first duplicate row or last duplicate row, respectively.

inplace (optional)

The inplace parameter enables us to specify whether or not drop_duplicates will operate directly on the input dataframe.

Like many other Pandas functions and methods, by default, drop_duplicates creates a new dataframe as the output. That means that by default, drop_duplicates leaves the dataframe you’re operating on unchanged. That’s because by default, the inplace parameter is set to inplace = False.

You can change this behavior by setting inplace = True. If you do this, drop_duplicates will directly modify your input dataframe.

But be careful! Make sure that the method is working exactly as you intend before you directly change your data!

ignore_index (optional)

The ignore_index parameter controls the index of the output, after the duplicates have been removed.

By default, this is set to ignore_index = False. This causes drop_duplicates to keep the same index values for the undeleted rows that remain in the output.

If, however, you set ignore_index = False, drop_duplicates will create a new index for the output, starting at 0 and ending at n – 1.

The output of drop_duplicates

As I’ve mentioned, the default output of Pandas drop_duplicates is a new dataframe.

That’s because by default, the inplace parameter is set to inplace = False.

Alternatively, you can set inplace = True, which will cause drop_duplicates to directly modify the dataframe you’re operating on.

But as I’ve mentioned, you need to be careful with this! Make sure that the method is doing what you expect it to!

Examples: How to Drop Duplicate Rows from a Pandas Dataframe

Now that we’ve looked at the syntax, let’s take a look at some examples of how the Pandas drop duplicates method works.

In these examples, I’ll show you several different ways to drop duplicate rows from a Pandas dataframe.

You can click on any of the following links, and it will take you to the appropriate example.

Examples:

Before you jump into the examples though, we need to import Pandas and create the dataframe we’ll work with.

Import Pandas

First we need to import Pandas.

import pandas as pd

Here, we’re importing Pandas with the alias ‘pd‘.

This is the common convention, so we’ll use it going forward.

Create Dataframe

Next, we’ll create a simple dataframe.

sales_data = pd.DataFrame({"name":["William","William","Emma","Emma","Anika","Anika"]
,"region":["East","East","North","West","East","East"]
,"sales":[50000,50000,52000,52000,65000,72000]
,"expenses":[42000,42000,43000,43000,44000,53000]})

And let’s print the data out.

print(sales_data)

OUT:

      name region  sales  expenses
0  William   East  50000     42000
1  William   East  50000     42000
2     Emma  North  52000     43000
3     Emma   West  52000     43000
4    Anika   East  65000     44000
5    Anika   East  72000     53000

Notice that the data has several types of duplicate records.

The first two rows (for William) are exactly the same.

The second two rows (for Emma), have the same name, but the region variable is different.

And the last two rows (for Anika) have the same name and region, but the data for sales and expenses are different.

I’ll show you how to deal with these different cases in the following examples.

EXAMPLE 1: Search for rows where all data is the same and keep first row (default)

First, we’ll just use drop_duplicates() with the default behavior.

Here, Pandas drop duplicates will find rows where all of the data is the same (i.e., the values are the same for every column). It will keep the first row and delete all of the other duplicates.

Let’s take a look.

sales_data.drop_duplicates()

OUT:

      name region  sales  expenses
0  William   East  50000     42000
2     Emma  North  52000     43000
3     Emma   West  52000     43000
4    Anika   East  65000     44000
5    Anika   East  72000     53000
Explanation

Notice that here, drop duplicates deleted row 1.

In the original dataset, row 0 and row 1 (the rows for William) were identical.

When we use drop_duplicates with the default setting, it only operates on rows where all of the values are the same. And by default, it keeps the first row (in this case, row 0).

EXAMPLE 2: Keep the last duplicate row

Next, we’re going to change our code so that drop_duplicates keeps the last duplicate row.

(Remember in the previous example, by default, drop duplicates kept the first duplicate row.)

Let’s take a look.

Here, we’ll set keep = 'last' to cause drop_duplicates to keep the last row:

sales_data.drop_duplicates(keep = 'last')

OUT:

      name region  sales  expenses
1  William   East  50000     42000
2     Emma  North  52000     43000
3     Emma   West  52000     43000
4    Anika   East  65000     44000
5    Anika   East  72000     53000
Explanation

In this example, drop duplicates operated on row 0 and row 1 (the rows for William).

Remember: by default, Pandas drop duplicates looks for rows of data where all of the values are the same. In this dataframe, that applied to row 0 and row 1.

But here, instead of keeping the first duplicate row, it kept the last duplicate row.

It should be pretty obvious that this was because we set keep = 'last'.

This literally causes the method to keep the last duplicate, out of all the duplicates that it finds.

EXAMPLE 3: Drop all duplicate rows

Now, we’ll modify the code so that it deletes all of the duplicates that it finds.

Here, we’ll call Pandas drop_duplicates with keep = False.

Remember that by default, drop_duplicates will look for rows where all of the values are the same.

Let’s take a look:

sales_data.drop_duplicates(keep = False)

OUT:

    name region  sales  expenses
2   Emma  North  52000     43000
3   Emma   West  52000     43000
4  Anika   East  65000     44000
5  Anika   East  72000     53000
Explanation

Once again, this code is operating only on the rows where all of the data are the same … rows 0 and 1 (the data for William).

In these rows, every value is the same.

But instead of keeping one row and deleting the other duplicates, this code is deleting all of the duplicates.

That’s because we set keep = False.

When we set keep = False, Pandas drop_duplicates will remove all rows that are duplicates of another row.

EXAMPLE 4: Look for duplicates on one variable

Next, let’s look for duplicates on a specific variable.

In the previous examples, drop_duplicates looked for rows where all of the data was the same. The data needed to be the same for every variable.

Here, we’re going to use the subset parameter to focus on one specific variable.

We’ll set subset = ['name'] to look for records where only the value for name is the same.

Here’s the code:

sales_data.drop_duplicates(subset = ['name'])

OUT:

      name region  sales  expenses
0  William   East  50000     42000
2     Emma  North  52000     43000
4    Anika   East  65000     44000
Explanation

Here, the drop duplicates method looked for duplicate data only in the name variable.

When looking at the name variable, there were duplicates for row 0 and 1 (William), rows 2 and 3 (Emma), and rows 4 and 5 (Anika).

After finding these rows that had duplicate names, it deleted the second duplicate (remember, that’s the default behavior).

Also notice syntactically how we executed this. When we use the subset parameter, we provide a list of variable names.

In this case, we only wanted to look for duplicates on the name variable, so we set subset = ['name'] (i.e., a list with only one variable name).

EXAMPLE 5: Look for duplicates on multiple variables

Now let’s look for duplicates on a subset of multiple variables.

This is actually very similar to the previous example, but here, we’ll look at both name and region.

sales_data.drop_duplicates(subset = ['name','region'])

OUT:

      name region  sales  expenses
0  William   East  50000     42000
2     Emma  North  52000     43000
3     Emma   West  52000     43000
4    Anika   East  65000     44000
Explanation

In this example, the drop_duplicates method operated on the rows for William (rows 0 and 1) as well as the rows for Anika (rows 4 and 5).

Why?

Here, we set subset = ['name','region'].

This caused drop_duplicates to search for records where name and region were the same.

That applied to rows 0 and 1, which had the same name and region.

It also applied to rows 4 and 5, which also had the same name and region.

But it left rows 2 and 3 alone. The reason for this is because even though rows 2 and 3 have the same name (Emma) they have a different region.

EXAMPLE 6: Operate on your dataframe “in place”

Finally, let’s operate on our dataframe “in place”.

Remember: by default, when we use drop_duplicates, Pandas will create a new dataframe as the output, and will leave the input dataframe unchanged.

By using the inplace parameter, we can chance this behavior.

Create copy dataframe

Instead of operating directly on sales_data, we’ll operate on a copy of the dataframe.

This is often a smart thing to do before you perform a data wrangling technique.

Frequently, you want to test your code to make sure that it works properly before you operate on your data directly.

sales_data_copy = pd.DataFrame({"name":["William","William","Emma","Emma","Anika","Anika"]
,"region":["East","East","North","West","East","East"]
,"sales":[50000,50000,52000,52000,65000,72000]
,"expenses":[42000,42000,43000,43000,44000,53000]})

There are a few ways of copying your data, but here, we’re just creating a different dataframe called sales_data_copy.

If your dataset is particularly large, you might want to create a subset or a sample of rows, since copying a very large dataset might cause storage problems.

Drop duplicates with inplace = True

Ok. Now that we have sales_data_copy created, let’s operate directly on this dataframe.

Here, we’re going to set the inplace parameter to inplace = True.

This will cause drop_duplicates to directly change the input dataframe, sales_data_copy.

Let’s take a look.

sales_data_copy.drop_duplicates(inplace = True)

And now, let’s print the dataframe.

print(sales_data_copy)

OUT:

      name region  sales  expenses
0  William   East  50000     42000
2     Emma  North  52000     43000
3     Emma   West  52000     43000
4    Anika   East  65000     44000
5    Anika   East  72000     53000
Explanation

You might have noticed that drop_duplicates did not send the output directly to the console.

Instead, we needed to print out sales_data_copy to look at the original dataframe.

In this case, sales_data_copy was actually changed directly. Here, drop_duplicates deleted the duplicate rows (row 1, the last row for William). Furthermore, it operated directly on sales_data_copy, without creating a new, different dataframe as output.

That’s what happens when we set inplace = True.

But be careful!

When you set inplace = True, drop duplicates will directly overwrite your input data, so you need to be sure that your code is operating exactly as you want to.

Frequently Asked Questions about Pandas Drop Duplicates

Now that you’ve learned about the Pandas drop duplicates method, and now that you’ve seen some examples, let’s review a common question.

Frequently asked questions:

Question 1: Why isn’t drop_duplicates changing my dataframe?

There’s a chance that you’ve run drop_duplicates, only to find that your original input dataframe is unchanged.

This is because by default, Pandas drop duplicates creates a new dataframe as an output, and keeps the original input dataframe unchanged.

To change this behavior and directly modify the input dataframe, you need to set inplace = True.

I covered this step by step in example 6.

Leave your other questions in the comments below

Do you have other questions about the Pandas drop duplicates method?

If you do, just leave your questions in the comments section below.

If you want to master Pandas, join our course

In this tutorial, I’ve explained how to use the drop_duplicates to remove duplicate rows from a Pandas dataframe.

This tutorial should help you understand this one technique, but if you really want to master data manipulation in Pandas, there’s a lot more to learn.

That said, if you want to master data wrangling with Pandas, you should join our premium online course, Pandas Mastery.

Pandas Mastery will teach you everything you need to know about Pandas, including:

  • How to create dataframes
  • How to subset your Python data
  • Data aggregation with Pandas
  • How to wrangle rows and columns
  • How to reshape your data
  • and much more …

Moreover, it will help you completely master the syntax within a few weeks. You’ll discover how to become “fluent” in writing Pandas code to manipulate your data.

Find out more here:

Learn More About Pandas Mastery

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

1 thought on “Pandas Drop Duplicates, Explained”

  1. ” There are several ways to identify duplicate rows. In Python, you can use the Pandas unique method. There are also some ways to identify duplicates by using aggregation and summation. ”

    Do you mind elaborating? how can I use ‘unique’ to carry this task out? or is it easier to just use the ‘duplicated’ function?

    Reply

Leave a Comment