It explains what value_counts does, how the syntax works, and it provides step-by-step examples.

If you need something specific, you can click on any of the following links.

**Table of Contents:**

Ok. Let’s get into the details.

First, let’s just start with an explanation of what the value_counts technique does.

Essentially, value_counts *counts the unique values* of a Pandas object. We often use this technique to do data wrangling and data exploration in Python.

The value_counts method will actually work on several different types of Pandas objects:

- Pandas Series
- Pandas dataframes
- dataframe columns (which are actually Pandas Series objects)

Having said that, how you use the value_counts method will vary slightly depending on which type of object you’re operating on.

Additionally, there are some optional parameters that you can use that will change what value_counts does.

That being the case, let’s look at the syntax.

Ok. Let’s look at the syntax of the Pandas value_counts technique.

Here, I’ll divide this up into different sections, so we can look at the syntax for how to use value_counts on Series objects and how to use value counts on dataframes.

The following syntax explanations assume that you’ve imported Pandas, and that you’ve already created a Pandas dataframe or a Pandas series.

You can import Pandas with this code:

import pandas as pd

And for more information about dataframes, you can read our introduction to Pandas dataframes.

First, let’s look at the syntax for how to use value_counts on a dataframe.

This is really simple. You just type the name of the dataframe then `.value_counts()`

.

When you use value_counts on a dataframe, it will count the number of records for every combination of unique values for *every column*.

This may be more information than you want, and it may be better to subset the dataframe down to only a few columns. I’ll show you some examples of this in the examples section.

Additionally, there are some optional parameters that you can use, which will modify the behavior of the method. I’ll show you those in the parameters section.

Next, let’s look at the syntax to use value_counts on a Series object.

The syntax for a Series is almost the same as the syntax for a dataframe:

You simply type the name of the Series object, and then `.value_counts()`

.

Additionally, there are some optional parameters that you can use, which we’ll discuss in the parameters section.

Finally, let’ look at how to use value_counts on a *column* inside of a dataframe.

Remember: individual dataframe columns *are* Series objects.

So to call value_counts on a column, we first use “dot syntax” to retrieve an individual column. For example, if your dataframe is named `your_dataframe`

and the column you want to retrieve is called `column`

, you would start by typing `your_dataframe.column`

.

After that, you simply type `.value_counts()`

and the method will retrieve the count of the unique values for that individual column.

And once again, there are some additional parameters that you can use to change how value_counts works.

Let’s look at those parameters.

The Pandas value_counts technique has several parameters that you can use which will change how the technique works and what exactly it does.

`ascending`

`sort`

`normalize`

`subset`

`dropna`

In addition, there is the `bins`

parameter, which I rarely use and won’t discuss here.

It’s important to note that all of these parameters are *optional*.

It’s also important to note that most of these parameters – `ascending`

, `sort`

, and `normalize`

– are used for both the Series syntax and the dataframe syntax.

On the other hand, `subset`

is only available when you use value_counts on dataframes, and `dropna`

is only available when you use value_counts on Series.

Having said all of that, let’s look at each of these parameters individually.

`ascending`

By default, value_counts will sort the data by numeric count in *descending order*.

The ascending parameter enables you to change this.

When you set `ascending = True`

, value counts will sort the data by count from low to high (i.e., ascending order).

I’ll show you an example of this in example 4.

`sort`

The sort parameter controls how the output is sorted.

By default, value_counts sorts the data by the *numeric count*.

You can change this and sort the data by categories instead by setting `sort = False`

.

I’ll show you an example of this in example 5.

`normalize`

The `normalize`

parameter changes the form of the output.

By default, value_counts shows the count of the unique values.

But if you set `normalize = True`

, value_counts will display the *proportion of total records* instead of the raw count.

I’ll show you an example of this in example 6.

`subset`

The `subset`

parameter enables you to specify a subset of columns on which to apply value_counts, when you use value_counts on a dataframe.

The argument to this parameter should be a list (or list-like object) of column names.

So for example, if you want to use value counts on `var_1`

and `var_2`

in a dataframe, you would use the code `your_dataframe.value_counts(subset = ['val_1','var_2'])`

.

NOTE: again, this parameter is works when you use value_counts on a whole dataframe.

I’ll show you an example of this in example 7.

`dropna`

The `dropna`

parameter enables you to show ‘NA’ values (i.e., `NaN`

values).

You can do this by setting `dropna = False`

.

NOTE: this parameter is only available for Pandas Series objects and individual dataframe columns. This parameter will not work if you use value_counts on a whole dataframe.

I’ll show you an example of this in example 2.

Now that we’ve looked at the syntax, let’s look at some examples of how to use the value_counts technique.

**Examples:**

- Use value_counts on a dataframe column
- Include ‘NA’ values in the counts
- Use value_counts on an entire Pandas dataframe
- Sort the output in ascending order
- Sort by category (instead of count)
- Compute proportions (i.e., normalize the value counts)
- Operate on a subset of dataframe columns

Before you run the examples, you’ll need to run some preliminary code in order to:

- import necessary packages
- get a dataframe
- create a dataframe subset that we can work with

Let’s do those one at a time.

First, let’s import two packages that we’ll need.

Specifically, we’ll need to import Pandas and Seaborn.

You can do that with the following code:

import pandas as pd import seaborn as sns

Obviously, we’ll need Pandas to use the `value_counts()`

technique. But we’ll also need Seaborn, because we’ll be using the `titanic`

dataframe which we can load from Seaborn’s pre-installed datasets.

Next, let’s get the dataframe we’ll be working with.

In the following examples, we’ll be using the `titanic`

dataset, or some subset of it.

So here, let’s load the dataset from Seaborn:

# GET DATASET titanic = sns.load_dataset('titanic')

Additionally, let’s print it out, so we can see the contents:

print(titanic)

OUT:

survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone 0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False 1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False 2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True 3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False 4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 886 0 2 male 27.0 0 0 13.0000 S Second man True NaN Southampton no True 887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True 888 0 3 female NaN 1 2 23.4500 S Third woman False NaN Southampton no False 889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True 890 0 3 male 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True [891 rows x 15 columns]

There are 15 columns in this dataframe, which will be a little difficult to work with if we’re using the value_counts() technique.

That said, let’s quickly create a subset that we can use with some of our examples.

Now, let’s create a subset of the `titanic`

dataframe.

Here, we’ll create a subset that contains two variables: `sex`

and `embarked`

.

To subset down to these two variables, we’ll use the Pandas filter method:

#CREATE SUBSET titanic_subset = titanic.filter(['sex','embarked'])

For some of our examples, this subset will simply be easier to work with, since it has only 2 variables.

First, let’s use the value_counts technique on a single column.

Here, we’ll use value_counts on the `embarked`

variable in the `titanic`

dataframe.

Let’s run the code, and then I’ll explain:

titanic.embarked.value_counts()

OUT:

S 644 C 168 Q 77 Name: embarked, dtype: int64

The code to perform this operation is a single line of code, but in some sense, it’s a two step process.

In this code, we’re:

- retrieving the
`embarked`

variable with “dot syntax” - calling the
`value_counts()`

method

So, we’re retrieving the `embarked`

variable with the code `titanic.embarked`

.

But after that, we’re calling the value counts method with `.value_counts()`

.

In the output, you see the unique values of the `embarked`

variable – `S`

, `C`

, and `Q`

– and the counts associated with each of those values.

Next, let’s include the ‘NA’ values (i.e., `NaN`

) in the output. This will enable us to see the number of ‘missing’ values for the variable, if there are any.

Keep in mind that here, we’re still going to operate on a single dataframe variable.

titanic.embarked.value_counts(dropna = False)

OUT:

S 644 C 168 Q 77 NaN 2 Name: embarked, dtype: int64

Here, we’ve called `value_counts()`

just like we did in example 1.

The only difference is that we included the code `dropna = False`

inside the parenthesis.

As you can see in the output, there is now a count of the number of `NaN`

values (i.e., “missing” values).

This can be useful if you need to identify missing values to clean them up, etc.

**NOTE**: This will only work if you use value_counts() on a Pandas Series or a dataframe column. It will *not* work if you try to use value_counts on an entire Pandas dataframe (like in example 3).

In the last two examples, we used value_counts on a single *column* of a dataframe (i.e., a Pandas series object).

Now, let’s use value_counts on a whole dataframe.

Here, we’re going to use value counts on the `titanic_subset`

dataframe. (Remember, we created this subset earlier. It has only two variables to make it easier to work with.)

Ok. Let’s run the code:

titanic_subset.value_counts()

OUT:

sex embarked male S 441 female S 203 male C 95 female C 73 male Q 41 female Q 36 dtype: int64

This is really straight forward.

To do this, we simply typed the name of the dataframe, and then `.value_counts()`

.

You can see that the output is a count of the unique combinations of the variables in the dataframe.

Notice as well that the output is sorted in descending order. That’s the default, but we can change it as well, which we’ll do in the next example.

In this example, we’ll sort the output in ascending order.

Remember that by default, value_counts sorts the output in *descending* order.

We can change that behavior though with the `ascending`

parameter.

Let’s take a look:

titanic_subset.value_counts(ascending = True)

OUT:

sex embarked female Q 36 male Q 41 female C 73 male C 95 female S 203 male S 441 dtype: int64

Here, we see the counts of the unique combinations of values in the dataframe.

But now, because we set `ascending = True`

, the output is sorted in ascending order … it’s sorted from low to high.

Now, let’s remove the sorting altogether.

To do this, we’ll call the method with `sort = False`

.

titanic_subset.value_counts(sort = False)

OUT:

sex embarked female C 73 Q 36 S 203 male C 95 Q 41 S 441 dtype: int64

Notice in the output, the data are not sorted by the value counts (i.e., the numbers).

Instead, the data are sorted by the categories. The unique categorical values in both variables are sorted in alphabetical order.

Personally, I think this is easier to read, but it does depend on what you’re doing.

There may be some applications where this is better, and there may be some instances where it’s better to sort the data by the numeric counts (like the default behavior).

In any case, you have a choice.

In this example, let’s compute the proportions of each unique combination of values.

In the previous examples, `value_counts`

provided a count of the number of values.

Here, we’ll tell `value_counts`

to compute the percent of total records, using the `normalize`

parameter:

titanic_subset.value_counts(normalize = True)

OUT:

sex embarked male S 0.496063 female S 0.228346 male C 0.106862 female C 0.082115 male Q 0.046119 female Q 0.040495 dtype: float64

The output here is somewhat similar to the output for example 3, in the sense that it’s sorted in descending order of frequency.

But instead of showing the raw counts of each unique combination of categories, it’s showing the proportion. Notice that if you add all the numbers up, they add up to 1.

So again, the numbers represent the proportion of total records accounted for by each unique combination.

In the previous examples, I’ve shown you how to use value_counts on a pandas Series, a small Pandas dataframe (with only 2 columns), or a single dataframe column.

Here, I’ll show you how to operate a large dataframe with many columns.

But we’ll use the `subset`

parameter to reduce the size and complexity of the output.

So here, we’ll be working with the full `titanic`

dataframe, which has 15 columns. We’ll use the `subset`

parameter to operate only on two of those variables: `sex`

and `embarked`

.

Let’s take a look.

titanic.value_counts(subset = ['sex','embarked'])

OUT:

sex embarked male S 441 female S 203 male C 95 female C 73 male Q 41 female Q 36 dtype: int64

Here, we’re working with the full `titanic`

dataset. Remember: this is the full dataset with 15 variables (instead of the smaller `titanic_subset`

dataframe, which only has 2 variables).

So here, we’re taking the full `titanic`

dataframe with 15 variables and using value_counts on only 2 variables. To do this, we’re setting `subset = ['sex','embarked']`

.

Notice that syntactically, each variable we want to include is presented as a string (inside of quotation marks). And the collection of variable names is organized into a Python list.

Now that you’ve learned about value_counts and seen some examples, let’s review some frequently asked questions.

**Frequently asked questions:**

Unfortunately, no.

The dropna parameter is very useful for identifying missing values, but unfortunately, you can only use this parameter when you operate on a single dataframe column or Pandas series.

Do you have any other questions about the Pandas value_counts technique?

Is there something that you’re struggling with that I haven’t covered here?

If so, leave your question in the comments section below.

This tutorial should have helped you understand the value_counts technique, and how it works.

But if you want to master data cleaning and data wrangling with Pandas, there’s a lot more to learn.

And there’s even more to learn if you want to learn data science in Python, more broadly.

That said, if you’re ready to learn more about Pandas and data science in Python, then sign up for our email list.

When you sign up, you’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

We publish free data science tutorials every week. When you sign up for our email list, we’ll deliver these free tutorials directly to your inbox.

]]>I’ll explain what the technique does, how the syntax works, and I’ll show you clear examples of how to use it.

If you need something specific, you can click on any of the following links.

**Table of Contents:**

Ok. Let’s start with a quick introduction to the rename method.

The Pandas rename method is fairly straight-forward: it enables you to rename the columns or rename the row labels of a Python dataframe.

This technique is most often used to rename the columns of a dataframe (i.e., the variable names).

But again, it can also rename the row labels (i.e., the labels in the dataframe index).

I’ll show you examples of both of these in the examples section.

But first, let’s take a look at the syntax.

Ok, now that I’ve explained what the Pandas rename method does, let’s look at the syntax.

Here, I’ll show you the syntax for how to rename Pandas columns, and also how to rename Pandas row labels.

Everything that I’m about to describe assumes that you’ve imported Pandas and that you already have a Pandas dataframe created.

You can import pandas with the following code:

import pandas as pd

And if you need a refresher on Pandas dataframes and how to create them, you can read our tutorial on Pandas dataframes.

Ok, let’s start with the syntax to rename columns. (The syntax for renaming columns and renaming rows labels is almost identical, but let’s just take it one step at a time.)

When we use the rename method, we actually start with our *dataframe*. You type the name of the dataframe, and then `.rename()`

to call the method.

Inside the parenthesis, you’ll use the `columns`

parameter, which enables you to specify the columns that you want to change.

Let’s look carefully at how to use the columns parameter.

When you change column names using the rename method, you need to present the old column name and new column name inside of a Python dictionary.

So if the old variable name is `old_var`

and the new variable name is `new_var`

, you would present to the `columns`

parameter as key/value pairs, inside of a dictionary: `columns = {'old_var':'new_var'}`

.

If you have multiple columns that you want to change, you simply provide multiple old name/new name dictionary pairs, separated by commas.

To do this properly, you really need to understand Python dictionaries, so if you need a refresher, then read about how dictionaries are structured.

Now, let’s look at the syntax for renaming row labels (i.e., the labels in the dataframe index).

You’ll notice that the syntax is almost exactly the same as the syntax for changing the column names.

The major difference is that instead of using the `columns`

parameter, when you want to change the row labels, you need to use the `index`

parameter instead.

Beyond that you’ll still use a dictionary with old name/new name pairs as the argument to the `index`

parameter.

Now, let’s take a look at the parameters of the rename method.

The most important parameters that you should know are:

`columns`

`index`

`inplace`

The rename method also has additional parameters – `axis`

, `copy`

, `levels`

, and `errors`

– but these are more advanced and less commonly used, so I won’t cover them here.

`columns`

The `columns`

parameter enables you to specify the column names you want to change, and what to change them to.

As described above, the argument to this parameter can be a dictionary, but it can also be a function.

When you provide a dictionary, it the values should be structured as old name/new name pairs, like this: `{'old_var':'new_var'}`

.

`index`

The `index`

parameter is very similar to the `columns`

parameter, except it operates on row index labels instead of column names.

So the `index`

parameter enables you to specify the row labels you want to change, and what to change them to.

As described above, the argument to this parameter can be a dictionary, but it can also be a function.

When you provide a dictionary, it the values should be structured as old name/new name pairs, like this: `{'old_var':'new_var'}`

.

`inplace`

The inplace parameter enables you to force the rename method to directly modify the dataframe that’s being operated on.

By default, `inplace`

is set to `inplace = False`

. This causes the rename method to produce a *new* dataframe. In this case, the original dataframe is left unchanged.

If you set `inplace = True`

, the rename method will directly alter the original dataframe, and overwrite the data directly. Be careful with this, and make sure that your code is doing exactly what you want it to.

By default, the rename method will output a *new* Python dataframe, with new column names or row labels. As noted above, this means that by default, rename will leave the original dataframe unchanged.

If you set `inplace = True`

, rename won’t produce any new output. In this case, rename will instead directly modify and overwrite the dataframe that’s being operated on.

Now that we’ve looked at the syntax, let’s look at some examples.

**Examples:**

- Rename one dataframe column
- Rename multiple dataframe columns
- Change row labels
- Change the column names and row labels ‘in place’

However, before you run the examples, you’ll need to run some preliminary code to import the modules we need, and to create the dataset we’ll be operating on.

First, let’s import Pandas.

You can do that with the following code:

#=============== # IMPORT MODULES #=============== import pandas as pd

Notice that we’re importing Pandas with the alias `pd`

. This makes it possible to refer to Pandas as `pd`

in our code, which is the common convention among Python data scientists.

Next, we’ll create a dataframe that we can operate on.

We’ll create our dataframe in three steps:

- create a Python dictionary
- create a Pandas dataframe from the dictionary
- specify the columns and row labels (i.e., the index)

Let’s start by creating a Python dictionary. As you can see, this dictionary contains economic data for several countries.

#========================== # CREATE DICTIONARY OF DATA #========================== country_data_dict = { 'country_code':['USA', 'CHN', 'JPN', 'GER', 'UK', 'IND'] ,'country':['USA', 'China', 'Japan', 'Germany', 'UK', 'India'] ,'continent':['North America','Asia','Asia','Europe','Europe','Asia'] ,'gross_domestic_product':[19390604, 12237700, 4872137, 3677439, 2622434, 2597491] ,'pop':[322179605, 1403500365, 127748513, 81914672, 65788574, 1324171354] }

Next, we’ll create a DataFrame from the dictionary:

#================================= # CREATE DATAFRAME FROM DICTIONARY #================================= country_data = pd.DataFrame(country_data_dict, columns = ['country_code','country', 'continent', 'gross_domestic_product', 'pop'])

Note that in this step, we’re setting the column names using the `columns`

parameter inside of pd.DataFrame().

Finally, we’ll set the row labels (i.e., the index). By default, Pandas uses a numeric integer index that starts at 0.

We’re going to change that default index to character data.

Specifically, we’re going to use the values in the `country_code`

variable as our new row labels.

To do this, we’ll use the Pandas set_index method:

country_data = country_data.set_index('country_code')

Notice that we’re assigning the output of `set_index()`

to the original dataframe name, `country_data`

, using the equal sign. This is because by default, the output of `set_index()`

is a new dataframe object by default. By default, `set_index()`

does *not* modify the DataFrame in place.

Ok. Now that we have our dataframe, let’s print it out to look at the contents:

print(country_data)

OUT:

country continent gross_domestic_product pop country_code USA USA North America 19390604 322179605 CHN China Asia 12237700 1403500365 JPN Japan Asia 4872137 127748513 GER Germany Europe 3677439 81914672 UK UK Europe 2622434 65788574 IND India Asia 2597491 1324171354

This dataframe has economic data for six countries, organized in a row-and-column structure.

The dataframe has four columns: `country`

, `continent`

, ` gross_domestic_product `

, and `pop`

.

Additionally, notice that the “`country_code`

” variable is set aside off to the left. That’s because we’ve set the `country_code`

column as the index. The values of `country_code`

will now function as the row labels of the dataframe.

Now that we have this dataframe, we’ll be able to use the `rename()`

method to rename the columns and the row labels.

First, let’s start simple.

Here, we’re going to rename a single column name.

Specifically, we’re going to rename the `gross_domestic_product`

variable to `GDP`

.

Let’s run the code, and then I’ll explain.

country_data.rename(columns = {'gross_domestic_product':'GDP'})

OUT:

country continent GDP pop country_code USA USA North America 19390604 322179605 CHN China Asia 12237700 1403500365 JPN Japan Asia 4872137 127748513 GER Germany Europe 3677439 81914672 UK UK Europe 2622434 65788574 IND India Asia 2597491 1324171354

Notice in the output that `gross_domestic_product`

has been renamed to `GDP`

.

How did we do it?

We typed the name of the dataframe, `country_data`

, and then used so-called “dot syntax” to call the `rename()`

method.

Inside the parenthesis, we’re using the `columns`

parameter to specify the columns we want to rename, and the new name that we want to use. This ‘old name’/’new name’ pair is presented as a *dictionary* in the form `{'old name':'new name'}`

.

So we have the synax `columns = {'gross_domestic_product':'GDP'}`

, which is basically saying change the column name `'gross_domestic_product'`

to `'GDP'`

.

One more thing to point out here: when we run this code, the *original dataframe will remain unchanged*.

That’s because by default, the Pandas rename method produces a new dataframe as an output and leaves the original unchanged. And by default, this output will be sent directly to the console. So when we run our code like this, we’ll see the new dataframe with the new name in the console, but the original dataframe will be left the same.

If you want to save the output, you can use the assignment operator like this:

country_data_new = country_data.rename(columns = {'gross_domestic_product':'GDP'})

Here, I’ve given the output dataframe the new name `country_data_new`

.

We can call it anything we like. We could even call it `country_data`

. Just be careful if you do `country_data_new = country_data.rename(...)`

, it will overwrite your original dataset. Make sure your code works perfectly before you do this!

Next, let’s make things a little more complicated.

Here, we’ll rename multiple columns at the same time.

The way to do this is very similar to the code in example 1, except here, we’ll provide more old name/new name pairs in our dictionary.

Specifically, we’ll rename `gross_domestic_product`

to `GDP`

, and we’ll rename `pop`

to `population`

.

Let’s take a look.

country_data.rename(columns = {'gross_domestic_product':'GDP', 'pop': 'population'})

OUT:

country continent GDP population country_code USA USA North America 19390604 322179605 CHN China Asia 12237700 1403500365 JPN Japan Asia 4872137 127748513 GER Germany Europe 3677439 81914672 UK UK Europe 2622434 65788574 IND India Asia 2597491 1324171354

This should make sense if you understood example 1.

Here, we’re calling the `rename()`

method using dot syntax.

Inside the parenthesis, we have the code `columns = {'gross_domestic_product':'GDP', 'pop': 'population'}`

.

Look inside the dictionary (i.e., inside the curly brackets). Here, we have *two* old name/new name pairs. These are organized as key/value pairs, just as we normally have inside of a dictionary.

This should be simple to understand, as long as you understand how dictionaries are structured. If you don’t make sure to review Python dictionaries.

Now, let’s rename some of the row labels.

Specifically, we’re going to rename the labels `GER`

and `UK`

.

To do this, we’re going to use the `index`

parameter.

Let’s take a look.

country_data.rename(index = {'GER':'DEU','UK':'GBR'})

OUT:

country continent gross_domestic_product pop country_code USA USA North America 19390604 322179605 CHN China Asia 12237700 1403500365 JPN Japan Asia 4872137 127748513 DEU Germany Europe 3677439 81914672 GBR UK Europe 2622434 65788574 IND India Asia 2597491 1324171354

Here, we’re renaming `GER`

to `DEU`

and we’re renaming `UK`

to `GBR`

.

To do this, we called the rename method and used the code `index = {'GER':'DEU','UK':'GBR'}`

inside the parenthesis.

The `index`

parameter enables us to specify the row labels that we want to change. And we’re using the dictionary as the argument, which contains the old value/new value pairs.

Finally, let’s change some of the columns and row labels ‘in place’.

As I mentioned in example 1, and in the syntax section, by default, the rename method leaves the original dataframe unchanged. That’s because by default, the `inplace`

parameter is set to `inplace = False`

. This causes the rename method to produce a new dataframe as the output, while leaving the original dataframe unchanged.

But sometimes, we actually want to modify the original dataframe directly.

To do this we can set `inplace = True`

.

Before we run our code, we’re actually going to make a copy of our data.

The reason is that we’re going to directly overwrite a dataframe. This can be dangerous if you get it wrong, so we’re actually going to work with a copy of the original.

country_data_copy = country_data.copy()

Now, we have a dataframe, `country_data_copy`

, which contains the same data as the original.

Now, we’re going to directly rename the columns and row labels of `country_data_copy`

.

To do this, we’ll use rename with the `inplace`

parameter as follows:

country_data_copy.rename(index = {'GER':'DEU','UK':'GBR'} ,columns = {'gross_domestic_product':'GDP', 'pop': 'population'} ,inplace = True )

And now, let’s print out the data, so we can see it.

print(country_data_copy)

OUT:

country continent GDP population country_code USA USA North America 19390604 322179605 CHN China Asia 12237700 1403500365 JPN Japan Asia 4872137 127748513 DEU Germany Europe 3677439 81914672 GBR UK Europe 2622434 65788574 IND India Asia 2597491 1324171354

As you can see in the output the row labels and column names have been changed directly in `country_data_copy`

.

We simply did this by setting `inplace = True`

.

Again: be careful with this. When you do this, you’ll directly modify and overwrite your data. Check your code and double check again to make sure that your code works correctly before using `inplace = True`

.

Let’s quickly cover a common question about the Pandas rename technique.

**Frequently asked questions:**

Remember: by default, the Pandas rename function creates a new dataframe as an output, but leaves the original dataframe unchanged. (I mentioned this in the syntax section.)

The reason is that by default, the inplace parameter is set to `inplace = False`

. Again, this leaves the original dataframe unchanged, and simply produces a *new* dataframe as the output.

If you want to directly modify your original dataframe, you need to set `inplace = False`

. I show how to do this in example 4.

Do you have any other questions about the Pandas rename method?

Is there something that you’re struggling with that I haven’t covered here?

If so, leave your questions in the comments section near the bottom of the page.

This tutorial should have given you a good idea of how to rename columns in Python using the Pandas rename method.

But to really understand data manipulation in Python, you’ll need to know quite a few more techniques.

Moreover, to understand data science more broadly, there’s really a lot more to learn.

Having said that, if you want to learn data manipulation with Pandas, and data science in Python, then sign up for our email newsletter.

When you sign up, you’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

We publish data science tutorials every week, and when you sign up for our email list, we’ll deliver those tutorials directly to your inbox.

]]>The tutorial will explain what the `describe()`

method does, how the syntax works, and it will show you step-by-step examples.

If you need something specific, you can click on any of the following links, and it will take you to the appropriate section in the tutorial.

**Table of Contents:**

Ok. Let’s start with a quick description of what the Pandas describe method does.

The `describe()`

method computes and displays summary statistics for a Python dataframe. (It also operates on dataframe columns and Pandas series objects.)

So if you have a Pandas dataframe or a Series object, you can use the describe method and it will output statistics like:

- mean
- median
- standard deviation
- minimum
- maximum
- percentiles
- etc

Having said that, the exact statistics that are computed depends on how you use the syntax.

With that in mind, let’s take a look at the syntax.

Here, we’ll take a look at the syntax of the Pandas describe method.

I’ll show you how to use the describe method on:

- dataframes
- Pandas Series objects
- dataframe columns (which are actually Series objects)

Additionally, I’ll explain some of the optional parameters that we can use to modify how the technique works.

In the syntax explanation ahead, we’ll be assuming that we already have a Pandas dataframe or a Pandas series object.

If you need a refresher on Pandas dataframes, how they work, and how to create them, you can read our tutorial on Pandas dataframes.

First, let’s look at how to use the describe method on a Pandas dataframe.

This is extremely simple.

You simply type the name of the dataframe, and then `.describe()`

.

By default, if you only type `your_dataframe.describe()`

, the describe method will compute summary statistics on all of the numeric variables in your dataframe.

There are also some optional parameters that we can use to modify the method, which we’ll get to in a moment.

You can also use the Pandas describe method on pandas Series objects instead of dataframes.

The most common use of this though is to use `describe()`

on individual columns of a Pandas dataframe (remember, each column of a dataframe is technically a Pandas Series).

You can use the describe method on a dataframe column like this:

So you type the name of your dataframe, then a ‘dot’, then the name of the column, then `.describe()`

.

And once again, there are also some additional parameters that you can use inside the parenthesis. These will change the behavior of the method.

That being the case, let’s look at the additional parameters

A few of the important parameters that you can use to modify the Pandas describe method are:

`include`

`exclude`

`percentiles`

`datetime_is_numeric`

Let’s look at a few of these.

`include`

(optional)The `include`

parameter enables you to specify what data types to operate on and include in the output descriptive statistics.

Possible arguments to this parameter are:

`'all'`

(this will include all variables)`numpy.number`

(this will include numeric variables)`object`

(this will include string variables)`'category'`

(this will include Pandas category variables)

Note that as shown above, some of these arguments need to be enclosed inside of quotation marks! (I’ll show you examples of these in the examples section.)

Additionally, you can provide multiple of these arguments in a Python list.

Note that this parameter is *ignored* when you use describe on a Series object.

`exclude`

(optional)The `include`

parameter enables you to specify what data types exclude in the descriptive statistics. (Note: this is very similar to the `include`

parameter explained above.)

Possible arguments to this parameter are:

`numpy.number`

(this will exclude numeric variables)`object`

(this will exclude string variables)`'category'`

(this will exclude Pandas category variables)

Note that as shown above, some of these arguments need to be enclosed inside of quotation marks! (I’ll show you examples of these in the examples section.)

Additionally, you can provide multiple of these arguments in a Python list.

Note that this parameter is *ignored* when you use describe on a Series object.

`percentiles`

(optional)The `percentiles`

parameter enables you to specify what percentiles to include in the descriptive statistics, when the `describe()`

method operates on numeric variables.

By default, `describe()`

will include the 25th and 75th percentiles.

You can provide a list or a list-like sequence of numbers between 0 and 1 as an argument to this parameter.

For example, if you set `percentiles = [.1, .9]`

, the describe method will return the 10th percentile and 90th percentiles (but will exclude the 25th and 75th percentiles).

Note that no matter what arguments you provide to the `percentiles`

parameter, the describe method will always return the median (50th percentile).

Ok. Now that we’ve looked at the syntax, let’s take a look at some examples of how to compute summary statistics with the `describe()`

method.

**Examples:**

- Describe a dataframe
- Describe a single column
- Compute summary statistics for numeric variables
- Compute summary statistics for string variables
- Get summary statistics for ‘category’ variables
- Specify what percentiles to include in the output

Before you run the examples though, you need to run some preliminary code.

First, make sure that you import Pandas, Numpy, and Seaborn.

import pandas as pd import numpy as np import seaborn as sns

We’ll obviously need the Pandas package for the `describe`

method, but we’ll use Numpy when we use the `include`

parameter. We’ll also need Seaborn to load our dataset.

Now, let’s load the dataset that we’re going to use.

In these examples, we’re going to use the `titanic`

dataset, which is included along with the Seaborn package.

To load the `titanic`

dataset, you can run the following code:

titanic = sns.load_dataset('titanic')

Now that we have our packages loaded and we have our dataset, we can move on to our examples.

Let’s start with a simple example.

Here, we’ll use Pandas describe on an entire dataframe. By default, this will return summary statistics for all of the numeric variables.

Let’s run the code:

titanic.describe()

OUT:

survived pclass age sibsp parch fare count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Here, we called the `describe()`

method using so-called “dot syntax.”

We typed the name of the dataframe, then `.describe()`

.

By default, describe computed the

- count
- mean
- standard deviation
- the minimum and maximum
- the 25th, 50th, and 75th percentiles

Note once again that by default, the method only shows statistics for the numeric variables. We’ll change that in example 3.

Next, let’s operate on a single column of our dataframe.

Here, we’ll use “dot syntax” to retrieve a single variable first, the `age`

variable, and then we’ll use the describe method on that column.

Let’s take a look:

titanic.age.describe()

OUT:

count 714.000000 mean 29.699118 std 14.526497 min 0.420000 25% 20.125000 50% 28.000000 75% 38.000000 max 80.000000 Name: age, dtype: float64

When we type the code `titanic.age`

, Python will retrieve the `age`

variable from the dataframe.

From there, we can use dot syntax again to call the `describe()`

method.

So here, the code `titanic.age.describe()`

computes summary statistics only for the `age`

variable.

Now, let’s move back to our dataframe.

Here, we’re going to explicitly specify the variables that we’re going to include.

Specifically, we’ll indicate that we want to include only the numeric variables.

Let’s run the code, and then I’ll explain.

titanic.describe(include = [np.number])

OUT:

survived pclass age sibsp parch fare count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Here, notice that we called the describe method in a similar way to example 1.

But notice that inside the parenthesis, we have the syntax `include = [np.number]`

. Remember that the `include`

parameter enables us to specify what types of variables we want to include. Here, the syntax `np.number`

indicates that we want to include numeric variables (i.e., Numpy numerics).

Notice as well that we’re presenting the arguments to this parameter as a *list*. This is very common in Pandas when you can provide multiple arguments. For example, try a list of several different data types: `titanic.describe(include = [np.number, object])`

.

Now, we’ll do an example that’s similar to example 3, but slightly different.

In example 3, we computed the summary stats for the numeric variables.

Here, we’ll compute the summary statistics for the string variables.

titanic.describe(include = [object])

OUT:

sex embarked who embark_town alive count 891 889 891 889 891 unique 2 3 3 3 2 top male S man Southampton no freq 577 644 537 644 549

So what happened here?

We called the `describe()`

method, and inside the parenthesis, we used the syntax `include = [object]`

. Here, `object`

refers to string variables, so the Pandas describe method computes summary stats for the string columns.

Notice that the statistics that are computed are actually different than the stats for the numeric variables.

For the numeric variables, `describe()`

computes things like the minimum, maximum, mean, percentiles, etc.

But for these string variables, `describe()`

has computed the count, the number of unique values, the most frequent value, and the frequency of the most frequent value.

Now, let’s operate on the ‘category’ variables.

This is very similar to the previous examples.

Let’s run the code, and then I’ll explain:

titanic.describe(include = ['category'])

OUT:

class deck count 891 203 unique 3 7 top Third C freq 491 59

This is very similar to example 3 and example 4.

Here, we’re using the describe method to compute the summary stats for the ‘`category`

‘ variables. We’re telling Pandas describe to do this with the code `include = ['category']`

.

Notice that the output is similar to the output for string variables (which we saw in example 4). The output includes the count, the number of unique values, the most frequent value (i.e., the ‘top’ value), and the frequency of the most frequent value.

At this point, you might be wondering what the difference is between a string and a category.

Strings and category variables are similar, but we typically use categories when there are only a few unique values. If there are many unique values (e.g., names, sentences, unstructured text data), a string (i.e., `object`

data) should be better.

That being said, you need to be mindful of what data types are in your dataframe.

Finally, let’s use the `percentiles`

parameter. This enables us to specify what percentiles to include in the output, when we operate on numeric variables.

Let’s run the code, and then I’ll explain:

titanic.describe(percentiles = [.1, .9])

OUT:

survived pclass age sibsp parch fare count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 10% 0.000000 1.000000 14.000000 0.000000 0.000000 7.550000 50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 90% 1.000000 3.000000 50.000000 1.000000 2.000000 77.958300 max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

So the output is very similar to the output for example 1. Remember that in example 1, we operated on the whole dataframe, but by default, describe included stats only for the numeric variables.

Here, the describe method has only included statistics for the numeric variables.

However, look at the percentiles. Instead of including the statistics for the 25th and 75th percentiles (which is the default), the method has included the stats for the 10th percentile and 90th percentile.

Why?

We explicitly forced this behavior with the `percentiles`

parameter. Specifically, we set `percentiles = [.1, .9]`

. This caused Pandas describe to include the stats for included the 10th percentile and 90th percentile instead of the 25th and 75th percentiles.

A few additional notes:

Notice that the median (50th percentile) is still included.

Also, notice that when we use this parameter, we need to present the percentiles as decimal numbers inside of a Python list: `[.1, .9]`

.

Do you have other questions about the Pandas describe method?

Is there something that you think I’ve missed?

If so, just leave your questions in the comments section below.

This tutorial should have helped you understand the Pandas describe method, but if you really want to master data manipulation with Pandas, there’s a lot more to learn.

And if you want to learn data science more broadly, there’s definitely more to learn, like data visualization and machine learning.

That being said, if you want to learn data science in Python, then sign up for our email list.

When you sign up, you’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

We publish free tutorials every week, and when you sign up, we’ll deliver them directly to your inbox.

]]>If you need something specific, you can click on any of the links below. These links will take you to the appropriate section of the tutorial.

**Table of Contents:**

If you really want to understand how Numpy argmin works though, you should probably read the whole tutorial. It will make more sense if you read the whole tutorial.

Let’s start with a quick introduction to the argmin function.

Numpy argmin function (much like it’s companion function, Numpy argmax) frequently confuses new Numpy users. However, it starts to make sense once you see a clear explanation and clear examples.

In its simplest use, the argmin function returns the *index* of the minimum value of a Numpy array.

So it’s similar to the Numpy minimum function, but instead of returning the minimum value, it returns the *index* of the minimum value.

To make this a little more clear, I’m going to quickly review some basics about Numpy and Python. Knowing these basics should make argmin easier to understand.

For starters, let’s review what a Numpy arrays are.

Numpy arrays store numeric data in a grid format.

So for example, in the Numpy array above, we have the values `91`

, `92`

, `93`

, `7`

, and `95`

, arranged in a one dimensional array.

So Numpy arrays are a special data structure that store numeric data in a grid-like format.

Next, let’s review Numpy indexes. Indexes are important for the argmin function.

All Numpy arrays have indexes. The values of the index are like numeric addresses for each position in the array.

1-dimensional arrays are simpler to understand than multi-dimensional arrays, so let’s look at a 1D example.

If we have a 1D Numpy array, every location in that array has an index. Again, the index is sort of like a numeric address for each position in the array.

If you’re familiar with lists, tuples, and other Python data structures, this probably sounds familiar. Most collections and sequences in Python have indexes, just like these Numpy array indexes.

And just like the indexes for lists and tuples, indexes for Numpy arrays start at 0.

Now, let’s bring this back to the argmin function.

When we call `np.argmin()`

, the argmin function identifies the minimum value in the array.

But instead of retrieving the minimum value itself, argmin retrieves the *index* that’s associated with the minimum value.

That’s really all it does! The np.argmin function simply returns the index of the minimum value in the array.

Having said that, there are more complicated ways of using the np.argmin. For example, you can use the function along specific axes. I’ll show you examples of how to do that in the examples section.

But before we look at some examples, let’s look at the syntax.

Ok. Let’s take a look at the syntax.

One reminder before we look at the syntax

The syntax explanation and the examples below assume that you’ve imported Numpy with the alias ‘`np`

‘.

You can do that with this code:

import numpy as np

Among Python data scientists, this is the common convention for importing Numpy. So, we’ll be using this convention in the following syntax explanation and examples.

With that said, let’s look at the syntax.

The syntax of np.argmin is fairly simple.

You call the function as `np.argmin()`

.

Inside the parenthesis, you have a few parameters that you can use to control how the function works.

Let’s look at those parameters.

The np.argmin function really only has 3 parameters:

`a`

`axis`

`out`

The `out`

parameter is somewhat rarely used, so we’re not going to discuss it here.

But let’s look at the `a`

parameter and `axis`

parameter.

`a`

(required)The `a`

parameter enables you to specify the input array that you want to operate on.

So if you want to operate on an array called `myarray`

, you can call the function as `np.argmin(a = myarray)`

.

Keep in mind that you need to provide an argument to this parameter.

Having said that, you don’t need to explicitly use this parameter. Instead, you can pass in an argument by position like this: `np.argmin(myarray)`

. The argmin function will assume that the first argument to the function is the input array to be passed to the `a=`

parameter.

Also note that this parameter will accept many data structures as arguments. Typically, we’ll pass in a Numpy array as the argument, but the np.argmin function will also accept “array like” objects, such as Python lists.

`axis`

The `axis`

parameter enables you to control the axis along which to use argmin.

Remember: Numpy arrays have *axes*. Axes are like directions along the numpy array.

Keep in mind, that the `axis`

parameter is optional. If you use it, `np.argmin`

will retrieve the index values for the minima along particular axes. But if you don’t use it, then argmin will flatten out the array and retrieve the index of the minimum of the flattened array.

To be honest, how axes work is little difficult to understand without examples. So I’ll show you some examples in the examples section bellow. I also strongly recommend that you read our tutorial that explains Numpy axes.

Ok. Let’s take a look at a few examples.

Things almost always make more sense when you can look at some examples, but that’s particularly true with np.argmin.

**Examples:**

- Use argmin on a 1-dimensional array
- Apply argmin to a 2-dimensional array
- Use argmin along axis 0
- Use argmin along axis 1

Before you run any of the examples, you need to import Numpy.

You can do that with the following code:

import numpy as np

When we do this, we’ll be able to call our Numpy functions starting with the alias ‘`np`

‘.

Ok, let’s start simple.

Here, we’re going to identify the index of the minimum value of a 1-dimensional Numpy array.

First, let’s just create the array:

my_1d_array = np.array([91,92,93,7,95])

Now, let’s use np.argmin.

np.argmin(a = my_1d_array)

OUT:

3

This is a simple example.

This is a 1D array, with several elements. The minimum value of the array is 7.

The minimum value, 7, is at index position 3, so argmin returns the value ‘3’.

Next, let’s look at how argmin works on a 2-dimensional array.

First, we need to create our array with the Numpy array() function.

my_2d_array = np.array([[8,92,93],[94,9,7]])

Notice that there are some high values and some low values.

Now that we have our 2D array, let’s apply np.argmin.

np.argmin(my_2d_array)

OUT:

5

The output for this example is 5. Why?

In this example, we’re operating on a 2-dimensional array.

But, by default, if we’re use np.argmin on a 2-dimensional array and we do *not* specify an axis, the Numpy argmin function applies a 2-step process.

First, np.argmin flattens the 2-dimensional array to a 1-dimensional array. (Keep in mind, np.argmin will also flatten out higher dimensional arrays).

Second, np.argmin operates on the new flattened array.

When we flatten out the array in this example, the minimum value, 7, is at index position 5 of the flattened array.

Therefore, numpy.argmin returns 5.

Next, let’s apply argmin to a 2-dimensional array, and also use the `axis`

parameter.

In this example, we’ll set `axis = 0`

.

We’ll re-use the array that we created in example 2, but if you didn’t run that code already, here’s the code to create our 2D array:

my_2d_array = np.array([[8,92,93],[94,9,7]])

Now that we have our array, let’s use Numpy argmin with `axis = 0`

:

np.argmin(my_2d_array, axis = 0)

OUT:

array([0, 1, 1])

This example is harder to understand, so let’s break it down.

Remember what I said earlier: Numpy axes are like directions along a Numpy array. For a 2-dimensional array, the axis-0 direction points downward.

When we use argmin in the axis-0 direction, the function identifies the minimum along that axis and returns the index.

Since this example is a little more complicated, let’s look carefully.

When we set `axis = 0`

, we’re using argmin in the axis-0 direction. Remember, for a 2D array, axis-0 points downward.

Also remember that for a 2D array, every row has an index. So you can see “row 0” and “row 1”. (Again, all Python indexes start at 0, so the “first” row is actually the 0th row.)

From there, argmin simply looks for the minimum value along the axis-0 direction. When it finds the minimum value, argmin returns the row index.

So 0 is the minimum value in the first column, and the *row index* of that value is 0.

The minimum value in the second column is 9, which is in row 1.

Similarly, the minimum value in the third column is 7, which is also in row 1.

np.argmin outputs the indexes of the minimum values in the axis-0 direction. So the output is `[0, 1, 1]`

.

Effectively, when we set `axis = 0`

, it’s like applying argmin along the columns.

Ok. Let’s do one more example.

Let’s apply argmin in the axis-1 direction.

First, let’s create our array (the same array as the previous two examples):

my_2d_array = np.array([[8,92,93],[94,9,7]])

And now, let’s use argmin.

np.argmin(my_2d_array, axis = 1)

OUT:

array([0, 2])

This example is also a hard to understand. To understand it, you really need to know how axes work for Numpy arrays.

In this example, we’re applying np.argmin along axis-1. Remember: for 2D Numpy arrays, axis-1 points *horizontally*.

When we set `axis = 1`

, argmin identifies the minimum value for every row. Then it returns the column index of each minimum value.

So for the first row, the minimum value is 8. That value has a column index of 0.

For the second row, the minimum value is 7. That value has a column index of 2.

So np.argmin outputs the column indexes of the minimum values: `[0,2]`

.

I’ve tried to make these examples as clear as possible, but I realize that Numpy argmin is difficult to understand.

Ultimately, to understand Numpy argmin, you need to understand Numpy indexes. You also need to understand how Numpy axes work. Those things being said, if you haven’t already, you should read our tutorial on Numpy axes.

Do you have other questions about Numpy argmin? Is there something I’ve missed?

If so, just leave your questions in the comments section near the bottom of the page.

In this tutorial, I’ve shown you how to use Numpy argmin.

Numpy argmin is useful for some very specific things, but if you’re working with numeric data in Python, there’s a lot more to learn. To use Numpy properly, you’ll need to know many other Numpy functions.

So if you’re serious about learning Numpy, you should consider joining our premium course called *Numpy Mastery*.

Numpy Mastery will teach you everything you need to know about Numpy, including:

- How to create Numpy arrays
- How to Numpy axes work
- What the “Numpy random seed” function does
- How to use the Numpy random functions
- How to reshape, split, and combine your Numpy arrays
- Applying mathematical operations on Numpy arrays
- and more …

The course will also show you our unique practice system. This practice system will enable you to *memorize* all of the Numpy syntax that you learn.

If you’re struggled to remember Numpy syntax, this is the course you’ve been looking for.

If you practice like we show you, you’ll memorize all of the critical Numpy syntax in only a few weeks.

Find out more here:

]]>It will explain what the np.unique function does, how the syntax works, and it will show you clear examples.

If you need something specific, you can click on any of the following links.

**Table of Contents:**

The Numpy unique function is pretty straight forward: it identifies the unique values in a Numpy array.

So let’s say we have a Numpy array with repeated values. If we apply the np.unique function to this array, it will output the unique values.

Additionally, the Numpy unique function can:

- identify the unique
*rows*of a Numpy array - identify the unique
*columns*of a Numpy array - compute the
*number*of occurrences of the unique values - identify the
*index*of the first occurrence of the unique values

So the Numpy unique function identifies unique values, rows, and columns, but can also identify some other information about those unique values.

Now that I’ve briefly explained what the Numpy unique function does, let’s take a look at the syntax.

On the syntax explanation here, and in the examples section below, I’m going to assume that you’ve imported Numpy with the following code:

import numpy as np

This is the common convention for importing Numpy. It’s important though, because the exact form of the syntax will depend on how we import Numpy.

The syntax is mostly straightforward.

We typically call the function as `np.unique()`

, assuming that we’ve imported Numpy with the alias `np`

.

Inside the parenthesis, the first argument to the function will be the name of the array that you want to operate on.

In the above syntax, this is called `arr`

, but here, you’ll actually use the name of your array. So if your array is called `my_array`

, you’ll use the code `np.unique(my_array)`

.

This input array is required.

Additionally though, there are a set of optional parameters that you can use to modify the behavior of the function.

The np.unique function has four optional parameters:

`return_index`

`return_counts`

`axis`

`return_inverse`

Let’s look at each of those.

`return_index`

(optional)When `return_index = True`

, np.unique will return the index of the first occurrence of the unique value.

This parameter is optional.

By default, this is set to `return_index = False`

.

`return_counts`

(optional)When `return_counts = True`

, np.unique will return the number of times each unique value occurs in the input array.

This parameter is optional.

By default, this is set to `return_counts = False`

.

`axis`

(optional)The `axis`

parameter enables you to specify a direction along which to use the np.unique function.

If set to `axis = None`

, the input array will be flattened before applying np.unique.

To learn more about the different axes (i.e., the “directions” along a Numpy array), you can read our tutorial about Numpy axes.

This parameter is optional.

By default, this is set to `axis = None`

.

`return_inverse`

(optional)If `return_inverse = True`

, np.unique will return the indices of the unique array. These index values can be used to reconstruct the original array.

This parameter is optional.

By default, this is set to `return_inverse = False`

.

Now that we’ve looked at the syntax of the np.unique function, let’s look at some examples.

**Examples:**

- Get unique values from a 1D Numpy array
- Identify index of first occurrence of unique values
- Get the counts of each unique value
- Get the unique rows and columns

Before you run any of these examples, you need to run some code to import Numpy and to create a dataset.

To import Numpy, run this code:

import numpy as np

This will enable us to call Numpy functions with the prefix `np`

.

Now we’ll create a Numpy array.

Here, we’ll use the np.array function to create a 1-dimensional array.

array_with_duplicates = np.array([5,5,1,5,4,5,1,5,3,5,1,3])

As you can see, the array has several duplicated values.

First, let’s get get the unique values from our 1D array, `array_with_duplicates`

.

# GET UNIQUE VALUES np.unique(array_with_duplicates)

OUT:

array([1, 3, 4, 5])

This is pretty simple.

The input array, `array_with_duplicates`

, has the values `1`

, `3`

, `4`

, and `5`

, but they are duplicated and organized in random order.

When we apply the `np.unique()`

function, the output is a Numpy array of the unique values. These unique values are sorted in ascending order.

Next, we’re going to get the unique values *and also* get the index of the first occurrence of each unique value.

To do this, we’ll use the `return_index`

parameter.

# GET UNIQUE VALUES, WITH INDEX OF FIRST OCCURRENCE unique_values, first_occurrence_index = np.unique(array_with_duplicates, return_index = True)

Next, let’s print each of these output arrays.

print('These are the unique values:') print(unique_values) print('These are the indexes of the first occurrence:') print(first_occurrence_index)

OUT:

These are the unique values: [1 3 4 5] These are the indexes of the first occurrence: [2 8 4 0]

Here, we used the `np.unique()`

on our input array, and we set parameter `return_index = True`

.

This caused `np.unique()`

to output two Numpy arrays:

- one array with the unique values (
`unique_values`

) - another array with the index of the first occurrence of every unique value (
`first_occurrence_index`

)

Just remember: when you set `return_index = True`

, `np.unique()`

will output two arrays!

Now, we’ll get the unique values *and* get the count of the number of occurrences of each unique value.

To do this, we’ll use the `return_counts`

parameter.

# GET UNIQUE VALUES, WITH COUNTS unique_values, value_count = np.unique(array_with_duplicates, return_counts = True)

Next, let’s print each of these output arrays.

print('These are the unique values:') print(unique_values) print('These are the counts of the unique values:') print(value_count)

OUT:

These are the unique values: [1 3 4 5] These are the counts of the unique values: [3 2 1 6]

Here, we used the `np.unique()`

on our input array, and we set parameter `return_counts = True`

.

This caused `np.unique()`

to output two Numpy arrays:

- one array with the unique values (
`unique_values`

) - another array with the count of the number of occurrences of every unique value (
`value_count`

)

Again, when you set `return_counts = True`

, `np.unique()`

will output two arrays!

Finally, let’s identify the unique rows and the unique columns of an array.

To do this, we’ll use the `axis`

parameter.

To run this example, we first need to create a 2-dimensional array. So here, we’ll create a 2D array using the Numpy array function.

dupe_array_2d = np.array([[1,2,1],[2,2,2],[1,2,1]])

And now, let’s look at it with a print statement:

print(dupe_array_2d)

OUT:

[[1 2 1] [2 2 2] [1 2 1]]

So the array, `dupe_array_2d`

, is a two dimensional array with 3 rows and 3 columns.

If you look carefully, you’ll notice that the 1st and 3rd rows are the same. The 1st and 3rd columns are also the same.

Now that we have our array, let’s get the unique rows and unique columns.

To get the unique rows, we set `axis = 0`

, and to get the unique columns, we set `axis = 1`

.

# GET UNIQUE ROWS print('Unique rows:') np.unique(dupe_array_3x4, axis = 0) # GET UNIQUE COLUMNS print('Unique columns:') np.unique(dupe_array_3x4, axis = 1)

OUT:

Unique rows: array([[1, 2, 1], [2, 2, 2]]) Unique columns: array([[1, 2], [2, 2], [1, 2]])

This is somewhat straightforward, if you understand how axes work.

For a 2D array, axis-0 points downward and axis-1 points horizontally.

So when we set `axis = 0`

, np.unique operates downward in the axis-0 direction. This causes it to identify the unique rows.

Similarly, when we set `axis = 1`

, np.unique operates horizontally in the axis-1 direction. This causes it to identify the unique columns.

This is fairly simple once you understand how Numpy axes work. Having said that, many people are confused by Numpy axes. If you need help understanding how axes work, read our explanation of Numpy array axes.

Do you have other questions about the Numpy unique function?

If so, leave your questions in the comments section at the bottom of the page.

This tutorial should have given you a good understanding of the Numpy unique function.

But to learn data science in Python, you’ll need to learn a lot more about Numpy. In fact, you’ll need to learn about Pandas, and several other data science topics.

So if you want to learn Python data science, you should sign up for our FREE email list.

When you sign up, you’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

We publish new tutorials every week, and when you sign up for our free email list, these tutorials will be delivered directly to your inbox.

]]>