The post Numpy standard deviation explained appeared first on Sharp Sight.

]]>At a high level, the Numpy standard deviation function is simple. It calculates the standard deviation of the values in a Numpy array.

But the details of exactly how the function works are a little complex and require some explanation.

That being said, this tutorial will explain how to use the Numpy standard deviation function.

It will explain the syntax of np.std(), and show you clear, step-by-step examples of how the function works.

The tutorial is organized into sections. You can click on any of the following links, which will take you to the appropriate section.

**Table of Contents:**

- A very quick review of Numpy
- Introduction to Numpy standard deviation
- The syntax of np.std
- Numpy standard deviation examples
- Numpy standard deviation FAQ

Having said that, if you’re relatively new to Numpy, you might want to read the whole tutorial.

Let’s just start off with a veeeery quick review of Numpy.

What is Numpy?

To put it simply, Numpy is a toolkit for working with numeric data.

First, Numpy has a set of tools for creating a data structure called a Numpy array.

You can think of a Numpy array as a row-and-column grid of numbers. Numpy arrays can be 1-dimensional, 2-dimensional, or even n-dimensional.

A 2D array looks something like this:

For simplicity sake, in this tutorial, we’ll stick to 1 or 2-dimentional arrays.

There are a variety of ways to create different types of arrays with different kinds of numbers. A few other tools for creating Numpy arrays include numpy arrange, numpy zeros, numpy ones, numpy tile, and other methods.

Regardless of how you create your Numpy array, at a high level, they are simply arrays of numbers.

Numpy not only provides tools for *creating* Numpy arrays, Numpy also provides tools for *working with* Numpy arrays.

Some of the most important of these Numpy tools are Numpy functions for performing calculations.

There’s a whole set of Numpy functions for doing things like:

- computing the sum of a Numpy array
- calculating the maximum
- calculating the exponential of the numbers in an array
- computing the value x to some power, for every value in a Numpy array

… and a variety of other computations.

The Numpy standard deviation is essentially a lot like these other Numpy tools. It is just used to perform a computation (the standard deviation) of a group of numbers in a Numpy array.

At a very high level, standard deviation is a measure of the spread of a dataset. In particular, it is a measure of how far the datapoints are from the mean of the data.

Let’s briefly review the basic calculation.

Standard deviation is calculated as the square root of the variance.

So if we have a dataset with numbers, the variance will be:

(1)

And the standard deviation will just be the square root of the variance:

(2)

Where:

= the individual values in the dataset

= the number of values in the dataset

= the mean of the values

Most of the time, calculating standard deviation by hand is a little challenging, because you need to compute the mean, the deviations of each datapoint from the mean, then the square of the deviations, etc. Frankly, it’s a little tedious.

However, if you’re working in Python, you can use the Numpy standard deviation function to perform the calculation for you.

Because this blog post is about using the numpy.std() function, I don’t want to get too deep into the weeds about how the calculation is performed by hand. This tutorial is really about how we use the function. So, if you need a quick review of what standard deviation is, you can watch this video.

Ok. Having quickly reviewed what standard deviation is, let’s look at the syntax for np.std.

The syntax of the Numpy standard deviation function is fairly simple.

I’ll explain it in just a second, but first, I want to tell you one quick note about Numpy syntax.

Typically, when we write Numpy syntax, we use the alias “np”. That’s the common convention among most data scientists.

To set that alias, you need to import Numpy like this:

import numpy as np

If we import Numpy with this alias, we’ll can call the Numpy standard deviation function as `np.std()`

.

Ok, that being said, let’s take a closer look at the syntax.

At a high level, the syntax for np.std looks something like this:

As I mentioned earlier, assuming that we’ve imported Numpy with the alias “`np`

” we call the function with the syntax `np.std()`

.

Then inside of the parenthesis, there are several parameters that allow you to control exactly how the function works.

Let’s take a look at those parameters.

There are a few important parameters you should know:

`a`

`axis`

`dtype`

`ddof`

`keepdims`

`out`

Let’s take a look at each of them.

`a`

(required)The `a`

parameter specifies the *array* of values over which you want to calculate the standard deviation.

Said differently, this enables you to specify the *input array* to the function.

Appropriate inputs include Numpy arrays, but also “array like” objects such as Python lists.

Importantly, **you must provide an input to this parameter**. An input is required.

Having said that, the parameter itself can be implicit or explicit. What I mean by that, is that you can directly type the parameter `a=`

, OR you can leave the parameter out of your syntax, and just type the name of your input array.

I’ll show you examples of this in example 1.

`axis`

The axis parameter enables you to specify an axis along which the standard deviation will be computed.

To understand this, you really need to understand axes.

Numpy arrays have axes.

You can think of an “axis” like a direction along the array.

In a 2-dimensional array, there will be 2 axes: axis-0 and axis-1.

In a 2D array, axis-0 points downward along the rows, and axis-1 points horizontally along the columns.

Visually, you can visualize the axes of a 2D array like this:

Using the `axis`

parameter, you can compute the standard deviation in a particular direction along the array.

This is best illustrated with examples, so I’ll show you an example in example 2.

(For a full explanation of Numpy array axes, see our tutorial called Numpy axes explained.)

`dtype`

*(optional)*

The `dtype`

parameter enables you to specify the data type that you want to use when np.std computes the standard deviation.

If the data in the input array are integers, then this will default to `float64`

.

Otherwise, if the data in the input array are floats, then this will default to the same float type as the input array.

`ddof`

*(optional)*

This enables you to specify the “degrees of freedom” for the calculation.

To understand this, you need to look at equation 2 again.

In this equation, the first term is .

Remember: is the number of values in the array or dataset.

But if we’re thinking in statistical terms, there’s actually a difference between computing a population standard deviation vs a sample standard deviation.

If we compute a population standard deviation, we use the term in our equation.

However, when we compute the standard deviation on a *sample* of data (a sample of datapoints), then we need to modify the equation so that the leading term is . In that case, the equation for a *sample* standard deviation becomes:

(3)

How do we implement this with np.std?

We can do this with the `ddof`

parameter, by setting `ddof = 1`

.

And in fact, we can set the `ddof`

term more generally. When we use `ddof`

, it will modify the standard deviation calculation to become:

(4)

To be honest, this is a little technical. If you need to learn more about this, you should watch this video at Khan academy about degrees of freedom, and population vs sample standard deviation.

`out`

*(optional)*

The `out`

parameter enables you to specify an alternative array in which to put the output.

It should have the same shape as the expected output.

`keepdims`

*(optional)*

The `keepdims`

parameter can be used to “keep” the original number of dimensions. When you set `keepdims = True`

, the output will have the same number of dimensions as the input.

Remember: when we compute the standard deviation, the computation will “collapse” the number of dimensions.

For example, if we input a 2-dimensional array as an input, then by default, np.std will output a number. A scalar value.

But if we want the output to be a number *within a 2D array* (i.e., an output array with the same dimensions as the input), then we can set `keepdims = True`

.

To be honest, some of these parameters are a little abstract, and I think they will make a lot more sense with examples.

Let’s take a look at some examples.

Here, we’ll work through a few examples. We’ll start simple and then increase the complexity.

**Examples:**

- Calculate standard deviation of a 1-dimensional array
- Calculate the standard deviation of a 2-dimensional array
- Use np.std to compute the standard deviations of the columns
- Use np.std to compute the standard deviations of the rows
- Change the degrees of freedom
- Use the keepdims parameter in np.std

Before you run any of the example code, you need to import Numpy.

To do this, you can run the following code:

import numpy as np

This will import Numpy with the alias “`np`

“.

Here, we’ll start simple.

We’re going to calculate the standard deviation of 1-dimensional Numpy array.

First, we’ll just create our 1D array:

array_1d = np.array([12, 14, 99, 72, 42, 55])

Now, we’ll calculate the standard deviation of those numbers.

np.std(array_1d)

OUT:

30.84369195367723

So what happened here?

The np.std function just computed the standard deviation of the numbers `[12, 14, 99, 72, 42, 55]`

using equation 2 that we saw earlier. Each number is one of the in that equation.

In the above example, we did not explicitly use the `a=`

parameter. That is because np.std understands that when we provide an argument to the function like in the code `np.std(array_1d)`

, the input should be passed to the `a`

parameter.

Alternatively, you can also explicitly use the `a=`

parameter:

np.std(a = array_1d)

OUT:

30.84369195367723

Ok. Now, let’s look at an example with a 2-dimensional array.

Here, we’re going to create a 2D array, using the np.random.randint function.

np.random.seed(22) array_2d = np.random.randint(20, size =(3, 4))

This array has 3 rows and 4 columns.

Let’s print it out, so we can see it.

print(array_2d)

OUT:

[[ 4 12 0 4] [ 6 11 8 4] [18 14 13 7]]

This is just a 2D array that contains 12 random integers between 0 and 20.

Okay, let’s compute the standard deviation.

np.std(array_2d)

OUT:

5.007633062524539

Here, numpy.std() is just computing the standard deviation of all 12 integers.

The standard deviation is `5.007633062524539`

.

Now, we’re going to compute the standard deviation of the columns.

To do this, we need to use the `axis`

parameter. (You learned about the `axis`

parameter in the section about the parameters of numpy.std)

Specifically, we need to set `axis = 0`

.

Why?

As I mentioned in the explanation of the `axis`

parameter earlier, Numpy arrays have axes.

In a two dimensional array, axis-0 is the axis that points downwards.

When we use numpy.std with `axis = 0`

, that will compute the standard deviations downward in the axis-0 direction.

Let’s take a look at an example so you can see what I mean.

First, we’ll create a 2D array, using the np.random.randint function.

(This is the same array that we created in example 2, so if you already created it, you shouldn’t need to create it again.)

np.random.seed(22) array_2d = np.random.randint(20, size =(3, 4))

Let’s print it out, so we can see it.

print(array_2d)

OUT:

[[ 4 12 0 4] [ 6 11 8 4] [18 14 13 7]]

This is just a 2D array that contains integers between 0 and 20.

Now, we’ll set `axis = 0`

inside of np.std to compute the standard deviations of the columns.

np.std(array_2d, axis = 0)

OUT:

array([6.18241233, 1.24721913, 5.35412613, 1.41421356])

What’s going on here?

When we use np.std with `axis = 0`

, Numpy will compute the standard deviation downward in the axis-0 direction. Remember, as I mentioned above, axis-0 points downward.

This has the effect of computing the standard deviation of each column of the Numpy array.

Now, let’s do a similar example with the row standard deviations.

Now, we’re going to use np.std to compute the standard deviations horizontally along a 2D numpy array.

Remember what I said earlier: numpy arrays have axes. The axes are like directions along the Numpy array. In a 2D array, axis-1 points horizontally, like this:

So, if we want to compute the standard deviations horizontally, we can set `axis = 1`

. This has the effect of computing the row standard deviations.

Let’s take a look.

To run this example, we’ll again need a 2D Numpy array, so we’ll create a 2D array using the np.random.randint function.

(This is the same array that we created in example 2, so if you already created it, you shouldn’t need to create it again.)

np.random.seed(22) array_2d = np.random.randint(20, size =(3, 4))

Let’s print it out, so we can see it.

print(array_2d)

OUT:

[[ 4 12 0 4] [ 6 11 8 4] [18 14 13 7]]

This is just a 2D array that contains integers between 0 and 20.

Now, we’ll use np.std with `axis = 1`

to compute the standard deviations of the rows.

np.std(array_2d, axis = 1)

OUT:

array([4.35889894, 2.58602011, 3.93700394])

If you understood example 3, this new example should make sense.

When we use np.std and set `axis = 1`

, Numpy will compute the standard deviations horizontally along axis-1.

Effectively, when we use Numpy standard deviation with `axis = 1`

, the function computes the standard deviation of the rows.

Now, let’s change the degrees of freedom.

Here in this example, we’re going to create a large array of numbers, take a sample from that array, and compute the standard deviation on that sample.

First, let’s create our arrays.

First, we’ll just create a normally distributed Numpy array with a mean of 0 and a standard deviation of 10.

To do this, we’ll use the Numpy random normal function. Note that we’re using the Numpy random seed function to set the seed for the random number generator. For more information on this, read our tutorial about np.random.seed.

np.random.seed(22) population_array = np.random.normal(size = 100, loc = 0, scale = 10)

Ok. Now we have a Numpy array, `population_array`

, that has 100 elements that have a mean of 0 and a standard deviation of 10.

Now, we’ll use Numpy random choice to take a random sample from the Numpy array, `population_array`

.

np.random.seed(22) sample_array = np.random.choice(population_array, size = 10)

This new array, `sample_array`

, is a random sample of 10 elements from `population_array`

.

We’ll use `sample_array`

when we calculate our standard deviation using the `ddof`

parameter.

Now, we’ll calculate the standard deviation of the sample.

Specifically, we’re going to use the Numpy standard deviation function with the `ddof`

parameter set to `ddof = 1`

.

np.std(sample_array, ddof = 1)

OUT:

10.703405562234051

Here, we’ve calculated:

And when we set `ddof = 1`

, the equation evaluates to:

To be clear, when you calculate the standard deviation of a *sample*, you will set `ddof = 1`

.

To be honest, the details about *why* are a little technical (and beyond the scope of this post), so for more information about calculating a sample standard deviation, I recommend that you watch this video.

Keep in mind, that for some other instances, you can set `ddof`

to other values besides 1 or 0. If you don’t use the `ddof`

parameter at all, it will default to 0.

No matter what value you select, the Numpy standard deviation function will compute the standard deviation with the equation:

Ok. Finally, we’ll do one last example.

Here, we’re going to set the `keepdims`

parameter to `keepdims = True`

.

First, we’ll create a 2D array, using the np.random.randint function.

(This is the same array that we created in example 2, so if you already created it, you shouldn’t need to create it again.)

np.random.seed(22) array_2d = np.random.randint(20, size =(3, 4))

Let’s print it out:

print(array_2d)

OUT:

[[ 4 12 0 4] [ 6 11 8 4] [18 14 13 7]]

Now, let’s take a look at the dimensions of this array.

array_2d.ndim

OUT:

2

This is a 2D array, just like we intended.

Ok. Now, we’re going to compute the standard deviation, and check the dimensions of the output.

output = np.std(array_2d)

Let’s quickly print the output:

print(output)

OUT:

5.007633062524539

So the standard deviation is 5.007633062524539.

Now, what’s the dimensions of the output?

output.ndim

OUT:

0

The output has 0 dimensions (it’s a scalar value).

Why?

When np.std computes the standard deviation, it’s computing a summary statistic. In this case, the function is taking a large number of values and collapsing them down to a single metric.

So the input was 2-dimensional, but the output is 0-dimensional.

What if we want to change that?

What if we want the output to technically have 2-dimensions?

We can do that with the `keepdims`

parameter.

Here, we’ll set `keepdims = True`

to make the output the same dimensions as the input.

output_2d = np.std(array_2d, keepdims = True)

Now, let’s look at the output:

print(output_2d)

OUT:

[[5.00763306]]

Notice that the output, the standard deviation, is still 5.00763306. But the result is enclosed inside of double brackets.

Let’s inspect `output_2d`

and take a closer look.

type(output_2d)

OUT:

numpy.ndarray

So, `output_2d`

is a Numpy array, not a scalar value.

Let’s check the dimensions:

output_2d.ndim

OUT:

2

This Numpy array, `output_2d`

, has 2 dimensions.

This is the *same* number of dimensions as the input.

What happened?

When we set `keepdims = True`

, that caused the np.std function to produce an output with the same number of dimensions as the input. Even though there are not any rows and columns in the output, the output `output_2d`

has 2 dimensions.

So, in case you ever need your output to have the same number of dimensions as your input, you can set `keepdims = True`

.

(This also works when you use the `axis`

parameter … try it!)

Now that you’ve learned about Numpy standard deviation and seen some examples, let’s review some frequently asked questions about np.std.

**Frequently asked questions:**

The simple reason is that matlab calculates the standard dev according to the following:

(Many other tools use the same equation.)

However, Numpy calculates with the following:

Notice the subtle difference between the vs the .

To fix this, you can use the `ddof`

parameter in Numpy.

If you use np.std with the `ddof`

parameter set to `ddof = 1`

, you should get the same answer as matlab.

Do you have other questions about the Numpy standard deviation function?

Leave your question in the comments section below.

The examples you’ve seen in this tutorial should be enough to get you started, but if you’re serious about learning Numpy, you should enroll in our premium course called *Numpy Mastery*.

There’s a lot more to learn about Numpy, and *Numpy Mastery* will teach you everything, including:

- How to create Numpy arrays
- How to use the Numpy random functions
- What the “Numpy random seed” function does
- How to reshape, split, and combine your Numpy arrays
- and more …

Moreover, it will help you completely *master* the syntax within a few weeks. You’ll discover how to become “fluent” in writing Numpy code.

Find out more here:

Learn More About Numpy Mastery

The post Numpy standard deviation explained appeared first on Sharp Sight.

]]>The post How to Use Pandas Reset Index appeared first on Sharp Sight.

]]>It will explain the syntax of reset_index, and it will also show you clear step-by-step examples of how to use reset_index to reset the index of a Pandas DataFrame.

The tutorial has several sections. You can click on one of the following links, and the link will take you to the appropriate section in the tutorial.

**Table of Contents:**

- A quick review of Pandas indexes
- Introduction to Pandas reset index
- The syntax of reset_index
- Reset index examples
- Reset index FAQ

Having said that, if you’re new to Pandas, or new to using Pandas DataFrame indexes, you should probably read the whole thing.

Ok … let’s get started.

To understand the Pandas reset index method, you really need to understand Pandas DataFrame indexes. You really need to understand what an index is, why we need them, and how we set indexes in Pandas.

Once you know that, we’ll be ready to talk about the reset_index method.

With that in mind, let’s review Pandas DataFrames and DataFrame indexes.

Briefly, let’s review DataFrames.

A Pandas DataFrame is a data structure in Python.

DataFrames have a row-and-column structure. Variables are along the columns, and observations (i.e., records) are down the rows.

At a high level, a Pandas DataFrame is a lot like an Excel spreadsheet. It’s just a row-and-column structure that holds data and enables us to perform analyses on that data.

One important feature of the DataFrame is what we call the “index.”

Every Pandas DataFrame has a special column-like structure called the index. To be clear, an index is only sort of like a column, but properly speaking, it’s not actually one of the `columns`

of a DataFrame.

If you print out a DataFrame, you’ll see the index on the far left hand side.

By default, if you don’t set any other index for the DataFrame, the index values will just be the integers starting from 0.

It looks something like this:

By default, every row will have an integer associated with it, starting with the number 0. We can use this integer index to retrieve rows by number using the Pandas iloc method.

The index is important.

A DataFrame index enables us to retrieve individual rows.

When we have a default numeric index, we can retrieve a row or a slice of rows by integer index. We typically do this with the Pandas iloc method.

The important thing to understand is that the index values act as sort of an “address” for the rows. So you can use techniques like Pandas iloc to retrieve or access specific rows.

Although Pandas DataFrames have a numeric index by default, you can also set a new index for a DataFrame.

There are a few ways to do this (including a way to set an index with pandas read_csv). But, the most common way to set a new index for a Pandas DataFrame is with the Pandas set index method.

When you use set_index, the function typically transforms a column into the DataFrame index.

So for example, if your DataFrame has a column called `name`

, you can use the set_index method to set `name`

as the index. This would allow you to select individual rows by the name of the person associated with the row.

But let’s say that you’ve set an index. For example, in the image above, the DataFrame has the index “`name`

“.

What do you do if you want to remove the index and “reset” the DataFrame index back to the default numeric index?

To do that, you use the Pandas reset index method.

The Pandas reset index method “resets” the current index.

Effectively, it takes the index of a DataFrame and turns it back into a proper column.

At the same time, it resets the index of the DataFrame back to the default integer index.

Having said that, there’s a little more to it. There are a few details of the method that are dictated by some details of the syntax.

With that being said, let’s look at the syntax of reset_index.

In the most basic case, the syntax of reset_index is fairly simple.

We simply type the name of the DataFrame, and then we use “dot syntax” to call the method.

Essentially, we type the name of the DataFrame, then a “dot”, and then `reset_index()`

.

If we do this, the reset_index method will take the existing index (whatever it is) and will turn the index into a column of the DataFrame. At the same time, it will reset the index back to the default numeric index starting at 0.

Having said that, there are some parameters for reset_index that enable you to modify the behavior of the function.

Let’s take a look at those parameters.

The reset_index method has several parameters that enable you to modify the behavior of the method.

The specific parameters that we’ll focus on are:

`level`

`drop`

`inplace`

The reset_index method also has parameters `col_level`

, and `col_fill`

. These are used less frequently, so we’re not going to cover them in this tutorial.

Having said that, let’s take a look at `level`

, `drop`

, and `inplace`

.

`level`

The `level`

parameter enables you to specify which level you want to “reset” and remove from the index.

This is applicable only if you have multiple levels in your index, which is sort of a special case.

You don’t need to provide any argument to this parameter. By default, it will simply remove all of the levels (and return all parts of the index back to the DataFrame).

I’ll show you an example of this in the examples section, so you understand how it works.

`drop`

The `drop`

parameter enables you to specify whether or not you want to delete the index entirely from the DataFrame.

Recall what I mentioned above: the Pandas reset_index method takes the index and returns the index back to the columns.

That’s the *default* behavior. By default, the `drop`

parameter is set to `drop = False`

(even if you don’t explicitly use the `drop`

parameter).

You can change this though. If you set `drop = True`

, reset_index will delete the index instead of inserting it back into the columns of the DataFrame. If you set `drop = True`

, the current index will be deleted entirely and the numeric index will replace it.

`inplace`

By default, the `inplace`

parameter is set to `inplace = False`

.

When `inplace`

is set to `False`

, the reset_index method will create an entirely new DataFrame as output. That means, when `inplace`

is set to `inplace = False`

(the default!), reset_index DOES NOT CHANGE THE ORIGINAL DATAFRAME.

This is important. Many people think that reset_index will operate directly on the original DataFrame that you’re referencing when you call the method. By default, it does not.

Instead, it simply creates a new DataFrame. Keep in mind, this new DataFrame will be sent to the console unless you assign it to a variable.

However, it is possible to have reset_index operate directly on the DataFrame.

To do this, you need to set `inplace = True`

.

When you set `inplace = True`

, the reset_index method will not create a new DataFrame. Instead, it will directly modify and overwrite your original DataFrame.

Sometimes that’s exactly what you want, but be careful! When you set `inplace = True`

, reset_index will overwrite your data, so make sure that it’s working properly.

By default, the Pandas reset_index method creates a new DataFrame as output and leaves the original DataFrame unchanged.

As noted in the section above, you can change this by setting `inplace = True`

. If you set , the reset_index method will not create a new DataFrame. Instead, it will directly modify and overwrite your original DataFrame.

Ok. Now that you’ve learned about the syntax of reset_index, let’s look at some examples of reset_index.

**Examples:**

- Reset the index of a DataFrame
- Delete the index completely
- Reset a specific level
- Reset the index in place

Before you run any of the examples, you need to import Pandas and create a DataFrame.

Let’s do both of those.

Here, we’re just going to import the Pandas package.

You should know this, but Pandas is a data manipulation toolkit for Python. The reset_index method is one of the tools of Pandas.

To import Pandas into your working environment, you can run the following import statement:

import pandas as pd

This will import the Pandas package with the alias “`pd`

“.

Next, we’re going to create a Pandas DataFrame with some “dummy” data.

To do this, we’ll use the `pd.DataFrame()`

method to create a new DataFrame from a dictionary.

sales_data = pd.DataFrame({ "name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"] ,"region":["East","North","East","South","West","West","South","West","West","East","South"] ,"sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000] ,"expenses":[42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]})

We’ve called this DataFrame `sales_data`

. This contains dummy sales data for 11 people.

Let’s print the data and take a look.

print(sales_data)

OUT:

name region sales expenses 0 William East 50000 42000 1 Emma North 52000 43000 2 Sofia East 90000 50000 3 Markus South 34000 44000 4 Edward West 42000 38000 5 Thomas West 72000 39000 6 Ethan South 49000 42000 7 Olivia West 55000 60000 8 Arun West 67000 39000 9 Anika East 65000 44000 10 Paulo South 67000 45000

As you can see, there are 11 rows and 4 columns (`name`

, `region`

, `sales`

, and `expenses`

).

Notice one thing.

On the far left hand side of the DataFrame is a column of integers starting at 0. This is the default index.

If you want, you can actually examine the index with this code:

print(sales_data.index)

OUT:

RangeIndex(start=0, stop=11, step=1)

The default index is something called a `RangeIndex`

. Don’t let this confuse you … that just means that the index is the “range” of integers starting at 0 and ending at 11 (excluding 11).

Ok. Now we’re ready for some examples.

In this example, we’re going to reset the index of our Pandas DataFrame.

But before we do that, we’re going to set the index first.

Here, we’re going to set the index to the `name`

variable.

We’ll do this with the Pandas set index method.

sales_data.set_index('name', inplace = True)

And let’s print out the data:

print(sales_data)

OUT:

region sales expenses name William East 50000 42000 Emma North 52000 43000 Sofia East 90000 50000 Markus South 34000 44000 Edward West 42000 38000 Thomas West 72000 39000 Ethan South 49000 42000 Olivia West 55000 60000 Arun West 67000 39000 Anika East 65000 44000 Paulo South 67000 45000

Notice that in the printout above, the “`name`

” column is actually set off to the side, separate from the regular columns. That’ because `name`

is now the “index” of the DataFrame.

We can also manually look at the index by accessing the `index`

attribute:

print(sales_data.index)

OUT:

Index(['William', 'Emma', 'Sofia', 'Markus', 'Edward', 'Thomas', 'Ethan', 'Olivia', 'Arun', 'Anika', 'Paulo'], dtype='object', name='name')

As you can see, the index values are the “names” now.

Ok. Next, we’ll use reset_index to undo that operation.

Now, we’ll use reset_index to reset the index.

sales_data.reset_index()

OUT:

name region sales expenses 0 William East 50000 42000 1 Emma North 52000 43000 2 Sofia East 90000 50000 3 Markus South 34000 44000 4 Edward West 42000 38000 5 Thomas West 72000 39000 6 Ethan South 49000 42000 7 Olivia West 55000 60000 8 Arun West 67000 39000 9 Anika East 65000 44000 10 Paulo South 67000 45000

Notice that in the output, `name`

has been returned to the columns.

`name`

is no longer the index in this output.

Instead, the output shows the range of integers starting at 0 as the new index.

Before we move on, I want to make one other point. If you print out the `sales_data`

DataFrame, you’ll notice that it still has `name`

as the index.

Why? Didn’t we just use reset_index to undo the index?

Remember that by default, the `inplace`

parameter is set to `inplace = False`

. As I explained earlier in the syntax section, this means that by default, reset_index creates a *new* DataFrame. It does not change the original.

However, we can modify that behavior. I’ll show you how in another example.

Next, we’ll use the `drop`

parameter to delete the index completely.

Before we do this, we’re going to recreate the DataFrame.

If you already have the DataFrame with `name`

as the index, you can skip this part.

sales_data = pd.DataFrame({ "name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"] ,"region":["East","North","East","South","West","West","South","West","West","East","South"] ,"sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000] ,"expenses":[42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]})

sales_data.set_index('name', inplace = True)

Ok. Now we’re going to reset the index and *delete* it altogether.

sales_data.reset_index(drop = True)

OUT:

region sales expenses 0 East 50000 42000 1 North 52000 43000 2 East 90000 50000 3 South 34000 44000 4 West 42000 38000 5 West 72000 39000 6 South 49000 42000 7 West 55000 60000 8 West 67000 39000 9 East 65000 44000 10 South 67000 45000

Notice that in the output, the index has been reset to the integer index.

Moreover, the `name`

variable is completely gone.

By setting `drop = True`

, we caused the reset_index method to “drop” (i.e., delete) the variable.

Next, we’ll reset a specific level of the index.

To do this, we’ll need a DataFrame with a multi-level index.

That being said, let’s first create our data.

sales_data = pd.DataFrame({ "name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"] ,"region":["East","North","East","South","West","West","South","West","West","East","South"] ,"sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000] ,"expenses":[42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]})

And we’ll set the index with multiple variables, `name`

and `region`

:

sales_data.set_index(['name', 'region'], inplace = True)

And let’s print it:

sales_data.set_index(['name', 'region'], inplace = True)

OUT:

sales expenses name region William East 50000 42000 Emma North 52000 43000 Sofia East 90000 50000 Markus South 34000 44000 Edward West 42000 38000 Thomas West 72000 39000 Ethan South 49000 42000 Olivia West 55000 60000 Arun West 67000 39000 Anika East 65000 44000 Paulo South 67000 45000

Notice that this DataFrame has two index levels: `name`

and `region`

. You can also examine the index with `print(sales_data.index)`

.

Ok. Now that we have our DataFrame, let’s reset the `region`

portion of the index. We’ll do this by setting `level = 'region'`

.

sales_data.reset_index(level = 'region')

OUT:

region sales expenses name William East 50000 42000 Emma North 52000 43000 Sofia East 90000 50000 Markus South 34000 44000 Edward West 42000 38000 Thomas West 72000 39000 Ethan South 49000 42000 Olivia West 55000 60000 Arun West 67000 39000 Anika East 65000 44000 Paulo South 67000 45000

If you inspect the output (or print out the actual index), you’ll see that region has been “reset” to one of the columns. But, *only* the `region`

variable has been reset. `name`

is still in the index.

Finally, let’s reset the index “in place.”

Remember from earlier in the tutorial, when I explained the `inplace`

parameter: by default, reset_index does *not* modify the original DataFrame. It simply creates a new dataframe.

But, we can change that behavior and cause reset_index to directly modify the original DataFrame by setting `inplace = True`

.

Before we do that, let’s quickly re-create our data, so that it’s structured properly.

And we’ll set the index with multiple variables, `name`

and `region`

:

sales_data.set_index(['name', 'region'], inplace = True)

Now we have `sales_data`

with `name`

as the index.

Ok. Let’s reset the index “in place.”

sales_data.reset_index(inplace = True)

And let’s print it out:

print(sales_data)

OUT:

name region sales expenses 0 William East 50000 42000 1 Emma North 52000 43000 2 Sofia East 90000 50000 3 Markus South 34000 44000 4 Edward West 42000 38000 5 Thomas West 72000 39000 6 Ethan South 49000 42000 7 Olivia West 55000 60000 8 Arun West 67000 39000 9 Anika East 65000 44000 10 Paulo South 67000 45000

When we print out `sales_data`

, you can see that `name`

is now one of the columns, and the index is the range of numbers from 0 to 10.

In this example, by setting `inplace = True`

, we caused the Pandas reset_index method to directly modify the DataFrame in question.

Now that you’ve learned about reset index and seen some examples, let’s review some frequently asked questions about the reset index method.

**Frequently asked questions:**

If you tried to use reset index and it didn’t change your DataFrame, you probably didn’t use `inplace = True`

.

I explain this in the syntax section. By default, the `inplace`

parameter is set to `inplace = False`

. That causes reset_index to create a *new* DataFrame as an output. When `inplace = False`

– which is the default behavior – the reset_index method will leave the original DataFrame unchanged.

To fix this behavior, and to directly modify the original DataFrame, you probably need to set `inplace = True`

.

You can see an example of this in example 4.

Do you have other questions about the reset index method in Python?

Leave your questions in the comments section below.

If you’re serious about learning Pandas, you should enroll in our premium Pandas course called *Pandas Mastery*.

Pandas Mastery will teach you everything you need to know about Pandas, including:

- How to subset your Python data
- Data aggregation with Pandas
- How to reshape your data
- and more …

Moreover, it will help you completely *master* the syntax within a few weeks. You’ll discover how to become “fluent” in writing Pandas code to manipulate your data.

Find out more here:

Learn More About Pandas Mastery

The post How to Use Pandas Reset Index appeared first on Sharp Sight.

]]>The post How to make a Seaborn histogram with the distplot function appeared first on Sharp Sight.

]]>It will explain the syntax and also show you clear, step-by-step examples of how to use sns.distplot.

The tutorial is divided up into several different sections. You can click on one of the following links to go to the appropriate section.

**Table of Contents:**

- A quick introduction to histograms and distplots
- A review of histograms and density plots in Seaborn
- The syntax of
`sns.distplot()`

- Examples of how to use sns.distplot
- Frequently asked questions about Seaborn histograms and Seaborn distplots

That said, if you’re new to data visualization in Python or new to using Seaborn, I recommend that you read the entire tutorial.

When we’re doing data science, one of the most common tasks is visualizing data distributions.

Frequently, we want to understand how our data are distributed as part of exploratory data analysis.

Sometimes we explore data to find out how it’s structured (i.e., when we first get a dataset).

Other times, we need to explore data distributions to answer a question or validate some hypothesis about the data.

Examining data distributions is also *very* common in machine learning, since many machine learning techniques assume that the data are distributed in particular ways

There are two primary ways to examine data distributions: the histogram and the density plot.

Histograms are arguably the most common tool for examining data distributions.

In a typical histogram, we map a numeric variable to the x axis.

The x axis is then divided up into a number of “bins” … for example, there might be a bin from 10 to 20, the next bin from 20 to 30, the next from 30 to 40, and so on.

When we create a histogram (or use software to create a histogram) we count the number of observations in each bin.

Then we plot a bar for each bin. The length of the bar corresponds to the number of records that are within that bin on the x-axis.

Ultimately, a histogram contains a group of bars that show the “height” of the data (i.e., the count of the data) for different values our numeric variable.

The histogram shows us how a variable is distributed.

The other primary tool for evaluating data distributions is the density plot.

There are a variety of methods for creating density plots, but one of the most common is called “kernel density estimation.” The plot that we generate when we use kernel density estimation is called “kernel density estimation plot.” These are also known as “KDE plots” for short.

KDE plots (i.e., density plots) are very similar to histograms in terms of how we use them. We use density plots to evaluate how a numeric variable is distributed.

The main differences are that KDE plots use a smooth line to show distribution, whereas histograms use bars. So KDE plots show *density*, whereas histograms show count.

Now that I’ve explained histograms and KDE plots generally, let’s talk about them in the context of Seaborn.

Seaborn has two different functions for visualizing univariate data distributions – `seaborn.kdeplot()`

and `seaborn.distplot()`

.

In this tutorial, we’re really going to talk about the distplot function.

Technically, Seaborn does not have it’s own function to create histograms.

Instead, it has the `seaborn.distplot()`

function.

The distplot function creates a combined plot that contains both a KDE plot *and* a histogram.

At least, that’s the default behavior.

You can use the distplot function to create a chart with *only* a histogram or only a KDE plot.

I’ll show you how to do both in the examples section, but to understand *how* you need to understand the syntax.

That being the case, let’s take a look at the syntax of the seaborn.distplot function.

The technical name of the function is seaborn.distplot, but it’s a very common convention to call the function with the code sns.distplot. That’s the convention we’ll be using going forward …. we’re going to call the function as `sns.distplot()`

.

(Remember, to use the `sns.`

prefix, you need to import Seaborn with the code `import seaborn as sns`

.)

In the simplest version of the syntax, you just call the function `sns.distplot()`

, and provide the name of a DataFrame variable or list inside of the parenthesis.

This will create a simple combined histogram/KDE plot.

However, the function can be used in more complex ways, if you use some extra parameters.

Let’s take a look at a few important parameters of the sns.distplot function.

The sns.distplot function has about a dozen parameters that you can use. However, you won’t need most of them.

That being the case, we’re going to focus on a few of the most common parameters for sns.distplot:

`color`

`kde`

`hist`

`bins`

Let’s take a closer look at each of them.

`color`

(required)The `color`

parameter does what it sounds like: it changes the color of the KDE plot and the histogram.

You can use a “named” color from Python, like red, green, blue, darkred, etc.

You can also use hexidecimal colors. Hex colors are beyond the scope of this post. They’re fairly easy once you get the hang of them, but in the interest of simplicity I’m not going to explain them here.

`kde`

The `kde`

parameter enables you to turn the KDE plot on and off in the output.

This parameter accepts a boolean value as an argument (i.e., `True`

or `False`

).

By default `kde`

parameter is set to `kde = True`

. That means that by default, the `sns.distplot`

function will include a kernel density estimate of your input variable.

If you manually set `kde = False`

, then the function will *remove* the KDE plot.

`hist`

The `hist`

parameter controls whether or not a histogram will appear in the output.

This parameter accepts a boolean input.

By default, it is set to `hist = True`

, which means that by default, the output plot will include a histogram of the input variable.

If you set `hist = False`

, the function will *remove* the histogram from the output.

`bins`

The `bins`

parameter enables you to control the number of bins in the output histogram.

If you do not set a value for the `bins`

parameter, the function will automatically compute an appropriate number of bins.

Now that you’ve learned about the syntax and parameters of sns.distplot, let’s take a look at some concrete examples.

Here, we’re going to take a look at several examples of the distplot function.

That will include creating a combination histogram/KDE, as well as individual histograms or KDE plots (without the other).

**Examples:**

- How to create a Seaborn distplot
- Change the color
- Create a Seaborn histogram
- Change the number of bins in the Seaborn histogram
- Create a density plot with Seaborn

Before you run any of the code for these examples, you’ll need to run some preliminary code.

Specifically, you’ll need to import a few packages, set the plot background formatting, and create a DataFrame.

First, you need to import two packages, Numpy and Seaborn.

We’ll use Numpy to create a normally distributed dataset that we can plot, and we’ll obviously need Seaborn in order to use the distplot function.

import numpy as np import seaborn as sns

We’ll also set the chart formatting using the `sns.set_style()`

function.

Depending on your Python settings, Seaborn can charts have the same format as matplotlib charts. Frankly, the matplotlib formatting is a little ugly. Seaborn gives us some better options.

The two options I like best are `darkgrid`

and `dark`

. I frequently use `darkgrid`

for other Seaborn charts, but I prefer `dark`

when I use distplot. That’s because the lines and histogram bars from distplot are a little transparent, and the gridlines from `darkgrid`

tend to distract from the plot.

#sns.set_style('darkgrid') sns.set_style('dark')

Here, we’re going to create a simple, normally distributed Numpy array.

We’ll create this array by using the np.random.normal function.

np.random.seed(42) normal_data = np.random.normal(size = 300, loc = 85, scale = 3)

Using the `loc`

parameter and `scale`

parameter, we’ve created this data to have a mean of 85, and a standard deviation of 3.

We’ll be able to see some of these details when we plot it with the `sns.distplot()`

function.

First, we’re going to create a distplot with Seaborn.

Remember that by default, the sns.distplot function includes both a histogram and a KDE plot.

Let’s just run the code and take a look at the output.

Here’s the code:

sns.distplot(normal_data)

And here’s the output.

Overall, the distplot shows us how the data are distributed. Remember that when we created the data, we created it to have a mean of 85 and a standard deviation of 3.

Although the standard deviation is a little difficult to see precisely from the plot, the plot certainly shows that the mean of the data is roughly around 85.

The histogram part of the plot gives us a slightly granular view of how the data are distributed. We can roughly see the relative counts within each “bin” of the x axis.

The KDE line (the smooth line) smooths over some of the rough details and provides a smooth distribution line that we can examine.

I don’t want to get too deep into the weeds concerning how we can use this plot for data analysis …. that’s beyond the scope of the post.

The ultimate point is that this is fairly easy to create. We simply call the function and provide the name of the variable that we want to plot inside of the parenthesis.

Next, we’re going to change the color of the plot.

By default, the color is a sort of medium blue color.

Here, we’re going to change the color to “navy.” To do this, we’ll set the `color`

parameter to `color = 'navy'`

.

sns.distplot(normal_data, color = 'navy')

OUT:

Notice in this chart that the color has been changed to a darker shade of blue.

Also notice, however, that although the KDE line is a dark navy color, the histogram is still a little light.

That’s because the histogram is set to be slightly transparent. Technically, the histogram *is* colored navy, but it’s just a little transparent.

Now, let’s create a Seaborn histogram.

To do this, we’re going to call the distplot function and we’re going to *remove* the KDE line by setting the `kde`

parameter to `kde = False`

.

sns.distplot(normal_data, kde = False)

Here’s the output:

This is pretty straightforward. By setting `kde = False`

, we’re telling the sns.distplot function to remove the KDE line. This leaves only the histogram in its place.

At this point, I think I should comment. I think that it’s debatable whether or not you should create a pure Seaborn histogram without the KDE line.

When I first started using the distplot function, I wanted to create histograms in Seaborn (without the KDE line).

After using it for a while, I actually prefer the distplot that contains both the histogram *and* the KDE line.

Try them out and see which you prefer.

Let’s quickly change the number of bins in the histogram.

Here, we’re still going to remove the KDE line in the plot, and we’ll create the underlying histogram with 50 bins.

sns.distplot(normal_data, kde = False, bins = 50)

OUT:

Here, we’ve simply created a Seaborn histogram with 50 bins.

The increased number of bins shows more granularity in the data distribution.

Seeing an increased number of bins can actually help when there’s a lot of variation at small scales or when we’re looking for unusual features in the data distribution (like a spike in a particular location).

Having said that, as an analyst or data scientist, you need to learn when to use a large number of bins, and when to use a small number.

There’s a bit of an art to choosing the right number of bins, and it takes practice.

Finally, let’s just plot a KDE line without the underlying histogram.

We can do this by calling the distplot function and setting the `hist`

parameter to `hist = False`

.

sns.distplot(normal_data, hist = False)

OUT:

Although I think it can be useful to have the combined KDE/histogram plot, I also like the lone KDE line, as seen here.

I think that this would be particularly useful if you had a large number of variables that you needed to plot (perhaps inside of a small multiple chart).

If you needed to plot a dozen or more distributions, for example, it might be better just to see the KDE line. If you’re plotting a large number of variables, a pure KDE line might be less distracting and easier to read at a glance.

That said, I think there’s an element of preference here as well. Play around with these and see which options you like best.

Now that you’ve learned about Seaborn histograms and distplots and seen some examples, let’s review some frequently asked questions.

**Frequently asked questions:**

You’ve probably noticed that by default, the histogram in the distplot is a little transparent.

That’s the default setting.

How do you make it more opaque?

You actually need to use a parameter from matplotlib (the `alpha`

parameter). Moreover, you need to call this in a special way. You need to use the `hist_kws`

parameter from sns.distplot to access the underlying matplotlib parameter.

Here’s some code that shows how:

sns.distplot(normal_data ,kde = False ,hist_kws = {"alpha": 1} )

OUT:

Here, the code `hist_kws = {"alpha": 1}`

is accessing the `alpha`

parameter from matplotlib, and setting `alpha`

equal to 1.

Notice that the output histogram is fully opaque.

Seaborn actually has two functions to plot the distribution of a variable: sns.distplot and sns.kdeplot.

What’s the difference?

They are almost the same.

The KDE line in a distplot plot is exactly the same as the KDE line from sns.kdeplot. The only difference is that sns.distplot includes a histogram.

If you call `sns.distplot(my_var, hist = False)`

, then the output will be identical to `sns.kdeplot(myvar)`

.

Do you have other questions about using the sns.distplot function to create a Seaborn histogram, or a visualization of a distribution?

Leave your question in the comments section at the bottom of the page.

The examples you’ve seen in this tutorial should be enough to get you started, but if you’re serious about learning Seaborn, you should enroll in our premium course called *Seaborn Mastery*.

There’s a lot more to learn about Seaborn, and *Seaborn Mastery* will teach you everything, including:

- How to create essential data visualizations in Python
- How to add titles and axis labels
- Techniques for formatting your charts
- How to create multi-variate visualizations
- How to think about data visualization in Python
- and more …

Moreover, it will help you completely *master* the syntax within a few weeks. You’ll discover how to become “fluent” in writing Seaborn code.

Find out more here:

Learn More About Seaborn Mastery

The post How to make a Seaborn histogram with the distplot function appeared first on Sharp Sight.

]]>The post Numpy where explained appeared first on Sharp Sight.

]]>I’ll explain what np.where is and also how the syntax of np.where works.

Later in the tutorial, I’ll show you clear, step-by-step examples of how the function works, so you can see it in action.

If you need to find something specific, the following links will take you to the appropriate section in the tutorial.

**Table of Contents:**

On the other hand, I if you really want to understand how Numpy where works, I recommend that you read the whole tutorial.

Let’s start off by quickly reviewing what Numpy where does.

According to the official documentation, the “Numpy where” function returns elements based on some logical condition.

Does that make sense to you?

Me neither.

Unfortunately, the Numpy where function is a little confusing, and many of the online tutorials and explanations do very little to clear things up. (In fact, a lot of online documentation about Numpy is very confusing.)

Let’s fix that.

I’m going to clarify what Numpy where actually does.

To really understand how Numpy where works, you need to understand the syntax first.

Once you understand the syntax, you’ll be able to look at simple examples and the examples will begin to make sense.

The syntax of the `np.where()`

function has a few parts.

First is just the name of the function. Typically, when we call the function, we’ll call it as `np.where()`

.

Keep in mind that exactly how we call the function depends on how we’ve imported Numpy. The common convention for importing Numpy is to run the code `import numpy as np`

. If we import Numpy like that, then we can use the nickname “`np`

” as an alias for Numpy when we call the Numpy functions. Thus, if we import Numpy that way, we’ll call the function as `np.where()`

.

Inside of the parenthesis, there are three inputs:

- condition
- output-if-true
- output-if-false

Let’s break down those inputs. Understanding those inputs is critical for understanding what the function does.

The parameters of np.where (i.e., the inputs to the function), are fairly easy to understand.

Let’s talk about them one at a time.

`condition`

(required)The `condition`

is some statement or object that evaluates as `True`

or `False`

.

For example, `condition`

could simply be a Numpy array with boolean values.

More often though, `condition`

is some comparison operation or logical test that operates on a Numpy array.

For example, if we have an array `b`

with several elements, our `condition`

could be the comparison operation `b > 0`

. In this case, the condition `b > 0`

would evaluate as `True`

or `False`

for every element of the array. These True/False values from `condition`

then influence the output of np.where.

`output-if-true`

This is the output of np.where if the `condition`

is `True`

.

This could be a single value, in which case, that value will be the output whenever `condition`

is `True`

.

But this can also be an array or array-like object, such as a list. If it’s an array-like object, the output of np.where will be the item in the `output-if-true`

array that corresponds to the positions in `condition`

that are `True`

.

If that sounds confusing, then just sit tight. I’ll show concrete examples in the examples section.

`output-if-false`

This is the output of np.where if the `condition`

is `False`

.

Again, this could be a single value, in which case, that value will be the output whenever `condition`

is `False`

.

But this can also be an array or array-like object, such as a list. If it’s an array like object, the output of np.where will be the item in the `output-if-false`

array that corresponds to the positions in `condition`

that are `False`

.

I realize that this syntax explanation might still be a little confusing.

In my opinion, the best way to really understand the syntax of np.where and how it works, is to look carefully at some examples.

Here, we’re going to look at several examples of the Numpy where function.

To help you understand, we’re going to start very, very simple, and then increase the complexity.

If you really want to understand how numpy.where works, you should start with the first example and work through them all. (You can obviously read the explanation, but it’s probably good to run the code too).

**Examples:**

- A super simple example of np.where
- Conditionally output ‘yes’ or ‘no’
- Take output from a list if true, 0 if false
- Take output from one list if true, take output from a different list if false

Before you run any of the following examples, you’ll need to import Numpy. So run this code first!

import numpy as np

This code will enable us to call Numpy functions with the prefix `np`

.

Ok. Let’s get started with the examples.

Ok. In this examples, we’re going to start *very* simple.

We’re going to create a simple 1D Numpy array, and use a simple comparison as our condition.

Let’s first create a simple 1-dimensional Numpy array.

range_1d = np.arange(start = 1, stop = 5)

And let’s print it out, so you can see it:

print(range_1d)

OUT:

[1 2 3 4]

This is really simple. The `range_1d`

array is just a Numpy array with the values 1 to 4.

Now, we’re going to use np.where to find the values greater than 2.

To do this, we’ll call `np.where()`

.

Inside of the function, we’ll have a condition that will test if the elements are greater than 2. Then we’ll output “`True`

” if the value is greater than 2, and “`False`

” if the value is not greater than 2.

Here’s the code:

np.where(range_1d > 2, True, False)

And here is the output:

array([False, False, True, True])

So what happened here?

Let’s go back to the structure of the input array, `range_1d`

.

The array `range_1d`

contains the values [1,2,3,4].

Inside of the np.where function, we have a condition that tests every element of `range_1d`

to evaluate if the element is greater than 2.

Evaluating that condition for every element of `range_1d`

will produce a boolean array with values `True`

or `False`

.

Those true or false values dictate which output np.where will produce.

Here, we’ve kept it simple.

In this case, np.where function outputs `True`

if the condition evaluates as `True`

, and it outputs `False`

if the condition evaluates as `False`

.

But, we still could have more control over the exact outputs.

Let’s take a look at how to output something different in the next example.

Next, we’re going to create a minor modification to example 1.

Remember that in example 1, we tested a simple condition and then outputted ‘`True`

‘ if the condition evaluated as true and outputted ‘`False`

‘ if the condition evaluated as false.

Let’s change that very slightly.

Here, we’re going to output ‘`yes`

‘ if the condition evaluates as true and output ‘`no`

‘ if the condition evaluates as false.

(Note that we’re going to use the datset we created in example 1, so if you didn’t run that example, go back and create the `range_1d`

dataset.)

Ok. Here’s the code for example 2:

np.where(range_1d > 2, 'Yes', 'No')

And here’s the output:

array(['No', 'No', 'Yes', 'Yes'])

(Note that the output is a special type of Numpy array with `dtype='<U3'`

.)

Ok. What happened?

This example is almost exactly the same as example 1.

Just like in example 1, we’re testing the condition `range_1d > 2`

.

Remember: the dataset `range_1d`

has the values `[1,2,3,4]`

.

The major difference in this example is the output.

If the condition `range_1d > 2`

is `True`

, then np.where outputs `'yes'`

.

If the condition `range_1d > 2`

is `False`

, then np.where outputs `'no'`

.

The way that numpy.where is working in this example looks something like this.

Do you see what’s going on here?

Numpy where simply tests a condition … in this case, a comparison operation on the elements of a Numpy array.

If the condition is `True`

, we output one thing, and if the condition is `False`

, we output another thing.

In this example, we’re going to build on examples 1 and 2.

(That means that we’ll still be using the `range_1d`

dataset that we made in example 1. If you haven’t created that dataset, go back and do that now.)

The array `range_1d`

contains the values `[1,2,3,4]`

.

Just like in examples 1 and 2, our condition will test if `range_1d > 2`

. That test will operate on every element of `range_1d`

.

But in this example, the output will be a little different.

If the condition `range_1d > 2`

is `False`

, np.where will output the value 0.

But if the condition `range_1d > 2`

is `True`

, numpy.where will pull the output value from the values in `range_1d`

.

Let’s run the code and take a look.

Here’s the code:

np.where(range_1d > 2, range_1d, 0)

And here’s the output:

array([0, 0, 3, 4])

This is really simple, once you get it (although I strongly recommend that you read example 1 and example 2 first).

Once again, as in all cases of np.where, the behavior of this code hinges on the condition.

Here, the Numpy where function test `range_1d > 2`

.

If that condition is true for a particular element, np.where outputs the correstponding value from `range_1d`

. It does this element wise … so if the condition is true for the element at index 3, it outputs element 3 from `range_1d`

.

But if the condition is false, it outputs 0.

So for any value in `range_1d`

that’s less than or equal to 2, np.where outputs 0, otherwise, it outputs the value in `range_1d`

.

This is almost the same as examples 1 and 2.

There’s a test condition, and then one output if true, and a different output if false. That’s what Numpy where does!

Ok.

One last example to drive this home.

Here, we’re going to use the exact same condition. We’ll test if `range_1d > 2`

.

But if the output is true, we’ll take the output (element-wise) from one list of numbers. If the condition is false, we’ll take the output from a different list of numbers.

Let’s run the code and look at the output.

np.where(range_1d > 2, [10,20,30,40], [-10,-20,-30,-40])

OUT:

array([-10, -20, 30, 40])

So what happened in this example?

It’s almost exactly the same as the previous examples!

We’re testing a condition, and then taking the output from one group of numbers if true, and taking the output from a different set of numbers if false.

Note that this happens *element wise*, meaning that if the condition is false at position 0 of `range_1d > 2`

, it will output the 0th element from the second list. If the condition is true for the test at position 3, it will output the value at position 3 from the first list.

Here in example 4, we’re just testing a condition, and then outputting values element wise from different groups of numbers depending on whether the condition is true or false.

All of the examples shown so far use 1-dimensional Numpy arrays. That’s intentional. I wanted to use a simple array as an input to make the examples extremely easy to understand.

However, everything that I’ve shown here extends to 2D and 3D Numpy arrays (and beyond).

Moreover, the conditions in this example were very simple. Having said that, you can use very complicated test conditions in Numpy where.

As always, I recommend that you learn how this works by using simple examples, and then increase the complexity to improve your understanding.

Now that you’ve learned about Numpy where and seen some examples, let’s review some frequently asked questions about the function.

**Frequently asked questions:**

The short answer is “yes.”

In all of the examples in the examples section, we use all three parameters: `condition`

, `output-if-true`

, `output-if-false`

.

But it’s possible to run np.where only with `condition`

, and remove `output-if-true`

and `output-if-false`

.

If you do this, Numpy where will simply output the index positions of the elements for which `condition`

is `True`

.

**Example:**

np.where(range_1d > 2)

OUT:

(array([2, 3]),)

Note that the output in this case is a tuple.

Do you have other questions about Numpy where?

I know that it’s a little confusing for beginners …

So if you have a question, leave your question in the comments section at the bottom of the page.

The examples you’ve seen in this tutorial should be enough to get you started, but if you’re serious about learning Numpy, you should enroll in our premium course called *Numpy Mastery*.

There’s a lot more to learn about Numpy, and *Numpy Mastery* will teach you everything, including:

- How to create Numpy arrays
- How to use the Numpy random functions
- What the “Numpy random seed” function does
- How to reshape, split, and combine your Numpy arrays
- and more …

Moreover, it will help you completely *master* the syntax within a few weeks. You’ll discover how to become “fluent” in writing Numpy code.

Find out more here:

Learn More About Numpy Mastery

The post Numpy where explained appeared first on Sharp Sight.

]]>The post How to use the Pandas sort_values method appeared first on Sharp Sight.

]]>The sort_values method is fairly straightforward to use, but this tutorial will explain everything step by step.

I’ll explain what the sort values method does. I’ll explain the syntax. And then I’ll provide clear, step-by-step examples of how the sort_values method works.

If you need something specific, you can click on any of the following links and it will take you to the appropriate section.

**Table of Contents:**

- A quick introduction to Pandas sort_values
- The syntax of Pandas sort_values
- Pandas sort_values examples
- Pandas sort_values FAQ

Again, if you’re looking for something specific, you can just click on one of the links.

But if you’re new to Pandas and not really sure how to do data manipulation in Python, you should really read the whole tutorial.

Ok. Let’s take a high level look at sort_values.

The sort_values method is a data manipulation tool from the Pandas package.

If you’re not completely familiar with it, the Pandas package is a data manipulation toolkit for the Python programming language.

Pandas has a few dozen tools for manipulating data. There are tools for renaming variables, subsetting rows of data, selecting DataFrame columns, and a variety of other data manipulation tasks. Pandas is really an integrated toolkit for cleaning and shaping data.

One of the most common data manipulation tasks is sorting data, and Pandas has a tool for that as well.

The sort_values method is a Pandas method for sorting the columns of a DataFrame.

That’s really all it does!

But there are a few details about how the function works that you should know about.

To really understand the details of the sort_values method, you need to understand the syntax.

Let’s take a look.

The syntax of the Pandas sort_values method is fairly straightforward.

When you call the method, you first need to type the name of your DataFrame. That means that before you call the method, you first need an actual DataFrame. For more information about DataFrames, I recommend our tutorial about Pandas DataFrames.

After you type the name of the DataFrame, you need to type a dot (“`.`

“) and then the name of the method: `sort_values`

.

Then, inside of the parenthesis there will be a group of arguments and parameters that enable you to specify exactly how to sort the data.

Let’s look at the parameters of sort_values more carefully.

The sort_values method actually has several parameters, but we’re going to focus on three:

`by`

`ascending`

`inplace`

These three parameters are the most common, and they are the ones you will use most often.

The method also has 3 other parameters – `axis`

, `kind`

, and `na_position`

. These are less common, so we’re not going to look at them in detail (although I will mention `axis`

and `kind`

in the FAQ section towards the end.).

Ok … I’m going to quickly explain the `by`

, `ascending`

, and `inplace`

parameters.

`by`

The `by`

parameter specifies the column or columns to sort by.

You can provide a name of an individual column or a list of names.

When you provide a single column name, you need to provide the column name in the form of a Pandas string (i.e., the column name needs to be enclosed inside of quotation marks).

When you provide several column names, you need to organize them inside of a list.

I’ll show you examples of both of these in the examples section of the tutorial.

Note that this is required. You must provide an argument to this parameter.

`ascending`

The `ascending`

parameter enables you to specify whether or not you want to sort the DataFrame in ascending order or descending order.

If you set `ascending = True`

, then sort_values will sort the data in ascending order (i.e., from lowest to highest).

If you set `ascending = False`

, the sort_values will sort the data in descending order (i.e., from highest to lowest).

This parameter is *not* required. If you don’t use this parameter, it will default to `ascending = True`

and sort your DataFrame in ascending order.

`inplace`

This parameter specifies whether or not you want to sort the DataFrame “in place.”

By default, this parameter is set to `inplace = False`

.

When `inplace = False`

, the method will create a new DataFrame as the output, and leave the original DataFrame unchanged.

As a side note, that’s actually a good thing, because that way you won’t accidentally overwrite your data!

However, if you set `inplace = True`

, the Pandas sort values method will directly sort the original DataFrame. Be careful! This mean that when you set set `inplace = True`

, sort_values will overwrite your original dataset!

As I just mentioned, the sort_values method outputs a new Pandas DataFrame by default.

However, if you set `inplace = True`

, the sort_values method will directly sort the original DataFrame.

Now that you’ve learned the syntax of Pandas sort_values, let’s take a look at some examples.

**Examples:**

- Sort DataFrame by a single column
- Sort a DataFrame by multiple variables
- Arrange a Pandas DataFrame in descending order
- Sort a Pandas DataFrame “in place”

Before you run the example code, you need to make sure that you do two things.

You need to import Pandas and you need to create the DataFrame that we’re going to use.

You can run this code to import pandas:

import pandas as pd

This imports the Pandas package with the prefix `pd`

.

Next, let’s create a simple DataFrame.

Here, we’re going to create a small DataFrame that contains some dummy sales data.

To do this, we’re just going to call the `pd.DataFrame()`

function and provide the data that we want to turn into a DataFrame.

sales_data = pd.DataFrame({"name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"] ,"region":["East","North","East","South","West","West","South","West","West","East","South"] ,"sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000] ,"expenses":[42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]})

And let’s print it out so you can see the contents:

print(sales_data)

OUT:

The DataFrame contains 11 rows and four columns. The columns are `name`

, `region`

, `sales`

, and `expenses`

.

We’ll be able to use several of these variables as our sorting variables.

For more information about how to create DataFrames like this, you can read our Pandas DataFrame tutorial.

For our first example, we’re going to start simple.

Here, we’re going to sort by a single variable, the `sales`

variable.

To do this, we’ll simply call the `sort_variables()`

by typing the name of the DataFrame, and then call the method using “dot” notation. Inside of the method, we specify our sort variable by simply typing the variable name as a string (i.e., enclosed in quotation marks).

Here’s the code:

sales_data.sort_values('sales')

And here is the output when you run the code:

name region sales expenses 3 Markus South 34000 44000 4 Edward West 42000 38000 6 Ethan South 49000 42000 0 William East 50000 42000 1 Emma North 52000 43000 7 Olivia West 55000 60000 9 Anika East 65000 44000 8 Arun West 67000 39000 10 Paulo South 67000 45000 5 Thomas West 72000 39000 2 Sofia East 90000 50000

Notice that the output DataFrame is sorted by the `sales`

variable, from the lowest value to the highest value.

Remember: by default, the `ascending`

parameter is set to `ascending = True`

. Because we did not manually set the `ascending`

variable, the method defaulted to `ascending = True`

, so the output is sorted by `sales`

in ascending order.

Next, let’s make things a little more complicated.

Here, we’re going to sort our DataFrame by multiple variables.

The syntax is almost exactly the same as the code for the previous example.

However, instead of providing a single variable to the method, we’re going to provide a *list* of variables.

Let’s take a look at the syntax and run it, and then I’ll explain more.

sales_data.sort_values(['region','sales'])

And here is the output DataFrame:

name region sales expenses 0 William East 50000 42000 9 Anika East 65000 44000 2 Sofia East 90000 50000 1 Emma North 52000 43000 3 Markus South 34000 44000 6 Ethan South 49000 42000 10 Paulo South 67000 45000 4 Edward West 42000 38000 7 Olivia West 55000 60000 8 Arun West 67000 39000 5 Thomas West 72000 39000

Notice that the output is sorted by two variables. It’s sorted alphabetically by `region`

, and then within each region, the rows are sorted by `sales`

.

This output corresponds to the syntax. When we called the method, we provided a list of two variables as the input. The method call was `.sort_values(['region','sales'])`

. Notice that inside of the method call, we have a Python list that contains the two variable names.

The output is simply sorted by those two variables.

Keep in mind that here, we sorted by two variables. We could sort by three (try it yourself). Or if you have a very large DataFrame with many variables, you can sort by a very large number of variables, if you need to. Just provide the sorting variables to the method in the form of a Python list.

Now, let’s sort our DataFrame in descending order.

Remember that by default, the `ascending`

parameter is set to `ascending = True`

. That automatically sorts the data in ascending order.

To change that and sort in *descending* order, we’re going to set `ascending = False`

.

Here’s the code:

sales_data.sort_values('sales', ascending = False)

And here’s the output:

name region sales expenses 2 Sofia East 90000 50000 5 Thomas West 72000 39000 8 Arun West 67000 39000 10 Paulo South 67000 45000 9 Anika East 65000 44000 7 Olivia West 55000 60000 1 Emma North 52000 43000 0 William East 50000 42000 6 Ethan South 49000 42000 4 Edward West 42000 38000 3 Markus South 34000 44000

Notice that in this case, the output are sorted by `sales`

in descending order … i.e., from highest to lowest.

Finally, we’re going to sort our data “in place”.

That might be a little confusing, so let me quickly explain.

Notice that whenever we’ve run our sort code in the previous examples, the code creates a new DataFrame as an output. If you’re working in Spyder or a similar Python IDE, the output is just sent right to the console window.

However, if, after running the above examples, you print out the original DataFrame with the code `print(sales_data)`

, you’ll find that the original dataframe (`sales_data`

) is exactly the same. Unsorted.

Syntactically, this is because the `inplace`

parameter is set to `inplace = False`

by default. If we do not reference the `inplace`

parameter in our code, it just defaults to `False`

. That means that the method does not modify the original DataFrame (e.g., `sales_data`

) directly. It does not modify the original DataFrame “in place”. Instead, if `inplace`

is set to `False`

, it just creates a new dataset.

What if we want to modify the original DataFrame though?

Easy.

We just set `inplace = True`

.

Let’s take a look.

Here, before we modify our DataFrame “in place”, we’re going to create a copy.

This is just so we don’t overwrite the original DataFrame. (I want to leave the original DataFrame intact for you, so we’ll just create a copy that we can modify in place.)

sales_data_copy = sales_data.copy()

And now, here is the code to modify the data in place:

sales_data_copy.sort_values('sales', inplace = True)

After you run the code, there should not be any direct output. This is because the sort_values method has modified `sales_data_copy`

directly.

That being the case, to see the sorted DataFrame, we need to print it out.

We can do that with the following code:

print(sales_data_copy)

OUT:

name region sales expenses 3 Markus South 34000 44000 4 Edward West 42000 38000 6 Ethan South 49000 42000 0 William East 50000 42000 1 Emma North 52000 43000 7 Olivia West 55000 60000 9 Anika East 65000 44000 8 Arun West 67000 39000 10 Paulo South 67000 45000 5 Thomas West 72000 39000 2 Sofia East 90000 50000

Notice that the `sales_data_copy`

DataFrame is sorted on `sales`

from low to high.

Because we sorted the data in place, that change to `sales_data_copy`

is permanent.

At the same time though, the original `sales_data`

DataFrame is still unchanged.

Ok …. let’s take a look at a couple questions about the sort_values method.

**Frequently asked questions:**

- Is it possible to sort the data along axis 1?
- How can you sort by ascending order and descending order for different variables?

So far in this tutorial, we’ve been using the sort_values method to sort the rows of data.

Technically, sorting the rows is equivalent to saying that we’re sorting axis 0 of the DataFrame. (This is a bit technical … you need to understand Pandas axes to understand this.)

I haven’t mentioned though, that it’s actually possible to sort the columns as well. You can use sort_values to sort the columns of a DataFrame just like you can sort the rows.

To do this, you can set the `axis`

parameter to `axis = 1`

in your code.

Having said that, it’s extremely rare to have to do this.

If you have a typical DataFrame with rows and columns, you should almost never need to set `axis = 1`

.

What if you want to sort by one variable in ascending order and sort by a second variable in descending order. Is that possible?

Yes.

You can just past a list of boolean values to the `ascending`

parameter.

Try it!

You can run this code on the `sales_data`

DataFrame we created above:

sales_data.sort_values(['region','sales'], ascending = [False, True])

Do you have more questions about the Pandas sort_values method?

Leave your questions in the comments section below.

If you’re serious about learning Pandas, you should enroll in our premium Pandas course called *Pandas Mastery*.

Pandas Mastery will teach you everything you need to know about Pandas, including:

- How to subset your Python data
- Data aggregation with Pandas
- How to reshape your data
- and more …

Moreover, it will help you completely *master* the syntax within a few weeks. You’ll discover how to become “fluent” in writing Pandas code to manipulate your data.

Find out more here:

Learn More About Pandas Mastery

The post How to use the Pandas sort_values method appeared first on Sharp Sight.

]]>