The post How to make a matplotlib scatter plot appeared first on Sharp Sight.

]]>The scatter plot is a relatively simple tool, but it’s also essential for doing data analysis and data science.

Having said that, if you want to do data science in Python, you really need to know how to create a scatter plot in matplotlib. You should know how to do this with your eyes closed.

This tutorial will show you how to make a matplotlib scatter plot, and it will show you how to modify your scatter plots too.

Overall, the tutorial is designed to be read top to bottom, particularly if you’re new to Python and want the details of how to make a scatter plot in Python. Ideally, it’s best if you read the whole tutorial.

Having said that, if you just need quick help with something, you can click on one of the following links. These links will bring you to the appropriate section in the tutorial.

- A quick introduction to matplotlib
- The syntax for the matplotlib scatter plot
- Examples of how to make a scatter plot with matplotlib

Again though, if you’re a relative beginner and you have the time, I recommend that you read the full tutorial. Everything will make more sense that way.

Ok. Before I show you how to make a scatter plot with matplotlib, let me quickly explain what matplotlib is.

Matplotlib is a data visualization module for the Python programming language. It provides Python users with a toolkit for creating data visualizations.

Some of those data visualizations can be extremely complex. You can use matplotlib to create complex visualizations, because the syntax is very detailed. This makes the syntax very adaptable for different visualization problems.

On the other hand, the complex syntax of matplotlib can make it more complicated to quickly create simple data visualizations.

This is where pyplot comes in.

When you start working with matplotlib, you might read about pyplot.

What is pyplot?

To put it simply, pyplot is part of matplotlib. Pyplot is a sub-module of the larger matplotlib module.

Specifically, pyplot provides a set of functions for creating simple visualizations. For example, pyplot has simple functions for creating simple plots like histograms, bar charts, and scatter plots.

Ultimately, the tools from pyplot give you a simpler interface into matplotlib. It makes visualization easier for some relatively standard plot types.

As I mentioned, one of those plots that you can create with pyplot is the scatter plot.

Let’s take a look at the syntax.

Creating a scatter plot with matplotlib is relatively easy.

To do this, we’re going to use the pyplot function `plt.scatter()`

.

For the most part, the synax is relatively easy to understand. Let’s take a look.

First of all, notice the name of the function. Here, we’re calling the function as `plt.scatter()`

. Keep in mind that we’re using the syntax `plt`

to refer to pyplot. Essentially, this code assumes that you’ve imported pyplot with the code `import matplotlib.pyplot as plt`

. For more information on that, see the examples below.

To create a scatter plot with matplotlib though, you obviously can’t just call the function. You need to use the parameters of the function to tell it exactly what to plot, and how to plot it.

With that in mind, let’s take a look at the parameters of the plt.scatter function.

Pyplot’s plt.scatter function has a variety of parameters that you can manipulate … nearly a dozen.

The large number of parameters can make using the function a little complicated though.

So in the interest of simplicity, we’re only going to discuss five of them: `x`

and `y`

, `c`

, `s`

, and `alpha`

.

Let’s talk about each of them.

The `x`

and `y`

parameters of plt.scatter are very similar, so we’ll talk about them together.

Essentially, they are the x and y axis positions of the points you want to plot.

The data that you pass to each of these should be in an “array like” format. In Python, structures with “array like” formats include things like lists, tuples, and NumPy arrays.

Commonly, you’ll find that people pass data to these parameters in the form of a Python list. For example, you might set `x = [1,2,3,4,5]`

.

In this tutorial though, we’ll work with NumPy arrays. You’ll see this later in the examples section, but essentially, we’ll pass values to the `x`

and `y`

parameters in the form of two NumPy arrays.

The `c`

parameter controls the color of the points.

There are several ways to manipulate this parameter.

First, you can set the `c`

parameter to a “named color.” Named colors are colors like “red,” “green,” “blue,” and so on. Python has a large number of named colors, so if you want something specific, take a look at the options and use one in your plot.

You can also set the `c`

parameter using a hexidecimal color. For example, you can set `c = "#CC0000"`

to set the color of the points to a sort of “fire engine red” color. Using hex colors is great, because they can give you very fine-grained control over the colors in your visualization. On the other hand, hexidecimal colors can be a little bit complicated for beginners. That being the case, we’re not going to really cover hex colors in this tutorial.

It’s also possible to create a color mapping for your points, such that the color of the points varies according to some variable. Unfortunately, this is somewhat complicated for a beginner. So in the interest of simplicity, I won’t explain it here. If you’re really interested in complex visualization with more visually appealing colors, I strongly recommend using R’s ggplot2 system instead.

The `s`

parameter controls the size of the points.

The default value is controlled by the `lines.markersize`

value in the `rcParams`

file.

We’re not going to work extensively with the s parameter, but I’ll show you a simple example of how it works in the examples below.

Finally, the `alpha`

parameter controls the opacity of the points.

This must be a value between 0 and 1 (inclusive), where 1 is fully opaque and 0 is fully transparent.

Now that you understand the syntax and the parameters of the plt.scatter function, let’s work through some examples.

One last thing though before you try to run the examples.

… you’ll need to run some code to get these examples to work properly.

First, you’ll need to import a few modules into your working environment. The following code will import matplotlib, numpy and pyplot.

import matplotlib import numpy as np import matplotlib.pyplot as plt

Also, you need to create some data.

We’re essentially going to create two vectors of data.

We’ll create the first, `x_var`

, by using the np.arange function. This data, `x_var`

, essentially contains the integer values from 0 to 49.

The second variable, `y_var`

, is the same value of x_var with a little random noise added in with the np.random.normal function.

# CREATE DATA np.random.seed(42) x_var = np.arange(0, 50) y_var = x_var + np.random.normal(size = 50, loc = 0, scale = 10)

You’ll see what the data looks like in a minute. The whole point of this tutorial is that we’re going to plot it! But essentially, when we plot them together, they will look like highly correlated linear data.

Ok, now that we have our data, let’s plot it.

We’ll start off by making a very simple scatter plot.

To do this, we’re going to call `plt.scatter()`

and set `x = x_var`

and `y = y_var`

.

# PLOT A SIMPLE SCATTERPLOT plt.scatter(x = x_var, y = y_var)

And here is the output:

Let me explain a few things about the code and the output.

First, notice the code. We mapped `x_var`

to the x axis and we mapped `y_var`

to the y axis.

You can see that this directly translates into how the points are plotted. For any given point in the scatter plot, the x axis value comes from the `x_var`

variable, and the y axis value comes from the `y_var`

variable. Said differently, the locations of the points are contained in the variables `x_var`

and `y_var`

.

I also want to note that you don’t need to explicitly need to type the parameters `x`

and `y`

. For example, you could remove `x =`

and `y =`

from the code, and it would still work. Like this:

plt.scatter(x_var, y_var)

This code works the same as `plt.scatter(x = x_var, y = y_var)`

. They are operationally identical. If you remove `x =`

and `y =`

from the code, Python still knows that you are passing `x_var`

and `y_var`

to the `x`

and `y`

parameter. It essentially knows that the first variable should be mapped to `x`

and the second should be mapped to `y`

. This is known as defining argument values by *position*. It’s very common to see that in code, so I want you to understand it.

At this point, I need to point out that a default matplotlib scatter plot is a little plain looking. It’s a little unrefined.

An unrefined chart is fine if you’re doing exploratory data analysis for personal consumption. But if you need to create a chart and show it to anyone important – like a management team in a business – this chart is unrefined. It lacks polish.

Like it or not, that lack of polish will reflect a little poorly on you. You can deny it all you want, but it can be very useful to learn how to polish your charts and make them look more professional. I’ll show you how in an example further down in this tutorial.

Although there is a lot we would need to do to make the basic scatter plot look better, changing the color of the points is a simple way to improve the aesthetics of the chart.

Let me show you how.

As noted earlier in this tutorial, you can modify the color of the points by manipulating the `c`

parameter.

There are actually several different ways to modify the `c`

parameter to change the color of the points.

The two primary ways to do this are to set the parameter to a “named color” or to set the parameter to a “hex color.”

Here in this tutorial, I’ll show you how to set the color of the points to a “named color.” Hex colors are a little more complicated, so I’m not going to explain them here.

We can change the color of the points in our scatter plot by setting the `c`

parameter to a “named color.”

What are named colors? This is very simple. Named colors are colors like “red,” “green,” and “blue.” Python has a pretty long list of named colors. I recommend that you become familiar with a few of them, so you have a few that you can use regularly in your plots.

When you know what color you want to use for your points, provide that color as the argument to the `c`

parameter.

For example, if you want to set the color of the points to “red” you can use the code `c = 'red'`

inside of plt.scatter.

Here’s the code to do that:

plt.scatter(x_var, y_var, c = 'red')

The code produces the following output:

As you can see, this code has changed the color of the points to red.

This chart still lacks polish, but by using the `c`

parameter, we now have a little more control over the aesthetics of our scatter plot.

You can also change the size of the points.

You can do that by using the `s`

parameter.

Changing the size is very similar to changing the color. Just provide a value.

plt.scatter(x_var, y_var, s = 120)

And here’s the output:

As you can see, the size of the points is larger than the size of the points in our simple scatter plot.

The value that you give to the `s`

parameter will be the point size in points**2.

As I mentioned earlier, the default formatting for pyplot plots is a little unrefined.

Again, that’s not a big deal if you’re just exploring data on your laptop and don’t intend to show it to anyone important. The default matplotlib formatting is OK for rough drafts.

But I definitely think you should “polish” your charts if you need to show them to anyone important. For example, if you work in a business environment and you need to present an analysis to a high-level management team, you’ll want your charts to be polished and aesthetically pleasing. The appearance of your visualizations matter. Don’t ignore it.

That being the case, let me show you a quick way to improve the look of your pyplot scatter plots.

We’re going to use a function from the seaborn module to change some of our plot formatting.

To use seaborn, we’ll need to import the seaborn module. You can do that with the following code.

# import seaborn module import seaborn as sns

Now that seaborn is installed, we’re going to use the `seaborn.set()`

function to re-set the plot defaults:

#set plot defaults using seaborn formatting sns.set()

After running `sns.set()`

, you can re-plot your data, and you’ll notice that it looks quite a bit better.

#plot scatter plot with matplotlib.pyplot plt.scatter(x = x_var, y = y_var)

Here’s the plot:

As you can see, the chart looks different. More professional, in my opinion.

The background color has been changed. There are gridlines now. The default color for the points is actually slightly different. The changes here are actually pretty minor, but I think they make a big difference in making the chart look better.

One quick note about the using seaborn formatting.

If you run the `seaborn.set()`

function above, you may find that all of your pyplot charts have that formatting.

How do you turn it off?

You can remove the seaborn formatting by using the `seaborn.reset_orig()`

function.

# REMOVE SEABORN FORMATTING sns.reset_orig()

Let’s do one more example.

Here, we’re going to use several of the parameters and techniques from prior examples *together* in a single example. The output will be a little more polished, and it will give you a sense of how to create a scatter plot with pyplot while controlling multiple parameters at the same time.

Here’s the code:

# FINALIZED EXAMPLE import seaborn as sns sns.set() plt.scatter(x_var, y_var, s = 120, c = 'red')

And here is the output:

Not bad.

It’s not perfect, and we could probably do a few things to improve it, but a plot like this will be “good enough” in many circumstances.

Having said that, if you really want to get the most out of our data visualizations in Python, you need to learn a lot more about matplotlib and pyplot. We’ve really just covered the basics here.

Moreover, if you’re serious about learning data science in Python, you really *need* to know matplotlib. Data visualization is an important part of data science, and if you’re doing data visualization in Python, matplotlib is often the tool of choice.

If you’re interested in data science in Python, sign up for our email list now.

Every week, we publish data science tutorials here at the Sharp Sight blog.

By signing up, you’ll get our tutorials delivered directly to your inbox.

You’ll get free tutorials on:

- Matplotlib
- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to make a matplotlib scatter plot appeared first on Sharp Sight.

]]>The post How to use the NumPy max function appeared first on Sharp Sight.

]]>At a high level, I want to explain the function and show you how it works. That being the case, there are two primary sections in this tutorial: the syntax of NumPy max, and examples of how to use NumPy max.

If you’re still getting started with NumPy, I recommend that you read the whole tutorial, start to finish. Having said that, if you just want to get a quick answer to a question, you can skip ahead to the appropriate section with one of the following links:

Ok. With all of that in mind, let’s get started.

First, let’s talk about NumPy and the NumPy max function.

It’s probably clear to you that the NumPy max function is a function in the NumPy module.

But if you’re a true beginner, you might not really know what NumPy is. So before we talk about the np.max function specifically, let’s quickly talk about NumPy.

What exactly is NumPy?

To put it very simply, NumPy is a data manipulation package for the Python programming language.

If you’re interested in data science in Python, NumPy is very important. This is because a *lot* of data science work is simply data manipulation. Whether you’re doing deep learning or data analysis, a huge amount of work in data science is just cleaning data, prepping data, and exploring it to make sure that it’s okay to use.

Again, because of the importance of data manipulation, NumPy is very important for data science in Python.

Specifically though, NumPy provides a set of tools for working with *numeric* data.

Python has other toolkits for working with non-numeric data and data of mixed type (like the Pandas module). But if you have any sort of numeric data that you need to clean, modify, reshape, or analyze, NumPy is probably the toolkit that you need.

Although NumPy functions can operate on a variety of data structures, they are built to operate on a structure called a NumPy array. NumPy arrays are just a special kind of Python object that contain numerical data. There are a variety of ways to create numpy arrays, including the np.array function, the np.ones function, the np.zeros function and the np.arange function, along with many other functions covered in past tutorials here at Sharp Sight.

Importantly, NumPy arrays are optimized for numerical computations. So there are a set of tools in NumPy for performing numerical computations on NumPy arrays, like calculating the mean of a NumPy array, calculating the median of a NumPy array, and so on.

Essentially, NumPy gives you a toolkit for creating arrays of numeric data, and performing calculations on that numeric data.

One of the computations you can perform is calculating the maximum value of a NumPy array. That’s where the np.max function comes in.

The `numpy.max()`

function computes the maximum value of the numeric values contained in a NumPy array. It can also compute the maximum value of the rows, columns, or other axes. We’ll talk about that in the examples section.

Syntactically, you’ll often see the NumPy max function in code as np.max. You’ll see it written like this when the programmer has imported the NumPy module with the alias `np`

.

Additionally, just to clarify, you should know that the np.max function is the same thing as the NumPy amax function, AKA np.amax. Essentially np.max is an alias of np.amax. Aside from the name, they are the same.

Later in this tutorial, I’ll show you concrete examples of how to use the np.max function, but right here I want to give you a rough idea of what it does.

For example, assume that you have a 1-dimensional NumPy array with five values:

We can use the NumPy max function to compute the maximum value:

Although this example shows you how the `np.max()`

function operates on a 1-dimensional NumPy array, it operates in a similar way on 2-dimensional arrays and multi-dimensional arrays. Again, I’ll show you full examples of these in the examples section of this tutorial.

Before we look at the code examples though, let’s take a quick look at the syntax and parameters of np.max.

The syntax of the np.max function is fairly straight forward, although a few of the parameters of the function can be a little confusing.

Here, we’ll talk about the syntactical structure of the function, and I’ll also explain the important parameters.

One quick note before we start reviewing the syntax.

Syntactically, the proper name of the function is `numpy.max()`

.

Having said that, you’ll often see the function in code as `np.max()`

.

Why?

Commonly, at the start of a program that uses the NumPy module, programmers will import the NumPy function as `np`

. You will literally see a line of code in the program that reads `import numpy as np`

. Effectively, this imports the NumPy module with the alias `np`

. This enables the programmer to refer to NumPy as `np`

in the code, which enables them to refer to the numpy.max function as np.max.

Having said that, let’s take a closer look at the syntax.

At a high level, the syntax of np.max is pretty straight forward.

There’s the name of the function – `np.max()`

– and inside of the function, there are several parameters that enable us to control the exact behavior of the function.

Let’s take a closer look at the parameters of np.max, because the parameters are what really give you fine-grained control of the function.

The numpy.max function has four primary parameters:

`a`

`axis`

`out`

`keepdims`

Let’s talk about each of these parameters individually.

The `a`

parameter enables you to specify the data that the np.max function will operate on. Essentially, it specifies the input array to the function.

In many cases, this input array will be a proper NumPy array. Having said that, numpy.max (and most of the other NumPy functions) will operate on any “array like sequence” of data. That means that the argument to the `a`

parameter can be a Python list, a Python tuple, or one of several other Python sequences.

Keep in mind that you need to provide something to this argument. It is required.

The `axis`

parameter enables you to specify the axis on which you will calculate the maximum values.

Said more simply, the axis parameter enables you to calculate the row maxima and column maxima.

I’ll explain how to do that with more detail in the examples section below, but let me quickly explain how the `axis`

parameter works.

At a high level, you need to understand that NumPy arrays have axes.

Axes are like directions along the NumPy array. In a 2-dimensional array, axis 0 is the axis that points down the rows and axis 1 is the axis that points horizontally across the columns.

So how does this relate to the `axis`

parameter?

When we use the `axis`

parameter in the numpy.max function, we’re specifying the axis along which to find the maxima.

This effectively lets us compute the column maxima and row maxima.

Let me show you what I mean.

Remember that axis 0 is the axis that points downwards, down the rows.

When we use the code `np.max(axis = 0)`

on an array, we’re effectively telling NumPy to compute the maximum values in that direction … the axis 0 direction.

Effectively, when we set `axis = 0`

, we’re specifying that we want to compute the *column* maxima.

Similarly, remember that in a 2-dimensional array, axis 1 points horizontally. Therefore, when we use NumPy max with `axis = 1`

, we’re telling NumPy to compute the maxima horizontally, in the axis 1 direction.

This effectively computes the row maxima.

I’ll show you concrete code examples of how to do this, later in the examples section.

Keep in mind that the *axis* parameter is optional. If you don’t specify an axis, NumPy max will find the maximum value in the whole NumPy array.

The out parameter allows you to specify a special output array where you can store the output of np.max.

It’s not common to use this parameter (especially if you’re a beginner) so we aren’t going to discuss this in the tutorial.

`out`

is an optional parameter.

The `keepdims`

parameter is a little confusing, so it will take a little effort to understand.

Ultimately, the `keepdims`

parameter keeps the dimensions of the output the same as the dimensions of the input.

To understand why this might be necessary, let’s take a look at how the numpy.max function typically works.

When you use np.max on a typical NumPy array, the function *reduces* the number of dimensions. It summarizes the data.

For example, let’s say that you have a 1-dimensional NumPy array. You use NumPy max on the array.

When you use np.max on a 1-d array, the output will be a single number. A scalar value … *not* a 1-d array.

Essentially, the functions like NumPy max (as well as numpy.median, numpy.mean, etc) *summarise* the data, and in summarizing the data, these functions produce outputs that have a reduced number of dimensions.

Sometimes though, you want the output to have the *same* number of dimensions. There are times when if the input is a 1-d array, you want the output to be a 1-d array (even if the output array has a single value in it).

You can do this the `keepdims`

parameter.

By default, `keepdims`

is set to `False`

. So by default (as discussed above), the dimensions of the output will not be the same as the dimensions of the input. By default, the dimensions of the output will be smaller (because np.max summarizes the data).

But if you set `keepdims = True`

, the output will have the *same* dimensions as the input.

This is a little abstract without a concrete example, so I’ll show you an example of this behavior later in the examples section.

And actually, now that we’ve reviewed the parameters, this is a good spot to start looking at the examples of NumPy max.

In this section, I’m going to show you concrete examples of how to use the NumPy max function.

I’ll show you several variations of how to find the maximum value of an array. I’ll show you how to find the maximum value of a 1-d array, how to find the max value of a 2-d array, and how to work with several of the important parameters of numpy.max.

Before we get started, there are some preliminary things you need to do to get set up properly.

First, you need to have NumPy installed properly on you computer.

Second, you need to have NumPy imported into your working environment.

You can import NumPy with the following code:

import numpy as np

Notice that we’ve imported NumPy as `np`

. That means that we will refer to NumPy in our code with the alias `np`

.

Ok, now that that’s finished, let’s look at some examples.

We’ll start simple.

Here, we’re going to compute the maximum value of a 1-d NumPy array.

To do this, we’ll first just create a 1-dimensional array that contains some random integers. To create this array, we’ll use the `numpy.random.randint()`

function. Keep in mind that you need to use the `np.random.seed()`

function so your NumPy array contains the same integers as the integers in this example.

np.random.seed(22) np_array_1d = np.random.randint(size = 5, low = 0, high = 99)

This syntax will create a 1-d array called `np_array_1d`

.

We can print out `np_array_1d`

using the `print()`

function.

print(np_array_1d)

And here’s the output:

[ 4, 44, 64, 84, 8]

Visually, we can identify the maximum value, which is `84`

.

But let’s do that with some code.

Here, we’ll calculate the maximum value of our NumPy array by using the `np.max()`

function.

np.max(np_array_1d)

Which produces the following output:

84

This is an extremely simple example, but it illustrates the technique. Obviously, when the array is only 5 items long, you can visually inspect the array and find the max value. But this technique will work if you have an array with thousands of values (or more!).

Next, let’s compute the maximum of a 2-d array.

To do this, obviously we need a 2-d array to work with, so we’ll first create a 2-dimensional NumPy array.

To create our 2-d array, we’re going to use the np.random.choice() function. Essentially, this function is going to draw a random sample from the integers between 0 and 8, without replacement. After np.random.choice() is executed, we’re using the reshape() method to reshape the integers into a 2-dimensional array with 3 rows and 3 columns.

np.random.seed(1) np_array_2d = np.random.choice(9, 9, replace = False).reshape((3,3))

Let’s take a look by printing out the array, `np_array_2d`

.

print(np_array_2d)

[[8 2 6] [7 1 0] [4 3 5]]

As you can see, this is a 2-d array with 3 rows and 3 columns. It contains the integers from 0 to 8, arranged randomly in the array.

Now, let’s compute the max value of the array:

np.max(np_array_2d)

Which produces the following output:

8

Again, this is a very simple example, but you can use this with a much larger 2-d array and it will operate in the same way. Once you learn how to use this technique, try it with larger arrays!

Next, let’s do something more complicated.

… in the next examples, we’ll compute the column maxima and the row maxima.

First up: we’ll compute the maximum values of the *columns* of an array.

To do this, we need to use the `axis`

parameter. Specifically, we need to set `axis = 0`

inside of the numpy.max function.

Let’s quickly review why.

Remember that NumPy arrays have axes, and that the axes are like directions along the array. In a 2-d array, axis 0 is the axis that points downwards, and axis 1 is the axis that points horizontally.

We can use these axes to define the direction along which to use np.max.

So let’s say that we want to compute the maximum values of the columns. This is equivalent to computing the means *downward*.

Essentially, to compute the column maxima, we need to compute the maxima in the axis-0 direction.

Let me show you how.

Here, we’re going to re-create our 2-d NumPy array. This is the same as the 2-d NumPy array that we created in a previous example, so if you already ran that code, you don’t need to run it again.

np.random.seed(1) np_array_2d = np.random.choice(9, 9, replace = False).reshape((3,3))

And we can print it out:

print(np_array_2d)

[[8 2 6] [7 1 0] [4 3 5]]

Once again, this is a 2-d array with 3 rows and 3 columns. It contains the integers from 0 to 8, arranged randomly in the array.

Now, let’s compute the column maxima by using numpy.max with `axis = 0`

.

# CALCULATE COLUMN MAXIMA np.max(np_array_2d, axis = 0)

Which produces the following output array:

array([8, 3, 6])

Let’s evaluate what happened here.

By setting `axis = 0`

, we specified that we want the NumPy max function to calculate the maximum values *downward* along axis 0.

It’s pretty straightforward as long as you understand NumPy axes and how they work in the NumPy functions.

Similarly, we can compute the row maxima by setting the `axis`

parameter to `axis = 1`

.

Here’s the code to create the 2-d dataset again:

np.random.seed(1) np_array_2d = np.random.choice(9, 9, replace = False).reshape((3,3))

print(np_array_2d)

[[8 2 6] [7 1 0] [4 3 5]]

And now let’s calculate the row maxima:

np.max(np_array_2d, axis = 1)

With the following output:

array([8, 7, 5])

This should make sense if you’ve already read and understood the previous examples.

When we set `axis = 1`

, we’re telling numpy.max to calculate the maximum values in the axis-1 direction. Since axis 1 is the axis that runs horizontally along the array, this effectively calculates the maximum values along the rows of a 2-d array:

Again, this is pretty straightforward, as long as you really understand NumPy array axes. If you’re still having trouble understanding axes, I recommend that you review our tutorial about NumPy array axes.

Finally, let’s take a look at the `keepdims`

parameter.

Before we do this, let me explain why we need it.

As I noted earlier, the NumPy max function *summarizes* data when you use it. In fact, many of the NumPy functions that calculate summary statistics (like np.mean, np.median, np.min, etc) summarize data by their very nature. When you calculate a summary statistic, you are by definition summarizing the data.

This has important consequences related to the dimensions of the data.

When you summarize your data with a function like numpy.max, the output of the function will have a reduced number of dimensions.

For example, let’s say you’re calculating the maximum value of a 2-dimensional array. If you use numpy.max on this 2-d array (without the `axis`

parameter), then the output will be a single number, a scalar. Scalars have *zero* dimensions. Two dimensions in, zero dimension out.

The NumPy max function effectively reduces the dimensions between the input and the output.

Sometimes though, you *don’t want a reduced number of dimensions*. There may be situations where you need the output to technically have the same dimensions as the input (even if the output is a single number).

You can force that behavior by using the `keepdims`

parameter.

By default, the `keepdims`

parameter is set to `False`

. As I just explained, this means that the output does *not* need to have the same dimensions as the input, by default.

But if you set `keepdims = True`

, this will force the output to have the same number of dimensions as the input.

This might confuse you, so let’s take a look at a solid example.

First, let’s just create a 2-d array.

This is the same array that we created earlier, so if you already ran this code, you don’t need to re-run it. Essentially, this code creates a 2-d array with the numbers from 0 to 8, arranged randomly in a 3-by-3 array.

np.random.seed(1) np_array_2d = np.random.choice(9, 9, replace = False).reshape((3,3))

Just for the sake of clarity, let’s take a look by printing it out:

print(np_array_2d)

And here’s the array:

[[8 2 6] [7 1 0] [4 3 5]]

Again, this array just contains the integers from 0 to 8, arranged randomly in a 3-by-3 NumPy array.

And how many dimensions does it have?

It’s probably obvious to you, but we can directly retrieve the number of dimensions by extracting the `ndim`

attribute from the array.

np_array_2d.ndim

And this tells us the number of dimensions:

2

So, `np_array_2d`

is a 2-dimensional array.

Now, let’s use np.max to compute the maximum value of the array.

np.max(np_array_2d)

The max value is `8`

.

And how many dimensions does the output have? We can check by referencing the `ndim`

attribute at the end of the `np.max()`

function:

np.max(np_array_2d).ndim

How many dimensions does the output have?

`0`

.

The output of np.max is the maximum value (8), which is a scalar. This scalar has zero dimensions.

Now, let’s re-run the code with `keepdims = True`

.

np.max(np_array_2d, keepdims = True)

Which produces the following output:

array([[8]])

And let’s check the dimensions:

np.max(np_array_2d, keepdims = True).ndim

2

Here, when we run np.max on `np_array_2d`

with `keepdims = True`

, the output has 2 dimensions.

Keep in mind that the maximum value is the same: `8`

. It’s just that the *dimensions* of the output is different. By setting `keepdims = True`

, we change the structure of the output … instead of being a scalar, the output is actually a 2-d NumPy array with a single value (`8`

).

If you’ve read other tutorials here at the Sharp Sight data science blog, you know just how important data manipulation is.

If you’re serious about learning data science, you really need to master the basics of data manipulation. A huge part of the data science workflow is just cleaning and manipulating input data.

If you’re working in Python, one of the essential skills you need to know then is NumPy. Numpy is critical for cleaning, manipulating, and exploring your data.

If you want to learn data science in Python, learn NumPy and learn it well.

Having said that, if you want to learn NumPy and data science in Python, then sign up for our email list.

Here at the Sharp Sight blog, we regularly publish data science tutorials.

When you sign up, you’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

When we publish tutorials, we’ll send them directly to your inbox.

Want to learn data science in Python? Sign up now.

The post How to use the NumPy max function appeared first on Sharp Sight.

]]>The post How to make a matplotlib bar chart appeared first on Sharp Sight.

]]>Specifically, you’ll learn how to use the plt.bar function from pyplot to create bar charts in Python.

I’ll be honest … creating bar charts in Python is harder than it should be.

People who are just getting started with data visualization in Python sometimes get frustrated. I suspect that this is particularly true if you’ve used other modern data visualization toolkits like ggplot2 in R.

But if you’re doing data science or statistics in Python, you’ll need to create bar charts.

To try to make bar charts easier to understand, this tutorial will explain bar charts in matplotlib, step by step.

The tutorial has several different sections. Note that you can click on these links and they will take you to the appropriate section.

- A quick introduction to matplotlib
- The syntax for the matplotlib bar chart
- Examples of how to make a bar chart with matplotlib

If you need help with something specific, you can click on one of the links.

However, if you’re just getting started with matplotlib, I recommend that you read the entire tutorial. Things will make more sense that way.

Ok. First, let’s briefly talk about matplotlib.

If you’re new to data visualization in Python, you might not be familiar with matplotlib.

Matplotlib is a module in the Python programming language for data visualization and plotting.

For the most part, it is the most common data visualization tool in Python. If you’re doing data science or scientific computing in Python, you are very likely to see it.

However, even though matplotlib is extremely common, it has a few problems.

The big problem is the syntax. Matplotlib’s syntax is fairly low-level. The low-level nature of matplotlib can make it harder to accomplish simple tasks. If you’re only using matplotlib, you might need to use a lot of code to create simple charts.

There’s a solution to this though.

To simplify matplotlib, you can use pyplot.

Pyplot is a sub-module within matplotlib.

Essentially, pyplot provides a group of relatively simple functions for performing common data visualization tasks.

For example, there are simple functions for creating common charts like the scatter plot, the bar chart, the histogram, and others.

If you’re new to matplotlib and pyplot, I recommend that you check out some of our related tutorials:

- How to make a scatterplot with matplotlib
- A quick introduction to the matplotlib histogram
- How to make a line chart with matplotlib

In this tutorial though, we’re going to focus on creating bar charts with pyplot and matplotlib.

With that in mind, let’s examine the syntax.

The syntax to create a bar chart with pyplot isn’t that bad, but it has a few “gotchas” that can confuse beginners.

Let’s take a high-level look at the syntax (we’ll look at the details later).

To create a bar chart with pyplot, we use the `plt.bar()`

function.

Inside of the plt.bar function are several parameters.

In the picture above, I’ve shown four: `x`

, `height`

, `width`

, and `color`

. The plt.bar function has more parameters than these four, but these four are the most important for creating basic bar charts, so we will focus on them.

Let’s talk a little more specifically about these parameters.

Here, I’ll explain four important parameters of the plt.bar function: `x`

, `height`

, `width`

, and `color`

.

The `x`

parameter specifies the position of the bars along the x axis.

So if your bars are at positions 0, 1, 2, and 3 along the x axis, those are the values that you would need to pass to the `x`

parameter.

You need to provide these values in the form of a “sequence” of scalar values. That means that your values (e.g., 0, 1, 2, 3) will need to be contained inside of a Python sequence, like a list or a tuple.

In this tutorial, I’m assuming that you understand what a Python sequence is. If you don’t, do some preliminary reading on Python sequences first, and then come back when you understand them.

The `height`

parameter controls the height of the bars.

Similar to the `x`

parameter, you need to provide a sequence of values to the `height`

parameter …. one value for each bar.

So if there are four bars, you’ll need to pass a sequence of four values. If there are five bars, you need to provide a sequence of five values. Etc.

The examples section will show you how this works.

The `width`

parameter controls the width of the bars.

You can provide a single value, in which case all of the bars will have the same width.

Or, you can provide a sequence of values to manually set the width of different bars.

By default, the `width`

parameter is set to .8.

The `color`

parameter controls the interior color of the bars.

You can set the value to a named color (like “red”, “blue”, “green”, etc) or you can set the color to a hexidecimal color.

Although I strongly prefer hex colors (because they give you a lot of control over the aesthetics of your visualizations), hex colors are a little more complicated for beginners. Having said that, this tutorial will only explain how to use named colors (see the examples below).

Ok … now that you know more about the parameters of the plt.bar function, let’s work through some examples of how to make a bar chart with matplotlib.

I’m going to show you individual examples of how to manipulate each of the important parameters discussed above.

Before you work with the examples, you’ll need to run some code.

You need to run code to import some Python modules. You’ll also need to run code to create some simple data that we will plot.

Here is the code to import the proper modules.

We’ll be working with matplotlib, numpy, and pyplot, so this code will import them.

import matplotlib import numpy as np import matplotlib.pyplot as plt

Note that we’ve imported numpy with the nickname `np`

, and we’ve imported pyplot with the nickname `plt`

. These are fairly standard in most Python code. We can use these nicknames as abbreviations of the modules … this just makes it easier to type the code.

Next, you need to create some data that we can plot in the bar chart.

We’re going to create three sequences of data: `bar_heights`

, `bar_labels`

, and `bar_x_positions`

.

# CREATE DATA bar_heights = [1, 4, 9, 16] bar_labels = ['alpha', 'beta', 'gamma', 'delta'] bar_x_positions = [0,1,2,3]

As noted above, most of the parameters that we’re going to work with require you to provide a *sequence* of values. Here, all of these sequences have been constructed as Python lists. We could also use tuples or another type of Python sequence. For example, we could use the NumPy arange function to create a NumPy array for `bar_heights`

or `bar_x_positions`

. As long as the structure is a “sequence” it will work.

Ok, now that we have our data, let’s start working with some bar chart examples.

Let’s start with a simple example.

Here, we’re just going to make a simple bar chart with pyplot using the plt.bar function. We won’t do any formatting … this will just produce a bar chart with default formatting.

To do this, we’re going to call the `plt.bar()`

function and we will set `bar_x_positions`

to the `x`

parameter and `bar_heights`

to the `height`

parameter.

# PLOT A SIMPLE BAR CHART plt.bar(bar_x_positions, bar_heights)

And here is the output:

This is fairly simple, but there are a few details that I need to explain.

First, notice the position of each of the bars. The bars are at locations 0, 1, 2, and 3 along the x axis. This corresponds to the values stored in `bar_x_positions`

and passed to the `x`

parameter.

Second, notice the height of the bars. The heights are 1, 4, 9, and 16. As should be obvious by now, these bar heights correspond to the values contained in the variable `bar_heights`

, which has been passed to the `height`

parameter.

Finally, notice that we’re passing the values `bar_x_positions`

and `bar_heights`

by *position*. When we do it this way, Python knows that the first argument (`bar_x_positions`

) corresponds to the `x`

parameter and the second argument (`bar_heights`

) corresponds to the `height`

parameter. There’s a bit of a quirk with matplotlib that if you make the parameter names explicit with the code by typing `plt.bar(x = bar_x_positions, height = bar_heights)`

, you’ll actually get an error. So in this example, you have to put the correct variables in the correct positions inside of `plt.bar()`

, and you have to exclude the actual parameter names.

Next, we’ll change the color of the bars.

This is a very simple modification, but it’s the sort of thing that you can make your plot look better, if you do it right.

There are a couple different ways to change the color of the bars. You can change the bars to a “named” color, like ‘red,’ ‘green,’ or ‘blue’. Or, you can change the color to a hexidecimal color. Hex colors are a little more complicated, so I’m not going to show you how to use them here. Having said that, hex colors give you more control, so eventually you should become familiar with them.

Ok. Here, we’re going to make a simple change. We’re going to change the color of the bars to ‘red.’

To do this, we can just provide a color value to the `color`

parameter:

plt.bar(bar_x_positions, bar_heights, color = 'red')

The code produces the following output:

Admittedly, this chart doesn’t look that much better than the default, but it gives you a simple example of how to change the bar colors. This code is easy to learn and easy to practice (you should always start with relatively simple examples).

As you become more skilled with data visualization, you will be able to select other colors that look better for a particular data visualization task.

The point here is that you can change the color of the bars with the `color`

parameter, and it’s relatively easy.

Now, I’ll show you how to change the width of the bars.

To do this, you can use the `width`

parameter.

plt.bar(bar_x_positions, bar_heights, width = .5)

And here’s the output:

Here, we’ve set the bar widths to .5. In this case, I think that the default (.8) is better. However, there may be situations where the bars are spaced out at larger intervals. In those cases, you’ll need to make your bars wider. My recommendation is that you make the space between the bars about 20% of the width of the bars.

You might have noticed in the prior examples that there is a bit of a problem with the x-axis of our bar charts: they don’t have labels.

Let’s take a look by re-creating the simple bar chart from earlier in the tutorial:

# ADD X AXIS LABELS plt.bar(bar_x_positions, bar_heights)

It produces the following bar chart:

Again, just take a look at the bar labels on the x axis. By default, they are just the x-axis positions of the bars. They are *not* the categories.

In most cases, this will not be okay.

In almost all cases, when you create a bar chart, the bars need to have labels. Typically, each bar is labeled with an appropriate category.

How do we do that?

When you use the plt.bar function from pyplot, you need to set those bar labels *manually*. As you’ve probably noticed, they are not included when you build a basic bar chart like the one we created earlier with the code `plt.bar(bar_x_positions, bar_heights)`

.

Here, I’ll show you how.

To add labels to your bars, you need to use the plt.xticks function.

Specifically, you need to call `plt.xticks()`

, and provide two arguments: you need to provide the x axis positions of your bars as well as the labels that correspond to those bars.

So in this example, we will call the function as follows: `plt.xticks(bar_x_positions, bar_labels)`

. The `bar_x_positions`

variable contains the position of each bar, and the `bar_labels`

variable contains the labels of each bar. (Remember that we defined both variables earlier in this tutorial.)

# ADD X AXIS LABELS plt.bar(bar_x_positions, bar_heights) plt.xticks(bar_x_positions, bar_labels)

And here is the result:

Notice that each bar now has a categorical label.

Ok, now I’ll show you a quick trick that will improve the appearance of your Python bar charts.

One of the major issues with standard matplotlib bar charts is that they don’t look all that great. The standard formatting from matplotlib is – to put it bluntly – ugly.

To be clear, the basic formatting is fine if you’re just doing some data exploration at your workstation. The basic formatting is okay if you’re creating charts for personal consumption.

But if you need to show your charts to anyone important, then the default formatting probably isn’t good enough. The default formatted charts look basic. They lack polish. They are a little unprofessional. You might not understand this, but you need to realize that the appearance of your charts matters when you present them to anyone important.

That being the case, you need to learn to format your charts properly.

The full details of how to format your charts is beyond the scope of this post, but here I’ll show you a quick way to dramatically improve the appearance of your pyplot charts.

We’re going to use a special function from the seaborn package to improve our charts.

To use this function, you’ll need to install seaborn. You can do that with the following code:

# import seaborn module import seaborn as sns

Once you have seaborn imported, you can use the seborn.set() function to set new plot defaults for your matplotlib charts. Because we imported seaborn as `sns`

, we can call the function with `sns.set()`

.

#set plot defaults using seaborn formatting sns.set()

This essentially changes many of the plot defaults like the background color, gridlines, and a few other things.

Let’s replot our bar chart so you can see what I mean.

#plot bar chart plt.bar(bar_x_positions, bar_heights)

Here’s the plot:

I’ll be honest … I think this is dramatically better. Just using this one simple modification makes your matplotlib bar chart look much more professional.

One issue that you might run into though is that when you use the seaborn.set() function all of your charts have that formatting. That might not be what you want!

So how do you revert to the original matplotlib formatting?

You can do that by running the following code:

# REMOVE SEABORN FORMATTING sns.reset_orig()

If you run this, it will reset the matplotlib formatting back to the original default values.

Let’s do one more example.

Here, we’ll use several techniques together to create a more complete and refined bar chart in Python.

We’ll set the bar positions and heights using the plt.bar function. Then we’ll add the bar labels using, the plt.xticks function. We’ll change the color using the `color`

parameter. And we’ll improve the background formatting by using the `sns.set()`

function from seaborn.

Let’s take a look:

# COMBINED EXAMPLE import seaborn as sns sns.set() plt.bar(bar_x_positions, bar_heights, color = 'darkred') plt.xticks(bar_x_positions, bar_labels)

And here is the output:

Let’s quickly break this down.

We used the `plt.bar()`

function to create a simple bar chart. The bar locations have been defined with the `bar_x_positions`

variable and the bar heights have been defined with the `bar_heights`

variable. We set the color of the bars to ‘darkred’ by using the `color`

parameter. We set the bar category labels by using the `plt.xticks`

function. And we improved the overall plot formatting by using the `sns.set()`

function.

There is certainly more that we could do to improve this chart. We could add a plot title, axis titles, and maybe change the fonts.

Having said that, this looks pretty damn good for a simple bar chart, and it’s only a few lines of code. In my opinion, it’s dramatically better than a simple default bar chart made with matplotlib.

And one last thing …

As I noted earlier, if you use the `sns.set()`

function to use seaborn formatting for your plots, you may want to reset the defaults afterwards. To do that, run the following code:

# reset defaults sns.reset_defaults()

This will return your matplotlib formatting back to the matplotlib defaults.

This tutorial should have given you a solid foundation for creating bar charts with matplotlib.

Having said that, there’s a lot more to learn. If you want to get the most out of matplotlib, you’ll need to learn more tools and more functions. You’ll need to learn more about matplotlib, but you’ll also need to learn more about NumPy and NumPy arrays. For example, you’ll often need to use techniques like NumPy linspace to set axis tick locations.

Overall, my point is that there’s more to learn. If you want to be great at data science in Python, you really need to know matplotlib.

So, this tutorial should be great for helping you learn some of the basics of the matplotlib bar chart, but if you’re really interested in data science, you’ll need to learn quite a bit more.

If you want to learn more about matplotlib and data science in Python, sign up for our email list.

When you sign up, you’ll get our tutorials delivered directly to your inbox. Every week, we publish data science tutorials … members of our email list hear about them whenever they are published.

If you sign up, you’ll get free tutorials about:

- Matplotlib
- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to make a matplotlib bar chart appeared first on Sharp Sight.

]]>The post How to use Pandas iloc to subset Python data appeared first on Sharp Sight.

]]>Working with data in Pandas is not terribly hard, but it can be a little confusing to beginners. The syntax is a little foreign, and ultimately you need to practice a lot to really make it stick.

To make it easier, this tutorial will explain the syntax of the iloc method to help make it crystal clear.

Additionally, this tutorial will show you some simple examples that you can run on your own.

This is critical. When you’re learning new syntax, it’s best to learn and master the tool with simple examples first. Learning is much easier when the examples are simple and clear.

Having said that, I recommend that you read the whole tutorial. It will provide a refresher on some of the preliminary things you need to know (like the basics of Pandas DataFrames). Everything will be more cohesive if you read the entire tutorial.

But, if you found this from a Google search, and/or you’re in a hurry, you can click on one of the following links and it will take you directly to the appropriate section:

- A quick refresher on Pandas
- Pandas DataFrame basics
- The syntax of Pandas iloc
- Examples: how to use iloc

Before I explain the Pandas iloc method, it will probably help to give you a quick refresher on Pandas and the larger Python data science ecosystem.

There are a few core toolkits for doing data science in Python: NumPy, Pandas, matplotlib, and scikit learn. Those are the big ones right now.

Each of those toolkits focuses on a different part of data science or a different part of the data workflow.

For example, NumPy focuses on numeric data organized into array-like structures. It’s a data manipulation toolkit specifically for numeric data.

Matplotlib focuses on data visualization. Commonly, when you’re doing data science or analytics, you need to *visualize* your data. This is true even if you’re working on an advanced project. You need to perform data visualization to *explore* your data and understand your data. Matplotlib provides a data visualization toolkit so you can visualize your data. You can use matplotlib for simple tasks like creating scatterplots in Python, histograms of single variables, line charts that plot two variables, etc.

And then there’s Pandas.

Pandas also focuses on a specific part of the data science workflow in Python.

… it focuses on ** data manipulation with DataFrames**.

Again, in this tutorial, I’ll show you how to use a specific tool, the iloc method, to retrieve data from a Pandas DataFrame.

Before I show you that though, let’s quickly review the basics of Pandas dataframes.

To understand the iloc method in Pandas, you need to understand Pandas DataFrames.

DataFrames are a type of data structure. Specifically, they are 2-dimensional structures with a row and column form.

So Pandas DataFrames are strictly 2-dimensional.

Also, the columns can contain different data types (although all of the data *within* a column must have the same data type).

Essentially, these features make Pandas DataFrames sort of like Excel spreadsheets.

Importantly, each row and each column in a Pandas DataFrame has a number. An *index*.

This structure, a row-and-column structure with numeric indexes, means that you can work with data by the row number and the column number.

That’s exactly what we can do with the Pandas iloc method.

The `iloc`

method enables you to “locate” a row or column by its “integer index.”

We use the numeric, integer index values to locate rows, columns, and observations.

**i**nteger **loc**ate.

`iloc`

.

Get it?

The syntax of the Pandas iloc isn’t that hard to understand, especially once you use it a few times. Let’s take a look at the syntax.

The syntax of iloc is straightforward.

You call the method by using “dot notation.” You should be familiar with this if you’re using Python, but I’ll quickly explain.

To use the iloc in Pandas, you need to have a Pandas DataFrame. To access iloc, you’ll type in the name of the dataframe and then a “dot.” Then type in “`iloc`

“.

Immediately after the `iloc`

method, you’ll type a set of brackets.

Inside of the brackets, you’ll use integer index values to specify the rows and columns that you want to retrieve. The order of the indexes inside the brackets obviously matters. The first index number will be the row or rows that you want to retrieve. Then the second index is the column or columns that you want to retrieve. Importantly, the column index is *optional*.

If you don’t provide a column index, iloc will retrieve all columns by default.

As I mentioned, the syntax of iloc isn’t that complicated.

It’s fairly simple, but it *still takes practice*.

Even though it’s simple, it’s actually easy to forget some of the details or confuse some of the details.

For example, it’s actually easy to forget which index value comes first inside of the brackets. Does the row index come first, or the column index? It’s easy to forget this.

It’s also easy to confuse the `iloc[]`

method with the `loc[]`

method. This other data retrieval method, `loc[]`

, is extremely similar to `iloc[]`

, and the similarity can confuse people. The `loc[]`

, method works differently though (we explain the loc method in a separate tutorial).

Although the iloc method can be a little challenging to learn in the beginning, it’s possible to learn and master this technique *fast*. Here at Sharp Sight, our premium data science courses will teach you to memorize syntax, so you can permanently remember all of those important little details.

This tutorial won’t give you all of the specifics about how to memorize the syntax of iloc. But, I can tell you that it just takes practice and repetition to remember the little details. You need to work with simple examples, and practice those examples over time until you can remember how everything works.

Speaking of examples, let’s start working with some real data.

Like I said, you need to learn these techniques and practice with simple examples.

Here, in the following examples, we’ll cover the following topics:

- rows selection with iloc
- column selection with iloc
- retrieve specific cells with iloc
- retrieve ranges of rows and columns (i.e., slicing)
- get specific subsets of cells

Before we work on those examples though, you’ll need to create some data.

First, we’ll import the Pandas module. Obviously, we’ll need this to call Pandas functions.

#=============== # IMPORT MODULES #=============== import pandas as pd

Next, you’ll need to create a Pandas DataFrame that will hold the data we’re going to work with.

There are two steps to this. First, we need to create a dictionary of lists that contain the data. Essentially, in this structure, the “key” will be the name of the column, and the associated list will contain the values of that column. You’ll see how this works in a minute.

#========================== # CREATE DICTIONARY OF DATA #========================== country_data_dict = { 'country':['USA', 'China', 'Japan', 'Germany', 'UK', 'India'] ,'continent':['Americas','Asia','Asia','Europe','Europe','Asia'] ,'GDP':[19390604, 12237700, 4872137, 3677439, 2622434, 2597491] ,'population':[322179605, 1403500365, 127748513, 81914672, 65788574, 1324171354] }

Now that we have our dictionary, `country_data_dict`

, we’re going to create a DataFrame from this data. To do this, we’ll apply the `pd.DataFrame()`

function to the `country_data_dict`

dictionary. Notice that we’re also using the `columns`

parameter to specify the order of the columns.

#================================= # CREATE DATAFRAME FROM DICTIONARY #================================= country_data_df = pd.DataFrame(country_data_dict, columns = ['country', 'continent', 'GDP', 'population'])

Now we have a DataFrame of data, `country_data_df`

, which contains country level economic and population data.

First, I’ll show you how to select single rows with iloc.

For example, let’s just select the first row of data. To do this, we’ll call the iloc method using dot notation, and then we’ll use the integer index value inside of the bracets.

country_data_df.iloc[0]

Which produces the following output:

country USA continent Americas GDP 19390604 population 322179605 Name: 0, dtype: object

Essentially, the code pulls back the first row of data, and *all* of the columns.

Notice that the “first” row has the numeric index of `0`

. If you’ve used Python for a little while, this should make sense. When we use indexes with Python objects – including lists, arrays, NumPy arrays, and other sequences – the numeric indexes start with `0`

. The first value of the index is 0. This is very consistent in Python.

Here’s another example.

We can pull back the sixth row of data by using index value `5`

. Remember, because the index values start at `0`

, the numeric index value will be one less than the row of data you want to retrieve.

Let’s pull back the row of data at index value `5`

:

country_data_df.iloc[5]

Which produces the following output:

country India continent Asia GDP 2597491 population 1324171354 Name: 5, dtype: object

Again, this is essentially the data for row index 5, which contains the data for India. Here, you can see the data for all of the columns.

There’s actually a different way to select a single row using iloc.

This is important, actually, because the syntax is more consistent with the syntax that we’re going to use to select columns, and to retrieve “slices” of data.

Here, we’re still going to select a single row. But, we’re going to use some syntax that explicitly tells Pandas that we want to retrieve *all columns*.

country_data_df.iloc[0, :]

Which produces the following:

country USA continent Americas GDP 19390604 population 322179605 Name: 0, dtype: object

Notice that this is the same output that’s produced by the code `country_data_df.iloc[0, :]`

.

What’s going on here?

Notice that in this new syntax, we still have an integer index for the rows. That’s in the first position just inside of the brackets.

But now we also have a ‘`:`

‘ symbol in the second position inside of the brackets.

The colon character (‘`:`

‘) essentially tells Pandas that we want to retrieve all columns.

Remember from the syntax explanation above that we can use two integer index values inside of `iloc[]`

. The first is the row index and the second is the column index.

When we want to retrieve *all* columns, we can use the ‘`:`

‘ character.

You’ll understand this more later. It’s relevant for when we retrieve ‘slices’ of data.

Similarly, you can select a single column of data using a special syntax that uses the ‘`:`

‘ character.

Let’s say that we want to retrieve the first column of data, which is the column at index position `0`

.

To do this, we will use an integer index value in the *second* position inside of the brackets when we use `iloc[]`

. Remember that the integer index in the second position specifies the *column* that we want to retrieve.

What about the rows?

When we want to retrieve a single column and *all rows* we need to use a special syntax using the ‘`:`

‘ character.

You’ll use the ‘`:`

‘ character in the first position inside of the brackets when we use `iloc[]`

. This indicates that we want to retrieve all of the rows. Remember, the first index position inside of `iloc[]`

specifies the rows, and when we use the ‘`:`

‘ character, we’re telling Pandas to retrieve *all* of the rows.

Let me show you an example of this in action.

In this example, we’re going to retrieve a single column.

The code is simple. We have our DataFrame that we created above: `country_data_df`

.

We’re going to use dot notation after the DataFrame to call the `iloc[]`

method.

Inside of the brackets, we’ll have the ‘`:`

‘ character, which indicates that we want to get all rows. We also have `0`

in the second position inside the brackets, which indicates that we want to retrieve the column with index `0`

(the first column in the DataFrame).

Let me show you the code:

country_data_df.iloc[:,0]

And here is the output.

0 USA 1 China 2 Japan 3 Germany 4 UK 5 India Name: country, dtype: object

Notice that the code retrieved a single column of data – the ‘`country`

‘ column – which is the first column in our DataFrame, `country_data_df`

.

It’s pretty straightforward. Using the syntax explained above, iloc retrieved a single column of data from the DataFrame.

Now, let’s move on to something a little more complicated.

Here, we’re going to select the data in a specific cell in the DataFrame.

You’ll just use `iloc[]`

and specify an integer index value for the data in the row and column you want to retrieve.

So if we want to select the data in row `2`

and column `0`

(i.e., row index `2`

and column index `0`

) we’ll use the following code:

country_data_df.iloc[2,0]

Which produces the following output:

'Japan'

Again. This is pretty straightforward.

Using the first index position, we specified that we want the data from row `2`

, and we used the second index position to specify that we want to retrieve the information in column `0`

.

The data that fits *both* criteria is `Japan`

, in cell `(2, 0)`

.

Notice that the Pandas DataFrame essentially works like an Excel spreadsheet. You can just specify the row and column of the data that you want to pull back.

Now that I’ve explained how to select specific rows and columns using `iloc[]`

, let’s talk about slices.

When we “slice” our data, we take multiple rows or multiple columns.

There’s a special syntax to do this, which is related to some of the examples above.

Essentially, we can use the colon (‘`:`

‘) character inside of `iloc[]`

to specify a start row and a stop row.

Keep in mind that the row number specified by the stop index value is *not* included.

It’s always best to illustrate an abstract concept with a concrete example, so let’s take a look at an example of how to use iloc to retrieve a slice of rows.

Here, we’re going to retrieve a subset of rows.

This is pretty straightforward.

We’re going to specify our DataFrame, `country_data_df`

, and then call the `iloc[]`

method using dot notation.

Then, inside of the iloc method, we’ll specify the start row and stop row indexes, separated by a colon.

Here’s the exact code:

country_data_df.iloc[0:3]

And here are the rows that it retrieves:

country continent GDP population 0 USA Americas 19390604 322179605 1 China Asia 12237700 1403500365 2 Japan Asia 4872137 127748513

Notice what data we have here.

The code has retrieved rows `0`

, `1`

, and `2`

.

It also retrieved *all* of the columns.

This is pretty straightforward … we’re retrieving a subset of rows by using the colon (‘`:`

‘) character inside of `iloc[]`

.

Now, we’re going to retrieve a subset of *columns* using iloc.

This is very similar to the previous example where we retrieved a subset of rows. The only difference is how exactly we use the row and column indexes inside of `iloc[]`

.

Here, we’re going to specify that we’re going to use data from `country_data_df`

. Then we’ll use dot notation to call the `iloc[]`

method following the name of the DataFrame.

Inside of the `iloc[]`

method, we’re using the “`:`

” character for the row index. This means that we want to retrieve all rows.

For the column index, we’re using the range `0:2`

. This means that we want to retrieve the columns starting from column `0`

up to and excluding column `2`

.

Here’s the exact code:

country_data_df.iloc[:,0:2]

Which produces the following result:

country continent 0 USA Americas 1 China Asia 2 Japan Asia 3 Germany Europe 4 UK Europe 5 India Asia

If you understand column indexes and how to get slices of data with iloc, this is pretty easy to understand.

The code `country_data_df.iloc[:,0:2]`

gets columns `0`

and `1`

, and gets all rows.

Visually, this is what is being retrieved:

To be clear, Pandas slices can get more complicated than this.

I recommend that you first learn, practice, and master these simple examples before you move on to anything more complicated.

Finally, let’s retrieve a subset of *cells* from our data.

Doing this is really just a combination of getting a slice of columns and a slice of rows with `iloc`

, at the same time.

Let me show you.

country_data_df.iloc[1:5,0:3]

Which produces the following output:

country continent GDP 1 China Asia 12237700 2 Japan Asia 4872137 3 Germany Europe 3677439 4 UK Europe 2622434

So what did we do here?

We called the `iloc[]`

using dot notation after the name of the Pandas DataFrame.

Inside of the `iloc[]`

method, you see that we’re retrieving rows ‘`1:5`

‘ and columns ‘`0:3`

.’

This means that we want to retrieve rows `1`

to `4`

(remember, the “stop” index is *excluded*, so it will exclude `5`

). It is also saying that we want to retrieve the contents of columns from `0`

through `2`

.

This has the effect of selecting the data in rows `1`

through `4`

and columns `0`

through `2`

. The cells that get retrieved must meed both criteria.

Visually, we can represent the results like this:

Again, this is relatively easy to understand if you understand the basics of iloc and the basics of slices.

That being said, you have questions, leave your question in the comment section below.

I’m sure that you’ve heard it before: data manipulation is really important for data science.

I’ve said it before, and so have many other professional data scientists.

In fact, you’ll often here the quote that “80 percent of your work as a data scientist will be data manipulation.”

That’s probably pretty close to true. Data manipulation is *really important*.

If you want to learn data science in Python, that means that you should really know the Pandas module and how to retrieve data using methods like iloc.

Having said that, if you’re interested in learning more about Pandas and more about data science in Python, then sign up for our email list.

Here at Sharp Sight, we teach data science.

Every week, we post new tutorials about Python data science topics like:

- Pandas
- Matplotlib
- Sci-kit learn
- NumPy
- Seaborn
- Keras

We also publish data science tutorials for the R programming language.

When you sign up for our email list, you’ll get these tutorials delivered *directly to your inbox* every week.

If you want *FREE* data science tutorials every week, then sign up now.

The post How to use Pandas iloc to subset Python data appeared first on Sharp Sight.

]]>The post How to use the NumPy median function appeared first on Sharp Sight.

]]>This tutorial will teach you a few things.

First, it will show you how the NumPy median function works syntactically. We’ll cover the basic syntax and parameters of the np.median function.

I’ll also show you some examples of how to use it. As always, one of the best ways to learn new syntax is studying and practicing simple examples.

If you’re a relative beginner with NumPy, I recommend that you read the full tutorial.

But if you only need help with a specific aspect of the NumPy median function, then you can click on one of the links below. The following links will take you to the appropriate section of the tutorial:

If you’re a real beginner, you may not be 100% familiar with NumPy. So before I explain the np.median function, let me explain what NumPy is.

What exactly is NumPy?

NumPy is a data manipulation module for the Python programing language.

At a high level, NumPy enables you to work with numeric data in Python. A little more specifically, it enables you to work with large arrays of numeric data.

You can create and store numeric data in a data structure called a NumPy array.

NumPy also has a set of tools for performing computations on arrays of numeric data. You can do things like combine arrays of numeric data, split arrays into multiple arrays, or reshape arrays into arrays with a new number of rows and columns.

NumPy also has a set of functions for performing calculations on numeric data. The NumPy median function is one of these functions.

Now that you have a broad understanding of what NumPy is, let’s take a look at what the NumPy median function is.

The NumPy median function computes the median of the values in a NumPy array. Note that the NumPy median function will also operate on “array-like objects” like Python lists.

Let’s take a look at a simple visual illustration of the function.

Imagine we have a 1-dimensional NumPy array with five values:

We can use the NumPy median function to compute the median value:

It’s pretty straight forward, although the np.median function can get a little more complicated. It can operate on 2-dimensional or multi-dimensional array objects. It can also calculate the median value of each row or column. You’ll see some examples of these operations in the examples section.

Ok. Now let’s take a closer look at the syntax of the NumPy median function.

One quick note. This explanation of the syntax and all of the examples in this tutorial assume that you’ve imported the NumPy module with the code `import numpy as np`

.

This is a common convention among NumPy users. When you write and run a NumPy/Python program, it’s common to import NumPy as `np`

. This enables you to refer to NumPy with the “nickname” `np`

, which makes the code a little simpler to write and read.

I just wanted to point this out to you to make sure you understand.

Ok. Let’s take a look at the syntax.

Assuming that you’ve imported NumPy as `np`

, you call the function by the name `np.median()`

. In some programs, you might also see the function called as `numpy.median()`

, if the coder imported NumPy as `numpy`

. Both are relatively common, and it really depends on how the NumPy module has been imported.

Inside of the `median()`

function, there are several parameters that you can use to control the behavior of the function more precisely. Let’s talk about those.

The np.median function has four parameters that we will discuss:

`a`

`axis`

`out`

`keepdims`

There’s actually a fifth parameter called `overwrite_input`

. The `overwrite_input`

parameter is not going to be very useful for you if you’re a beginner, so for the sake of simplicity, we’re not going to discuss it in this tutorial.

Ok, let’s quickly review what each parameter does:

`a`

The `a`

parameter specifies the data that you want to operate on. It’s the data on which you will compute the median.

Typically, this will be a numpy array. However, the np.median function can also operate on “array-like objects” such as Python lists. For the sake of simplicity, this tutorial will work with NumPy arrays, but remember that many (if not all) of the examples would work the same way if you used an array-like object instead.

Note that this parameter is required. You need to provide something to the `a`

parameter, otherwise the np.median function won’t work.

The `axis`

parameter controls the axis along which the function will compute the median.

More simply, the axis parameter enables you to compute median values along the rows of an array, or the median values along the columns of an array (instead of computing the median of all of the values).

Using the `axis`

parameter confuses many people.

Later in this tutorial, I’ll show you an example of how to use the `axis`

parameter; hopefully that will make it more clear.

But quickly, let me explain how this works.

NumPy arrays have *axes*. It’s best to think of axes as directions along the array.

So if you have a 2-dimensional array, there are two axes: axis 0 is the direction down the rows and axis 1 is the direction across the columns. (Keep in mind that higher-dimensional arrays have additional axes.)

When we use NumPy functions like np.median, we can often specify an axis along which to perform the computation.

So when we set axis = 0, the NumPy median function computes the median values downward along axis 0. This effectively computes the column medians.

Similarly, when we set axis = 1, the NumPy median function computes the median values horizontally across axis 1. This effectively computes the row medians.

Hopefully these images illustrate the concept and help you understand.

But if you’re still confused, I’ll show you examples of how to use the axis parameter later in the examples section.

The `out`

parameter enables you to specify a different output array where you can put the result.

So if you want to store the result of np.median in a different array, you can use the `out`

parameter to do that.

This is an optional parameter.

The `keepdims`

parameter enables you to make the dimensions of the output the same as the input.

This is a little confusing to many people, so let me explain.

Remember that the np.median function (and other similar functions like np.sum and np.mean) summarize your data in some way. They are computing summary statistics.

When you summarize the data in this way, you are effectively collapsing the number of dimensions of the data. For example, if you have a 1-dimensional NumPy array, and you compute the median, you are collapsing the data from a 1-dimensional structure down to a 0 dimensional structure (a single scalar number).

Or similarly, if you compute the column means of a 2-d array, you’re collapsing the data from 2 dimensions down to 1 dimension.

Essentially, the output of the NumPy median function has a *reduced number of dimensions*.

What if you don’t want that? What if you want the output to have the same number of dimensions as the input?

You can force NumPy median to make keep the dimensions the same by using the `keepdims`

parameter. We can set `keepdims = True`

to make the dimensions of the output the same as the dimensions of the input.

I understand that this might be a little abstract, so I’ll show you an example in the examples section.

Note: the `keepdims`

parameter is optional. By default it is set to `keepdims = False`

, meaning that the output of np.array will not necessarily have the same dimensions as the input.

Ok. Let’s work through some examples. In the last section I explained the syntax, which is probably helpful. But to really understand the code, you need to play with some examples.

Before you get started with the examples though, you’ll need to run some code.

You need to import NumPy. Run this code to properly import NumPy.

import numpy as np

By running this code, you’ll be able to refer to NumPy as `np`

when you call the NumPy functions.

Ok.

This first example is very simple. We’re going to compute the median value of a 1-dimensional array of values.

First, you’ll need to create the data.

To do this, you can call the np.array function with a list of numeric values.

np_array_1d = np.array([0,20,40,60,80,100])

And now we’ll print out the data:

print(np_array_1d)

And here’s the output:

[0 20 40 60 80 100]

This is pretty straight forward. Using the np.array function, we’ve created an array with six values from 0 to 100, in increments of 20.

Now, we’ll calculate the median of these values.

np.median(np_array_1d)

Which gives us the following output:

50.0

This is fairly straightforward, but I’ll quickly explain.

Here, the NumPy median function takes the NumPy array and computes the median.

The median of these six values is 50, so the function outputs `50.0`

as the result.

Next, let’s work through a slightly more complicated example.

Here, we’re going to calculate the median of a 2-dimensional NumPy array.

First, we’ll need to create the array. To do this, we’re going to use the NumPy array function to create a NumPy array from a list of numbers. After that, we’re going to use the reshape method to reshape the data from 1-dimensional array to a 2-dimensional array that has 2 rows and 3 columns.

np_array_2d = np.array([0,20,40,60,80,100]).reshape((2,3))

And we can examine the array by using the `print()`

function.

print(np_array_2d)

[[ 0 20 40] [ 60 80 100]]

As you can see, this dataset has six values arranged in a 2 by 3 NumPy array.

Now, we’ll compute the median of these values.

np.median(np_array_2d)

Which produces the following output:

50.0

This example is very similar to the previous example. The only difference is that in this example, the values are arranged into a 2-dimensional array instead of a 1-dimensional array.

Ultimately though, the result is the same.

If we use the np.median function on a 2-dimensional NumPy array, by default, it will just compute the median of all of the values in the array. Here in this example, we only have six values in the array, but we could also have a larger number of values … the function would work the same.

Moreover, the NumPy median function would also work this way for higher dimensional arrays. For example, if we had a 3-dimensional NumPy array, we could use the `median()`

function to compute the median of all of the values.

However, with 2-d arrays (and multi-dimensional arrays) we can use the axis parameter to compute the median along rows, columns, or other axes.

Let’s take a look.

First, I’m going to show you how to compute the median of the columns of a 2-dimensional NumPy array.

To do this, we need to use the `axis`

parameter. Remember from earlier in the tutorial that NumPy axes are like directions along the rows and columns of a NumPy array.

Remember: axis 0 is the direction that points down against the rows, and axis 1 is the direction that points horizontally across the columns (in a 2-d array).

So how exactly does the `axis`

parameter control the behavior of np.median?

This is important: when you use the `axis`

parameter, the `axis`

parameter controls which axis gets summarized.

Said differently, it controls which axis gets collapsed.

So if you set `axis = 0`

inside of np.median, you’re effectively telling NumPy to compute the medians *downward*. The medians will be computed down along axis 0. Essentially, it will *collapse* axis 0 and compute the medians down that axis.

In other words, it will compute the column medians.

This confuses many people, because they think that by setting `axis = 0`

, it will compute the row medians. That’s not how it works.

Again, it helps to think of NumPy axes as directions. The axis parameter specifies the direction along which the medians will be computed.

Let me show you.

Here, we’re going to compute the column medians by setting `axis = 0`

.

Again, we’ll start by creating a dataset.

np_array_2d = np.array([0,20,40,60,80,100]).reshape((2,3))

And we can examine this array by using the `print()`

function.

print(np_array_2d)

[[ 0 20 40] [ 60 80 100]]

Next, let’s compute the median while setting `axis = 0`

.

# CALCULATE COLUMN AXES np.median(np_array_2d, axis = 0)

And here’s the output:

array([ 30., 50., 70.])

What happened here?

NumPy calculated the medians along axis 0. This effectively computes the column medians:

Again, this might seem counter intuitive, so remember what I said previously. The `axis`

parameter controls which axis gets summarized. By setting axis = 0, we told NumPy median to summarize axis 0.

Now, let’s compute the row medians.

This example is almost identical to the previous example, except here we will set `axis = 1`

.

Once again, we’ll create a dataset. (This is the same as the previous example, so if you’ve already run it, you don’t need to re-run it.)

np_array_2d = np.array([0,20,40,60,80,100]).reshape((2,3))

And here are the contents of `np_array_2d`

:

[[ 0 20 40] [ 60 80 100]]

It’s just a simple 2-d NumPy array.

Now, we’re going to calculate the median and set `axis = 1`

. This will effectively calculate the row medians.

np.median(np_array_2d, axis = 1)

Here’s the output:

array([ 20., 80.])

If you’ve read this tutorial carefully so far, you should understand this. Still, I’ll explain.

The input array, `np_array_2d`

, is a 2-d NumPy array. There are 2 rows and 3 columns.

When we use the `np.median`

function on this array with `axis = 1`

, we are telling the function to compute the medians along the direction of axis 1. Remember, in a 2-d array, axis 1 is the direction that runs horizontally across the columns.

When we use NumPy median with axis = 1, we’re basically telling NumPy to *summarise* axis 1.

This amounts to computing the row medians.

This is fairly easy to understand, but you really need to understand how NumPy axes work. So if you’re still confused, make sure to read read our NumPy axis tutorial, and then come back and read this example and the prior example.

Finally, let’s talk about how to use the `keepdims`

parameter.

Remember, using the np.median function has the effect of *summarizing* or collapsing your data. As I showed you earlier, if you have an array of 6 values, and you use np.median on that array, it will summarize those values by computing a single value (the median).

Similarly, if you compute the median and use the `axis`

parameter, the median function will also reduce the number of dimensions. Like we saw in one of the previous examples, if we use np.median on a 2-dimensional array with `axis = 0`

or `axis = 1`

, the np.median function will compute the column medians or row medians respectively. In either case, the input had 2 dimensions, but the output (e.g., the row median) had only 1 dimension.

This reduction in dimensions is okay in many instances, but sometimes you want the output tho have the same number of dimensions as the input.

To force that behavior, we can use the `keepdims`

parameter.

By default, the `keepdims`

parameter is set to `keepdims = False`

. As explained above, this means that the dimensions of the output does not need to be the same as the dimensions of the input.

To change this, we must set `keepdims = True`

.

Here’s an example. We’re going to create a 2-d NumPy array and then calculate the column medians:

np_array_2d = np.array([0,20,40,60,80,100]).reshape((2,3))

Quickly, let’s examine the number of dimensions of this array by examining the `ndim`

attribute.

np_array_2d.ndim

Which shows us the number of dimensions:

2

The array has 2 dimensions.

Now, let’s compute the median with `axis = 0`

, and examine the number of dimensions by also using the `ndim`

attribute.

np.median(np_array_2d, axis = 0).ndim

When we run this code, the result is `1`

. The output of `np.median(np_array_2d, axis = 0)`

has 1 dimension. The explanation for this is just as I explained above: np.median summarizes data, which reduces the number of dimensions.

However, we can *keep* the same number of dimensions by setting `keepdims = True`

.

Let’s run the operation and look at the number of dimensions of the output:

np.median(np_array_2d, axis = 0, keepdims = True).ndim

Which produces the following output

2

What happened here?

The `keepdims`

parameter forces the median function to keep the dimensions of the output the same as the dimensions of the input. The input array (`np_array_2d`

) has 2 dimensions, so if we set `keepdims = True`

, the output of np.median will also have 2 dimensions.

NumPy’s median function is one of several important functions in the NumPy module. Basically, if you’re new to NumPy, there’s a lot more to learn than what we covered here.

And NumPy is really important if you want to learn data science in Python. NumPy is critical for data manipulation in Python. If you want to learn data science in Python, you really need to study NumPy.

With that in mind, I suggest that you sign up for our email list.

Here at Sharp Sight, we regularly publish free tutorials about data science topics.

For example, we regularly publish tutorials about NumPy. If you want to learn NumPy, sign up now.

When you sign up, all of our tutorials will be sent to you. It’s like having a Python data science tutor right in your inbox.

When you sign up, you’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to use the NumPy median function appeared first on Sharp Sight.

]]>