The post How to make a matplotlib histogram appeared first on Sharp Sight.

]]>If you’re interested in data science and data visualization in Python, then read on. This post will explain how to make a histogram in Python using matplotlib.

Here’s exactly what the tutorial will cover:

- A quick introduction to matplotlib
- The syntax for the matplotlib histogram
- Examples of how to make a histogram with matplotlib

Clicking on any of the above links will take you to the relevant section in the tutorial.

Having said that, if you’re a relative beginner, I recommend that you read the full tutorial.

Ok, let’s get started with a brief introduction to matplotlib.

If you’re new to Python – and specifically data science in Python – you might be a little confused about matplotlib.

Here’s a very brief introduction to matplotlib. If you want to skip to the section that’s specifically about matplotlib histograms, click here.

Matplotlib is a module for data visualization in the Python programming language.

If you’re interested in data science or data visualization in Python, matplotlib is very important. It will enable you to create very simple data visualizations like histograms and scatterplots in Python, but it will also enable you to create much more complicated data visualizations. For example, using matplotlib, you can create 3-dimensional plots of your data.

Data visualization is extremely important for data analysis and the broader data science workflow. So even if you’re not interested in data visualization per-se, you really do need to master it if you want to be a good data scientist.

That means, if you’re doing data science in Python, you should learn matplotlib.

Related to matplotlib is *pyplot*.

You’ll often see pyplot mentioned and used in the context of matplotlib. Beginners often get confused about the difference between matplotlib and pyplot, because it’s often unclear how they are related.

Essentially, pyplot is a sub-module in matplotlib. It provides a set of convenient functions that enable you to create simple plots like histograms. For example, you can use `plt.plot()`

to create a line chart or you can use the `plt.bar()`

function to create a bar chart. Both `plt.plot()`

and `plt.bar()`

are functions from the Pyplot module.

In this tutorial, we’ll be using the `plt.hist()`

function from pyplot. Just remember though that a pyplot histogram is effectively a matplotlib histogram, because pyplot is a sub-module of matplotlib.

Now that I’ve explained what matplotlib and pyplot are, let’s take a look at the syntax of the `plt.hist()`

function.

From this point forward, we’re going to be dealing with the pyplot `hist()`

function, which makes a histogram.

The syntax is fairly straight forward in the simplest case. On the other hand, the `hist()`

function has a variety of parameters that you can use to modify the behavior of the function. Really. There are a lot of parameters.

In the interest of simplicity, we’re only going to work with a few of those parameters.

If you really need to control how the function works, and need to use the other parameters, I suggest you consult the documentation for the function.

There are 3 primary parameters that we’re going to cover in this tutorial: `x`

, `bins`

, and `color`

.

The `x`

parameter is essentially the input values that you’re going to plot. Said differently, it is the data that you want to plot on the x-axis of your histogram.

(If that doesn’t make sense, take a look at the examples later in the tutorial.)

This parameter will accept an “array or sequence of arrays.”

Essentially, this means that the numeric data that you want to plot in your histogram should be contained in a Python array.

For our purposes later in the tutorial, we’re actually going to provide our data in the form of a NumPy array. NumPy arrays are also acceptable.

The `bins`

parameter controls the number of bins in your histogram. In other words, it controls the number of bars in the histogram; remember that a histogram is a collection of bars that represent the tally of the data for that part of the x-axis range.

More often than not, you’ll provide an *integer* value to the `bins`

parameter. If you provide an integer value, the value will set the number of bins. For example, if you set `bins = 30`

, the histogram will have 30 bars.

You can also provide a string or a Python sequence to the `bins`

parameter to get some additional control over the histogram bins. Having said that, using the `bins`

parameter that way can be a little more complicated, and I don’t recommend it to beginners.

Also, keep in mind that the `bins`

parameter is optional, which means that you don’t need to provide a value.

If you don’t provide a value, matplotlib will use a default value. It will use the default value defined in the `matplotlib.rcParams`

file, which contains matplotlib settings. Assuming that you haven’t changed those settings in `matplotlib.rcParams`

, the `bins`

parameter will default to 10 bins.

For examples of how to work with the bins parameter, consult the example below about histogram bins.

Finally, let’s talk about the `color`

parameter.

As you might guess, the `color`

parameter controls the color of the histogram. In other words, it controls the color of the histogram bars.

This parameter is optional, so if you don’t explicitly provide a color value, it will default to a default value (which is typically a sort of inoffensive blue color).

If you decide to manually set the color, you can set it to a “named” color, like “red,” or “green,” or “blue.” Python and matplotlib have a variety of named colors that you can specify, so take a look at the color options if you manipulate the `color`

parameter this way.

You can also provide hexidecimal colors to the `color`

parameter. This is actually my favorite way to specify colors in data visualizations, because it gives you tight control over the aesthetics of the chart. On the other hand, using hex colors is more complicated, because you need to understand how hex colors work. Hex colors are beyond the scope of this blog post, so I won’t explain them here.

Ok, now that I’ve explained the syntax and the parameters at a high level, let’s take a look at some examples of how to make a histogram with matplotlib.

Most of the examples that follow are simple. If you’re just getting started with matplotlib or Python, first just try running the examples exactly as they are. Once you understand them, try modifying the code little by little just to play around and build your intuition. For example, change the `color`

parameter from “red” to something else. Basically, run the code and then play around a little.

One more thing before we get started with the examples.

Before you run the examples, make sure to run the following code:

import matplotlib import numpy as np import matplotlib.pyplot as plt

This code will import matplotlib, pyplot, and NumPy.

We’re going to be using matplotlib and pyplot in our examples, so you’ll need them.

Also, run this code to create the dataset that we’re going to visualize.

# CREATE NORMALLY DISTRIBUTED DATA norm_data = np.random.normal(size = 1000, loc = 0, scale = 1)

This will create a dataset called `norm_data`

, using the NumPy random normal function. This data is essentially normally distributed data that has a mean of 0 and a standard deviation of 1. How to use NumPy random normal is beyond the scope of this post, so if you want to understand how the code works, consult our tutorial about np.random.normal.

Ok, on to the actual examples.

Let’s start simple.

Here, we’ll use matplotlib to to make a simple histogram.

# MAKE A HISTOGRAM OF THE DATA WITH MATPLOTLIB plt.hist(norm_data)

And here is the output:

This is about as simple as it gets, but let me quickly explain it.

We’re calling `plt.hist()`

and using it to plot `norm_data`

.

`norm_data`

contains normally distributed data, and you can see that in the visualization.

Aesthetically, the histogram is very simple. Because we didn’t use the `color`

parameter or `bins`

parameter, the visualization has defaulted to the default values. There are 10 bins (my current default) and the color has defaulted to blue. The plot is also relatively unformatted.

I will be honest. I think the default histogram is a little on the ugly side. At least, it’s rather plain. That’s OK if you’re just doing data exploration for yourself, but if you need to present your work to other people, you might need to format your chart to make it look more pleasing.

Let’s talk about how to change the color of the bars, which is one way to make your chart more visually appealing.

As noted above, we can change the color of the histogram bars using the `color`

parameter.

As you saw earlier in the previous example, the bar colors will default to a sort of generic “blue” color.

Here, we’re going to manually set it to “red.”

plt.hist(norm_data, color = 'red')

The code produces the following output:

As you can see, the bars are now red.

The chart is still a little visually boring, but this at least shows you how you can change the color. As you become more skilled in data visualization, you can use the `color`

parameter to make your histograms more visually appealing.

Now, let’s modify the number of bins.

Changing the number of bars can be important if your data are a little uneven. You can increase the number of bins to get a more fine-grained view of the data. Or, you can decrease the number of bins to smooth out abnormalities in your data.

Because this tutorial is really about how to create a Python histograms, I’m not going to talk a lot about histogram application. However, I do want you to see *how* you can modify the `bins`

parameter. That will give you more control over the visualization when you begin to apply the technique.

Here’s the code:

plt.hist(norm_data, bins = 50)

And here’s the output:

So what have we done here?

We increased the number of bins by setting `bins = 50`

. As I noted above, the bins parameter generally defaults to 10 bins. Here, by increasing the number of bins to 50, we’ve generated a more fine-grained view of the data. This can help us see minor fluctuations in the data that are invisible when we use a smaller number of bins.

Now that we’ve covered some of the essential parameters of the plt.hist function, I want to show you a quick way to improve the appearance of your plot.

We’re going to use the seaborn module to change the default formatting of the plot.

To do this, we will first import seaborn.

# import seaborn module import seaborn as sns

Next, we’ll use the `seaborn.set()`

function to modify the default settings of the chart. As you’ll see in a moment, this will change the default values for the background color, gridlines, and a few other things. Ultimately, it will just make your histogram look better.

#set plot defaults using seaborn formatting sns.set()

Finally, let’s replot the data using plt.hist.

#plot histogram with matplotlib.pyplot plt.hist(norm_data)

As you can see, the chart looks different. More professional, in my opinion.

The bar colors are slightly different, and the background has been changed. The changes are actually fairly minor, but I think they make a big difference in making the chart look better.

One quick note.

If you run the above code and use the `sns.set()`

function to set the plot defaults with seaborn, you might run into an issue.

… you might find that all of your matplotlib charts have the new seaborn formatting.

How do you make that go away?

You can remove the seaborn formatting defaults by running the following code.

# REMOVE SEABORN FORMATTING sns.reset_orig()

When you run this code, it will return the plot formatting to the matplotlib defaults.

Ok, let’s do one more example.

Here, I want to show you how to put the pieces together.

We’re going to modify several parameters at once to create a histogram:

# FINALIZED EXAMPLE import seaborn as sns sns.set() plt.hist(norm_data, bins = 50, color = '#CC0000')

And here is the output:

What have we done here?

We used `plt.hist()`

to plot a histogram of `norm_data`

.

Using the `bins`

parameter, we increased the number of bins to 50 bins.

We used the `color`

parameter to change the color of the bars to the hex color ‘`#CC0000`

‘, which a shade of red.

Finally, we used the `sns.set()`

function to change the plot defaults. This modified the background color and the gridlines.

Overall, I think this is a fairly professional looking chart, created with a small amount of code.

There’s definitely more that we could do to improve this chart (with titles, etc), but for a rough draft, it’s pretty good.

In this tutorial, we’re really just scratching the surface.

There’s a lot more that you can do with matplotlib, beyond just making a histogram.

To really get the most out of it, and to gain a solid understanding of data visualization in Python, you need to study matplotlib.

With that in mind, if you’re interested in learning (and mastering) data visualization and data science in Python, you should sign up for our email list right now.

Here at the Sharp Sight blog, we regularly post tutorials about a variety of data science topics … in particular, about matplotlib.

If you sign up for our email list, our Python data science tutorials will be delivered to your inbox.

You’ll get free tutorials on:

- Matplotlib
- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to make a matplotlib histogram appeared first on Sharp Sight.

]]>The post How to use numpy random normal in Python appeared first on Sharp Sight.

]]>If you’re doing any sort of statistics or data science in Python, you’ll often need to work with random numbers. And in particular, you’ll often need to work with normally distributed numbers.

The NumPy random normal function generates a sample of numbers drawn from the normal distribution, otherwise called the Gaussian distribution.

This tutorial will show you how the function works, and will show you how to use the function.

If you’re a little unfamiliar with NumPy, I suggest that you read the whole tutorial. However, if you just need some help with something specific, you can skip ahead to the appropriate section. The following links link to specific parts of this tutorial:

If you’re a real beginner with NumPy, you might not entirely be familiar with it.

With that in mind, let’s briefly review what NumPy is.

NumPy is a module for the Python programming language that’s used for data science and scientific computing.

Specifically, NumPy performs data manipulation on numerical data. It enables you to collect numeric data into a data structure, called the NumPy array. It also enables you to perform various computations and manipulations on NumPy arrays.

Essentially, NumPy is a package for working with numeric data in Python.

For more details about NumPy, check out our high level tutorial on NumPy, as well as our tutorial about the NumPy array.

So NumPy is a package for working with numerical data. Where does np.random.normal fit in?

As I mentioned previously, NumPy has a variety of tools for working with numerical data. In most cases, NumPy’s tools enable you to do one of two things: *create* numerical data (structured as a NumPy array), or perform some calculation on a NumPy array.

The NumPy random normal function enables you to create a NumPy array that contains normally distributed data.

Hopefully you’re familiar with normally distributed data, but just as a refresher, here’s what it looks like when we plot it in a histogram:

Normally distributed data is shaped sort of like a bell, so it’s often called the “bell curve.”

Now that I’ve explained what the np.random.normal function does at a high level, let’s take a look at the syntax.

The syntax of the NumPy random normal function is fairly straightforward.

Note that in the following illustration and throughout this blog post, we will assume that you’ve imported NumPy with the following code: `import numpy as np`

. That code will enable you to refer to NumPy as `np`

.

Let’s take a quick look at the syntax.

Let me explain this. Typically, we will call the function with the name `np.random.normal()`

. As I mentioned earlier, this assumes that we’ve imported NumPy with the code `import numpy as np`

.

Inside of the function, you’ll notice 3 parameters: `loc`

, `scale`

, and `size`

.

Let’s talk about each of those parameters.

The np.random.normal function has three primary parameters that control the output: `loc`

, `scale`

, and `size`

.

I’ll explain each of those parameters separately.

The `loc`

parameter controls the mean of the function.

This parameter defaults to `0`

, so if you don’t use this parameter to specify the mean of the distribution, the mean will be at 0.

The `scale`

parameter controls the standard deviation of the normal distribution.

By default, the `scale`

parameter is set to 1.

The `size`

parameter controls the size and shape of the output.

Remember that the output will be a *NumPy array*. NumPy arrays can be 1-dimensional, 2-dimensional, or multi-dimensional (i.e., 2 or more).

This might be confusing if you’re not really familiar with NumPy arrays. To learn more about NumPy array structure, I recommend that you read our tutorial on NumPy arrays.

Having said that, here’s a quick explanation.

The argument that you provide to the `size`

parameter will dictate the size and shape of the output array.

If you provide a single integer, `x`

, np.random.normal will provide `x`

random normal values in a 1-dimensional NumPy array.

You can also specify a more complex output.

For example, if you specify `size = (2, 3)`

, np.random.normal will produce a numpy array with 2 rows and 3 columns. It will be filled with numbers drawn from a random normal distribution.

Keep in mind that you can create ouput arrays with more than 2 dimensions, but in the interest of simplicity, I will leave that to another tutorial.

There’s another function that’s similar to np.random.normal. It’s called np.random.randn.

Just like np.random.normal, the np.random.randn function produces numbers that are drawn from a normal distribution.

The major difference is that np.random.randn is like a special case of np.random.normal. np.random.randn operates like np.random.normal with `loc = 0`

and `scale = 1`

.

So this code:

np.random.seed(1) np.random.normal(loc = 0, scale = 1, size = (3,3))

Operates effectively the same as this code:

np.random.seed(1) np.random.randn(3, 3)

Now that I’ve shown you the syntax the numpy random normal function, let’s take a look at some examples of how it works.

Before you work with any of the following examples, make sure that you run the following code:

import numpy as np

I briefly explained this code at the beginning of the tutorial, but it’s important for the following examples, so I’ll explain it again.

The code `import numpy as np`

essentially imports the NumPy module into your working environment and enables you to call the functions from NumPy. If you don’t use the `import`

statement to import NumPy, NumPy’s functions will be unavailable.

Moreover, by importing NumPy as `np`

, we’re giving the NumPy module a “nickname” of sorts. So we’ll be able to refer to NumPy as np when we call the NumPy functions.

You probably understand this if you’ve worked with Python modules before, but if you’re really a beginner, it might be a little confusing. So, I wanted to quickly explain it.

Ok, now let’s work with some examples.

First, let’s take a look at a very simple example.

Here, we’re going to use np.random.normal to generate a single observation from the normal distribution.

np.random.normal(1)

This code will generate a single number drawn from the normal distribution with a mean of 0 and a standard deviation of 1.

Essentially, this code works the same as `np.random.normal(size = 1, loc = 0, scale = 1)`

. Remember, if we don’t specify values for the `loc`

and `scale`

parameters, they will default to `loc = 0`

and `scale = 1`

.

Now, let’s draw 5 numbers from the normal distribution.

This code will look almost exactly the same as the code in the previous example.

np.random.normal(5)

Here, the value `5`

is the value that’s being passed to the `size`

parameter. It essentially indicates that we want to produce a NumPy array of 5 values, drawn from the normal distribution.

Note as well that because we have not explicitly specified values for `loc`

and `scale`

, they will default to `loc = 0`

and `scale = 1`

.

Now, we’ll create a 2-dimensional array of normally distributed values.

To do this, we need to provide a tuple of values to the `size`

parameter.

np.random.seed(42) np.random.normal(size = (2, 3))

Which produces the output:

array([[ 1.62434536, -0.61175641, -0.52817175], [-1.07296862, 0.86540763, -2.3015387 ]])

So we’ve used the `size`

parameter with the `size = (2, 3)`

. This has generated a 2-dimensional NumPy array with 6 values. This output array has *2 rows and 3 columns*.

To be clear, you can use the `size`

parameter to create arrays with even higher dimensional shapes.

Now, let’s generate normally distributed values with a specific mean.

To do this, we’ll use the `loc`

parameter. Recall from earlier in the tutorial that the `loc`

parameter controls the mean of the distribution from which we draw the numbers with np.random.normal.

Here, we’re going to set the mean of the data to 50 with the syntax `loc = 50`

.

np.random.seed(42) np.random.normal(size = 1000, loc = 50)

The full array of values is too large to show here, but here are the first several values of the output:

array([ 50.49671415, 49.8617357 , 50.64768854, 51.52302986, 49.76584663, 49.76586304, 51.57921282, 50.76743473, 49.53052561, 50.54256004, 49.53658231, 49.53427025 ...

You can see at a glance that these values are roughly centered around 50. If you were to calculate the average using the numpy mean function, you would see that the mean of the observations is in fact 50.

Next, we’ll generate an array of values with a specific standard deviation.

As noted earlier in the blog post, we can modify the standard deviation by using the `scale`

parameter.

In this example, we’ll generate 1000 values with a standard deviation of 100.

np.random.seed(42) np.random.normal(size = 1000, scale = 100)

And here is a truncated output that shows the first few values:

array([ 4.96714153e+01, -1.38264301e+01, 6.47688538e+01, 1.52302986e+02, -2.34153375e+01, -2.34136957e+01, 1.57921282e+02, 7.67434729e+01, -4.69474386e+01 ...

Notice that we set `size = 1000`

, so the code will generate 1000 values. I’ve only shown the first few values for the sake of brevity.

It’s a little difficult to see how the data are distributed here, but we can use the `std()`

method to calculate the standard deviation:

np.random.seed(42) np.random.normal(size = 1000, scale = 100).std()

Which produces the following:

99.695552529463015

If we round this up, it’s essentially 100.

Notice that in this example, we have not used the `loc`

parameter. Remember that by default, the loc parameter is set to `loc = 0`

, so by default, this data is centered around 0. We could modify the `loc`

parameter here as well, but for the sake of simplicity, I’ve left it at the default.

Let’s do one more example to put all of the pieces together.

Here, we’ll create an array of values with a mean of 50 and a standard deviation of 100.

np.random.seed(42) np.random.normal(size = 1000, loc = 50, scale = 100)

I won’t show the output of this operation …. I’ll leave it for you to run it yourself.

Let’s quickly discuss the code. If you’ve read the previous examples in this tutorial, you should understand this.

We’re defining the mean of the data with the `loc`

parameter. The mean of the data is set to 50 with `loc = 50`

.

We’re defining the standard deviation of the data with the `scale`

parameter. We’ve done that with the code `scale = 100`

.

The code `size = 1000`

indicates that we’re creating a NumPy array with 1000 values.

That’s it. You can use the NumPy random normal function to create normally distributed data in Python.

If you really want to master data science and analytics in Python though, you really need to learn more about NumPy. Here, we’ve covered the np.random.normal function, but NumPy has a large range of other functions. The np.random.normal function is just one piece of a much larger toolkit for data manipulation in Python.

Having said that, if you want to be great at data science in Python, you’ll need to learn more about NumPy.

Check out our other NumPy tutorials on things like how to create a numpy array, how to reshape a numpy array, how to create an array with all zeros, and many more.

More broadly though, if you want to learn data science in Python, you should sign up for our email list.

Here at Sharp Sight, we regularly post tutorials about a variety of data science topics. In particular, we regularly publish tutorials about NumPy.

If you sign up for our email list, we will send our Python data science tutorials directly to your inbox.

You’ll get free tutorials on:

- NumPy
- Matplotlib
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to use numpy random normal in Python appeared first on Sharp Sight.

]]>The post How to use the NumPy mean function appeared first on Sharp Sight.

]]>It will teach you how the NumPy mean function works at a high level and it will also show you some of the details.

So, you’ll learn about the syntax of np.mean, including how the parameters work.

This post will also show you clear and simple examples of how to use the NumPy mean function. Those examples will explain everything and walk you through the code.

Let’s get started by first talking about what the NumPy mean function does.

NumPy mean calculates the mean of the values within a NumPy array (or an array-like object).

Let’s take a look at a visual representation of this.

Imagine we have a NumPy array with six values:

We can use the NumPy mean function to compute the mean value:

It’s actually somewhat similar to some other NumPy functions like NumPy sum (which computes the sum on a NumPy array), NumPy median, and a few others. These are similar in that they compute summary statistics on NumPy arrays.

Further down in this tutorial, I’ll show you exactly how the numpy.mean function works by walking you through concrete examples with real code.

But before I do that, let’s take a look at the syntax of the NumPy mean function so you know how it works in general.

Syntactically, the numpy.mean function is fairly simple.

There’s the name of the function – np.mean() – and then several parameters inside of the function that enable you to control it.

In the image above, I’ve only shown 3 parameters – `a`

, `axis`

, and `dtype`

.

There are actually a few other parameters that you can use to control the np.mean function.

Let’s look at all of the parameters now to better understand how they work and what they do.

The np.mean function has five parameters:

`a`

`axis`

`dtype`

`out`

`keepdims`

Let’s quickly discuss each parameter and what it does.

** a** (required)

The

`a =`

parameter enables you to specify the exact NumPy array that you want numpy.mean to operate on.This parameter is *required*. You need to give the NumPy mean something to operate on.

Having said that, it’s actually a bit flexible. You can give it any *array like object*. That means that you can pass the np.mean() function a proper NumPy array. But you can also give it things that are structurally similar to arrays like Python lists, tuples, and other objects.

** axis** (optional)

Technically, the

`axis`

is the dimension on which you perform the calculation.On the other hand, saying it that way confuses many beginners. So another way to think of this is that the `axis`

parameter enables you to calculate the mean of the rows or columns.

The reason for this is that NumPy arrays have *axes*. What is an axis? An “axis” is like a dimension along a NumPy array.

Think of axes like the directions in a Cartesian coordinate system. In Cartesian coordinates, you can move in different directions. We typically call those directions “x” and “y.”

Similarly, you can move along a NumPy array in different directions. You can move down the rows and across the columns. In NumPy, we call these “directions” *axes*.

Specifically, in a 2-dimensional array, “axis 0” is the direction that points vertically down the rows and “axis 1” is the direction that points horizontally across the columns.

So how does this relate to NumPy mean?

When you have a multi dimensional NumPy array object, it’s possible to compute the mean of a set of values down along the rows or across the columns. In these cases, NumPy produces a new array object that holds the computed means for the rows or the columns respectively.

This probably sounds a little abstract and confusing, so I’ll show you solid examples of how to do this later in this blog post.

Additionally, if you’re still a little confused about them, you should read our tutorial that explains how to think about NumPy axes.

** dtype** (optional)

The

`dtype`

parameter enables you to specify the exact data type that will be used when computing the mean.By default, if the values in the input array are integers, NumPy will actually treat them as floating point numbers (`float64`

to be exact). And if the numbers in the input are floats, it will keep them as the same kind of float; so if the inputs are `float32`

, the output of np.mean will be float32. If the inputs are `float64`

, the output will be `float64`

.

Keep in mind that the data type can really matter when you’re calculating the mean; for floating point numbers, the output will have the same precision as the input. If the input is a data type with relatively lower precision (like `float16`

or `float32`

) the output may be inaccurate due to the lower precision. To fix this, you can use the `dtype`

parameter to specify that the output should be a higher precision float. (See the examples below.)

** out** (optional)

The

`out`

parameter enables you to specify a NumPy array that will accept the If you use this parameter, the output array that you specify needs to have the same shape as the output that the mean function computes.

** keepdims** (optional)

The

`keepdims`

parameter enables you keep the dimensions of the output the same as the dimensions of the input.This confuses many people, so let me explain.

The NumPy mean function *summarizes* data. It takes a large number of values and summarizes them. So if you want to compute the mean of 5 numbers, the NumPy mean function will summarize those 5 values into a single value, the mean.

When it does this, it is effectively *reducing the dimensions*. If we summarize a 1-dimensional array down to a single scalar value, the dimensions of the output (a scalar) are lower than the dimensions of the input (a 1-dimensional array). The same thing happens if we use the np.mean function on a 2-d array to calculate the mean of the rows or the mean of the columns. When we compute those means, the output will have a reduced number of dimensions.

Sometimes, we don’t want that. There will be times where we want the output to have the exact same number of dimensions as the input. For example, a 2-d array goes in, and a 2-d array comes out.

To make this happen, we need to use the `keepdims`

parameter.

By setting `keepdims = True`

, we will cause the NumPy mean function to produce an output that keeps the dimensions of the output the same as the dimensions of the input.

This confuses many people, so there will be a concrete example below that will show you how this works.

Note that by default, `keepdims`

is set to `keepdims = False`

. So the natural behavior of the function is to reduce the number of dimensions when computing means on a NumPy array.

Now that we’ve taken a look at the syntax and the parameters of the NumPy mean function, let’s look at some examples of how to use the NumPy mean function to calculate averages.

Before I show you these examples, I want to make note of an important learning principle. When you’re trying to learn and master data science code, you should study and practice simple examples. Simple examples are examples that can help you intuitively understand how the syntax works. Simple examples are also things that you can *practice and memorize*. Mastering syntax (like mastering any skill) requires study, practice, and repetition.

And by the way, before you run these examples, you need to make sure that you’ve imported NumPy properly into your Python environment. To do that, you’ll need to run the following code:

import numpy as np

Ok, now let’s move on to the code.

Here, we’ll start with something very simple. We’re going to calculate the mean of the values in a single 1-dimensional array.

To do this, we’ll first create an array of six values by using the np.array function.

np_array_1d = np.array([0,20,40,60,80,100])

Let’s quickly examine the contents of the array by using the `print()`

function.

print(np_array_1d)

Which produces the following output:

[0 20 40 60 80 100]

As you can see, the new array, `np_array_1d`

, contains six values between 0 and 100.

Now, let’s calculate the mean of the data. Here, we’re just going to call the np.mean function. The only argument to the function will be the name of the array, `np_array_1d`

.

np.mean(np_array_1d)

This code will produce the mean of the values:

50

Visually though, we can think of this as follows.

The NumPy mean function is taking the values in the NumPy array and computing the average.

Keep in mind that the array itself is a 1-dimensional structure, but the result is a single scalar value. In a sense, the mean() function has *reduced* the number of dimensions. The output has a lower number of dimensions than the input. This will be important to understand when we start using the `keepdims`

parameter later in this tutorial.

Next, let’s compute the mean of the values in a 2-dimensional NumPy array.

To do this, we first need to create a 2-d array. We can do that by using the np.arange function. We’ll also use the reshape method to reshape the array into a 2-dimensional array object.

np_array_2x3 = np.arange(start = 0, stop = 21, step = 4).reshape((2,3))

Let’s quickly look at the contents of the array by using the code `print(np_array_2x3)`

:

[[ 0 4 8] [12 16 20]]

As you can see, this is a 2-dimensional object with six values: 0, 4, 8, 12, 16, 20. By using the reshape() function, these values have been re-arranged into an array with 2 rows and 3 columns.

Now, let’s compute the mean of these values.

To do this, we’ll use the NumPy mean function just like we did in the prior example. We’ll call the function and the argument to the function will simply be the name of this 2-d array.

np.mean(np_array_3x2)

Which produces the following result:

10.0

Here, we’re working with a 2-dimensional array, but the mean() function has still produced a single value.

When you use the NumPy mean function on a 2-d array (or an array of higher dimensions) the default behavior is to compute the mean of all of the values.

Having said that, you can also use the NumPy mean function to compute the mean value in every row or the mean value in every column of a NumPy array.

Let’s take a look at how to do that.

Here, we’ll look at how to calculate the column mean.

To understand how to do this, you need to know how axes work in NumPy.

Recall earlier in this tutorial, I explained that NumPy arrays have what we call *axes*. Again, axes are like directions along the array.

Axis 0 refers to the row direction. Axis 1 refers to the column direction.

You really need to know this in order to use the `axis`

parameter of NumPy mean. There’s not really a great way to learn this, so I recommend that you just memorize it … the row-direction is axis 0 and the column direction is axis 1.

Having explained axes again, let’s take a look at how we can use this information in conjunction with the `axis`

parameter.

Using the `axis`

parameter is confusing to many people, because the way that it is used is a little counter intuitive. With that in mind, let me explain this in a way that might improve your intuition.

When we use the axis parameter, we are specifying which axis we want to summarize. Said differently, we are specifying which axis we want to collapse.

So when we specify `axis = 0`

, that means that we want to *collapse* axis 0. Remember, axis 0 is the row axis, so this means that we want to *collapse* or summarize the rows, but keep the columns intact.

Let me show you an example to help this make sense.

Let’s first create a 2-dimensional NumPy array. (Note: we used this code earlier in the tutorial, so if you’ve already run it, you don’t need to run it again.)

np_array_2x3 = np.arange(start = 0, stop = 21, step = 4).reshape((2,3))

Ok. Let’s quickly examine the contents by using the code `print(np_array_2x3)`

:

[[ 0 4 8] [12 16 20]]

As you can see, this is a 2-dimensional array with 2 rows and 3 columns.

Now that we have our NumPy array, let’s calculate the mean and set `axis = 0`

.

np.mean(np_array_2x3, axis = 0)

Which produces the following output:

array([ 6., 10., 14.])

What happened here?

Essentially, the np.mean function has produced *a new array*. But notice what happened here. Instead of calculating the mean of all of the values, it created a summary (the mean) along the “axis-0 direction.” Said differently, it collapsed the data along the axis-0 direction, computing the mean of the values along that direction.

Why?

Remember, axis 0 is the row axis. So when we set `axis = 0`

inside of the np.mean function, we’re basically indicating that we want NumPy to calculate the mean down axis 0; calculate the mean *down the row-direction*; calculate *row-wise*.

This is a little confusing to beginners, so I think it’s important to think of this in terms of directions. Along which direction should the mean function operate? When we set `axis = 0`

, we’re indicating that the mean function should move along the 0th axis … the direction of axis 0.

If that doesn’t make sense, look again at the picture immediately above and pay attention to the direction along which the mean is being calculated.

Similarly, we can compute row means of a NumPy array.

In this example, we’re going to use the NumPy array that we created earlier with the following code:

np_array_2x3 = np.arange(start = 0, stop = 21, step = 4).reshape((2,3))

This code creates the following array:

[[ 0 4 8] [12 16 20]]

It is a 2-dimensional array. As you can see, it has 3 columns and 2 rows.

Now, we’re going to calculate the mean while setting `axis = 1`

.

np.mean(np_array_2x3, axis = 1)

Which gives us the output:

array([ 4., 16.])

So let’s talk about what happened here.

First remember that axis 1 is the column direction; the direction that sweeps across the columns.

When we set `axis = 1`

inside of the NumPy mean function, we’re telling np.mean that we want to calculate the mean such that we summarize the data in that direction.

Again, said differently, we are collapsing the axis-1 direction and computing our summary statistic in that direction (i.e., the mean).

Do you see now?

Axis 1 is the column direction; the direction that sweeps across the columns.

When we set `axis = 1`

, we are indicating that we want NumPy to operate along this direction. It will therefore compute the mean of the values along that direction (axis 1), and produce an array that contains those mean values: `[4., 16.]`

.

Ok. Now that you’ve learned about how to use the `axis`

parameter, let’s talk about how to use the ** keepdims** parameter.

The `keepdims`

parameter of NumPy mean enables you to control the dimensions of the output. Specifically, it enables you to make the dimensions of the output exactly the same as the dimensions of the input array.

To understand this, let’s first take a look at a few of our prior examples.

Earlier in this blog post, we calculated the mean of a 1-dimensional array with the code `np.mean(np_array_1d)`

, which produced the mean value, `50`

.

There’s something subtle here though that you might have missed. The dimensions of the output are *not* the same as the input.

To see this, let’s take a look first at the dimensions of the input array. We can do this by examining the `ndim`

attribute, which tells us the number of dimensions:

np_array_1d.ndim

When you run this code, it will produce the following output: `1`

. The array `np_array_1d`

is a 1-dimensional array.

Now let’s take a look at the number of dimensions of the output of np.mean() when we use it on `np_array_1d`

.

Again, we can do this by using the `ndim`

parameter:

np.mean(np_array_1d).ndim

Which produces the following output: `0`

.

So the input (`np_array_1d`

) has 1 dimension, but the output of np.sum has 0 dimensions … the output is a scalar. In some sense, the output of np.sum has a reduced number of dimensions as the input.

This is relevant to the `keepdims`

parameter, so bear with me as we take a look at another example.

Let’s look at the dimensions of the 2-d array that we used earlier in this blog post:

np_array_2x3.ndim

When you run this code, the output will tell you that `np_array_2x3`

is a 2-dimensional array.

What about the output of np.sum?

If we don’t specify an axis, the output of np.sum() on this array will have 0 dimensions. You can check it with this code:

np.mean(np_array_2x3).dim

Which produces the following output: `0`

. When we use np.mean on a 2-d array, it calculates the mean. The mean value is a scalar, which has 0 dimensions. In this case, the output of np.mean has a different number of dimensions than the input.

What if we set an axis? Remember, if we use np.mean and set `axis = 0`

, it will produce an array of means. Run this code:

np.mean(np_array_2x3, axis = 0)

Which produces the output `array([ 6., 10., 14.])`

.

And how many dimensions does this output have? We can check by using the `ndim`

attribute:

np.mean(np_array_2x3, axis = 0).ndim

Which tells us that the output of np.mean in this case, when we set axis set to 0, is a 1-dimensional object.

The input had 2 dimensions and the output has 1 dimension.

Again, *the output has a different number of dimensions than the input*.

Ok, now that we’ve looked at some examples showing number of dimensions of inputs vs. outputs, we’re ready to talk about the `keepdims`

parameter.

The `keepdims`

parameter enables you to set the dimensions of the output to be the same as the dimensions of the input.

`keepdims`

takes a logical argument … meaning that you can set it to `True`

or `False`

.

By default, the parameter is set as `keepdims = False`

. This means that the mean() function will *not* keep the dimensions the same. By default, the dimensions of the output will *not* be the same as the dimensions of the input. And that’s exactly what we just saw in the last few examples in this section!

On the other hand, if we set `keepdims = True`

, this will cause the number of dimensions of the output to be * exactly the same* as the dimensions of the input.

Let’s take a look.

Once again, we’re going to operate on our NumPy array `np_array_2x3`

. Remember, this is a 2-dimensional object, which we saw by examining the `ndim`

attribute.

Now, let’s once again examine the dimensions of the np.mean function when we calculate with `axis = 0`

.

np.mean(np_array_2x3, axis = 0).ndim

This code indicates that the output of np.mean in this case has 1-dimension. Why? Because we didn’t specify anything for `keepdims`

so it defaulted to `keepdims = False`

. This code does *not* deep the dimensions of the output the same as the dimensions of the input.

Now, let’s explicitly use the `keepdims`

parameter and set `keepdims = True`

.

np.mean(np_array_2x3, axis = 0, keepdims = True).ndim

Which produces the following output:

2

When we use np.mean on a 2-d array and set `keepdims = True`

, *the output will also be a 2-d array*.

When we set `keepdims = True`

, the dimensions of the output will be __the same__ as the dimensions of the input.

I’m not going to explain when and why you might need to do this ….

Just understand that when you need to dimensions of the output to be the same, you can force this behavior by setting `keepdims = True`

.

Ok, one last example.

Let’s look at how to specify the output datatype by using the `dtype`

parameter.

As I mentioned earlier, if the values in your input array are *integers* the output will be of the `float64`

data type. If the values in the input array are floats, then the output will be the same type of float. So if the inputs are `float32`

, the outputs will be `float32`

, etc.

But what if you want to specify another data type for the output?

You can do this with the `dtype`

parameter.

Let’s take a look at a simple example.

Here, we’ll create a simple 1-dimensional NumPy array of integers by using the NumPy numpy arange function.

np_array_1d_int = np.array([1,3,4,7,11])

And we can check the data type of the values in this array by using the dtype attribute:

np_array_1d_int.dtype

When you run that code, you’ll find that the values are being stored as integers; `int64`

to be precise.

Now let’s use numpy mean to calculate the mean of the numbers:

mean_output = np.mean(np_array_1d_int)

Now, we can check the data type of the output, `mean_output`

.

mean_output.dtype

Which tells us that the datatype is `float64`

.

This is exactly the behavior we should expect. As I mentioned earlier, by default, NumPy produces output with the `float64`

data type.

So now that we’ve looked at the default behavior, let’s change it by explicitly setting the `dtype`

parameter.

mean_output_alternate = np.mean(np_array_1d_int, dtype = 'float32')

The object `mean_output_alternate`

contains the calculated mean, which is `5.1999998`

.

Now, let’s check the datatype of `mean_output_alternate`

.

mean_output_alternate.dtype

When you run this, you can see that `mean_output_alternate`

contains values of the `float32`

data type. This is exactly what we’d expect, because we set `dtype = 'float32'`

.

As I mentioned earlier, you need to be careful when you use the `dtype`

parameter.

If you need the output of np.mean to have high precision, you need to be sure to select a data type with high precision. For example, if you need the result to have high precision, you might select `float64`

.

If you select a data type with low precision (like `int`

), the result may be inaccurate or imprecise.

You’ve probably heard that 80% of data science work is just data manipulation. That’s mostly true.

If you want to be great at data science in Python, you need to know how to manipulate data in Python.

And one of the primary toolkits for manipulating data in Python is the NumPy module.

In this post, I’ve shown you how to use the NumPy mean function, but we also have several other tuturials about other NumPy topics, like how to create a numpy array, how to reshape a numpy array, how to create an array with all zeros, and many more.

If you’re interested in learning NumPy, definitely check those out.

More broadly though, if you’re interested in learning (and mastering) data science in Python, or data science generally, you should sign up for our email list right now.

Here at the Sharp Sight blog, we regularly post tutorials about a variety of data science topics … in particular, about NumPy.

If you want to learn NumPy and data science in Python, sign up for our email list.

If you sign up for our email list, you’ll receive Python data science tutorials delivered to your inbox.

You’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to use the NumPy mean function appeared first on Sharp Sight.

]]>The post How to use the NumPy concatenate function appeared first on Sharp Sight.

]]>This post will cover several topics. If you don’t want to read the full tutorial, click on the appropriate link and it will send you to the relevant section of this tutorial.

This post will cover:

- What the NumPy concatenate function does
- The syntax of NumPy concatenate
- Examples of how to use NumPy concatenate

First, I’ll start by explaining what the concatenate function does.

So what is the concatenate function?

The NumPy concatenate function is function from the NumPy package. NumPy (if you’re not familiar), is a data manipulation package in the Python programming language. We use NumPy to “wrangle” numeric data in Python.

NumPy concatenate essentially combines together multiple NumPy arrays.

There are a couple of things to keep in mind.

First, NumPy concatenate isn’t exactly like a traditional database join. It’s more like stacking NumPy arrays.

Second, the concatenate function can operate both vertically and horizontally. You can concatenate arrays together vertically (like in the image above), or you can concatenate arrays together horizontally.

Later in the examples section, I’ll show you how to use concatenate both ways.

Before we discuss concrete examples though, let’s quickly look at the syntax of the np.concatenate function.

The syntax of NumPy concatenate is fairly straightforward, particularly if you’re familiar with other NumPy functions.

Syntactically, there are a few main parts of the function: the name of the function, and several parameters inside of the function that we can manipulate.

In Python code, the concatenate function is typically written as `np.concatenate()`

, although you might also see it written as `numpy.concatenate()`

. Either case assumes that you’ve imported the NumPy package with the code `import numpy as np`

or `import numpy`

, respectively.

Moving forward, this tutorial will assume that you’ve imported NumPy by executing the code `import numpy as np`

.

There are a few parameters and arguments of the np.concatenate function:

- a sequence of input arrays (the arrays that you will concatenate together)
- the
`axis`

parameter

Let’s take a look at each of these separately.

When you use the np.concatenate function, you need to provide at least two input arrays.

There are a few important points that you should know about the input arrays for np.concatenate.

Notice that the arrays – `arr1`

and `arr2`

in the above example – are enclosed inside of parenthesis. Because they are enclosed in parenthesis, they are essentially being passed to the concatenate function as a Python *tuple*. Alternatively, you could enclose them inside of brackets (i.e., `[arr1, arr2]`

), which would pass them to concatenate as a Python *list*.

Either method is acceptable: you can provide the input arrays in a list or a tuple. What’s important to understand is that you need to provide the input arrays to the concatenate function within some type of Python *sequence*. Tuples and lists are both types of Python sequences.

If you’re a little confused about this, I suggest that you review Python sequences.

Another point that I’ll make is that the input arrays should probably contain data of the same data type.

But keep in mind that the data types probably *should* be the same, but they don’t have to be.

The issue here is that, if the input arrays that you give to NumPy concatenate have *different* datatypes, then the function will try to re-cast the data of one array to the data type of the other.

For example, let’s say that you create two NumPy arrays and pass them to np.concatenate. One NumPy array contains integers, and one array contains floats.

integer_data = np.array([[1,1,1],[1,1,1]], dtype = 'int') float_data = np.array([[9,9,9],[9,9,9]], dtype = 'float') np.concatenate([integer_data, float_data])

When you run this, you can see that all of the numbers in the output array are *floats*.

array([[ 1., 1., 1.], [ 1., 1., 1.], [ 9., 9., 9.], [ 9., 9., 9.]])

Why? Some of the inputs were integers, right?

A NumPy array must contain numbers that all have the same data type. If the inputs to np.concatenate have *different* data types, it will re-cast some of the numbers so that all of the data in the output have the *same* type. (It appears that NumPy is re-casing the lower precision inputs to the data type of the higher precision inputs. So it is re-casting the integers into floats.)

Ultimately, you need to be careful when working with NumPy arrays that have different data types. The behavior of NumPy concatenate in those cases may have unintended consequences.

In the examples I’ll show later in this tutorial, we’ll mostly work with *two* arrays. We’ll concatenate together only two.

Keep in mind, however, that it’s possible to concatenate together a large sequence of NumPy arrays. More than two. You can do three, or four, or more.

Having said that, if you’re just getting started with NumPy, I recommend that you learn and practice the syntax with very simple examples. Stick with two arrays in the beginning.

Now that we’ve talked about the input arrays, let’s talk about how the `np.concatenate()`

function puts them together.

As I mentioned earlier in this tutorial, the concatenate function can join together arrays vertically *or* horizontally.

The behavior of np.concatenate – whether it concatenates the numpy arrays vertically or horizontally – depends on the axis parameter.

I have to be honest. One of the hardest things for beginners to understand in NumPy are array axes.

For a variety of reasons, array axes are just hard to understand. The naming conventions (axis 0, axis 1, etc) are a little abstract. And the documentation about axes is not always 100% clear. Ultimately, these factors make array axes a little un-intuitive.

Be that as it may, to understand how to use NumPy concatenate with the axis parameter, you need to understand how NumPy array axes work.

With that in mind, let’s try to shed a little light on array axes.

First, let’s start with the basics. NumPy arrays have what we call *axes*.

The term “axis” seems to confuse people in the context of NumPy arrays, so let’s take a look at a more familiar example. Take a look at a Cartesian coordinate system.

A Cartesian coordinate system has *axes*. Specifically, we typically refer to the horizontal axis as the `x axis`

, and the vertical axis as the `y axis`

. Almost everyone should be familiar with this.

In Cartesian space, these axes are just directions. Moreover, an observation at a point in a Cartesian space can be defined by its value along each axis. So for example, we can identify a point in a Cartesian space by specifying how many units to travel along the x axis, and how many units to travel along the y axis.

Axes in a NumPy array are very similar. Axes in a NumPy array are just directions: axis 0 is the direction running vertically down the rows and axis 1 is the direction running horizontally across the columns.

Remember also that in Python, things are indexed starting with “0” (e.g., the first element in a list is actually at index 0). Similarly, the “first” axis in a NumPy array is “axis 0.”

Ultimately though, when we say “axis 0” we’re talking about the direction that points down the rows, and when we say “axis 1” we’re talking about the direction that points across the columns.

And just like in a Cartesian coordinate system, we can use this system of axes to identify particular cells in the dataset. We can identify a particular location in a NumPy array by specifying how many units on the 0-axis and how many units on the 1-axis. It’s very similar to how we identify particular points at locations in an x/y coordinate space.

Now that we’ve talked about axes in general, let’s talk about how they operate with respect to the concatenate function.

Remember what I mentioned earlier in this tutorial: we can concatenate NumPy arrays *horizontally* or we can concatenate NumPy arrays *vertically*.

Which one we do is specified by the

parameter.*axis*

If we set `axis = 0`

, the concatenate function will concatenate the NumPy arrays vertically.

(By the way, this is the default behavior. If you don’t specify the axis, the default behavior will be `axis = 0`

.)

On the other hand, if we manually set `axis = 1`

, the concatenate function will concatenate the NumPy arrays horizontally.

A lot of people still find this to be un-intuitive, so I’ll quickly explain it another way.

The best way to think of NumPy concatenate is to think of it like stacking arrays, either vertically or horizontally.

The axis that we specify with the `axis`

parameter is the axis along which we stack the arrays.

So when we set `axis = 0`

, we are stacking along axis 0. Axis 0 is the axis that runs *vertically* down the rows, so this amounts to stacking the arrays vertically.

Similarly, when we set `axis = 1`

, we’re stacking along axis 1. Axis 1 is the axis that runs *horizontally* across the columns, so this amounts to stacking the arrays horizontally.

If this still seems a little confusing, that’s OK.

To help clear things up, we’re going to move on to some concrete examples that you can run yourself. Understanding how np.concatenate works will be easier when you have some real examples that you can play with.

Ok, let’s work with some real examples.

Before you get started with these examples, you’ll need to import the NumPy package into your development environment.

You can do that with the import statement as follows:

import numpy as np

This will enable you to refer to NumPy as `np`

when when you call the concatenate function.

First, let’s just concatenate together two simple NumPy arrays.

To do this, we’ll first create two NumPy arrays with the np.array function.

np_array_1s = np.array([[1,1,1],[1,1,1]]) np_array_9s = np.array([[9,9,9],[9,9,9]])

Now, let’s print them out:

print(np_array_1s)

Which yields:

[[1 1 1] [1 1 1]]

… and

print(np_array_9s)

Which yields:

[[9 9 9] [9 9 9]]

Basically, we have two simple NumPy arrays, each with three values.

Now, let’s combine them together using NumPy concatenate.

np.concatenate([np_array_1s, np_array_9s])

When you run this, it produces the following output:

array([[1, 1, 1], [1, 1, 1], [9, 9, 9], [9, 9, 9]])

Notice what’s happened here. The concatenate function has combined the two arrays together *vertically*. Essentially, the concatenate function has combined them together and has *defaulted* to `axis = 0`

.

Next, we’re going to concatenate the arrays together vertically again, but this time we’re going to do it explicitly with the `axis`

parameter.

In this example, we’re going to reuse the two arrays that we created earlier: `np_array_1s`

and `np_array_9s`

.

To explicitly concatenate them together vertically, we need to set `axis = 0`

.

np.concatenate([np_array_1s, np_array_9s], axis = 0)

Which produces the following output:

array([[1, 1, 1], [1, 1, 1], [9, 9, 9], [9, 9, 9]])

Notice that this is the same as if we had used concatenate *without* specifying the `axis`

. By default, the np.concatentate function sets `axis = 0`

.

Finally, let’s concatenate the two arrays horizontally.

To do this, we need to set `axis = 1`

.

np.concatenate([np_array_1s, np_array_9s], axis = 1)

Which produces the following output:

array([[1, 1, 1, 9, 9, 9], [1, 1, 1, 9, 9, 9]])

Remember that axis 1 is the axis that runs horizontally across the columns. So when we set `axis = 1`

, the concatenate function is essentially combining the two arrays in that direction … horizontally.

Before ending this NumPy concatenate tutorial, I want to give you a quick warning about working with 1 dimensional NumPy arrays.

If you want to concatenate together two 1-dimensional NumPy arrays, things won’t work exactly the way you expect.

Let’s say we have two 1-dimensional arrays:

np_array_1s_1dim = np.array([1,1,1]) np_array_9s_1dim = np.array([9,9,9])

And let’s concatenate them together using `axis = 0`

:

np.concatenate([np_array_1s_1dim, np_array_9s_1dim], axis = 0)

Here’s the output:

array([1, 1, 1, 9, 9, 9])

Why are they being concatenated together horizontally? If we set `axis = 0`

, shouldn’t this concatenate them together vertically?

No, not in this case.

This is a little subtle, and it all comes down to axes.

Think about what we have here. Both of the input arrays are *one dimensional*.

Because they are one dimensional, *there is only one axis*. Axis 0 is the *only* axis they have!

Moreover, in the case of a 1-d array, axis 0 actually points along the observations. It points in the direction of the index.

So when we use np.concatenate in this case, it is still concatenating them along axis 0. The issue is that because they are 1-d arrays, axis 0 points horizontally along the observations.

In any event, concatenate function works “fine” in this case, but you need to really understand NumPy axes to understand its behavior.

A related issue is when you try to concatenate together two 1-dimensional NumPy arrays with `axis = 1`

.

If you try to concatenate together two 1-d NumPy arrays *vertically*, using `axis = 1`

, you will get an error.

For example, take a look at the following code:

np.concatenate([np_array_1s_1dim, np_array_9s_1dim], axis = 1)

When you run this, you’ll get an error:

IndexError: axis 1 out of bounds [0, 1)

What’s going on here?

Again, this is a bit subtle, but it makes sense if you think about it.

The input arrays that we’ve used here are *one dimensional*.

When we use the syntax `axis = 1`

, we’re asking the concatenate function to concatenate the arrays along the *second* axis. Remember that in NumPy, the first axis is “axis 0” and the second axis is “axis 1.” The axes are numbered starting from 0 (just like Python indexes).

Here’s the problem though: in a 1-dimensional NumPy array, *there is no second axis*. In a 1-d array, the only axis is axis 0. There is no second axis (“axis 1”) along which we can concatenate the arrays.

Once again, this is subtle, but it makes sense when you understand how NumPy axes work.

Just be careful, and make sure you think through the structure of your arrays before you use NumPy concatenate.

NumPy concatenate is only one data manipulation tool in Python’s NumPy package.

If you want to be great at data science in Python, you’ll need to learn more about NumPy.

Having said that, check out our other NumPy tutorials on things like how to create a numpy array, how to reshape a numpy array, how to create an array with all zeros, and many more.

More broadly though, if you’re interested in learning and mastering data science in Python, you should sign up for our email list right now.

Here at the Sharp Sight blog, we regularly post tutorials about a variety of data science topics … in particular, about NumPy.

If you sign up for our email list, our Python data science tutorials will be delivered to your inbox.

You’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to use the NumPy concatenate function appeared first on Sharp Sight.

]]>The post How to use the Numpy ones function appeared first on Sharp Sight.

]]>The NumPy ones function creates NumPy arrays filled with `1`

‘s. That’s it. It’s pretty straight forward.

The np.ones function is also fairly easy to use. But, there are a few details of the function that you might not know about, such as parameters that help you precisely control how it works.

Having said that, this tutorial will give you a full explanation of how the np.ones function works.

I’ll explain how the syntax works at a very high level. I’ll also explain the parameters of the ones() function.

Moreover, I’ll show you several step-by-step examples of how to use the np.ones function. These examples will show you how to use the function parameters to modify the function’s behavior.

Ok … first things first. Let’s look at the syntax of the np.ones function.

The syntax of the NumPy ones function is very straightforward. Syntactically, it’s very similar to a few other NumPy functions like the NumPy zeros function.

Typically, we’ll call the function with the name `np.ones()`

. Keep in mind that this assumes that you’ve imported NumPy into your environment using the code `import numpy as np`

.

Inside of the function, there are several parameters. The two most important parameters are the `shape`

parameter and the `dtype`

parameter.

Let’s quickly talk about the parameters of the np.ones function.

As you saw in the previous section, there are several parameters that enable you to control the behavior of the NumPy ones function. The most important is the `shape`

parameter, but there is also the `dtype`

parameter and the `order`

parameter.

Let’s look at those one at a time.

The `shape`

parameter enables you to specify the exact shape of the output of the function.

Remember … the np.ones function creates arrays that contain all `1`

‘s. Therefore, you don’t need to tell it the values to put in the array; they are always `1`

‘s. The critical thing that you need to specify then is the exact *shape* of the output array.

Therefore, the `shape`

parameter is required.

This parameter will accept an integer argument, and it will also accept a sequence of integers. Often, for multi-dimensional array outputs, you’ll see a *tuple* of ints. But because the `shape`

parameter will accept any sequence of integers, it’s also possible to use a list or other numeric sequence to define the shape.

I’ll show you concrete examples of how to use the shape parameter later in this tutorial.

The `dtype`

parameter enables you to specify the datatype of the `1`

‘s in the output array.

By default, the `1`

‘s will be floating point numbers. Specifically, the outputs will have the type `numpy.float64`

.

Having said that, you can choose a data type from among the many Python and NumPy data types.

Later in this tutorial, I’ll show you an example of how to set the output datatype using the `dtype`

parameter.

The `order`

parameter handles how multi-dimensional arrays are stored.

To be honest, this is something you’re unlikely to use, so we’re not going to cover it in this tutorial. If you need to use this parameter, I suggest that you review the original documentation for NumPy.

Ok. Now that we’ve reviewed the parameters of the np.ones function, let’s look at some concrete examples of how the function works.

Before we get started, there’s a piece of code that you need to run:

import numpy as np

You need to run that code first, otherwise the following examples won’t work properly.

We’re going to start simple.

Here, we’ll make a simple, 1-dimensional array of four `1`

‘s.

np.ones(shape = 4)

Which produces the following output:

array([ 1., 1., 1., 1.])

Keep in mind that you can also write the code without explicitly using the `shape`

parameter:

np.ones(4)

Both the snippets of code – `np.ones(shape = 4)`

and `np.ones(4)`

– produce the same output. This is because the `shape`

parameter is what we call a “positional” parameter. This means that you can pass an argument to that parameter strictly by the position inside the parenthesis. NumPy knows that the first argument inside the function is supposed to correspond to the `shape`

parameter.

Ok, back to the example. This is pretty straight forward, so it should be pretty clear what’s going on.

We have specified that we want the “shape” to be four values long. Essentially, this causes NumPy to create a new 1-dimensional NumPy array filled with `1`

‘s that is four values long. Visually, we can think of it like this:

This is really as simple as it gets. We specified that the “shape” is 4, which produces a NumPy array that is four values long.

Next, let’s examine how to make arrays of 1’s that have a more complex shape.

In the last example, we made a simple, 1-dimensional array that was four items long.

To create arrays with more complex shapes, we need to manipulate the `shape`

parameter. Specifically, instead of providing an integer value to the `shape`

parameter, like `shape = 4`

, we need to provide a sequence of integers that specifies a multi dimensional shape.

Let me show you an example so you understand.

Here, we’re going to create a 2-dimensional NumPy array with 2 rows and 3 columns.

To do this, we’re going to use the `shape`

parameter. Specifically, we’re going to set `shape = (2, 3)`

.

np.ones(shape = (2,3))

This code creates the following array:

array([[ 1., 1., 1.], [ 1., 1., 1.]])

For the sake of illustration, we can visually represent the array like this:

So what do we have?

This is a 2 dimensional array. The two dimensions are the rows and columns.

The shape is 2 by 3. There are 2 rows and 3 columns.

We created this by specifying that exact shape by setting `shape = (2, 3)`

. The first integer in the sequence (2) specifies the number of rows. The second integer in the sequence (3) specifies the number of columns.

Notice that when we use the `shape`

parameter, the argument that we provided was a *tuple* of integers: `(2, 3)`

. Having said that, the argument to the shape parameter can be any sequence. For example, we could instead have provided a list: `[2, 3]`

. For the purposes of manipulating the `shape`

parameter of np.ones, a Python list would work essentially the same as a Python tuple.

Next, let’s create a NumPy array that contains elements of a specific *data type*.

By default, the np.ones function creates an array of floating point numbers. Specifically, it creates an array of ones that have the data type `float64`

.

We can change this behavior by using the `dtype`

parameter.

This is very straight forward, but let’s take a look at an example so you can see how it’s done.

Here, we’re going to create an array of *integers*. To do this, we will set `dtype = int`

inside of the np.ones() function:

np.ones(3, dtype = int)

This code creates the following output:

array([1, 1, 1])

This is really straight forward. The output is a NumPy array with three ones, all integers.

If you want to check the datatype of the output, you can examine it by using the `dtype`

attribute:

np.ones(3, dtype = int).dtype

When you run this code, you’ll see that the data type is integer … `int64`

to be specific.

Remember that when using the `dtype`

parameter, you can specify essentially any Python data type or NumPy data type.

Let’s try one more example.

Here, we’re going to use several of the parameters together to precisely control the output of the NumPy ones function.

np.ones(shape = (3, 5), dtype = int)

This code creates the following output:

array([[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]])

If you’ve understood the prior examples in this tutorial, you should understand what’s going on here.

We’ve called the function with the code `np.ones()`

.

Inside the function, we’ve used two parameters to control the function: `shape`

and `dtype`

.

We set `shape = (3, 5)`

to create an array with 3 rows and 5 columns.

We set `dtype = int`

to specify that we want the values of the output array to be *integers*. Specifically, the values in the array are of the type `int64`

.

If you’re interested in learning more about NumPy, we have several other tutorials.

There are NumPy tutorials that explain:

- The basics of NumPy arrays
- How to use the NumPy zeros function
- How to use the NumPy arrange function
- A quick introduction to the NumPy reshape function
- How to use the NumPy linspace function

… and more.

If you’re interested in NumPy (and data science in Python) then check out those tutorials.

Moreover, if you want to learn more about data science in Python (and data science in general) then sign up for our email list.

Here at Sharp Sight, we teach data science.

And every week, we publish articles and tutorials about data science.

When you sign up for our email list, you’ll get immediate access to our tutorials … they’ll be delivered right to your inbox.

When you sign up, you’ll learn about:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to use the Numpy ones function appeared first on Sharp Sight.

]]>