The post How to use NumPy random choice appeared first on Sharp Sight.

]]>I recommend that you read the whole blog post, but if you want, you can skip ahead. Here are the contents of the tutorial …

**Contents:**

- a quick review of NumPy
- why we use np.random.choice
- the syntax of NumPy random choice
- examples of np.random.choice

Again, if you have the time, I strongly recommend that you read the whole tutorial. Everything will make more sense if you read everything carefully and follow the examples.

Ok … let’s get into it.

First of all, what is np.random.choice?

NumPy random choice is a function from the NumPy package in Python.

You might know a little bit about NumPy already, but I want to quickly explain what it is, just to make sure that we’re all on the same page.

NumPy is a data manipulation module for Python.

Specifically, the tools from NumPy operate on arrays of numbers … i.e., numeric data.

Because NumPy functions operate on numbers, they are especially useful for data science, statistics, and machine learning.

For example, if you want to do some data analysis, you’ll often be working with tables of numbers. Frequently, when you work with data, you’ll need to organize it, reshape it, clean it and transform it. We call these data cleaning and reshaping tasks “data manipulation.”

In recent years, NumPy has become particularly important for “machine learning” and “deep learning,” since these often involve large datasets of numeric data. When you’re doing machine learning and deep learning, numeric data manipulation is a very big part of the workflow.

In any case, whether you’re doing statistics or analysis or deep learning, NumPy provides an excellent toolkit to help you clean up your data.

One common task in data analysis, statistics, and related fields is taking random samples of data.

You’ll see random samples in probability, Bayesian statistics, machine learning, and other subjects. Random samples are very common in data-related fields.

NumPy random choice provides a way of *creating* random samples with the NumPy system.

If you’re working in Python and doing any sort of data work, chances are (heh, heh), you’ll have to create a random sample at some point.

NumPy random choice can help you do just that.

To explain it though, let’s take a look at an example.

Think of a die … the kind of die that you would see in a game:

A typical die has six sides. Each side has some dots on it, corresponding to a number 1 through 6. Essentially, a die has the numbers 1 to 6 on its six different faces.

If you roll the die, when the die lands, one face will emerge pointing upwards, so rolling the die is exactly like selecting a number between 1 and 6. The numbers 1 to 6 on the die are the possible outcomes that can appear, and rolling a die is like randomly *choosing* a number between 1 and 6.

So essentially, in the example of rolling a die, we have possible outcomes (i.e., the faces), and a random process that chooses one of them.

The NumPy random choice function is a lot like this. Given an input array of numbers, numpy.random.choice will *choose* one of those numbers randomly.

So let’s say that we have a NumPy array of 6 integers … the numbers 1 to 6.

If we apply np.random.choice to this array, it will select one. It will *choose* one randomly…. it’s essentially the same as rolling a die.

That’s how np.random.choice works. You input some items, and the function will randomly choose one or more of them as the output.

Conceptually, this function is easy to understand, but using it properly can be a little tricky.

Ultimately, to use NumPy random choice properly, you need to know the syntax and how the syntax works.

That being the case, let’s look at the syntax of np.random.choice.

One quick note …

In this tutorial, you’ll see me refer to the function as np.random.choice.

The term “`np`

” refers to NumPy. But, to get the syntax to work properly, you need to tell your Python system that you’re referring to NumPy as “np”. You need to run the code `import numpy as np`

. This code essentially tells Python that we’re giving the NumPy package the nickname “`np`

“.

I’ll show you exactly how to do that again in the examples section of this tutorial, but I want to briefly explain it before we look at the syntax.

Ok, let’s take a look at the syntax.

The `np.random.choice()`

function is fairly simple. When you use it, there is the name of the function, and then some parameters that will be enclosed inside of parenthesis.

Because the parameters of the function are important to how it works, let’s take a closer look at the parameters of NumPy random choice.

There are four parameters for the NumPy random choice function:

`a`

`size`

`replace`

`p`

Let’s discuss each of these individually.

`a`

(required)The `a`

parameter enables us to specify the array of input values … typically a NumPy array.

This is essentially the set of input elements from which we will generate the random sample.

Note that the `a`

parameter is *required* … you need to provide some array-like structure that contains the inputs to the random selection process.

Also note that the `a`

parameter is flexible in terms of the inputs that it will accept. Typically, we’ll supply a NumPy array of numbers to the `a`

parameter. However, because it is flexible, it will also accept things like Python lists, tuples, and other Python sequences.

Moreover, instead of supplying a sequence like a NumPy array, you can also just provide a *number* (i.e., an integer). If you provide an integer `n`

, it will create a NumPy array of integers up to but excluding n by using the NumPy arange function. In this case, it’s as if you supplied a NumPy array with the code `np.arange(n)`

. I’ll show you an example of this in the examples section of this tutorial.

`size`

The `size`

parameter describes (…. wait for it ….)

… the *size* of the output.

Remember that the NumPy random choice function accepts an input of elements, chooses randomly from those elements, and outputs the random selections as a NumPy array.

Because the output of numpy.random.choice is a NumPy array, the array will have a *size*. If you know about NumPy arrays, this will make sense, but if you’re new to NumPy this may be confusing.

Therefore, if you don’t know what the `size`

attribute is, I suggest that you read our tutorial about NumPy arrays. Specifically, you should read the section about the attributes of NumPy arrays.

`replace`

The `replace`

parameter specifies whether or not you want to sample with replacement.

If you’ve taken a statistics class, you’ll probably be familiar with this.

… but if you *haven’t* taken a stats class, the idea of sampling with and without replacement might be foreign.

That being the case, let me quickly explain.

Let’s say that you have 4 simple cards on a table: a diamond, a spade, a heart, and a club. (This is an extremely simple example, so we’re working with simplified playing cards.)

I turn them over and mix them up on the table. Then I ask you to close your eyes.

You make your selection … it’s the heart card.

Next, I ask you to select another card.

… now, this is the critical point.

Do you put your first card back or not? Do you “replace” your initial selection?

If you *do* put your card back, then it will be possible to re-select the heart card, or any of the other three cards. But if you *do not* replace your initial card, then it will only be possible to select a spade, diamond, or club.

Essentially, *replacement* makes a difference when you choose multiple times.

And this is what the `replace`

parameter controls. It will control whether or not an element that is chosen by numpy.random.choice gets *replaced* back into the pool of possible choices.

I’ll explain this again in the examples section, so you can see it in action.

`p`

Finally, the `p`

parameter controls the probability of selecting a given item.

By default, each item in the input array has an equal probability of being selected.

It’s like rolling a fair die.

A fair die has 6 sides, and each side is equally likely to come up. So the probability of rolling a 1 is .1667 (i.e., 1/6th). The probability of rolling a 2 is also .1667, etc.

Similarly, if we set up NumPy random choice with the input values 1 through 6, then each of those values will have an equal probability of being selected, by default.

But we can change that. We can manually specify the probabilities of the different outcomes. For example, we could make selecting ‘`1`

‘ a probability of .5, and give the other outcomes a probability of .1. (This is akin to rolling an unfair, weighted die.)

Essentially, this is what the `p`

parameter controls: the probabilities of selecting the different input elements.

Note that the `p`

parameter is optional, and if we don’t provide anything, NumPy just treats each outcome as equally likely.

If we *do* provide something to the `p`

parameter, then we need to provide it in the form of an “array like” object, such as a NumPy array, list, or tuple.

Now that we’ve looked at the syntax of numpy.random.choice, and we’ve taken a closer look at the parameters, let’s look at some examples.

**Examples:**

- select a random number from a numpy array
- generate a random sample from a numpy array
- perform random sampling with replacement
- change the probabilities of different outcomes
- select a sample from a list of items

Before you run any of these examples, you’ll need to run some code as a preliminary setup step.

Specifically, you’ll need to properly import the NumPy module.

Keep in mind, that to import the NumPy module into your code environment, you’ll need to have NumPy installed on your computer first. Installing NumPy is complicated, and beyond the scope of this blog post. Having said that, I recommend that you just use Anaconda to get the modules properly installed.

But assuming that you have NumPy installed on your computer, you can import it into your working environment with the following code:

import numpy as np

This will import NumPy with the nickname `np`

. Going forward, we will syntactically refer to NumPy as `np`

in our code.

In this first example, we’re going to select a single integer from a range of possible integers.

More specifically, we’re going to select a single integer between 0 and 9.

First, before we use np random choice to randomly select an integer from an array, we actually need to *create* the NumPy array.

Let’s do that now.

Here, we’re going to create a simple NumPy array with the numpy.arange function.

array_0_to_9 = np.arange(start = 0, stop = 10)

This is fairly straightforward, as long as you understand how to use np.arange. If you don’t, make sure to read our numpy.arange tutorial.

Using NumPy arange this way has created a new array, called array_0_to_9. This array contains the integers from 0 to 9.

You can print it out with the print function:

print(array_0_to_9)

OUTPUT:

[0 1 2 3 4 5 6 7 8 9]

Visually, we can represent the array as follows:

This is really straight forward … this array contains the integers from 0 to 9.

Next, we’re going to randomly select one of those integers from the array.

To select a random number from `array_0_to_9`

we’re now going to use numpy.random.choice.

np.random.seed(0) np.random.choice(a = array_0_to_9)

OUTPUT:

5

If you read and understood the syntax section of this tutorial, this is somewhat easy to understand. But there are a few potentially confusing points, so let me explain it.

Essentially, we’re using np.random.choice with the ‘`a`

‘ parameter. You’ll remember from the syntax section earlier in this tutorial that the `a`

parameter enables us to set the input array (i.e., the NumPy array that contains our input values). In other words, the code `a = array_0_to_9`

indicates that the input values are contained in the array `array_0_to_9`

.

Remember, the input array `array_0_to_9`

simply contains the numbers from 0 to 9.

When we use np.random.choice to operate on that array, it simply randomly selects one of those numbers.

In this case, it randomly selects the number 5.

Visually, we can represent the operation like this:

The input array has 10 values, and NumPy random choice randomly chooses one of them.

There’s one part of this code that confuses many beginners, so I want to address it.

Before we ran the line of code `np.random.choice(a = array_0_to_9)`

, we ran the code `np.random.seed(0)`

.

We need np.random.seed because it “seeds” the random number generator for numpy.random.choice.

But WTF is a “seed” anyway?

This is a little complicated, but I’ll briefly explain here.

The NumPy random choice function operates on the principle of pseudorandom number generation.

When we use a pseudorandom number generator, the numbers in the output *approximate* random numbers, but are not exactly “random.” In fact, when we use pseudorandom numbers, the output is actually *deterministic*; the output is actually determined by an initializing value called a “seed.”

Let me say that again: when we set a seed for a pseudorandom number generator, the output is completely determined by the seed.

What that means is that if we use the same seed, a pseudorandom number generator will produce the same output.

Let me show you:

np.random.seed(0) np.random.choice(a = np.arange(10))

This produces the output 5.

Now run it again with the same seed.

np.random.seed(0) np.random.choice(a = np.arange(10))

It produces the output 5 again.

You can run this code as many times as you like. If you use the same seed, it will produce the exact same output.

What this means is that np.random.choice is random-ish. It’s sort of random, in the sense that there will be no discernible relationship between the seed and the output. But you have to remember that using the same seed will produce the same output.

This is actually good, because it makes the results of a pseudorandom function reproducible. If I share my code with you, and you run it with the same seed, you will get the exact same result. This is good for code testing, among other things.

If this is still confusing, you should read our tutorial about numpy.random.seed, which explains random number generation with NumPy.

Ok.

Now that I’ve shown you how to select a single random number from a specific NumPy array, let’s take a look at another way to select a number from a sequence of values.

Here, we’re going to select a number from the numbers 0 to 9. It’s essentially just like the prior example.

The one major difference is that we’re not going to supply a specific input array. Instead, we’re just going to provide a number inside of the parenthesis when we call np.random.choice. Here, we’re going to run the code `np.random.choice(10)`

.

np.random.seed(0) np.random.choice(10)

Which produces the exact same output as in the previous example.

OUTPUT:

5

What’s going on here?

In this example, we ran the code `np.random.choice(10)`

. We did not provide a specific NumPy array as an input. Instead, we just provided the number `10`

.

When we provide a number to np random choice this way, it will automatically *create* a NumPy array using NumPy arange. Effectively, the code `np.random.choice(10)`

is identical to the code `np.random.choice(a = np.arange(10))`

. So by running np.random.choice this way, it will create a new numpy array of values from 0 to 9 and pass that as the input to numpy.random.choice.

This is essentially a shorthand way to both create an array of input values and then select from those values using the NumPy random choice function.

Now that you’ve learned how to select a *single* number from a NumPy array, let’s take a look at how to create a random sample with NumPy random choice. That is, we’re going to select *multiple* elements from an input range.

First, let’s just create a NumPy array.

Here, we’ll create a NumPy array of values from 0 to 99.

array_0_to_99 = np.arange(100)

Now that we have our input array, let’s select a sample of 5 numbers from it:

To do this, we’ll use the `size`

parameter.

np.random.seed(1) np.random.choice(array_0_to_99, size = 5)

OUTPUT:

array([37, 12, 72, 9, 75])

What happened here?

The NumPy random choice function randomly selected 5 numbers from the input array, which contains the numbers from 0 to 99.

The output is basically a random sample of the numbers from 0 to 99.

Next, let’s create a random sample with replacement using NumPy random choice.

Here, we’re going to create a random sample with replacement from the numbers 1 to 6.

First, we’ll just create a NumPy array of the values from 1 to 6.

array_1_to_6 = np.arange(start = 1, stop =7)

If we print it out, we can see the contents.

print(array_1_to_6)

OUT:

[1 2 3 4 5 6]

This is really straight forward. It’s just the numbers from 1 to 6.

Now, we’ll generate a random sample from those inputs.

Specifically, we’re going to create a sample of 3 values.

Additionally, we will set the `replace`

parameter to `replace = True`

. This will cause np.random.choice to perform random sampling with replacement. That is, even if a value is selected once, it will be “replaced” back into the possible input values, and it will be possible that the input could be selected again.

Let’s run the code.

np.random.seed(77) np.random.choice(a = array_1_to_6, size = 3, replace = True)

OUTPUT:

array([5, 5, 4])

Notice what’s in the output. We have an output of 3 values. This is because we set the `size`

parameter to `size = 3`

. That means that the output must have 3 values.

Also, notice the values that are in the output. The value `5`

is repeated *twice*.

Why?

This is possible because we set the `replace`

parameter to `replace = True`

.

When we do this, it means that an item in the input can be selected (i.e., included in the sample) and will then be “replaced” back into the pool of possible input values. Setting `replace = True`

essentially means that a given input value can be selected multiple times!

Remember earlier in this tutorial that I explained NumPy random choice in terms of rolling a die?

That’s essentially what we’ve done in this example. The code `np.random.choice(a = array_1_to_6, size = 3, replace = True)`

is essentially like rolling a die multiple times!

That’s what’s great about Python and NumPy … if you know how to use the tools right, you can begin to create little models of real-world processes.

Next, we’re going to work with the `p`

parameter to change the probabilities associated with the different possible outcomes.

So for example, let’s reuse our array `array_1_to_6`

.

Here’s the code to create the array again:

array_1_to_6 = np.arange(start = 1, stop =7)

Essentially, the array `array_1_to_6`

has the values from 1 to 6.

Now, we’re going to randomly select from those values (1 to 6) but the probability of each value will not be the same.

Remember that by default, np.random.choice gives each input value an equal probability of being selected.

… but if we use the `p`

parameter, we can change this.

np.random.choice(a = array_1_to_6, p = [.5,.1,.1,.1,.1,.1])

What are we doing here?

We’re using the `p`

parameter to give the input values (1 to 6) different probabilities.

We can visualize the new setup like this:

So essentially, the value “`1`

” will have a probability of being selected of .5 (a 50% chance). And the other values from `2`

to `6`

will each have a probability of .1.

Now let’s run the code:

np.random.seed(42) np.random.choice(a = array_1_to_6, p = [.5,.1,.1,.1,.1,.1])

OUT:

1

Now let’s run the code again, but instead of generating a single value, we’ll generate a random sample of 20 values.

np.random.seed(42) np.random.choice(a = array_1_to_6, p = [.5,.1,.1,.1,.1,.1], size = 20)

OUT:

array([1, 6, 4, 2, 1, 1, 1, 5, 3, 4, 1, 6, 5, 1, 1, 1, 1, 2, 1, 1])

Look closely at the numbers in the output array. LOOK AT ALL THOSE `1`

‘s.

Just by glancing at the output, you can see that `1`

is coming up a lot more than the other values. That’s exactly how we designed it! There’s a 50% chance of generating a `1`

.

Next, let’s move on from using *numbers* as possible outcomes.

…. let’s start using non-numeric inputs in the input array.

Here, we’re going to use a simple example.

For our input array, we’re going to create a Python array of 4 simplified playing cards: a ‘Diamond’ card, a ‘Spade’ card, a ‘Heart’, and a ‘Club’.

simple_cards = ['Diamond','Spade','Heart','Club']

You can think of the list `simple_cards`

like this:

`simple_cards`

represents a simplified set of 4 cards.

This is obviously not like a real set of 52 playing cards. As always, I really want to simplify this as much as possible just so you can see how this works.

Technically though, what is `simple_cards`

? It’s a Python list that contains 4 strings.

Now that we have our Python list, we’re first just going select a single item randomly from that list.

This is really easy. It’s almost exactly the same as some of the previous examples above where we were selecting a single item from a NumPy array of numbers. The only difference is that we’re supplying a *list of strings* to the numpy.random.choice instead of a NumPy array.

Let’s take a look.

np.random.seed(0) np.random.choice(simple_cards)

OUTPUT:

'Diamond'

You can think of this code like selecting a single card from our simplified deck of four cards. There are four possible cards, and we selected the diamond.

From a technical perspective, if you read the earlier examples in this blog post, this should make sense.

All we did is randomly select a single item from our Python list.

Keep in mind though that the code is a little simplified syntactically, because I did not explicitly reference the parameters. If we were a little more explicit in how we wrote this, we could write the code as `np.random.choice(a = simple_cards, replace = True)`

. That’s effectively the same thing.

Now, let’s move on to a slightly more complicated example. We’re going to generate a random sample from our Python list.

Random sampling from a Python list is easy with NumPy random choice.

Once again, it’s almost exactly the same as some of the previous examples in this blog post.

Here, we’re going to select *two* cards from the list.

Essentially, we’re just going to pass the Python list to NumPy random choice and set the `size`

parameter to 2. We’ll also set `replace = False`

to make it so we can’t select the same card twice. I really want this to be like selecting two different cards from a deck of cards.

Let’s take a look at the code.

np.random.seed(55) np.random.choice(a = simple_cards, size = 2, replace = False)

OUT:

array(['Diamond', 'Club'], dtype='U7')

So in this example, we randomly selected two cards from the ‘deck’ (i.e., we randomly selected 2 items from the list).

We selected the ‘Diamond’ and the ‘Club.’

Again, this example is pretty straight forward if you’ve read and understood the previous examples.

If this does *not* make sense, I recommend that you start at the top and review a few of the more simple examples more carefully.

Random sampling is really important for data science, speaking broadly.

The reason is that random sampling is a key concept and technique in probability. It’s also very important in statistics. Moreover, sampling is also applicable to machine learning and deep learning.

Essentially, random sampling is really important for a variety of sub-disciplines of data science.

You really need to know how to do this!

I’ve written this tutorial to help you get started with random sampling in Python and NumPy.

Having said that, I realize that random sampling can be confusing to beginners.

With that in mind, if you have specific questions about random sampling with NumPy or about the NumPy random choice function, please post your question in the comments section at the bottom of this page.

Not only is the numpy.random.choice function important for data science and probability, the broader NumPy toolkit is important for data science in Python.

NumPy gives you a set of tools for working with numeric data in Python. To really get the most out of the NumPy package, you’ll need to learn *many* functions and tools … not just numpy.random.choice. For example, you’ll need to learn

I recommend that you read our free tutorials …. they will teach you a lot about NumPy.

I also recommend that you sign up for our email list.

We regularly post tutorials about NumPy and data science in Python.

If you sign up for our email list, you’ll get our tutorials delivered directly to your inbox …

You’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

If you want to learn more about NumPy and data science, sign up now.

The post How to use NumPy random choice appeared first on Sharp Sight.

]]>The post A quick guide to NumPy sort appeared first on Sharp Sight.

]]>As the name implies, the NumPy sort technique enables you to *sort* NumPy arrays.

So, this blog post will show you exactly how to use the technique to sort different kinds of arrays in Python.

The blog post has two primary sections, a syntax explanation section and an examples section.

**Contents:**

You can click on either of those links and it will take you to the appropriate section in the tutorial.

But if you’re new to Python and NumPy, I suggest that you read the whole blog post.

Ok. Let’s just start out by talking about the sort function and where it fits into the NumPy data manipulation system.

If you’re reading this blog post, you probably know what NumPy is.

But, just in case you don’t, I want to quickly review NumPy.

NumPy is a toolkit for doing data manipulation in Python.

More specifically, NumPy provides a set of tools and functions for working with arrays of numbers. That’s actually where the name comes from:

“**Num**erical **Py**thon” ….

NumPy.

Although the tools from NumPy can work on a variety of data structures, they are primarily designed to operate on NumPy arrays.

NumPy arrays are essentially arrays of numbers. We’ll create some NumPy arrays later in this tutorial, but you can think of them as row-and-column grids of numbers.

And again, the tools of NumPy can perform manipulations on these arrays. For example, you can do things like calculate the mean of an array, calculate the median of an array, calculate the maximum, etc.

Essentially, NumPy is a broad toolkit for working with arrays of numbers.

And one of the things you can do with NumPy, is you can *sort* an array.

That’s basically what NumPy sort does … it sorts NumPy arrays.

Let me give you a quick example.

Imagine that you have a 1-dimensional NumPy array with five values that are in random order:

You can use NumPy sort to *sort* those values in ascending order. Essentially, numpy.sort will take an input array, and output a new array in sorted order.

Take a look at that image and notice what np.sort did.

It sorted the array in ascending order, from low to high. That’s it.

To be clear, the NumPy sort function can actually sort arrays in more complex ways, but at a basic level, that’s all the function does. It sorts data.

Ok … so now that I’ve explained the NumPy sort technique at a high level, let’s take a look at the details of the syntax.

In this section, I’ll break down the syntax of np.sort.

Before I do that though, you need to be aware of some syntax conventions.

When we write NumPy code, it’s very common to refer to NumPy as `np`

.

Syntactically, `np`

frequently operates as a “nickname” or alias of the NumPy package. So if you see the term `np.sort()`

, that’s sort of a shorthand for `numpy.sort()`

.

Having said that, this sort of aliasing only works if you set it up properly.

To set up that alias, you’ll need to “import” NumPy with the appropriate nickname by using the code `import numpy as np`

.

We’ll talk more about this in the examples section, but I want you to understand this before I start explaining the syntax.

Ok. Let’s take a close look at the syntax.

To initiate the function (assuming you’ve imported NumPy as I explained above), you can call the function as `np.sort()`

. Again though, you can also refer to the function as `numpy.sort()`

and it will work in a similar way.

Then inside of the function, there are a set of parameters that enable you to control exactly how the function works.

The function is fairly simple, but to really understand it, you need to understand the parameters.

With that in mind, let’s talk about the parameters of numpy.sort.

The np.sort function has 3 primary parameters:

`a`

`axis`

`kind`

There’s also a 4th parameter called `order`

. Since `order`

is not used very often and it’s a little more complicated to understand, I am leaving it out of this tutorial.

However, the parameters `a`

, `axis`

, and `kind`

are a much more common. That being the case, I’ll only explain them in a little more detail.

`a`

(required)The `a`

parameter simply refers to the NumPy array that you want to operate on.

Typically, this will be a NumPy array object. However, np.sort (like almost all of the NumPy functions) will also operate on “array-like” objects. So for example, numpy.sort will sort Python lists, tuples, and many other itterable types.

Keep in mind that this parameter is *required*. So you need to provide a NumPy array here, or an array-like object.

`axis`

The `axis`

parameter describes the axis along which you will sort the data.

This parameter is *optional*.

By default, `axis`

is set to `axis = -1`

. This means that if you don’t use the axis parameter, then by default, the np.sort function will sort the data on the last axis.

If you’re not sure what an “axis” is, I recommend that you read our tutorial about NumPy axes. You’ll also learn more about how this parameter works in the examples section of this tutorial.

`kind`

The `kind`

parameter specifies the sorting algorithm you want to use to sort the data.

If you’re not well-trained with computer science and algorithms, you might not realize this ….

… but there are many different algorithms that can be used to sort data. Moreover, these different sorting techniques have different pros and cons. For example, some algorithms are faster than others.

So, there are several different options for this parameter: `quicksort`

, `heapsort`

, and `mergesort`

.

By default, the `kind`

parameter is set to `kind = 'quicksort'`

.

The `quicksort`

algorithm is typically sufficient for most applications, so we’re not really going to change this parameter in any of our examples. (If you have a question about sorting algorithms, just leave your question in the comments section below.)

Ok … now that you’ve learned more about the parameters of numpy.sort, let’s take a look at some working examples.

To learn and master a new technique, it’s almost always best to start with very, very simple examples.

This, by the way, is one of the mistakes that beginners make when learning new syntax; they work on examples that are simply too complicated.

Because simple examples are so important, I want to show you simple examples of how the np.sort function works.

I’ll show you how it works with NumPy arrays of different sizes …

And I’ll also show you how to use the parameters.

Here’s a list of the examples we’ll cover:

- Sort a 1D numpy array
- How to sort the
*columns*of a 2D array - How to sort the
*rows*of a 2D array - Sort a NumPy array in reverse order

But before you run the code in the following examples, you’ll need to make sure that everything is set up properly.

Before you run the code below, you’ll need to have NumPy installed and you’ll need to “import” the NumPy module into your environment.

Installing NumPy can be very complex, and it’s beyond the scope of this tutorial. If you don’t have it installed, you can search online for how to install it. My recommendation is to simply start using Anaconda.

Assuming that you have NumPy *installed* though, you’ll still need to run some code to import it.

To import NumPy, you can run this:

import numpy as np

This will make the NumPy functions available in your code.

Also, after running this code, you’ll be able to refer to NumPy in your code with the nickname ‘`np`

‘.

Ok … now we’re ready to go.

First, we’ll start very simple.

We’re going to sort a simple, 1-dimensional numpy array.

Before we sort the array, we’ll first need to create the array. To do this, we’re going to use the np.array function. The np.array function will enable us to create a NumPy array object from a Python list of 5 numbers:

simple_array_1d = np.array([5,3,1,2,4])

And we can print out the array with a simple print statement:

print(simple_array_1d)

Which shows the following output:

array([5, 3, 1, 2, 4])

This is really simple. We just have a NumPy array of 5 numbers. As you can see, the numbers are arranged in a random order.

Next, we can sort the array with np.sort:

np.sort(simple_array_1d)

When we run this, np.sort will produce the following output array:

array([1, 2, 3, 4, 5])

As you can see, the output of np.sort is the same group of numbers, but now they are sorted in ascending order.

Next, we’re going to sort the columns of a 2-dimensional NumPy array.

To do this, we’ll first need to *create* a 2D NumPy array.

Ultimately here, we’re going to create a 2 by 2 array of 9 integers, randomly arranged.

To do this, we’re going to use the numpy.arange function to create an array of integers from 1 to 9, then randomly arrange them with numpy random choice, and finally reshape the array into a 2 by 2 array with numpy.reshape.

np.random.seed(77) array_2d = np.random.choice(a = np.arange(start = 1, stop = 10), size = 9, replace = False).reshape([3,3])

And now let’s print out `array_2d`

to see what’s in it.

print(array_2d)

Which produces the following output:

array([[3, 6, 1], [2, 4, 7], [5, 9, 8]])

As you can see, we have a 2D array of the integers 1 to 9, arranged in a random order.

To be honest, the process for creating this array is a little complicated, so if you don’t understand it, you should review our tutorial on NumPy arrange and our tutorial on NumPy reshape.

Ok. Now let’s sort the columns of the array.

To do this, we’re going to use numpy.sort with the `axis`

parameter.

np.sort(array_2d, axis = 0)

Which produces the following NumPy array:

array([[2, 4, 1], [3, 6, 7], [5, 9, 8]])

Take a close look at the output. The columns are sorted from low to high.

Why though? Why does the `axis`

parameter do this?

To understand this example, you really need to understand NumPy axes. If you don’t understand axes, you really should read our NumPy axes tutorial.

However, I will explain axes here, briefly.

You can think of axes like *directions*.

In a 2D NumPy array, axis-0 is the direction that runs downwards down the rows and axis-1 is the direction that runs horizontally across the columns.

Once you understand this, you can understand the code `np.sort(array_2d, axis = 0)`

.

What we’re really saying here is that we want to sort the array `array_2d`

along axis 0. Remember, axis 0 is the axis that points downwards.

When we run this code, we’re basically saying that we want to sort the data in the axis-0 direction.

… effectively, this sorts the columns!

Next, let’s sort the *rows*.

Sorting the rows is very similar to sorting the columns.

To do this, we’ll need to use the `axis`

parameter again.

Quickly though, we’ll need a NumPy array to sort.

The following code is exactly the same as the previous example (sorting the columns), so if you already ran that code, you don’t need to run it again.

np.random.seed(77) array_2d = np.random.choice(a = np.arange(start = 1, stop = 10), size = 9, replace = False).reshape([3,3])

Just so we’re clear on the contents of the array, let’s print it out again:

print(array_2d)

OUT:

array([[3, 6, 1], [2, 4, 7], [5, 9, 8]])

Now let’s sort the rows.

Do do this, we’ll use NumPy sort with `axis = 1`

.

np.sort(array_2d, axis = 1)

Which produces the following output array, with sorted rows:

array([[1, 3, 6], [2, 4, 7], [5, 8, 9]])

Take a close look. The rows are sorted from low to high.

Once again, to understand this, you really need to understand what NumPy axes are.

As I mentioned previously in this tutorial, in a 2D array, axis 1 is the direction that runs horizontally:

So when we use the code `np.sort(array_2d, axis = 1)`

, we’re telling NumPy that we want to sort the data along that axis-1 direction.

This basically means, sort the rows!

A common question that people ask when they dive further into NumPy is “how can I sort the data in reverse order?”

Unfortunately, this is not so easy to do.

I think that there should be a way to do this directly with NumPy, but at the moment, there isn’t.

That being the case, I’ll show you a quick-and-dirty workaround.

(But note: this is not necessarily an *efficient* workaround.)

We’re going to sort our 1D array `simple_array_1d`

that we created above.

Let’s print out `simple_array_1d`

to see what’s in it.

print(simple_array_1d)

OUT:

[5 3 1 2 4]

You can see that this is a NumPy array with 5 elements that are arranged in random order.

Now, we’re going to sort these values in *reverse* order.

To do this, we’re going to use np.sort on the negative of the values in `array2d`

(i.e., `-array_2d`

), and we’ll take the negative of that output:

-np.sort(-array_2d)

Which gives us the following result:

array([5, 4, 3, 2, 1])

You can see that the code `-np.sort(-array_2d)`

sorted the numbers in reverse (i.e., descending) order.

You can use this technique in a similar way to sort the columns and rows in descending order.

To do this, we need to use the axis parameter in conjunction with the technique we used in the previous section.

To sort the columns, we’ll need to set `axis = 0`

. And we’ll use the negative sign to sort our 2D array in reverse order.

-np.sort(-array_2d, axis = 0)

Which produces the following output:

array([[9, 7, 5], [8, 4, 3], [6, 2, 1]])

As you can see, the code `-np.sort(-array_2d, axis = 0)`

produces an output array where the columns have been sorted in descending order, from the top of the column to the bottom.

You can do the same thing to sort the rows by using `axis = 1`

.

Again, we’ll be working with `array_2d`

.

-np.sort(-array_2d, axis = 1)

The code `axis = 1`

indicates that we’ll be sorting the data in the axis-1 direction, and by using the negative sign in front of the array name and the function name, the code will sort the rows in descending order.

Here in this tutorial, I’ve explained how to sort numpy arrays by using the np.sort function.

But the NumPy toolkit is much bigger than one function.

If you’re serious about data science and scientific computing in Python, you’ll have to learn quite a bit more about NumPy.

In fact, if you want to master data science in Python, you’ll need to learn quite a few Python packages. You’ll need to learn NumPy, Pandas, matplotlib, scikit learn, and more.

There’s a lot to learn!

If you’re ready to learn data science though, we can help.

Here at Sharp Sight, we teach data science.

We offer premium data science courses to help you master data science *fast* …

And we also offer FREE tutorials.

If you sign up for our email list, you’ll get our free tutorials, and you’ll find out when our courses open for registration.

When you sign up, you’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- Data science in Python
- Data science in R
- … and more.

If you want access to our free tutorials every week, enter your email address and sign up now.

The post A quick guide to NumPy sort appeared first on Sharp Sight.

]]>The post NumPy random seed explained appeared first on Sharp Sight.

]]>The function itself is extremely easy to use.

However, the *reason* that we need to use it is a little complicated. To understand *why* we need to use NumPy random seed, you actually need to know a little bit about pseudo-random numbers.

That being the case, this tutorial will first explain the basics of pseudo-random numbers, and will then move on to the syntax of numpy.random.seed itself.

The tutorial is divided up into several different sections.

- A quick introduction to pseudo-random numbers
- How and why we use NumPy random seed
- The syntax of NumPy random seed
- Examples of how to use numpy random seed
- Frequently asked questions about numpy.random.seed
- Applications of pseudo-random numbers

You can click on any of the above links, and it will take you directly to that section.

However, I strongly recommend that you read the whole tutorial.

As I said earlier, numpy.random.seed is very easy to use, but it’s not that easy to understand. Understanding *why* we use it requires some background. That being the case, it’s much better if you actually read the tutorial.

Ok … let’s get to it.

So what exactly is NumPy random seed?

NumPy random seed is simply a function that sets the random seed of the NumPy pseudo-random number generator. It provides an essential input that enables NumPy to generate pseudo-random numbers for random processes.

Does that make sense? Probably not.

Unless you have a background in computing and probability, what I just wrote is probably a little confusing.

Honestly, in order to understand “seeding a random number generator” you need to know a little bit about pseudo-random numbers.

That being the case, let me give you a quick introduction to them …

Here, I want to give you a very quick overview of pseudo-random numbers and why we need them.

Once you understand pseudo-random numbers, numpy.random.seed will make more sense.

At the risk of being a bit of a smart-ass, I think the name “pseudo-random number” is fairly self explanatory, and it gives us some insight into what pseudo-random numbers actually are.

Let’s just break down the name a little.

A pseudo-random number is a *number*. A number that’s sort-of random. *Pseudo*-random.

So essentially, a pseudo-random number is a number that’s almost random, __but not really random__.

It might sound like I’m being a bit sarcastic here, but that’s essentially what they are. Pseudo-random numbers are numbers that appear to be random, but are not actually random.

In the interest of clarity though, let’s see if we can get a definition that’s a little more precise.

According to the encyclopedia at Wolfram Mathworld, a pseudo-random number is:

… a computer-generated random number.

The definition goes on to explain that ….

The prefix pseudo- is used to distinguish this type of number from a “truly” random number generated by a random physical process such as radioactive decay.

A separate article at random.org notes that pseudo-random numbers “appear random, but they are really predetermined”.

Got that? Pseudo-random numbers are computer generated numbers that appear random, but are actually predetermined.

I think that these definitions help quite a bit, and they are a great starting point for understanding why we need them.

I swear to god, I’m going to bring this back to NumPy soon.

But, we still need to understand why pseudo-random numbers are required.

Really. Just bear with me. This will make sense soon.

There’s a fundamental problem when using computers to simulate or work with random processes.

Setting aside some rare exceptions, computers are deterministic by their very design. To quote an article at MIT’s School of Engineering “if you ask the same question you’ll get the same answer every time.”

Another way of saying this is that if you give a computer a certain input, it will precisely follow instructions to produce an output.

… And if you later give a computer the *same* input, it will produce the *same* output.

If the input is the same, then the output will be the same.

THAT’S HOW COMPUTERS WORK.

The behavior of computers is *deterministic* …

Essentially, the behavior of computers is NOT random.

This introduces a problem: how can you use a non-random machine to produce random numbers?

Computers solve the problem of generating “random” numbers the same way that they solve essentially everything: with an algorithm.

Computer scientists have created a set of algorithms for creating psuedo random numbers, called “pseudo-random number generators.”

These algorithms can be executed on a computer.

As such, they are completely deterministic. However, the numbers that they produce have properties that *approximate* the properties of random numbers.

That is to say, the numbers generated by pseudo-random number generators *appear* to be random.

Even though the numbers they are completely determined by the algorithm, when you examine them, there is typically no discernible pattern.

For example, here we’ll create some pseudo-random numbers with the NumPy randint function:

`np.random.seed(1)`

`np.random.randint(low = 1, high = 10, size = 50)`

)

OUT:

[6, 9, 6, 1, 1, 2, 8, 7, 3, 5, 6, 3, 5, 3, 5, 8, 8, 2, 8, 1, 7, 8, 7, 2, 1, 2, 9, 9, 4, 9, 8, 4, 7, 6, 2, 4, 5, 9, 2, 5, 1, 4, 3, 1, 5, 3, 8, 8, 9, 7]

See any pattern here? Me neither.

I can assure you though, that these numbers are not random, and are in fact completely determined by the algorithm. If you run the same code again, you’ll get the exact same numbers.

Importantly, because pseudo-random number generators are deterministic, they are also repeatable.

What I mean is that if you run the algorithm with the same input, it will produce the same output.

So you can use pseudo-random number generators to create and then re-create the exact same set of pseudo-random numbers.

Let me show you.

Here, we’ll create a list of 5 pseudo-random integers between 0 and 9 using numpy.random.randint.

(And notice that we’re using np.random.seed here)

np.random.seed(0) np.random.randint(10, size = 5)

This produces the following output:

array([5, 0, 3, 3, 7])

Simple. The algorithm produced an array with the values `[5, 0, 3, 3, 7]`

.

Ok.

Now, let’s run the same code again.

… and notice that we’re using np.random.seed in exactly the same way …

np.random.seed(0) np.random.randint(10, size = 5)

OUTPUT:

array([5, 0, 3, 3, 7])

Well take a look at that …

The. numbers. are. the. same.

We ran the exact same code, and it produced the exact same output.

I will repeat what I said earlier: pseudo random number generators produce numbers that look random, but are 100% determined.

Determined how though?

Remember what I wrote earlier: computers and algorithms process inputs into outputs. The outputs of computers depend on the inputs.

So just like any output produced by a computer, pseudo-random numbers are dependent on the *input*.

*THIS* is where numpy.random.seed comes in …

The numpy.random.seed function provides the input (i.e., the seed) to the algorithm that generates pseudo-random numbers in NumPy.

Ok, you got this far.

You’re ready now.

Now you can learn about NumPy random seed.

to the pseudo-random number generator

What I wrote in the previous section is critical.

The “random” numbers generated by NumPy are not exactly random. They are pseudo-random … they approximate random numbers, but are 100% determined by the input and the pseudo-random number algorithm.

The np.random.seed function provides an input for the pseudo-random number generator in Python.

That’s all the function does!

It allows you to provide a “seed” value to NumPy’s random number generator.

Importantly, numpy.random.seed doesn’t exactly work all on its own.

The numpy.random.seed function works in *conjunction* with other functions from NumPy.

Specifically, numpy.random.seed works with other function from the `numpy.random`

namespace.

So for example, you might use numpy.random.seed along with numpy.random.randint. This will enable you to create random integers with NumPy.

You can also use numpy.random.seed with numpy.random.normal to create normally distributed numbers.

… or you can use it with numpy.random.choice to generate a random sample from an input.

In fact, there are several dozen NumPy random functions that enable you to generate random numbers, random samples, and samples from specific probability distributions.

I’ll show you a few examples of some of these functions in the examples section of this tutorial.

Remember what I said earlier in this tutorial …. pseudo-random number generators are completely deterministic. They operate by algorithm.

What this means is that if you provide the same seed, you will get the same output.

And if you change the seed, you will get a different output.

The output that you get depends on the input that you give it.

I’ll show you examples of this behavior in the examples section.

The important thing about using a seed for a pseudo-random number generator is that it makes the code *repeatable*.

Remember what I said earlier?

… pseudo-random number generators operate by a deterministic process.

If you give a pseudo-random number generator the same input, you’ll get the same output.

This can actually be a good thing!

There are times when you really want your “random” processes to be repeatable.

Code that has well defined, repeatable outputs is good for testing.

Essentially, we use NumPy random seed when we need to generate pseudo-random numbers in a repeatable way.

The fact that np.random.seed makes your code repeatable also makes is easier to *share*.

Take for example the tutorials that I post here at Sharp Sight.

I post detailed tutorials about how to perform various data science tasks, and I show how code works, step by step.

When I do this, it’s important that people who read the tutorials and run the code get the same result. If a student reads the tutorial, and copy-and-pastes the code exactly, I want them to get the exact same result. This just helps them check their work! If they type in the code exactly as I show it in a tutorial, getting the exact same result gives them confidence that they ran the code properly.

Again, in order to get repeatable results when we are using “random” functions in NumPy, we need to use numpy.random.seed.

Ok … now that you understand what NumPy random seed is (and why we use it), let’s take a look at the actual syntax.

The syntax of NumPy random seed is extremely simple.

There’s essentially only one parameter, and that is the seed value.

So essentially, to use the function, you just call the function by name and then pass in a “seed” value inside the parenthesis.

Note that in this syntax explanation, I’m using the abbreviation “`np`

” to refer to NumPy. This is a common convention, but it requires you to import NumPy with the code “`import numpy as np`

.” I’ll explain more about this soon in the examples section.

Let’s take a look at some examples of how and when we use numpy.random.seed.

Before we look at the examples though, you’ll have to run some code.

To get the following examples to run properly, you’ll need to import NumPy with the appropriate “nickname.”

You can do that by executing the following code:

import numpy as np

Running this code will enable us to use the alias `np`

in our syntax to refer to `numpy`

.

This is a common convention in NumPy. When you read NumPy code, it is extremely common to see NumPy referred to as `np`

. If you’re a beginner you might not realize that you need to import NumPy with the code `import numpy as np`

, otherwise the examples won’t work properly!

Now that we’ve imported NumPy properly, let’s start with a simple example. We’ll generate a single random number between 0 and 1 using NumPy random random.

Here, we’re going to use NumPy to generate a random number between zero and one. To do this, we’re going to use the NumPy random random function (AKA, np.random.random).

Ok, here’s the code:

np.random.seed(0) np.random.random()

OUTPUT:

0.5488135039273248

Note that the output is a float. It’s a decimal number between 0 and 1.

For the record, we can essentially treat this number as a probability. We can think of the np.random.random function as a tool for generating probabilities.

Now that I’ve shown you how to use np.random.random, let’s just run it again with the same seed.

Here, I just want to show you what happens when you use np.random.seed before running np.random.random.

np.random.seed(0) np.random.random()

OUTPUT:

0.5488135039273248

Notice that the number is exactly the same as the first time we ran the code.

Essentially, if you execute a NumPy function with the same seed, you’ll get the same result.

Fore more information on the np.random.random function, check out our tutorial on NumPy random random.

Next, we’re going to use np.random.seed to set the number generator before using NumPy random randint.

Essentially, we’re going to use NumPy to generate 5 random integers between 0 and 99.

np.random.seed(74) np.random.randint(low = 0, high = 100, size = 5)

OUTPUT:

array([30, 91, 9, 73, 62])

This is pretty simple.

NumPy random seed sets the seed for the pseudo-random number generator, and then NumPy random randint selects 5 numbers between 0 and 99.

Let’s just run the code so you can see that it reproduces the same output if you have the same seed.

np.random.seed(74) np.random.randint(low = 0, high = 100, size = 5)

OUTPUT:

array([30, 91, 9, 73, 62])

Once again, as you can see, the code produces the same integers if we use the same seed. As noted previously in the tutorial, NumPy random randint doesn’t exactly produce “random” integers. It produces pseudo-random integers that are completely determined by numpy.random.seed.

It’s also common to use the NP random seed function when you’re doing random sampling.

Specifically, if you need to generate a reproducible random sample from an input array, you’ll need to use numpy.random.seed.

Let’s take a look.

Here, we’re going to use numpy.random.seed before we use numpy.random.choice. The NumPy random choice function will then create a random sample from a list of elements.

np.random.seed(0) np.random.choice(a = [1,2,3,4,5,6], size = 5)

OUTPUT:

array([5, 6, 1, 4, 4])

As you can see, we’ve basically generated a random sample from the list of input elements … the numbers 1 to 6.

In the output, you can see that some of the numbers are repeated. This is because np.random.choice is using random sampling with replacement. For more information about how to create random samples, you should read our tutorial about np.random.choice.

Let’s quickly re-run the code.

I want to re-run the code just so you can see, once again, that the primary reason we use NumPy random seed is to create results that are completely repeatable.

Ok, here is the exact same code that we just ran (with the same seed).

np.random.seed(0) np.random.choice(a = [1,2,3,4,5,6], size = 5)

OUTPUT:

array([5, 6, 1, 4, 4])

Once again, we used the same seed, and this produced the same output.

Now that we’ve taken a look at some examples of using NumPy random seed to set a random seed in Python, I want to address some frequently asked questions.

Dude. I just wrote 2000 words explaining what the np.random.seed function does … which basically explains what np.random.seed(0) does.

Ok, ok … I get it. You’re probably in a hurry and just want a quick answer.

I’ll summarize.

We use np.random.seed when we need to generate random numbers or mimic random processes in NumPy.

Computers are generally deterministic, so it’s very difficult to create truly “random” numbers on a computer. Computers get around this by using pseudo-random number generators.

These pseudo-random number generators are algorithms that produce numbers that appear random, but are not really random.

In order to work properly, pseudo-random number generators require a starting input. We call this starting input a “seed.”

The code `np.random.seed(0)`

enables you to provide a seed (i.e., the starting input) for NumPy’s pseudo-random number generator.

NumPy then uses the seed and the pseudo-random number generator in conjunction with other functions from the numpy.random namespace to produce certain types of random outputs.

Ultimately, creating pseudo-random numbers this way leads to repeatable output, which is good for testing and code sharing.

Having said all of that, to really understand numpy.random.seed, you need to have some understanding of pseudo-random number generators.

… so if what I just wrote doesn’t make sense, please return to the top of the page and read the f*#^ing tutorial.

Basically, it doesn’t matter.

You can use `numpy.random.seed(0)`

, or `numpy.random.seed(42)`

, or any other number.

For the most part, the number that you use inside of the function doesn’t really make a difference.

You just need to understand that using different seeds will cause NumPy to produce different pseudo-random numbers. The output of a `numpy.random`

function will depend on the seed that you use.

Here’s a quick example. We’re going to use NumPy random seed in conjunction with NumPy random randint to create a set of integers between 0 and 99.

In the first example, we’ll set the seed value to 0.

np.random.seed(0) np.random.randint(99, size = 5)

Which produces the following output:

array([44, 47, 64, 67, 67])

Basically, np.random.randint generated an array of 5 integers between 0 and 99. Note that if you run this code again with the exact same seed (i.e. 0), you’ll get the same integers from np.random.randint.

Next, let’s run the code with a *different* seed.

np.random.seed(1) np.random.randint(99, size = 5)

OUTPUT:

array([37, 12, 72, 9, 75])

Here, the code for np.random.randint is exactly the same … we only changed the seed value. Here, the seed is `1`

.

With a *different* seed, NumPy random randint created a *different* set of integers. Everything else is the same. The code for np.random.randint is the same. But with a different seed, it produces a different output.

Ultimately, I want you to understand that the output of a numpy.random function ultimately depends on the value of np.random.seed, but the choice of seed value is sort of arbitrary.

The short answer is, no.

If you use a function from the `numpy.random`

namespace (like np.random.randint, np.random.normal, etc) *without* using NumPy random see first, Python will actually still use numpy.random.seed in the background. NumPy will generate a seed value from a part of your computer system (like `/urandom`

on a Unix or Linux machine).

So essentially, if you don’t set a seed with numpy.random.seed, NumPy will set one for you.

However, this has a disadvantage!

If you don’t explicitly set a seed, your code will not have repeatable outputs. NumPy will generate a seed on its own, but that seed might change moment to moment. This will make your outputs different every time you run it.

So to summarize: you don’t absolutely have to use numpy.random.seed, but you *should* use it if you want your code to have repeatable outputs.

Ok.

We’re really getting into the weeds here.

Essentially, numpy.random.seed sets a seed value for the global instance of the numpy.random namespace.

On the other hand, np.random.RandomState returns one instance of the RandomState and does not effect the global RandomState.

Confused?

That’s okay …. this answer is a little technical and it requires you to know a little about how NumPy is structured on the back end. It also requires you to know a little bit about programming concepts like “global variables.” If you’re a relative data science beginner, the details that you need to know might be over your head.

The important thing is that NumPy random seed is probably sufficient if you’re just using NumPy for some data science or scientific computing.

However, if you’re building software systems that need to be secure, NumPy random seed is probably not the right tool.

To summarize, np.random.seed is probably fine if you’re just doing simple analytics, data science, and scientific computing, but you need to learn more about RandomState if you want to use the NumPy pseudo-random number generator in systems where security is a consideration.

Now that I’ve explained the basics of NumPy random seed, I want to tell you a few applications …

Here’s where you might see the np.random.seed function.

It’s possible to do probability and statistics using NumPy.

Almost by definition, probability involves uncertainty and randomness. As such, if you use Python and NumPy to model probabilistic processes, you’ll need to use np.random.seed to generate pseudo-random numbers (or a similar tool in Python).

More specifically, if you’re doing random sampling with NumPy, you’ll need to use numpy.random.seed.

NumPy has a variety of functions for performing random sampling, including numpy random random, numpy random normal, and numpy random choice.

In almost every case, when you use one of these functions, you’ll need to use it in conjunction with numpy random seed if you want to create reproducible outputs.

Monte Carlo methods are a class of computational methods that rely on repeatedly drawing random samples.

I won’t go into the details here, since Monte Carlo methods are a little complicated, and beyond the scope of this post.

Essentially though, Monte Carlo methods are a powerful computational tool used in science and engineering. In fact, Monte Carlo methods were initially used at the Manhattan Project!

Monte Carlo methods require random numbers. In most cases, when these methods are used, they actually use *pseudo-random* numbers instead of true random numbers.

Interested in machine learning?

Great … it’s a powerful toolset, and it will be extremely important in the 21st century.

Broadly speaking, pseudo-random numbers are important in machine learning.

Performing simple tasks like splitting datasets into training and test sets requires random sampling. In turn, random sampling almost always requires pseudo-random numbers.

So if you’re doing machine learning in Python, you’ll almost certainly need to use NumPy random seed …

More specifically, you’ll also probably use pseudo-random numbers if you want to do deep learning.

For example, if you want to do deep learning in Python, you’ll often need to split datasets into training and test sets (just like with other machine learning techniques). Again, this requires pseudo-random numbers.

… so when people do deep learning in Python, you’ll frequently see at least a few uses of numpy.random.seed.

I’ve really only touched on a few applications of numpy.random.seed in Python. There are many more.

Speaking generally, if you want to use NumPy, you really need to know this little function.

But even though we focused on NumPy random seed in this tutorial, there are many other NumPy functions that you probably need to learn …

If you want to learn how to do data science in Python, NumPy is very important …

If you want to learn NumPy and data science in Python, then sign up for our email list.

Here at Sharp Sight, we teach data science.

… and we regularly post FREE data science tutorials just like this one.

If you want to get our free tutorials delivered directly to you email inbox, then sign up now.

If you sign up for our email list, you’ll get tutorials about:

- NumPy
- Pandas
- Matplotlib
- Seaborn
- Sci-kit learn
- Machine learning
- Deep learning
- … and more

We also teach data science in R, so if you sign up, you’ll get tutorials for both languages.

So if you want to learn more data science for FREE, sign up now.

The post NumPy random seed explained appeared first on Sharp Sight.

]]>The post How to make a matplotlib line chart appeared first on Sharp Sight.

]]>I’ll be honest. Creating a line chart in Python is a little confusing to beginners.

If you’ve been trying to create a decent line chart in Python and just found yourself confused, don’t worry. Many beginners feel a little confused.

Part of the problem is that the tools for creating data visualizations in Python are not as well designed as some modern tools like ggplot in R. If you’ve come from R, you might find that creating a line chart is actually more challenging in Python.

Another issue is that many of the examples online for how to make a line chart with matplotlib are bad. Many of the examples are either out of date, or more complex than they need to be.

Those things being the case, this blog post will try to clear up some of the confusion and introduce you to some basic syntax to get you started.

Although this blog post won’t show you everything about data visualization with matplotlib, it will show you some of the essential tools so you can make a basic line chart. It will give you a foundation that you can build on as you continue to learn.

The tutorial has several different sections that will help you understand creating line charts with pyplot.

You’ll learn:

- What is matplotlib?
- The syntax for the matplotlib line chart
- Examples of how to make a line chart with matplotlib

If you need help with something specific, you can click on one of the links. The links will take you directly to the relevant section within this blog post.

On the other hand, if you’re just getting started with data visualization in Python, it’s probably a good idea to read the entire blog post. Instead of just trying to copy and paste some code, it’s good to read through everything so you know how it all works.

Before we get started actually creating line charts, let’s talk about matplotlib first.

If you’re just getting started with data science in Python, you’ve probably heard about matplotlib, but you might not know what it is.

What is matplotlib?

Matplotlib is a module for Python that focuses on plotting and data visualization. It’s very flexible and it provides you with tools for creating almost any data visualization you can think of.

On the other hand, it was initially released in 2003, and some of the techniques for creating visualizations feel out of date.

Specifically, the syntax for matplotlib is a little “low level” in some cases, and this can make it difficult to use for many beginners.

However, one thing that can make matplotlib easier to use is the pyplot sub-module.

Pyplot is part of matplotlib … it is a sub-module within the overall matplotlib module.

The pyplot sub-module provides a set of “convenience functions” for creating common data visualizations and performing common data visualization tasks. Essentially, pyplot provides a set of relatively simple tools for creating common charts like the bar chart, scatter plot, and line chart.

Pyplot still isn’t perfect (it can still be a little confusing to beginners), but it simplifies the process of creating some data visualizations in Python.

Now that you know a little more about matplotlib and pyplot, let’s examine the syntax to create a line chart.

To create a line chart with pyplot, you typically will use the plt.plot function.

The name of the function itself often confuses beginners, because many of the other functions in pyplot have names that directly relate to the chart that they create. For example, you create a bar chart in pyplot by using the plt.bar function. You create histograms by using the plt.hist function. And you create scatter plots in matplotlib by using the plt.scatter function.

You’d think that to create a line chart, there would be a function called “`plt.line()`

“, right?

No. That’s not how you create a line chart with pyplot.

To create a matplotlib line chart, you need to use the vaguely named `plt.plot()`

function.

That being said, let’s take a look at the syntax.

The plt.plot function has a lot of parameters … a couple dozen in fact.

But here in this tutorial we’re going to simplify things and just focus on a few: `x`

, `y`

, `color`

, and `linewidth`

.

I want to focus on these parameters because they are the one’s you will probably use most often. Also, by focusing down on a few, you can make it easier to learn the syntax. If you’re just getting started, you really need to simplify things as much as possible until you learn and memorize the basics. Once you learn the basics, then make things more complex.

Ok. Let me explain the parameters I mentioned, one at a time.

Here, I’ll explain four important parameters of the plt.plot function: `x`

, `y`

, `color`

, and `linewidth`

.

The `y`

parameter allows you to specify the y axis coordinates of the points along the line you want to draw.

Here’s a very simple example. The following line has been created by connecting four points. The y axis coordinates of these points are at `2`

, `5`

, `4`

, and `8`

.

The plt.plot function basically takes those points and connects them with line segments. That’s what the function does.

We tell plt.plot the position of those points by passing data to the `y`

parameter.

Typically, we will pass data to this parameter in the form of an array or an array-like object. You can use a Python list or similar objects like NumPy arrays.

Keep in mind, the `y`

parameter is required.

I’ll show you exactly how to use this parameter in the examples section of this tutorial.

The `x`

parameter is similar to the `y`

parameter.

Essentially, the `x`

parameter enables you to supply the x axis positions of the points on the line.

So let’s take another look at the example we saw in the last section:

Here, the line is made up of segments that connect four points.

The points are at locations `1`

, `2`

, `3`

, and `4`

on the x axis.

We tell the plt.plot function these x axis locations by using the `x`

parameter.

Typically, we’ll supply these x axis positions in the form of a Python list. More broadly though, we can supply the x axis positions in the form of any array-like object … a list, a NumPy array, etc.

Keep in mind that the `x`

parameter is *optional*. That means that although you need to supply values for the y parameter, you do *not* need to supply values for the `x`

parameter. If you don’t provide any data to the `x`

parameter, matplotlib will assume that the x axis positions are `[0, 1, 2, ... n - 1]`

, if you have n points. Basically, the x axis positions will just be 0 to n – 1.

Here in this tutorial, we are mostly going to omit the arguments to the x parameter.

The `color`

parameter does what you probably expect that it does … it changes the color of the line.

There are a few ways to define the color that you want to use and the easiest way is to use a “named” color. Named colors are colors like “red”, “blue”, “yellow”, and so on. Python and matplotlib recognize several dozen “named” colors. They aren’t limited to the simple colors that we commonly talk about, but there are colors like “crimson”, “wheat”, “lavender”, and more. It’s a good idea to become familiar with a few of the named colors.

Having said that, I strongly prefer to use hexideceimal colors in my data visualizations. Hex colors allow for a lot more flexibility and they allow you to customize your plots to a much larger degree. Essentially, with hex colors, you can “mix your own” colors.

On the other hand, although hex colors allow for more flexibility, they are harder to use. You’ll also need to learn about how hexidecimal numbers work in order to really understand hex colors.

Given that hex colors are a little more complicated we’re not really going to cover them here. I’ll explain hex colors in a future blog tutorial.

The `linewidth`

parameter is also fairly self explanatory. It controls the width of the line that’s plotted.

I’ll show you an example in the examples section below to show you how to use this to increase or decrease the width of the plotted line.

Now that we’ve gone over a few of the important parameters of the plt.plot function, let’s look at some concrete examples of how to use the plt.plot function.

Here, I’ll show you a simple example of how to use the function, and I’ll also show you individual examples of how to use the parameters that I explained earlier in this tutorial.

Before you start working with the examples themselves, you need to run some code.

First, you need to run some code to import a few Python modules. You need to import the pyplot submodule of matplotlib. You also need to import the seaborn module. We’ll be using that later to do some formatting.

# IMPORT MODULES import matplotlib.pyplot as plt import seaborn as sns

Notice that we’re importing these modules with different names. For example, we’re importing the pyplot module as `plt`

. We’re importing the seaborn module as `sns`

. We’re essentially giving these modules “nicknames” … these are aliases that we can use to simplify and shorten our code. You’ll see these later as we call the functions from pyplot and seaborn.

After you import the modules, you’ll need to get the data that we’re going to use.

For these examples, we’re going to use stock price data from the company Tesla, Inc. The data is from the IPO in June of 2010 to the fall of 2018.

# GET DATA FROM TXT FILE tsla_stock_data = pd.read_csv("https://www.sharpsightlabs.com/datasets/TSLA_start-to-2018-10-26_CLEAN.txt")

#-------------------- # EXTRACT CLOSE PRICE #-------------------- tsla_close_price = tsla_stock_data.close_price

As noted above, most of the parameters that we’re going to work with require you to provide a *sequence* of values. Here, we’ve imported the date using the `read_csv()`

function from pandas, and then extracted one variable, `tsla_close_price`

. The way that we’ve extracted this data, the `tsla_close_price`

is actually a Pandas series.

Having said that, the `plt.plot()`

function can also operate on Python lists, tuples, and array-like objects.

In the following examples, we’re going to keep things very simple.

This is a general principle that you should remember when you’re learning a new programming language or skill. Start simple. Break everything down and isolate individual techniques.

Once you’ve broken down the individual techniques, study them and practice them.

Then, after you’ve mastered the basic techniques, you can start to combine those techniques into more complicated structures.

Start simple and then increase the complexity.

With that in mind, let’s start to look at a few very simple examples of how to make a line chart with matplotlib.

For our first example, we’re going to start very simple. This will be as simple as it gets.

We’re basically going to plot our Tesla stock data with plt.plot.

To do this, we’ll call the `plt.plot()`

function with the `tsla_close_price`

data as the only argument.

#----------------- # SIMPLE LINE PLOT #----------------- plt.plot(tsla_close_price)

And here is the output:

There’s nothing fancy about this, but it’s a decent rough draft, and it’s easy to understand.

Let’s break it down.

We’ve called the `plt.plot()`

function. Inside of the function, we see the data set name `tsla_close_price`

, which is the daily closing price of Tesla stock from June of 2010 to the fall of 2018.

Notice that we didn’t explicitly refer to any of the parameters. You’ll often see this in Python code. It’s very common for Python programmers to leave the names of the parameters out of the syntax.

So which parameter is being used here?

The code is implicitly using the `y`

parameter. When you supply a single argument to the plt.plot function, the function assumes that the argument you supply should be connected to the `y`

parameter. This is effectively like setting `y = tsla_close_price`

.

With that in mind, you can understand what this plot shows. The y axis essentially shows the value of the closing price on any given day. Each observation in `tsla_close_price`

is effectively a point on the line, and the plt.plot function just creates a line that connects them.

What about the x axis? We actually didn’t supply any data to the `x`

parameter, so the plt.plot function just generated x axis values from 0 to n – 1 (where n is the total number of observations in the `tsla_close_price`

data).

We can interpret the x axis as the number of days since the IPO. That’s not typically what we’d show … in many cases we’d probably show the date on the x axis. However, I wanted to make this example as simple as possible. Remember my recommendation a few sections ago: when you’re learning syntax, start by studying very simple examples. This example is as simple as it gets.

Next, let’s increase the complexity of the chart just a little bit.

Here, we’re going to change the color of the line.

To do this, we’ll use the `color`

parameter.

#------------------ # CHANGE LINE COLOR #------------------ plt.plot(tsla_close_price, color = 'red')

Which produces the following chart:

This is very simple. We essentially created this with the same code as the previous example, but we added an extra piece of syntax. Essentially, we added the syntax `color = 'red'`

, which (surprise) turns the line to a red color.

As you’re playing with this syntax, try out different colors. You can change the color to ‘green’, ‘yellow’, or another of the matplotlib colors. Part of learning data visualization is learning which colors to use. To learn this, you need to try out different aesthetic values, and see what looks good.

Now, I’ll show you how to change the width of the line.

To do this, you need to use the `linewidth`

parameter.

This is very straight forward. All we need to do is provide a numeric argument to the `linewidth`

parameter (an integer or decimal number).

By default, the `linewidth`

parameter is typically set to 1.5.

In the charts so far, this has made the line just slightly too thick, so I’m going to reduce it to 1.

#------------------ # CHANGE LINE WIDTH #------------------ plt.plot(tsla_close_price, linewidth = 1)

And here’s the output:

The difference is subtle, but I think this linewidth looks better for this particular chart.

When you create your own line charts, I recommend playing around with the width of the line. The “right” line width will depend on the chart that you’re making. For some charts you’ll want a thicker line and for others you’ll want a thinner line. As you learn and master data visualization, you’ll simply need to develop your judgement about when to use a thick or thin line.

Having said that, actually setting the width is easy enough. When you’re using pyplot, just use the `linewidth`

parameter.

One problem I have with the charts that we’ve made so far is that the formatting is a little ugly.

Unfortunately, this is one of the downsides of standard matpotlib … the default settings create charts that are a little unrefined. The default charts are okay if you’re just doing basic data analysis for personal consumption; they are okay if you aren’t going to show them to anyone important. But if you plan to present your work to anyone important – say important colleagues or a management team – the basic charts aren’t great. You should present charts that have a little more polish.

That being said, in this section, I’ll show you a quick trick for improving the formatting of your Python line chart.

To do this, we’re going to use a simple function from the seaborn module.

The seaborn module is a data visualization module for Python. I won’t explain seaborn too much here, but at a high level, seaborn works along side and on top of matplotlib.

We’re going to use a special function from the seaborn package to improve our charts: the `seaborn.set()`

function.

To use the `sns.set()`

function, you’ll need to import seaborn into your working environment.

The following code will import seaborn with the alias `sns`

.

# import seaborn module import seaborn as sns

Once you have seaborn imported, you can use `seaborn.set()`

function.

To use it, you simply need to call the function by itself.

Because we’ve imported seaborn as `sns`

, we can call the function as `sns.set()`

.

#set plot defaults using seaborn formatting sns.set()

Calling the function this way will change the formatting for your matplotlib charts.

Let’s take a look.

Here, we’re simply going to replot our line chart.

#---------------------------------------- # PLOT LINE CHART WITH SEABORN FORMATTING #---------------------------------------- plt.plot(tsla_close_price)

Here’s the output:

Notice what the sns.set function did. It changed the background color and added some white gridlines in the background. There are also a few other changes that aren’t immediately visible in this example.

The formatting changes are relatively minor, but I think this looks dramatically better.

One quick note …

When you run the `seaborn.set()`

function, it may end up changing the formatting on all of your matplotlib charts going forward.

Many people do *not* want this, and want to turn off the seaborn formatting.

How do you turn it off?

You can turn off the seaborn formatting by running the following code:

#-------------------------- # REMOVE SEABORN FORMATTING #-------------------------- sns.reset_orig()

This will reset the plot formatting for your matplotlib charts to the original values (although it will respect any custom setting that you’ve established for your `rcParams`

file).

Let’s do one more example that combines all of the parameters and techniques that we’ve learned so far.

Here, we’re going to modify the linewidth and the line color, and we’re going to modify the background formatting by using the sns.set function.

#------------------------ # FINAL COMBINED EXAMPLE #------------------------ import seaborn as sns sns.set() plt.plot(tsla_close_price, color = 'crimson', linewidth = 1)

And here is the output:

You should understand this if you’ve carefully read the previous examples in this tutorial. However, let me quickly explain it.

Here, we’ve used the `plt.plot()`

function to plot the data contained in `tsla_close_price`

. We used the `linewidth`

parameter to make the line a little thinner, and we used the `color`

parameter to change the color of the line to ‘crimson’, which is very close to the color of Tesla’s logo.

We also used the `seaborn.set()`

function to enhance the background formatting and make your matplotlib line chart look more “polished.”

Overall, there’s still more that we could do to improve this, but it’s pretty good.

Let me give you one quick reminder …

When you use the `sns.set()`

function to change the formatting of your charts, it may change the formatting of all of your charts in the future.

To remove that formatting and revert the formatting to the matplotlib defaults, you can use the following code:

# reset defaults sns.reset_defaults()

This will reset the formatting of your charts to the default matplotlib format.

This tutorial should be enough to get you started making line charts with matplotlib.

But, this is really only the beginning.

If you’re serious about data visualization and data science with Python, you’ll need to learn more. You’ll need to learn how to add titles to your plots, format the text, add annotations, and a lot more.

Moreover, to really learn data science in Python, you can’t strictly learn the data visualization tools. You’ll need to learn at least a little about data manipulation … for example, you should learn about NumPy arrays and learn about Pandas dataframes. You should probably also study some machine learning as well.

What I’m getting at is that if you’re serious about data visualization and data science in Python, you will need to learn more.

And if you’re ready to learn more, we can help.

Here at Sharp Sight, we teach data science.

We regularly publish free data science tutorials, right here at the Sharp Sight blog.

To get these tutorials delivered right to your inbox, sign up for our email list.

When you sign up, you’ll get free tutorials about:

- Matplotlib
- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to make a matplotlib line chart appeared first on Sharp Sight.

]]>The post How to use Pandas loc to subset Python dataframes appeared first on Sharp Sight.

]]>If you’re new to Pandas and new to data science in Python, I recommend that you read the whole tutorial. There are some little details that can be easy to miss, so you’ll learn more if you read the whole damn thing.

But, I get it. You might be in a hurry.

Fair enough.

Here are a few links to the important sections:

- A quick refresher on Pandas
- Pandas DataFrame basics
- The syntax of Pandas loc
- Examples: how to use the Pandas loc method

Again though, I recommend that you slow down and learn step by step. That’s the best way to rapidly master data science.

Ok. Quickly, I’m going to give you an overview of the Pandas module. The specifics about `loc[]`

will follow just afterwards.

To understand the Pandas loc method, you need to know a little bit about Pandas and a little bit about DataFrames.

What is Pandas?

Pandas is a module for data manipulation in the Python programming language.

At a high level, Pandas exclusively deals with data manipulation (AKA, data wrangling). That means that Pandas focuses on creating, organizing, and cleaning datasets in Python.

However, Pandas is a little more specific.

Pandas focuses on *DataFrames*. This is important to know, because the loc technique requires you to understand DataFrames and how they operate.

That being the case, let’s quickly review Pandas DataFrames.

A Pandas DataFrame is essentially a 2-dimensional row-and-column data structure for Python.

This row-and-column format makes a Pandas DataFrame similar to an Excel spreadsheet.

Notice in the example image above, there are multiple rows and multiple columns. Also notice that different columns can contain different data types. A column like ‘`continent`

‘ contains string data (i.e., character data) but a different column like ‘`population`

‘ contains numeric data. Again, different columns can contain different data types.

But, within a column, all of the data must have the *same* data type. So for example, all of the data in the ‘`population`

‘ column is integer data.

Pandas DataFrames have another important feature: the rows and columns have associated index values.

Take a look. Every row has an associated number, starting with `0`

. Every column also has an associated number.

These numbers that identify specific rows or columns are called indexes.

Keep in mind that all Pandas DataFrames have these integer indexes *by default*.

Integer indexes are useful because you can use these row numbers and column numbers to select data and generate subsets. In fact, that’s what you can do with the Pands `iloc[]`

method. Pandas iloc enables you to select data from a DataFrame by numeric index.

But you can also select data in a Pandas DataFrames by *label*. That’s really important for understanding `loc[]`

, so let’s discuss row and column labels in Pandas DataFrames.

In addition to having integer index values, DataFrame rows and columns can also have *labels*.

Unlike the integer indexes, these labels do *not* exist on the DataFrame by default. You need to define them. (I’ll show you how in a moment.)

When you set them up, the row and column labels look something like this:

Importantly, if you set the labels up right, you can use these labels to subset your data.

And that’s exactly what you can do with the Pandas loc method.

So now that we’ve discussed some of the preliminary details of DataFrames in Python, let’s really talk about the Pandas loc method.

The Pandas loc method enables you to select data from a Pandas DataFrame by label.

It allows you to “**loc**ate” data in a DataFrame.

That’s where we get the name `loc[]`

. We use it to locate data.

It’s slightly different from the `iloc[]`

method, so let me quickly explain that.

This is very straightforward.

The loc method locates data by *label*.

The iloc method locates data by *integer index*.

I’m really not going to explain iloc here, so if you want to know more about it, I suggest that you read our Pandas iloc tutorial.

Now that you have a good understanding of DataFrame structure, DataFrame indexes, and DataFrame labels, lets get into the details of the loc method.

Here, I want to explain the syntax of Pandas loc.

How does it work?

If you’re familiar with calling methods in Python, this should be very familiar.

Essentially, you’re going to use “dot notation” to call `loc[]`

after specifying a Pandas Dataframe.

So first, you’ll specify a Pandas DataFrame object.

Then, you’ll type a dot (“`.`

“) ….

… followed by the method name, `loc[]`

.

Inside of the `loc[]`

method, you need to specify the labels of the rows or columns that you want to retrieve.

It’s important to understand that you can specify a *single* row or column. Or you can also specify a *range* of rows or columns. Specifying ranges is called “slicing,” and it’s an important tool for subsetting data in Python. I’ll explain more about slicing later in the examples section of this tutorial.

There’s one important note about the ‘column’ label.

If you don’t provide a column label, loc will retrieve *all* columns by default.

Essentially, it’s optional to provide the column label. If you leave it out, loc[] will get all of the columns.

Ok. Now that I’ve explained the syntax at a high level, let’s take a look at some concrete examples.

Here’s what I will show you:

- row selection with loc
- column selection with loc
- retrieve specific cells with loc
- retrieve ranges of rows and columns (i.e., slicing)
- get specific subsets of cells

In this examples section, we’re going to focus on *simple* examples. This is important. When you’re learning, it’s very helpful to work with simple, clear examples. Don’t try to get fancy too early on. Learn the technique with simple examples and then move on to more complex examples later.

Before we actually get into the examples though, we have two things we need to do. We need to import Pandas and we need to create a simple Pandas DataFrame that we can work with.

First, we’ll just import Pandas.

We can do this with the following code.

#=============== # IMPORT MODULES #=============== import pandas as pd

Note that we’re importing Pandas with the alias `pd`

. This makes it possible to refer to Pandas as `pd`

in our code, which simplifies things a little.

Next, we’re going to use the pd.DataFrame function to create a Pandas DataFrame.

There’s actually three steps to this. We need to first create a Python dictionary of data. Then we need to apply the pd.DataFrame function to the dictionary in order to create a dataframe. Finally, we’ll specify the row and column labels.

Here’s the step where we create the Python dictionary:

#========================== # CREATE DICTIONARY OF DATA #========================== country_data_dict = { 'country':['USA', 'China', 'Japan', 'Germany', 'UK', 'India'] ,'continent':['North America','Asia','Asia','Europe','Europe','Asia'] ,'GDP':[19390604, 12237700, 4872137, 3677439, 2622434, 2597491] ,'population':[322179605, 1403500365, 127748513, 81914672, 65788574, 1324171354] }

Next, we’ll create our DataFrame from the dictionary:

#================================= # CREATE DATAFRAME FROM DICTIONARY #================================= country_data_df = pd.DataFrame(country_data_dict, columns = ['country', 'GDP', 'population'])

Notice that in this step, we set the column labels by using the `columns`

parameter inside of pd.DataFrame().

Finally, we need to set the row labels. By default, the row labels will just be the integer index value starting from 0.

Here though, we’re going to manually change the row labels.

Specifically, we’re going to use the values of one of our existing columns, `country`

, as the row labels.

To do this, we’ll use the `set_index()`

method from Pandas:

country_data_df = country_data_df.set_index('country')

Notice that we need to store the output of `set_index()`

back in the DataFrame, `country_data_df`

by using the equal sign. This is because `set_index()`

creates a new object by default; it doesn’t modify the DataFrame in place.

Quickly, let’s examine the data with a print statement:

print(country_data_df)

continent GDP population country USA North America 19390604 322179605 China Asia 12237700 1403500365 Japan Asia 4872137 127748513 Germany Europe 3677439 81914672 UK Europe 2622434 65788574 India Asia 2597491 1324171354

You can see the row-and-column structure of the data. There are 3 columns: `continent`

, `GDP`

, and `population`

. Notice that the “`country`

” column is set aside off to the left. That’s because the `country`

column has actually become the row index (the labels) of the rows.

Visually, we can represent the data like this:

Essentially, we have a Pandas DataFrame that has row labels and column labels. We’ll be able to use these row and column labels to create subsets.

With that in mind, let’s move on to the examples.

First, I’m going to show you how to select a single row using loc.

Here, we’re going to select all of the data for the row `USA`

.

To do this, we’ll simply call the `loc[]`

method after the dataframe:

country_data_df.loc['USA']

Which produces the following output:

continent North America GDP 19390604 population 322179605 Name: USA, dtype: object

This is fairly straightforward, but let me explain.

We’re using the `loc[]`

method to select a single row of data by the *row label*. The row label for the first row is ‘`USA`

,’ so we’re using the code `country_data_df.loc['USA']`

to pull back everything associated with that row.

Notice that using `loc[]`

in this way returns the values for all of the columns for that row. It tells us the continent of `USA`

(‘`North America`

‘), the GDP of `USA`

(`19390604`

), and the population of the row for `USA`

(`322179605`

).

The loc method returns all of the data for the row with the label that we specify.

Here’s another example.

Here, we’re going to select all of the data for India. In other words, we’re going to select the data for the row with the label `India`

.

Once again, we’ll simply use the name of the row label inside of the `loc[]`

method:

country_data_df.loc['India']

Which produces the following output:

continent Asia GDP 2597491 population 1324171354 Name: India, dtype: object

As you can see, the code `country_data_df.loc['India']`

returns all of the data for the ‘`India`

‘ row.

Now that I’ve shown you one way to select data for a single row, I’m going to show you an alternate syntax.

There’s actually another way to select a single row with the loc method.

It’s a little more complicated, but it’s relevant for retrieving “slices” of data, which I’ll show you later in this tutorial.

Here, we’re going to call the `loc[]`

method using dot notation, just like we did before.

Inside of the `loc[]`

method, the first argument will be the label associated with the row we want to return. Here, we’re going to retrieve the data for `USA`

, so the first argument inside of the brackets will be ‘`USA`

.’

After that though, the code will be a little different. After the row label that we want to return, we have a comma, followed by a colon (‘`:`

‘).

The full line of code looks like this:

country_data_df.loc['USA',:]

Which produces the following:

continent North America GDP 19390604 population 322179605 Name: USA, dtype: object

Once again, this code has pulled back the row of data associated with the label ‘`USA`

.’

The output of this code is effectively the same as the code `country_data_df.loc['USA']`

. The difference is that we’re using a colon inside of the brackets now (i.e., `country_data_df.loc['USA',:]`

).

Why?

Remember from earlier in this tutorial when I explained the syntax: when we use the Pandas loc method to retrieve data, we can refer to a row label and a column label inside of the brackets.

In the code `country_data_df.loc['USA',:]`

, ‘`USA`

‘ is the row label and the colon is functioning as the column label.

But instead of referring to a specific column, the colon basically tells Pandas to retrieve *all* columns.

The output though is basically the row associated with the row label ‘`USA`

‘:

Keep this syntax in mind … it will be relevant when we start working with slices of data.

Ok. Now that you’ve learned how to select a single row of data from a Python dataframe, let’s look at how to select a single column of data.

Selecting a column from a Python DataFrame is fairly simple syntactically. It’s very similar to the syntax for selecting a row.

The major difference is how we specify the row and column labels inside of the `loc[]`

method.

When we select a single column, the first argument inside of `loc[]`

will be the colon. Remember, the item in this position refers to the rows that we want to select. By using the colon (“`:`

“) here, we indicate that we want to retrieve *all* rows.

The next item inside of `loc[]`

is the name of the column that we want to select.

This might still be a little abstract, so let’s take a look at a concrete example.

In this example, we’re going to select the ‘`population`

‘ column from the `country_data_df`

DataFrame.

Here’s the code:

country_data_df.loc[:,'population']

And here is the output:

country USA 322179605 China 1403500365 Japan 127748513 Germany 81914672 UK 65788574 India 1324171354 Name: population, dtype: int64

This is pretty straightforward.

We called the `loc[]`

method by using dot notation after the name of the DataFrame, `country_data_df`

.

Inside of the `loc[]`

method, we have two arguments.

The first is the colon operator, which indicates that we want to retrieve all rows.

The second is the name of the column that we want to retrieve, `population`

.

And what does it return? This code returns all of the row lables (which we set up as the country names earlier by using `set_index('country')`

. It also returns the population that corresponds to each country.

Essentially, it returns the `population`

column, along with the row labels, which looks like this:

You can retrieve data in a similar way for the other columns … just use a different column name in place of ‘`population`

.’ Change the code and try it out yourself!

Next, let’s select the data in a single cell.

To select a single cell of data using loc is pretty simple, if you already know how to select rows and columns.

Essentially, we’re going to supply both a row label and a column label inside of `loc[]`

.

Let’s take a look:

country_data_df.loc['China', 'GDP']

Which produces the following output:

12237700

This is pretty straightforward.

We called the `loc[]`

method by using dot notation after the name of the DataFrame.

Inside of the method, we listed specified ‘`China`

‘ as the row label and ‘`GDP`

‘ as the column label.

This tells the loc method to return the data that meet both criteria. It tells loc to pull back the data that is in the ‘`China`

‘ row and the ‘`GDP`

‘ column. Visually, we can represent that like this:

Again … this is pretty simple once you understand the basic mechanics of loc.

Now though, let’s move on to something a little more complicated. Let’s talk about “slicing” DataFrames with the loc method.

Instead of just retrieving single rows or single columns using loc, we can actually retrieve “slices” of data.

“Slices” of data are basically “ranges” of data.

The syntax for doing this is pretty easy to understand, if you’ve understood how to retrieve a single row.

Essentially, to retrieve a range of rows, we need to define a “start” row and a “stop” row.

Syntactically, you’ll call the `loc[]`

method just like you normally would.

Then inside of the `loc[]`

method, you’ll specify the label of the “start” row and the label of the stop row, separated by a colon.

Keep in mind that the stop row *will* be included. The range of data that’s returned will be up to and *including* the stop row. This is different than how iloc[] works and how numeric indexes work generally in Python. Typically, the stop index is *excluded*, but that’s not the case with `loc[]`

.

Let me show you an example so you can see this in action.

Here, we’re going to retrieve a range of rows.

Specifically, we’ll retrieve the rows from ‘`China`

‘ to ‘`Germany`

‘.

Here’s the code:

country_data_df.loc['China':'Germany', :]

And here are the rows that it retrieves:

continent GDP population country China Asia 12237700 1403500365 Japan Asia 4872137 127748513 Germany Europe 3677439 81914672

As you can see, the code `country_data_df.loc['China':'Germany', :]`

retrieved the rows from ‘`China`

‘ up to and including ‘`Germany`

‘.

Visually, we can represent the results of the code like this:

Next, let’s retrieve a slice of columns using loc.

Getting a subset of columns using the loc method is very similar to getting a subset of rows.

We’re going to call the `loc[]`

method and then inside of the brackets, we’ll specify the row and column labels.

Because we want to retrieve all rows, we’ll use the colon (‘`:`

‘) for the row label specifier.

After that, we’ll use the code `'GDP':'population'`

to specify that we want to select the columns from `'GDP'`

up to and including `'population'`

.

Here’s the exact code:

country_data_df.loc[:, 'GDP':'population']

Which produces the following result:

GDP population country USA 19390604 322179605 China 12237700 1403500365 Japan 4872137 127748513 Germany 3677439 81914672 UK 2622434 65788574 India 2597491 1324171354

Essentially, the code `country_data_df.loc[:, 'GDP':'population']`

retrieved all rows but only two columns, ‘`GDP`

‘ and ‘`population`

‘. It basically retrieved a “slice” of columns.

We can visually represent the output like this:

Finally, let’s put all of the pieces together and select a subset of cells using loc.

Selecting a subset of cells using the `loc[]`

method is very similar to selecting slices.

Essentially, you’ll use code that returns a slice of rows *and* a slice of columns at the same time.

Let me show you an example.

Here is some code that will select the cells for `GDP`

and `population`

for the rows between `China`

and `Germany`

(including `Germany`

).

country_data_df.loc['China':'Germany', 'GDP':'population']

Which produces the following output:

GDP population country China 12237700 1403500365 Japan 4872137 127748513 Germany 3677439 81914672

This is pretty simple to understand, if you already understand row slices and column slices. (If you don’t, go back and review those sections of this tutorial!)

We’ve called the loc method as normal.

Inside of `loc[]`

we specified that we want to retrieve the range of rows starting from `China`

up to and including the row for `Germany`

.

After that, we then specified that we want to retrieve the columns for `GDP`

up to and including the column for `population`

.

We can visually represent the output like this:

Again, this is pretty easy to understand, as long as you understand the basics of the loc method.

Having said that, if you’re confused about anything in particular, leave your question in the comments at the bottom of this page.

Look.

You’ve probably heard it before …

Data manipulation is *really* important for data science.

If you want to be good at data science in Python, you really need to learn how to do data manipulation in Python.

That means that you need to learn and master Pandas. You should also learn more about NumPy.

I can’t stress this enough, if you want to learn data science in Python, make sure to study Pandas!

If you want to learn more about Pandas, and discover strategies to master Pandas, then sign up for our email list.

Every week here at Sharp Sight, we publish *FREE* data science tutorials.

We write about data science in Python … things like Pandas, matplotlib, NumPy and scikit learn.

(We also write about data science in R.)

If you want to learn more about data science, then sign up!

When you sign up for our email list, we send you our *free* data science tutorials every week.

You have nothing to loose … the tutorials are free, so sign up now.

The post How to use Pandas loc to subset Python dataframes appeared first on Sharp Sight.

]]>