How to Use Numpy Random Choice

This tutorial will explain the NumPy random choice function which is sometimes called np.random.choice or numpy.random.choice.

I recommend that you read the whole blog post, but if you want, you can skip ahead. Here are the contents of the tutorial …

Contents:

Again, if you have the time, I strongly recommend that you read the whole tutorial. Everything will make more sense if you read everything carefully and follow the examples.

Ok … let’s get into it.

 

A quick introduction to the NumPy random choice function

First of all, what is np.random.choice?

NumPy random choice is a function from the NumPy package in Python.

You might know a little bit about NumPy already, but I want to quickly explain what it is, just to make sure that we’re all on the same page.

Numpy is a data manipulation module for Python

NumPy is a data manipulation module for Python.

Specifically, the tools from NumPy operate on arrays of numbers … i.e., numeric data.

Numpy is important for data science, statistics, and machine learning

Because NumPy functions operate on numbers, they are especially useful for data science, statistics, and machine learning.

For example, if you want to do some data analysis, you’ll often be working with tables of numbers. Frequently, when you work with data, you’ll need to organize it, reshape it, clean it and transform it. We call these data cleaning and reshaping tasks “data manipulation.”

In recent years, NumPy has become particularly important for “machine learning” and “deep learning,” since these often involve large datasets of numeric data. When you’re doing machine learning and deep learning, numeric data manipulation is a very big part of the workflow.

In any case, whether you’re doing statistics or analysis or deep learning, NumPy provides an excellent toolkit to help you clean up your data.

 

NumPy random choice helps you create random samples

One common task in data analysis, statistics, and related fields is taking random samples of data.

You’ll see random samples in probability, Bayesian statistics, machine learning, and other subjects. Random samples are very common in data-related fields.

NumPy random choice provides a way of creating random samples with the NumPy system.

NumPy random choice generates random samples

If you’re working in Python and doing any sort of data work, chances are (heh, heh), you’ll have to create a random sample at some point.

NumPy random choice can help you do just that.

To explain it though, let’s take a look at an example.

Think of a die … the kind of die that you would see in a game:

A die (cube) with the numbers 1 to 6 printed on the sides.

A typical die has six sides. Each side has some dots on it, corresponding to a number 1 through 6. Essentially, a die has the numbers 1 to 6 on its six different faces.

If you roll the die, when the die lands, one face will emerge pointing upwards, so rolling the die is exactly like selecting a number between 1 and 6. The numbers 1 to 6 on the die are the possible outcomes that can appear, and rolling a die is like randomly choosing a number between 1 and 6.

So essentially, in the example of rolling a die, we have possible outcomes (i.e., the faces), and a random process that chooses one of them.

The NumPy random choice function is a lot like this. Given an input array of numbers, numpy.random.choice will choose one of those numbers randomly.

So let’s say that we have a NumPy array of 6 integers … the numbers 1 to 6.

An array containing the numbers 1 to 6.

If we apply np.random.choice to this array, it will select one. It will choose one randomly…. it’s essentially the same as rolling a die.

That’s how np.random.choice works. You input some items, and the function will randomly choose one or more of them as the output.

 

The syntax of numpy random choice

Conceptually, this function is easy to understand, but using it properly can be a little tricky.

Ultimately, to use NumPy random choice properly, you need to know the syntax and how the syntax works.

That being the case, let’s look at the syntax of np.random.choice.

A quick note

One quick note …

In this tutorial, you’ll see me refer to the function as np.random.choice.

The term “np” refers to NumPy. But, to get the syntax to work properly, you need to tell your Python system that you’re referring to NumPy as “np”. You need to run the code import numpy as np. This code essentially tells Python that we’re giving the NumPy package the nickname “np“.

I’ll show you exactly how to do that again in the examples section of this tutorial, but I want to briefly explain it before we look at the syntax.

An explanation of the syntax

Ok, let’s take a look at the syntax.

The np.random.choice() function is fairly simple. When you use it, there is the name of the function, and then some parameters that will be enclosed inside of parenthesis.

An explanation of the syntax and parameters of NumPy random choice.

Because the parameters of the function are important to how it works, let’s take a closer look at the parameters of NumPy random choice.

The parameters of numpy random choice

There are four parameters for the NumPy random choice function:

  • a
  • size
  • replace
  • p

Let’s discuss each of these individually.

a (required)

The a parameter enables us to specify the array of input values … typically a NumPy array.

This is essentially the set of input elements from which we will generate the random sample.

Note that the a parameter is required … you need to provide some array-like structure that contains the inputs to the random selection process.

Also note that the a parameter is flexible in terms of the inputs that it will accept. Typically, we’ll supply a NumPy array of numbers to the a parameter. However, because it is flexible, it will also accept things like Python lists, tuples, and other Python sequences.

Moreover, instead of supplying a sequence like a NumPy array, you can also just provide a number (i.e., an integer). If you provide an integer n, it will create a NumPy array of integers up to but excluding n by using the NumPy arange function. In this case, it’s as if you supplied a NumPy array with the code np.arange(n). I’ll show you an example of this in the examples section of this tutorial.

size

The size parameter describes (…. wait for it ….)

… the size of the output.

Remember that the NumPy random choice function accepts an input of elements, chooses randomly from those elements, and outputs the random selections as a NumPy array.

Because the output of numpy.random.choice is a NumPy array, the array will have a size. If you know about NumPy arrays, this will make sense, but if you’re new to NumPy this may be confusing.

Therefore, if you don’t know what the size attribute is, I suggest that you read our tutorial about NumPy arrays. Specifically, you should read the section about the attributes of NumPy arrays.

replace

The replace parameter specifies whether or not you want to sample with replacement.

If you’ve taken a statistics class, you’ll probably be familiar with this.

… but if you haven’t taken a stats class, the idea of sampling with and without replacement might be foreign.

That being the case, let me quickly explain.

Let’s say that you have 4 simple cards on a table: a diamond, a spade, a heart, and a club. (This is an extremely simple example, so we’re working with simplified playing cards.)

An image of four playing cards: a diamond, a spade, a heart, and a club.

I turn them over and mix them up on the table. Then I ask you to close your eyes.

You make your selection … it’s the heart card.

An image showing that you selected the 'heart' card.

Next, I ask you to select another card.

… now, this is the critical point.

Do you put your first card back or not? Do you “replace” your initial selection?

If you do put your card back, then it will be possible to re-select the heart card, or any of the other three cards. But if you do not replace your initial card, then it will only be possible to select a spade, diamond, or club.

Essentially, replacement makes a difference when you choose multiple times.

And this is what the replace parameter controls. It will control whether or not an element that is chosen by numpy.random.choice gets replaced back into the pool of possible choices.

I’ll explain this again in the examples section, so you can see it in action.

p

Finally, the p parameter controls the probability of selecting a given item.

By default, each item in the input array has an equal probability of being selected.

It’s like rolling a fair die.

A die (cube) with the numbers 1 to 6 printed on the sides.

A fair die has 6 sides, and each side is equally likely to come up. So the probability of rolling a 1 is .1667 (i.e., 1/6th). The probability of rolling a 2 is also .1667, etc.

Similarly, if we set up NumPy random choice with the input values 1 through 6, then each of those values will have an equal probability of being selected, by default.

But we can change that. We can manually specify the probabilities of the different outcomes. For example, we could make selecting ‘1‘ a probability of .5, and give the other outcomes a probability of .1. (This is akin to rolling an unfair, weighted die.)

Essentially, this is what the p parameter controls: the probabilities of selecting the different input elements.

Note that the p parameter is optional, and if we don’t provide anything, NumPy just treats each outcome as equally likely.

If we do provide something to the p parameter, then we need to provide it in the form of an “array like” object, such as a NumPy array, list, or tuple.

 

Examples: how to use the numpy random choice function

Now that we’ve looked at the syntax of numpy.random.choice, and we’ve taken a closer look at the parameters, let’s look at some examples.

Examples:

Run this code first

Before you run any of these examples, you’ll need to run some code as a preliminary setup step.

Import numpy

Specifically, you’ll need to properly import the NumPy module.

Keep in mind, that to import the NumPy module into your code environment, you’ll need to have NumPy installed on your computer first. Installing NumPy is complicated, and beyond the scope of this blog post. Having said that, I recommend that you just use Anaconda to get the modules properly installed.

But assuming that you have NumPy installed on your computer, you can import it into your working environment with the following code:

import numpy as np

This will import NumPy with the nickname np. Going forward, we will syntactically refer to NumPy as np in our code.

 

Example 1: select a random number with np.random.choice

In this first example, we’re going to select a single integer from a range of possible integers.

More specifically, we’re going to select a single integer between 0 and 9.

First, before we use np random choice to randomly select an integer from an array, we actually need to create the NumPy array.

Let’s do that now.

Create a NumPy array with np.arange

Here, we’re going to create a simple NumPy array with the numpy.arange function.

array_0_to_9 = np.arange(start = 0, stop = 10)

This is fairly straightforward, as long as you understand how to use np.arange. If you don’t, make sure to read our numpy.arange tutorial.

Using NumPy arange this way has created a new array, called array_0_to_9. This array contains the integers from 0 to 9.

You can print it out with the print function:

print(array_0_to_9)

OUTPUT:

[0 1 2 3 4 5 6 7 8 9]

Visually, we can represent the array as follows:

A visual representation of a NumPy array with integers from 0 to 9.

This is really straight forward … this array contains the integers from 0 to 9.

Next, we’re going to randomly select one of those integers from the array.

Select a random number from the NumPy array

To select a random number from array_0_to_9 we’re now going to use numpy.random.choice.

np.random.seed(0)
np.random.choice(a = array_0_to_9)

OUTPUT:

5

If you read and understood the syntax section of this tutorial, this is somewhat easy to understand. But there are a few potentially confusing points, so let me explain it.

Essentially, we’re using np.random.choice with the ‘a‘ parameter. You’ll remember from the syntax section earlier in this tutorial that the a parameter enables us to set the input array (i.e., the NumPy array that contains our input values). In other words, the code a = array_0_to_9 indicates that the input values are contained in the array array_0_to_9.

Remember, the input array array_0_to_9 simply contains the numbers from 0 to 9.

When we use np.random.choice to operate on that array, it simply randomly selects one of those numbers.

In this case, it randomly selects the number 5.

Visually, we can represent the operation like this:

A simple example of using numpy.random.choice on a small NumPy array, which selects a single number.

The input array has 10 values, and NumPy random choice randomly chooses one of them.

Why do we need to use np.random.seed?

There’s one part of this code that confuses many beginners, so I want to address it.

Before we ran the line of code np.random.choice(a = array_0_to_9), we ran the code np.random.seed(0).

We need np.random.seed because it “seeds” the random number generator for numpy.random.choice.

But WTF is a “seed” anyway?

This is a little complicated, but I’ll briefly explain here.

The NumPy random choice function operates on the principle of pseudorandom number generation.

When we use a pseudorandom number generator, the numbers in the output approximate random numbers, but are not exactly “random.” In fact, when we use pseudorandom numbers, the output is actually deterministic; the output is actually determined by an initializing value called a “seed.”

Let me say that again: when we set a seed for a pseudorandom number generator, the output is completely determined by the seed.

What that means is that if we use the same seed, a pseudorandom number generator will produce the same output.

Let me show you:

np.random.seed(0)
np.random.choice(a = np.arange(10))

This produces the output 5.

Now run it again with the same seed.

np.random.seed(0)
np.random.choice(a = np.arange(10))

It produces the output 5 again.

You can run this code as many times as you like. If you use the same seed, it will produce the exact same output.

np.random.choice is only pseudo random

What this means is that np.random.choice is random-ish. It’s sort of random, in the sense that there will be no discernible relationship between the seed and the output. But you have to remember that using the same seed will produce the same output.

This is actually good, because it makes the results of a pseudorandom function reproducible. If I share my code with you, and you run it with the same seed, you will get the exact same result. This is good for code testing, among other things.

If this is still confusing, you should read our tutorial about numpy.random.seed, which explains random number generation with NumPy.

Select a single integer (shorthand syntax)

Ok.

Now that I’ve shown you how to select a single random number from a specific NumPy array, let’s take a look at another way to select a number from a sequence of values.

Here, we’re going to select a number from the numbers 0 to 9. It’s essentially just like the prior example.

The one major difference is that we’re not going to supply a specific input array. Instead, we’re just going to provide a number inside of the parenthesis when we call np.random.choice. Here, we’re going to run the code np.random.choice(10).

np.random.seed(0)
np.random.choice(10)

Which produces the exact same output as in the previous example.

OUTPUT:

5

What’s going on here?

In this example, we ran the code np.random.choice(10). We did not provide a specific NumPy array as an input. Instead, we just provided the number 10.

When we provide a number to np random choice this way, it will automatically create a NumPy array using NumPy arange. Effectively, the code np.random.choice(10) is identical to the code np.random.choice(a = np.arange(10)). So by running np.random.choice this way, it will create a new numpy array of values from 0 to 9 and pass that as the input to numpy.random.choice.

This is essentially a shorthand way to both create an array of input values and then select from those values using the NumPy random choice function.

 

Example 2: select a random sample from a numpy array

Now that you’ve learned how to select a single number from a NumPy array, let’s take a look at how to create a random sample with NumPy random choice. That is, we’re going to select multiple elements from an input range.

Create a numpy array

First, let’s just create a NumPy array.

Here, we’ll create a NumPy array of values from 0 to 99.

array_0_to_99 = np.arange(100)
Select a random sample from the numpy array

Now that we have our input array, let’s select a sample of 5 numbers from it:

To do this, we’ll use the size parameter.

np.random.seed(1)
np.random.choice(array_0_to_99, size = 5)

OUTPUT:

array([37, 12, 72,  9, 75])

What happened here?

The NumPy random choice function randomly selected 5 numbers from the input array, which contains the numbers from 0 to 99.

The output is basically a random sample of the numbers from 0 to 99.

 

Example 3: perform random sampling with replacement

Next, let’s create a random sample with replacement using NumPy random choice.

Here, we’re going to create a random sample with replacement from the numbers 1 to 6.

Create a numpy array

First, we’ll just create a NumPy array of the values from 1 to 6.

array_1_to_6 = np.arange(start = 1, stop =7)

If we print it out, we can see the contents.

print(array_1_to_6)

OUT:

[1 2 3 4 5 6]

This is really straight forward. It’s just the numbers from 1 to 6.

Create a random sample from the input

Now, we’ll generate a random sample from those inputs.

Specifically, we’re going to create a sample of 3 values.

Additionally, we will set the replace parameter to replace = True. This will cause np.random.choice to perform random sampling with replacement. That is, even if a value is selected once, it will be “replaced” back into the possible input values, and it will be possible that the input could be selected again.

Let’s run the code.

np.random.seed(77)
np.random.choice(a = array_1_to_6, size = 3, replace = True)

OUTPUT:

array([5, 5, 4])

Notice what’s in the output. We have an output of 3 values. This is because we set the size parameter to size = 3. That means that the output must have 3 values.

Also, notice the values that are in the output. The value 5 is repeated twice.

Why?

This is possible because we set the replace parameter to replace = True.

When we do this, it means that an item in the input can be selected (i.e., included in the sample) and will then be “replaced” back into the pool of possible input values. Setting replace = True essentially means that a given input value can be selected multiple times!

Note: this is like rolling a die multiple times

Remember earlier in this tutorial that I explained NumPy random choice in terms of rolling a die?

That’s essentially what we’ve done in this example. The code np.random.choice(a = array_1_to_6, size = 3, replace = True) is essentially like rolling a die multiple times!

A die (cube) with the numbers 1 to 6 printed on the sides.

That’s what’s great about Python and NumPy … if you know how to use the tools right, you can begin to create little models of real-world processes.

 

Example 4: change the probabilities associated with outcomes

Next, we’re going to work with the p parameter to change the probabilities associated with the different possible outcomes.

So for example, let’s reuse our array array_1_to_6.

Here’s the code to create the array again:

array_1_to_6 = np.arange(start = 1, stop =7)

Essentially, the array array_1_to_6 has the values from 1 to 6.

Now, we’re going to randomly select from those values (1 to 6) but the probability of each value will not be the same.

Remember that by default, np.random.choice gives each input value an equal probability of being selected.

… but if we use the p parameter, we can change this.

np.random.choice(a = array_1_to_6, p = [.5,.1,.1,.1,.1,.1])

What are we doing here?

We’re using the p parameter to give the input values (1 to 6) different probabilities.

We can visualize the new setup like this:

A visual representation of using numpy.random.choice with different probabilities for the outcomes.

So essentially, the value “1” will have a probability of being selected of .5 (a 50% chance). And the other values from 2 to 6 will each have a probability of .1.

Now let’s run the code:

np.random.seed(42)
np.random.choice(a = array_1_to_6, p = [.5,.1,.1,.1,.1,.1])

OUT:

1
Generate a random sample from weighted inputs

Now let’s run the code again, but instead of generating a single value, we’ll generate a random sample of 20 values.

np.random.seed(42)
np.random.choice(a = array_1_to_6, p = [.5,.1,.1,.1,.1,.1], size = 20)

OUT:

array([1, 6, 4, 2, 1, 1, 1, 5, 3, 4, 1, 6, 5, 1, 1, 1, 1, 2, 1, 1])

Look closely at the numbers in the output array. LOOK AT ALL THOSE 1‘s.

Just by glancing at the output, you can see that 1 is coming up a lot more than the other values. That’s exactly how we designed it! There’s a 50% chance of generating a 1.

 

Example 5: select a random sample from list of items

Next, let’s move on from using numbers as possible outcomes.

…. let’s start using non-numeric inputs in the input array.

Here, we’re going to use a simple example.

For our input array, we’re going to create a Python array of 4 simplified playing cards: a ‘Diamond’ card, a ‘Spade’ card, a ‘Heart’, and a ‘Club’.

simple_cards = ['Diamond','Spade','Heart','Club']

You can think of the list simple_cards like this:

simple_cards represents a simplified set of 4 cards.

This is obviously not like a real set of 52 playing cards. As always, I really want to simplify this as much as possible just so you can see how this works.

Technically though, what is simple_cards? It’s a Python list that contains 4 strings.

Randomly select one item from the Python list

Now that we have our Python list, we’re first just going select a single item randomly from that list.

This is really easy. It’s almost exactly the same as some of the previous examples above where we were selecting a single item from a NumPy array of numbers. The only difference is that we’re supplying a list of strings to the numpy.random.choice instead of a NumPy array.

Let’s take a look.

np.random.seed(0)
np.random.choice(simple_cards)

OUTPUT:

'Diamond'

You can think of this code like selecting a single card from our simplified deck of four cards. There are four possible cards, and we selected the diamond.

A visual representation of this example, where we select a single card (the Diamond) from a simple deck of 4 cards.

From a technical perspective, if you read the earlier examples in this blog post, this should make sense.

All we did is randomly select a single item from our Python list.

Keep in mind though that the code is a little simplified syntactically, because I did not explicitly reference the parameters. If we were a little more explicit in how we wrote this, we could write the code as np.random.choice(a = simple_cards, replace = True). That’s effectively the same thing.

Now, let’s move on to a slightly more complicated example. We’re going to generate a random sample from our Python list.

Select a random sample from the Python list

Random sampling from a Python list is easy with NumPy random choice.

Once again, it’s almost exactly the same as some of the previous examples in this blog post.

Here, we’re going to select two cards from the list.

Essentially, we’re just going to pass the Python list to NumPy random choice and set the size parameter to 2. We’ll also set replace = False to make it so we can’t select the same card twice. I really want this to be like selecting two different cards from a deck of cards.

Let’s take a look at the code.

np.random.seed(55)
np.random.choice(a = simple_cards, size = 2, replace = False)

OUT:

array(['Diamond', 'Club'], dtype='U7')

So in this example, we randomly selected two cards from the ‘deck’ (i.e., we randomly selected 2 items from the list).

We selected the ‘Diamond’ and the ‘Club.’

A visual representation of selecting the Diamond and the Club cards from a simple 4 card deck.

Again, this example is pretty straight forward if you’ve read and understood the previous examples.

If this does not make sense, I recommend that you start at the top and review a few of the more simple examples more carefully.

Random sampling is really important for data science

Random sampling is really important for data science, speaking broadly.

The reason is that random sampling is a key concept and technique in probability. It’s also very important in statistics. Moreover, sampling is also applicable to machine learning and deep learning.

Essentially, random sampling is really important for a variety of sub-disciplines of data science.

You really need to know how to do this!

I’ve written this tutorial to help you get started with random sampling in Python and NumPy.

Having said that, I realize that random sampling can be confusing to beginners.

With that in mind, if you have specific questions about random sampling with NumPy or about the NumPy random choice function, please post your question in the comments section at the bottom of this page.

You should learn more about NumPy

Not only is the numpy.random.choice function important for data science and probability, the broader NumPy toolkit is important for data science in Python.

NumPy gives you a set of tools for working with numeric data in Python. To really get the most out of the NumPy package, you’ll need to learn many functions and tools … not just numpy.random.choice. For example, you’ll need to learn <href=”https://www.sharpsightlabs.com/blog/numpy-array-python/”>how to create NumPy arrays, how to calculate average values and other statistics, how to reshape NumPy arrays, and more.

I recommend that you read our free tutorials …. they will teach you a lot about NumPy.

For more tutorials, sign up for our email list

I also recommend that you sign up for our email list.

We regularly post tutorials about NumPy and data science in Python.

If you sign up for our email list, you’ll get our tutorials delivered directly to your inbox …

You’ll get free tutorials on:

  • NumPy
  • Pandas
  • Base Python
  • Scikit learn
  • Machine learning
  • Deep learning
  • … and more.

If you want to learn more about NumPy and data science, sign up now.

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

9 thoughts on “How to Use Numpy Random Choice”

  1. Your explanations are good to conceive. You look at all involved aspects and your examples are well choosen. Very good.

    Reply
  2. This is a great tutorial. It explained a lot. However I did not find a solution for my problem. I need to sample n samples from m containers, each one having different number of elements. But the resulting count for a given container cannot exceed initial count for that container. Is it possible?

    Reply

Leave a Comment