The post How to Use the Pandas Assign Method to Add New Variables appeared first on Sharp Sight.

]]>In this tutorial, I’ll explain what the assign method does and how it works. I’ll explain the syntax, and I’ll show you step-by-step examples of how to use it.

If you need something specific, you can click on one of the following links and it will take you to the right section in the tutorial.

**Table of Contents:**

Having said that, if you really want to understand Pandas assign, I recommend that you read the whole article.

So what does the assign method do?

Put simply, the assign method adds new variables to Pandas dataframes.

Quickly, I’ll explain that in a little more depth.

You’re probably aware of this, but just to clarify: Pandas is a toolkit for working with data in the Python programming language.

In Pandas, we typically work with a data structure called a dataframe.

A dataframe is a collection of data stored in a rows and column format.

Pandas gives us a toolkit for creating these Dataframes, and it also provides tools for modifying dataframes.

Pandas has tools for sorting dataframes, aggregating dataframes, reshaping dataframes, and a lot more.

And one of the most important things we need to be able to do, is *add new columns to a dataframe*.

The Pandas assign method enables us to add new columns to a dataframe.

We provide the input dataframe, tell assign how to calculate the new column, and it creates a new dataframe with the additional new column.

It’s fairly straightforward, but as the saying goes, the devil is in the details.

So with that said, let’s take a look at the syntax so we can see how the assign method works.

The syntax for the assign method is fairly simple.

You type the name of your dataframe, then a “dot”, and then type `assign()`

.

Remember, the assign method is a Python method that’s associated with dataframe objects, so we can use so-called “dot syntax” to call the method.

Next, inside the parenthesis, we need to provide a “name value pair.”

What does that mean?

We simply provide the name of the new variable and the value that we want to assign to that variable. The value that we assign can be simple (like an integer constant), but it can also be a complicated value that we calculate.

I’ll show you examples of exactly how we use it in the examples section of this tutorial.

One quick note on the syntax:

If you want to add multiple variables, you can do this with a single call to the assign method.

Just type the name of your dataframe, call the method, and then provide the name-value pairs for each new variable, separated by commas.

Honestly, adding multiple variables to a Pandas dataframe is really easy. I’ll show you how in the examples section.

Before we look at the examples, let’s quickly talk about the output of the assign method.

This is really important, so you need to pay attention …

The output of the assign method is a *new dataframe*.

Read that again. It’s really important.

**The output of the assign method is a new dataframe**.

So if you use the assign method, you need to save the output in some way, or else the output will go to the console (if you’re working in an IDE).

The implication of this, is that if you just run the method, your original dataframe will be left *unchanged* unless you store the output to the original name.

(You can obviously also store the output to a new name. This is safer, unless you’re positive that you want to overwrite your original data.)

Ok. Now that I’ve explained how the syntax works, let’s take a look at some examples of how to use assign to add new variables to a dataframe.

**Examples:**

- Create a new variable and assign a constant
- Add a variable that’s a computed value
- Add multiple variables to your dataframe
- Store the output of assign to a new name

Obviously, you can click on any of the above links, and it will take you to that example in the tutorial.

Before you run any of these examples, you need to do two things:

- import pandas
- create the dataframe we’ll use

You can run this code to import Pandas:

import pandas as pd

Next, let’s create our dataframe.

sales_data = pd.DataFrame({ "name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"] ,"region":["East","North","East","South","West","West","South","West","West","East","South"] ,"sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000] ,"expenses":[42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]})

We’ve called this DataFrame `sales_data`

.

This dataframe contains mock sales data for 11 people and it has variables for both `sales`

and `expenses`

.

From here, we can use the assign() method to add some new variables.

In this first example, we’re going to add a new variable to the datafame and assign a constant value for every row.

Let’s think about something specific.

Say that you’re working with this dataset, and all of these people work for the same company. You might have some other dataframes that have records for salespeople who work for *different* companies, but everyone in `sales_data`

works for the same company.

What if we want to create a variable that contains the company name for the people in this dataframe?

We can do that with assign as follows:

sales_data.assign(company = "Vandelay Industries")

OUT:

name region sales expenses company 0 William East 50000 42000 Vandelay Industries 1 Emma North 52000 43000 Vandelay Industries 2 Sofia East 90000 50000 Vandelay Industries 3 Markus South 34000 44000 Vandelay Industries 4 Edward West 42000 38000 Vandelay Industries 5 Thomas West 72000 39000 Vandelay Industries 6 Ethan South 49000 42000 Vandelay Industries 7 Olivia West 55000 60000 Vandelay Industries 8 Arun West 67000 39000 Vandelay Industries 9 Anika East 65000 44000 Vandelay Industries 10 Paulo South 67000 45000 Vandelay Industries

So what did we do in this example?

Here, we created a new variable called `company`

.

For every row in the data, the value for the `company`

variable is the same. The value is “Vandelay Industries.”

In technical terms, the value is a constant for every row. More specifically, it’s a string value.

Having said that, when we create variables with constant values, we can add string values like this example, but we can also assign a new variable with a constant numeric value. For example, try the code `sales_data.assign(newvar = 1)`

.

Here, we’re going to assign a new variable that’s a computed value.

Specifically, we’re going to create a new variable called `profit`

that equals sales minus expenses. (Finance and accounting geeks will know that this is not a precise way to compute profit, but we’ll use this simplified calculation for purposes of example.)

Let’s run the code, and I’ll explain below.

sales_data.assign(profit = sales_data.sales - sales_data.expenses)

OUT:

name region sales expenses profit 0 William East 50000 42000 8000 1 Emma North 52000 43000 9000 2 Sofia East 90000 50000 40000 3 Markus South 34000 44000 -10000 4 Edward West 42000 38000 4000 5 Thomas West 72000 39000 33000 6 Ethan South 49000 42000 7000 7 Olivia West 55000 60000 -5000 8 Arun West 67000 39000 28000 9 Anika East 65000 44000 21000 10 Paulo South 67000 45000 22000

Here, we created a new computed column called `profit`

.

As you can see, `profit`

is simply `sales`

minus `expenses`

.

Notice though, that when we reference the `sales`

and `expenses`

variables inside of `assign()`

, we need to call them as `sales_data.sales`

and `sales_data.expenses`

.

Alternatively, we could call them as `sales_data['sales']`

and `sales_data['expenses']`

.

I prefer the former because they’re much easier to read, but you can choose.

In the previous two examples, we were adding only one new variable at a time.

Here in this example, we’ll add two variables at the same time.

We’re going to add the `profit`

variable and the `company`

variable.

Let’s take a look.

sales_data.assign(profit = sales_data.sales - sales_data.expenses ,company = "Vandelay Industries" )

OUT:

name region sales expenses profit company 0 William East 50000 42000 8000 Vandelay Industries 1 Emma North 52000 43000 9000 Vandelay Industries 2 Sofia East 90000 50000 40000 Vandelay Industries 3 Markus South 34000 44000 -10000 Vandelay Industries 4 Edward West 42000 38000 4000 Vandelay Industries 5 Thomas West 72000 39000 33000 Vandelay Industries 6 Ethan South 49000 42000 7000 Vandelay Industries 7 Olivia West 55000 60000 -5000 Vandelay Industries 8 Arun West 67000 39000 28000 Vandelay Industries 9 Anika East 65000 44000 21000 Vandelay Industries 10 Paulo South 67000 45000 22000 Vandelay Industries

Here in this example, we added two variables at the same time: `profit`

and `company`

.

Notice that syntactically, I actually put the second variable on a new line of code. This is mostly for readability. If you want, you can keep all of your code on the same line, but I don’t necessarily recommend it. I personally think that your code is much easier to read and debug if each different variable assignment is on a separate line.

That said, the two new variable assignments *must* be separated by a comma. Here, the comma that separates the two variable assignments comes before the assignment of the `company`

variable. This is important, so don’t forget the comma.

Finally, let’s do one more example.

Here, we’re going to store the output to a new name.

Notice that in the previous examples, the code did *not* modify the original dataframe.

When we use assign, it produces a *new* dataframe as an output and leaves your original dataframe unchanged. This is very important to remember! Many beginner data science students get frustrated when they first use this technique, because they can’t figure out why their dataframe stays the same, even after they run `assign()`

. Always remember: assign produces a *new* dataframe.

Having said that, we can *store* the new output dataframe to a new name.

If we want, we can store it to a new name, like `sales_data_revised`

.

Or, we can store it to the original dataframe name, `sales_data`

, and overwrite the original!

So it is possible to directly modify your original dataframe, but you need to do it with an equal sign to store the output of the assign method.

Ok, with all that said, let’s look at an example.

Here, we’ll take the output of assign and store it to a new name called `sales_data_revised`

.

sales_data_revised = sales_data.assign(profit = sales_data.sales - sales_data.expenses ,company = "Vandelay Industries" )

Now, the new dataframe is stored in `sales_data_revised`

.

Let’s print it out.

print(sales_data_revised)

OUT:

name region sales expenses profit company 0 William East 50000 42000 8000 Vandelay Industries 1 Emma North 52000 43000 9000 Vandelay Industries 2 Sofia East 90000 50000 40000 Vandelay Industries 3 Markus South 34000 44000 -10000 Vandelay Industries 4 Edward West 42000 38000 4000 Vandelay Industries 5 Thomas West 72000 39000 33000 Vandelay Industries 6 Ethan South 49000 42000 7000 Vandelay Industries 7 Olivia West 55000 60000 -5000 Vandelay Industries 8 Arun West 67000 39000 28000 Vandelay Industries 9 Anika East 65000 44000 21000 Vandelay Industries 10 Paulo South 67000 45000 22000 Vandelay Industries

When we run the code in this example, assign() is creating a new dataframe with the newly assigned variables, `profit`

and `company`

.

But instead of letting that new output be passed to the console, we’re storing it with a new name so we can access it later.

Remember: assign produces a *new* dataframe as an output and leaves the original unchanged. If you want to store the output, you need to use the equal sign to pass the output to a new name.

One last comment on this.

You can actually overwrite your original data directly. To do this, just run the assign method and pass the output to the original dataframe name, `sales_data`

.

sales_data = sales_data.assign(profit = sales_data.sales - sales_data.expenses ,company = "Vandelay Industries" )

This is totally appropriate to do in some circumstances. Sometimes, you really do want to overwrite your data.

But be careful!

Test your code before you do this, otherwise you might overwrite your data with incorrect values!

Let’s very quickly address one common question about the Pandas assign method.

This is a very common question, and the answer is very straightforward.

As I mentioned several times in this tutorial, the assign method returns a *new* dataframe that contains the newly assigned variables, and it leaves your input dataframe unchanged.

If you want to overwrite your dataframe, and add the new variables, you need to take the output and use the equal sign to re-store the output into the original name.

So you need to set `sales_data = sales_data.assign(...)`

, like this:

sales_data = sales_data.assign(profit = sales_data.sales - sales_data.expenses ,company = "Vandelay Industries" )

Keep in mind that this will overwrite your data! So you need to be very careful when you do this. Test your code and make sure that it’s working *exactly* as expected before you do this. If you don’t you might overwrite your original data with an incorrect dataset, and you’ll have to re-start your data retrieval and data wrangling from scratch. This is sometimes a huge pain in the a**, so be careful.

Alternatively, you can store the output of assign with a new name, like this:

sales_data_revised = sales_data.assign(profit = sales_data.sales - sales_data.expenses ,company = "Vandelay Industries" )

Storing the output with a new name, like `sales_data_revised`

, is safer because it doesn’t overwrite the original.

You may actually want to overwrite the original, just make sure that your code works before you do.

Do you have other questions about the assign method?

Leave your questions in the comments section near the bottom of the page.

This tutorial should give you a taste of how to use Pandas to manipulate your data, but there’s a *lot* more to learn.

If you really want to master data wrangling with Pandas, you should join our premium online course, Pandas Mastery.

Pandas Mastery is our online course that will teach you these critical data manipulation tools.

Inside the course, you’ll learn all of the essentials of data manipulation in pandas, like:

- adding new variables
- filtering data by logical conditions
- subsetting data
- working with Pandas indexes
- reshaping data
- and much more …

Additionally, you’ll discover our unique practice system that will enable you to memorize all of the syntax you learn.

And, it will only take a few weeks.

We’re re-opening Pandas Mastery for enrollment next week on November 3.

If you have questions about it, just leave your question in the comments section below.

The post How to Use the Pandas Assign Method to Add New Variables appeared first on Sharp Sight.

]]>The post How to Use Numpy Round appeared first on Sharp Sight.

]]>I’ll explain how the function works, and I’ll also show you some step-by-step examples of how to use it.

**Table of Contents:**

You can click on any of the above links and it will take you to the appropriate section of the tutorial.

That said, you’re relatively new to Numpy, you might want to read the whole tutorial.

Let’s start off with a quick explanation of what the Numpy round function is and what it does.

Numpy round is a function that’s included in the Numpy module for the Python programming language.

Numpy is a module for working with numeric data. Specifically, Numpy works with data organized into a structure called a Numpy array.

A Numpy array has a row-and-column structure, and is filled with numeric data. Here’s an example of a 2-dimensional Numpy array.

So Numpy has a variety of functions for *creating* these arrays of Numeric data (like Numpy arange, Numpy ones, Numpy randint, etc)

… but it also has a variety of functions for *manipulating* these numeric arrays.

Numpy Round is one of the Numpy functions that we use to manipulate Numpy arrays.

The Numpy module has a variety of data manipulation functions.

Some of these functions (like Numpy reshape or Numpy concatenate) deal with reshaping arrays or combining Numpy arrays.

But many Numpy functions perform mathematical operations on Numpy arrays. So we have functions for summing Numpy arrays, calculating exponents on array values, and more.

The Numpy round function is one of these mathematical functions.

A little more specifically: Numpy round rounds numbers.

It can do this with single input numbers like this:

When it operates on a single input value, Numpy round rounds the number to the nearest integer value.

Although we can use np.round on single values, you can also use Numpy round on arrays of numbers.

For example, you could use Numpy round on a 1-dimensional array of numbers. When you do this, Numpy will apply the np.round() function to every element of the array.

In other words, it will round all of the numbers, and the output will be a new Numpy array of the same shape that contains the rounded numbers.

Now that we’ve looked at what Numpy round does at a high level, let’s take a look at the syntax.

One quick note before we get started.

Before we use Numpy, we need to import the Numpy package.

How exactly we import Numpy will change the exact form of the syntax.

Among data scientists, the common convention is to import Numpy as “`np`

“, like this:

import numpy as np

When we import Numpy like this, we’ll be able to use `np`

as a prefix when we call our Numpy functions.

As I said, this is the common convention among most Python users, and we’ll use it in this tutorial.

Ok. The syntax for the round function is fairly simple.

We call the function as np.round().

Then, inside the parenthesis, we provide an input.

This input can be a single number (i.e., a Python float) or a Numpy array.

Having said that, let’s take a closer look at the input parameters as well as the output.

The np.round function has two major input parameters, `a`

and `decimals`

.

Additionally, there is one other parameter called `out`

, which is somewhat less commonly used.

Let’s quickly take a look at those.

The `a`

parameter enables you to specify the input value or values to the function.

As you’ll see in the examples section, the input value that you provide can be a single number (i.e., a `float`

) or a Numpy array. It can also be an array-like object like a Python list.

This is required, in the sense that you need to provide an input to the function.

Having said that, when you use this, you do not need to explicitly type the parameter as `a=`

. Python knows that you’re passing an argument to this parameter by position.

The `decimals`

parameter enables you to specify the number of decimals places that will be used when the input numbers are rounded.

By default, this parameter is set to `decimals = 0`

(i.e., the number is rounded to the nearest integer value).

Technically, you can also specify a negative value for `decimals`

. If you do this, it will control the number of positions to the left of the decimal point to which to round the numbers.

The `out`

parameter enables you to specify an output array in which to store the output of the function.

This is somewhat rarely used, so we’re not going to cover this in the examples.

The output of np.round is a Numpy array with the same shape as the input. The output will contain the rounded values of the input.

Keep in mind that np.round does *not* change the original array by default. The output is a new array that will be sent to the console (if you’re working in an IDE) or that can be stored with a specific name using the `=`

sign.

Ok. Now that we’ve looked at the syntax, let’s look at some examples of Numpy round.

You can click on any of the following links, and it will take you to the appropriate example.

**Examples:**

- Round a value downward [example 1]
- Round 1.5 to nearest integer [example 2]
- Use np.round to round 2.5 to nearest integer [example 3]
- Use np.round on a negative number [example 4]
- Round a number to a specific decimal place [example 5]
- Round the values of a Numpy array [example 6]

One thing before you run any of the examples.

Before you run the example code, you need to import Numpy properly. You can import Numpy by running the following:

import numpy as np

This will allow us to call Numpy round with the prefix `np`

.

Ok, let’s start simple.

Here, we’re going to round a single floating point number (a decimal value).

np.round(1.1)

OUT:

1.0

This is really straight forward. Here, we’re calling np.round on the value 1.1.

It’s simply rounding the number to the nearest integer value, which is 1.0.

Note that any input with a decimal value less than .5 will round down to the nearest integer value. So 1.2, 1.3, and 1.4 will all round down to 1.0, just like this example here.

But keep in mind, even though it’s rounding to the nearest integer value, the data type of the output is actually a `float`

.

This is a subtle point, but it might be important!

Just to make things clear, let’s round another single value.

In the previous example, we rounded the value 1.1, which rounded down to 1.0.

Now, let’s input a value that will round upward.

np.round(1.5)

OUT:

2.0

Here, we’re rounding the value 1.5, which rounds upward to 2.0.

Keep in mind that when you round a value that’s exactly halfway between between two values, np.round will round to the nearest *even* value.

In the last example, I noted that values that are exactly halfway between two values will be rounded to the nearest even value. So in the last example. 1.5 rounded upward to 2.0.

Here, let’s round 2.5.

np.round(2.5)

OUT:

2.0

Here, we’re rounding the value 2.5, which rounds *downward* to 2.0.

Again, when you round a value that’s exactly halfway between between two values, np.round will round to the nearest *even* value.

Things get slightly more complicated when we work with negative numbers, but it’s intuitive once you understand the principle at work.

Let’s take a look at the code, and then I’ll explain.

np.round(-1.333)

OUT:

-1.0

Here, np.round rounded to -1.0. Why?

Whenever we use np.round, the function will round the input to *the nearest integer value*.

In this case, the nearest integer is -1.0.

This is really simple if you just remember that Numpy round rounds to the nearest integer.

Next, we’ll round a number to a specific decimal place.

Specifically, we’ll round the number `e`

(Euler’s number) to 3 decimal places.

np.round(np.e, decimals = 3)

OUT:

2.718

In Numpy, the value of the constant `np.e`

is 2.718281828459045.

When we use np.round with `decimals = 3`

, the function rounds the input value to 3 decimal places, which in this case is 2.718.

Let’s do one more example.

Here, we’ll round the values of a Numpy array.

To do this, we’ll create a Numpy array with floating point Numbers using the Numpy random uniform function. Numpy random uniform generates floating point numbers randomly from a uniform distribution in a specific range. Here, we’ll draw 6 numbers from the range -10 to 10, and we’ll reshape that array into a 2×3 array using the Numpy reshape method. (Note that we’re also using Numpy random seed to set the seed for the random number generator.)

#CREATE 2D ARRAY np.random.seed(2) floats_2d = np.random.uniform(size = 6, low = -10, high = 10).reshape((2,3))

And let’s take a look at the numbers in the array:

print(floats_2d)

OUT:

[[-1.28010196 -9.48147536 0.99324956] [-1.29355215 -1.59264396 -3.39330358]]

So `floats_2d`

contains 6 decimal numbers between -10 and 10, arranged into a Numpy array with 2 rows and 3 columns.

Now, let’s round the numbers.

np.round(floats_2d)

OUT:

array([[-1., -9., 1.], [-1., -2., -3.]])

Here, np.round simply rounded all of the numbers of the input array to the nearest integer.

Notice a few things.

First, the output array has the exact same shape as the input array.

Second, all of the rounded values in the output have the `float`

data type. So even though the inputs have been rounded to the nearest integer value, they’re actually being reported as floating point numbers.

Do you have other questions about Numpy round?

Leave your questions in the comments section below.

In this tutorial, I’ve shown you how to use one Numpy function: Numpy round.

The np.round function is pretty easy to understand, once you work with it a little bit.

But other parts of Numpy can be a lot more complicated.

If you’re serious about learning Numpy, and if you’re serious about data science in Python, you should consider joining our premium course called *Numpy Mastery*.

Numpy Mastery will teach you everything you need to know about Numpy, including:

- How to create Numpy arrays
- How to use the Numpy random functions
- What the “Numpy random seed” function does
- How to reshape, split, and combine your Numpy arrays
- How to perform mathematical operations on Numpy arrays
- and more …

Moreover, it will help you completely *master* the syntax within a few weeks. We’ll show you a practice system that will enable you to memorize all of the Numpy syntax you learn. If you have trouble remembering Numpy syntax, this is the course you’ve been looking for.

Find out more here:

Learn More About Numpy Mastery

The post How to Use Numpy Round appeared first on Sharp Sight.

]]>The post The 3 Reasons You Should Learn R for Data Science appeared first on Sharp Sight.

]]>The short answer is “it depends.”

Both R and Python have strengths and weaknesses as data science languages, or as broader programming languages.

So here in this blog post, I want to explain why you should learn R for data science.

I’ll start off by comparing and contrasting R and Python.

… and then I’ll continue the blog post by explaining why R is great for data science, and the types of people I think should learn R for data science.

Before I explain why I think R is an excellent data science language to learn, let’s first do a quick comparison of R and Python.

R and Python are currently the most common programming languages for data science.

Although both languages are excellent in their own way, each language has strengths and weaknesses.

That being said, I put together a quick-and-dirty table that expresses my personal opinions on the main differences between R and Python. (In the following table, 1 is low/bad and 5 is high/good):

To be clear this is not really scientific. It’s based on my own opinion, and many years of experience as a data professional and as someone who teaches data science. And that being the case, there’s probably some room for argument about the exact numbers.

But at a glance, it should help you understand the strengths and weaknesses of both languages.

If you look at the table, you’ll see that R and Python are both good, but they are really excellent in different areas.

Python, in my opinion, is OK at data manipulation (i.e., Pandas) and data visualization (i.e., matplorlib and Seaborn). Python’s Scikit Learn is generally stronger than R for machine learning. And Python – in my opinion – is much better for general programming, software development, and automation. Essentially, if I need to build a system, Python is much better than R.

But even with all of it’s strengths, I still have not moved entirely to Python.

There are still many instances where I prefer to use R, and they really center around three things.

In the above table, you’ll notice that R scores a ‘5’ in three areas:

- data manipulation
- data visualization
- data analysis

These areas are where R really shines in comparison to Python.

And these 3 strengths translate into 3 reasons why I think R is a great data science language:

- dplyr is better than Pandas for data manipulation
- ggplot2 is better than Seaborn or Matplotlib for data visualization
- data analysis with dplyr + ggplot2 is simple and powerful

These should inform your decision about which language to chose (R or Python).

Later in the blog post, I’ll discuss the types of people I think should learn R instead of Python.

But first, let’s look at each of the 3 reasons I like R in a little more detail.

In my opinion, dplyr is slightly better than Pandas for data manipulation.

Why?

The biggest reason is ease of use.

Both dplyr and Pandas are relatively easy to use. You’ll notice that I gave dplyr a ‘5’ and Pandas a ‘4’ for “ease of use” in the above table.

Both are fairly easy to use, but I give the edge to R’s dplyr.

Concerning syntax, all of the major functions in dplyr are simple and well named. For example, you ‘filter’ rows using the `filter()`

function. You ‘select’ columns using the `select()`

function. And you ‘rename’ columns with the `rename()`

function. In dplyr, the function names are simple and they closely describe what they actually do. Reading and writing dplyr code is almost like using English.

Moreover, you can use a special technique in dplyr that I sometimes call “dplyr chaining” to combine dplyr functions together.

This enables you to create data manipulation pipelines that accomplish complex data manipulations in a simple, linear, step-by-step way. If you’ve struggled with data manipulation in the past, you need to know this technique. It makes data manipulation so much easier.

Although Pandas is also similar, in the sense that all of the functions are well named, the functions are slightly more difficult to remember and the syntax is a little more complex. Not much, but a little. (But to be clear, many people still use “bracket notation” to add variables and manipulate dataframes in Python. This is a terrible practice, because it’s hard to read and hard to use.)

The truth is, if I had to choose, I’d probably choose dplyr over Pandas. I really love using dplyr for data manipulation.

Where R really shines in comparison to Python is in data visualization.

Today, the primary data visualization tool for R is ggplot2.

ggplot2 is simple, easy to use, and extremely powerful.

You can use ggplot2 to make simple data visualizations like scatter plots, bar charts, or line charts:

But you can also use ggplot2 to create intricate, beautiful data visualizations like this choropleth map:

So you can use ggplot to create simple data visualizations, but you can also use it to create very complex visualizations. It’s very flexible and very powerful.

Now to be fair, ggplot2 has a bit of a learning curve. Some beginners are confused when the first look at the syntax.

But the syntax for ggplot is extremely systematic. Once you understand how the ggplot system works, everything makes so much damn sense.

(For a quick introduction to ggplot2, check out our ggplot2 tutorial for beginners.)

Perhaps the killer feature of R as a data science language is the combination of dplyr plus ggplot2.

As I mentioned previously in the section on dplyr, you can use a special operator called the “pipe operator” to combine together different dplyr functions. That enables you to perform complex data manipulations by combining simple dplyr functions. It’s like combining little building blocks together.

But you can also combine dplyr functions with ggplot2 functions in pipelines.

For example, in a previous blog post, the R data analysis of covid-19 data, we combined together several dplyr functions along with ggplot2 to create a small multiple chart. Specifically, we combined dplyr’s filter, group_by, and summarise with ggplot2’s geom_line and facet_wrap.

covid_data %>% filter(country %in% covid_top_12$country) %>% group_by(country, date) %>% summarise(new_cases = sum(new_cases)) %>% ggplot(aes(x = date, y = new_cases)) + geom_line() + facet_wrap(~country, ncol = 4)

OUT:

Not to be fair, we needed to do quite a bit of data wrangling to create our dataset in that analysis, the `covid_data`

dataset.

But once we had that data, we were able to use dplyr + ggplot2 to quickly analyze our data. We did that by filtering, aggregating, and summarizing the data with dplyr, and then sending that output to ggplot to visualize it.

This combination of ggplot2 + dplyr is extremely powerful for data analysis.

If you’re struggling with data analysis, dplyr + ggplot2 is arguably the best toolkit, once you learn how to use them correctly.

So if your job primarily involves gathering, wrangling, visualizing, and analyzing data with more than a few thousand rows and you want sophisticated, modern tools, R is arguably the best choice.

One final point.

Because R is excellent at data manipulation, data visualization, and data analysis, I think that R is the best language for “data analytics”.

What’s the difference between data analytics and data science?

There’s not a clear definition here, but this is how I think of it:

If you’re doing low-scale data wrangling and analysis with a small number of rows, and you’re using old-school tools (like Excel), then that’s data analysis.

If you’re doing larger-scale data wrangling, visualization, and analysis with moderate scale, but you’re not doing really advanced work like machine learning or AI, that’s data analytics.

… Data analytics is like data analysis on steroids. Or, data analytics is like data analysis, with modern “power tools” like R. Data analytics is like a subset of data science.

Finally, if you’re doing larger-scale data wrangling, visualization, and analysis … AND you’re doing machine learning and AI, then that’s data science.

Again, there aren’t clear definitions here, but the way I think of it, “data analytics” is a type of very-sophisticated, larger-scale data analysis, with modern power tools.

In my opinion, R is the best programming language for data analytics.

Any time I need to wrangle, visualize, and analyze my data, but I don’t need to do machine learning or software engineering, I strongly prefer to do it in R. This is especially true if I have more than a few thousand rows of data (if there’s less than this, I might do it in Excel). And it’s also especially true if I need more sophisticated tools than what Excel offers.

Ultimately, although Python has its strengths, I think that R is better for data manipulation, data visualization, and data analysis.

So I think that R is the best choice for a few groups of people:

- data analysts who want to improve their skills
- people who want to focus on data visualization
- data science beginners who want to be productive fast

If you fall into one of these categories, and you’re trying to decide which language to learn, you might want to learn R.

Once you know the right packages (like ggplot2, dplyr, and the rest of the Tidyverse) you’ll have a powerful, easy to use toolkit for doing data manipulation, data visualization, and data analysis.

Do you have questions about this?

Are you still uncertain about which language to choose? R or Python?

Leave your questions in the comments section below.

If you’re serious about learning dplyr, ggplot2, and data science in R, you should consider joining our premium course called *Starting Data Science with R*.

Starting Data Science will teach you all of the essentials you need to do data science in R, including:

- How to manipulate your data with dplyr
- How to visualize your data with ggplot2
- Tidyverse helper tools, like tidyr and forcats
- How to analyze your data with ggplot2 + dplyr
- and more …

Moreover, it will help you completely *master* the syntax within a few weeks. We’ll show you a practice system that will enable you to *memorize* all of the R syntax you learn. If you have trouble remembering R syntax, this is the course you’ve been looking for.

Find out more here:

Learn More About Starting Data Science with R

The post The 3 Reasons You Should Learn R for Data Science appeared first on Sharp Sight.

]]>The post A quick introduction to dplyr appeared first on Sharp Sight.

]]>As a data professional, you’ll spend a huge amount of time doing data preparation.

Cleaning, joining, reshaping, aggregating …

These tasks make up a huge amount of your data work. Many data professionals say as much as 80%.

Because data manipulation is so important, it’s something you need to focus on relentlessly.

This is particularly true in the beginning. When you’re first starting to learn data science, you should really focus on two core skills: data manipulation and data visualization.

Because data manipulation is so important, I want to give you a crash course in how to do data manipulation in R.

If you’re doing data science in the R programming language, that means that you should be using dplyr.

If you’re not really familiar with it, dplyr is a data manipulation package for R.

Moreover, dplyr is one of the modules of the so-called “Tidyverse.” The Tidyverse is a collection of R packages for doing data science, which includes `dplyr`

, `ggplot2`

, `tidyr`

, `forcats`

, and several others.

Although the packages of the Tidyverse all deal with data science in one way or another, dplyr focusses on data manipulation.

One of the brilliant things about dplyr though is the simplicity.

At its core, dplyr really only has 5 major functions, which we sometimes call “verbs.”

Each of these dplyr verbs does one thing.

Each verb is named in a way that is extremely easy to remember.

And all of the verbs can be combined together to perform more complex data manipulations (which I’ll explain in the section about dplyr “chains”).

Let’s talk about these data manipulation “verbs.”

Dplyr has 5 primary verbs.

These verbs are essentially commands, and each one does one thing.

I’ll show you examples of these in the examples section, but first, let’s quickly look at the syntax.

All of the primary dplyr functions (i.e., verbs) share a similar syntax.

You can use them like this:

Essentially, you call the function. Inside the parenthesis, the first argument (i.e., the first input) is the name of the dataframe you want to operate on.

Then, after that, there is some syntax that specifies exactly what to do with the dplyr function. This will be different for every dplyr function, so look at the upcoming examples to see exactly how each one works.

Keep in mind though, that this is the *normal* syntax for using the dplyr verbs. There’s also another way to use the dplyr verbs using “pipes”. I’ll explain that in the section on dplyr pipes.

Ok, now that you’ve learned about the general syntax of the dplyr functions, let’s look at some examples.

In this section, we’ll take a look at some simple, yet concrete examples of each of the 5 dplyr verbs.

To be clear: these will not be comprehensive examples. They won’t show you everything that dplyr can do.

But they will get you started by show you most of the basic functionality.

Before you look at the following examples, you’ll need to run some code first.

Open up R Studio (or whatever IDE you’re using) and run the following:

library(tidyverse) library(dplyr) data(starwars)

This will load the tidyverse collection of packages, and will also retrieve our dataset.

Here, we’ll be working with the `starwars`

dataset.

Ok. Now that you’ve done that, let’s look at the verbs.

First, we have filter.

The `filter()`

verb selects rows of data based on value.

Another way of saying this is that filter selects rows based on logical criteria that involve values.

That might sound complicated, but it’s actually really simple once you understand.

Let’s take a look so I can explain.

Let’s say that we want to identify Star Wars characters who are droids.

We can do that with the filter technique.

filter(starwars, species == 'Droid')

OUT:

name height mass hair_color skin_color eye_color birth_year gender homeworld species 1 C-3PO 167 75 NA gold yellow 112 NA Tatooine Droid 2 R2-D2 96 32 NA white, bl… red 33 NA Naboo Droid 3 R5-D4 97 32 NA white, red red NA NA Tatooine Droid 4 IG-88 200 140 none metal red 15 none NA Droid 5 BB8 NA NA none none black NA none NA Droid

Notice that the output is the rows for droid characters.

Here, we’ve used the dplyr filter function on the `starwars`

dataset.

After calling the function, the first argument is the name of the dataframe.

The second argument is a logical condition that specifies which rows we want to retrieve. Look at the code `species == 'Droid'`

. Remember that `species`

is one of the variables. `'Droid'`

is one of the categories in that variable.

So essentially, we’re subsetting our data based on a logical condition.

If the condition `species == 'Droid'`

is true for a particular row, then it is returned in the subset. If that condition is false, it is *not* returned.

Keep in mind that this is just one example. You can also subset on multiple variables, numeric variables, and more. There are a variety of types of logical conditions that we can use to subset our data with `filter()`

.

For more insight into how to use `filter()`

, check out our tutorial on the dplyr filter function.

Next, let’s take a look at the `select()`

verb.

Select retrieves columns.

So whereas filter retrieves rows, select retrieves columns. (Remember: in dplyr, every function essentially does only one thing).

When we use select, we retrieve the columns *based on name*.

Here’s an example.

Let’s say that we want to retrieve only 3 columns: `name`

, `species`

, and `homeworld`

.

This is extremely easy to do with `select()`

.

We simply call the name of the function, and inside the parenthesis, we provide the name of the dataframe, and the name of each column we want to return.

select(starwars, name, species, homeworld)

OUT:

name species homeworld 1 Luke Skywalker Human Tatooine 2 C-3PO Droid Tatooine 3 R2-D2 Droid Naboo 4 Darth Vader Human Tatooine 5 Leia Organa Human Alderaan 6 Owen Lars Human Tatooine 7 Beru Whitesun lars Human Tatooine 8 R5-D4 Droid Tatooine 9 Biggs Darklighter Human Tatooine 10 Obi-Wan Kenobi Human Stewjon # … with 77 more rows

Here, the output consists of all of the rows of data, but only 3 columns: name, species, and homeworld.

Notice syntactically, how simple it is. We simply call the function, provide the name of the dataframe and then provide the names of the columns we want to return. We don’t even need to enclose the names of the columns in quotations. Just provide the name of each column you want to return, separated by commas. Everything is clean and simple.

Also, notice something about the output. The columns are returned in exactly the order that we list them as the arguments to the function.

The mutate function adds new variables to a dataframe.

Again, like the other dplyr functions, mutate is extremely easy to use.

First, we call the name of the function.

Then inside the parenthesis, we first provide the name of the dataframe.

And after that, we provide an expression that defines a new variable name, and how to compute it. (We call this a “name/value pair.”)

Let’s take a look.

The *mass* variable in the dataframe is the mass of the character, in kilograms.

But let’s say that we want to compute the weight, in pounds. We can do that with mutate.

Here, we’re going to use mutate to create a new variable called `weight_lbs`

.

mutate(starwars, weight_lbs = mass * 2.2)

OUT:

name height mass hair_color skin_color eye_color birth_year gender homeworld species weight_lbs1 Luke Skywa… 172 77 blond fair blue 19 male Tatooine Human 169. 2 C-3PO 167 75 NA gold yellow 112 NA Tatooine Droid 165 3 R2-D2 96 32 NA white, bl… red 33 NA Naboo Droid 70.4 4 Darth Vader 202 136 none white yellow 41.9 male Tatooine Human 299. 5 Leia Organa 150 49 brown light brown 19 female Alderaan Human 108. 6 Owen Lars 178 120 brown, grey light blue 52 male Tatooine Human 264 7 Beru White… 165 75 brown light blue 47 female Tatooine Human 165 8 R5-D4 97 32 NA white, red red NA NA Tatooine Droid 70.4 9 Biggs Dark… 183 84 black light brown 24 male Tatooine Human 185. 10 Obi-Wan Ke… 182 77 auburn, wh… fair blue-gray 57 male Stewjon Human 169. # … with 77 more rows

In the output, the new variable is all the way at the right hand side, so you might need to scroll to see it.

(Note: I removed a few columns so we could view the new column easier.)

Here, we used the `mutate()`

function to create a new variable called `weight_lbs`

. You can see this new variable at the far right hand side of the output.

How did we do it?

We simply called the mutate function like this:

`mutate(starwars, weight_lbs = mass * 2.2)`

Inside of the parenthesis, we’re specifying that we want to operate on the starwars dataframe.

And specifically, we’re specifying that we want to create a new variable called `weight_lbs`

that will be equal to the value in the `mass`

variable, times `2.2`

.

That’s it. Adding a variable with mutate is just that simple. You can even add multiple new variables by specifying additional name-value pairs, separated by commas.

Having said that, in spite of the simplicity here, using mutate can be more complex. This is particularly true when we’re creating a variable that’s based on some complex computation.

That being said, this example should get you started, but for more information, check out our tutorial on the dplyr mutate verb.

Now, let’s look at the arrange function.

The arrange function *sorts* a dataframe.

Like the other dplyr functions, arrange is very easy to use.

We simply call the name of the function. Inside the parenthesis, we specify the dataframe we want to operate on, and then the variable or variables that we want to sort by.

Let’s take a look. Here, we’ll sort by height.

arrange(starwars, height)

OUT:

name height mass hair_color skin_color eye_color birth_year gender homeworld species 1 Yoda 66 17 white green brown 896 male NA Yoda … 2 Ratt… 79 15 none grey, blue unknown NA male Aleen Mi… Aleena 3 Wick… 88 20 brown brown brown 8 male Endor Ewok 4 Dud … 94 45 none blue, grey yellow NA male Vulpter Vulpte… 5 R2-D2 96 32 NA white, bl… red 33 NA Naboo Droid 6 R4-P… 96 NA none silver, r… red, blue NA female NA NA 7 R5-D4 97 32 NA white, red red NA NA Tatooine Droid 8 Sebu… 112 40 none grey, red orange NA male Malastare Dug 9 Gasg… 122 NA none white, bl… black NA male Troiken Xexto 10 Watto 137 NA black blue, grey yellow NA male Toydaria Toydar… # … with 77 more rows

Here, we’re sorting the data by `height`

.

By default, `arrange()`

sorts the data in ascending order (notice Yoda at the top).

We can actually sort in descending order by using the `desc()`

helper function:

arrange(starwars, desc(height))

Keep in mind that this is a fairly simple example. It is possible to sort in more complex ways, such as multiple variables, etc.

Finally, let’s look at the `summarize()`

function.

Summarize summarizes your data.

For example, if you need to calculate things like a mean, median, count, sum, etc … you can do this with `summarize()`

.

Let’s look at an example.

Here, we’ll calculate the average height.

summarise(starwars, mean(height, na.rm = TRUE))

OUT:

`mean(height, na.rm = TRUE)` 174.

Ok, this one is a little more complicated, but still pretty easy.

The summarize function *summarizes* your data.

Here, we’re calculating the average height. To do this, we need to use the `mean()`

function.

So we’re calling `summarize()`

.

Inside the parenthesis, the first argument is the name of the dataframe.

The second argument is where we call the mean function. Here, we’re using `height`

as the input to mean(), and we’re setting `na.rm = TRUE`

to deal with missing values.

Compared to the other dplyr verbs, summarize can be a little more complicated. We typically need to use summarization functions (like mean, median and others) to get it to work properly. And there are some other options that enable use to change the output, like providing a name for the summarized variable.

Still, the summarize function is fairly easy to use.

One quick note about all of these dplyr functions.

All of these functions create *new* dataframes.

What that means is that these functions do *not* directly change the original dataframe.

Typically, the output is sent to the console in R studio, and is unsaved.

If you want to save the output, you need to pass the output to a variable name using the assignment operator.

Here is an example:

starwars_droids <- filter(starwars, species == 'Droid')

Here, we're subsetting retrieving the droids from the data using the filter verb, with `species == 'Droid'`

.

But notice that we're using the assignment operator (`<-`

)to assign the output to the variable name `starwars_droids`

. This keeps the original dataframe intact, and saves the new subsetted data to a new variable name.

Alternatively, we could overwrite and update the original dataset, also using the assignment operator:

starwars <- filter(starwars, species == 'Droid')

Be careful with this!

This will overwrite the original `starwars`

dataframe with the smaller subset of only droids!

Again, be careful when you're assigning the output to the original variable name. Make sure that your code is working properly, and that you're sure you want to overwrite the original.

Now that you've learned about the 5 primary dplyr verbs, let's quickly talk about dplyr pipes.

In dplyr and the larger Tidyverse (i.e., ggplot2, tidyr, etc) we can use a special operator to chain together multiple commands.

This operator, `%>%`

is typically called the pipe operator (although I think that the term "chaining" makes more sense, and you might see me refer to dplyr "chains").

We can use this operator to combine together multiple dplyr functions in a chain. This enables us to perform much more complicated data manipulations.

Let's quickly look at the syntax for how we use the dplyr pipe.

Notice that when we use a dplyr pipe, the syntax is sort of turned inside-out.

Typically, when we use this technique, the syntax *starts* with the name of the dataframe.

Then we use the pipe operator to "pipe" the dataframe as an input into the dplyr function. (And inside of the dplyr function, the rest of the syntax will work as normal.)

This might sound complicated, but it's really simple once you see it, so let's look at an example.

Let's start by looking at a simple example.

Here, we're going to redo the example from the section on filter.

Previously, we created a subset of data where `species == 'Droid'`

.

The code looked like this:

filter(starwars, species == 'Droid')

Now, we're going to rebuild that code and use a dplyr pipe:

starwars %>% filter(species == 'Droid')

Both examples will produce the exact same output.

The only difference between these is that the second uses a dplyr pipe.

So why would we do this?

What's the advantage to using the pipe operator like this?

The advantage is that we can use multiple pipes in a row, such that the output of one dplyr function becomes the input of another dplyr function.

Keep in mind that you can also put the dplyr verbs on different lines ... they don't all need to be on the same line.

This technique is extremely powerful.

Dplyr chaining syntax enables you to create multi-step data manipulations that modify your data in complex ways.

This is really what makes dplyr so much better than almost any other data wrangling toolkit available.

Let's take a look at an example of a multi-step dplyr chain.

We're going to get some basic stats for the characters born on Tatooine, and sort them by age.

To do this, we'll start with the `starwars`

dataframe, then we'll use filter to subset down to the characters that were from Tatooine. Then we'll use select to retrieve only a few specific columns, and we'll use arrange to sort the data.

starwars %>% filter(homeworld == 'Tatooine') %>% select(name, species, height, mass, birth_year) %>% arrange(desc(birth_year))

OUT:

name species height mass birth_year 1 C-3PO Droid 167 75 112 2 Cliegg Lars Human 183 NA 82 3 Shmi Skywalker Human 163 NA 72 4 Owen Lars Human 178 120 52 5 Beru Whitesun lars Human 165 75 47 6 Darth Vader Human 202 136 41.9 7 Anakin Skywalker Human 188 84 41.9 8 Biggs Darklighter Human 183 84 24 9 Luke Skywalker Human 172 77 19 10 R5-D4 Droid 97 32 NA

The output gives us a sorted subset of a few important columns.

But notice the syntax.

We combined multiple dplyr functions together using the `%>%`

operator.

When we do this, we start with the dataframe and "pipe" it as an input to the first dplyr function (filter, in this case).

Then we can take the output of any given dplyr function and use the `%>%`

operator to pipe the output of one function into the next one.

Using `%>%`

, the output dataframe of one function becomes the input dataframe to the next.

One additional comment about the syntax.

One good practice, is when you're reading code that uses the pipe operator, you should read it as "then."

Let's take a look at that chaining code again, but here, I'll add some comments to show how to read it.

starwars %>% # Start with the starwars dataset filter(homeworld == 'Tatooine') %>% # THEN retrieve the rows where homeworld == 'Tatooine' select(name, species, height, mass, birth_year) %>% # THEN retrieve the name, species, height, mass, and birth_year variables arrange(desc(birth_year)) # THEN sort the data in descending order, by birth year

At every step, we can read the code like a series of procedures.

Do something .... then do something else .... then do another thing ... etc.

One of the reasons that this technique is so powerful is that it makes your code *so f*cking easy to read*.

Part of the power is the combinatorial nature of the technique (which I'll mention in a moment), but a huge part of the benefit is that you can read your code in this top-to-bottom, serial fashion.

What's great about dplyr is that you have a set of simple, easy-to-use functions that can be combined in complex ways using pipes.

This makes your code easy to read, easy to write, and easy to debug.

But moreover, the process of performing data manipulation just becomes a problem of combining little building blocks ... almost like snapping together little LEGO blocks to create a larger structure.

It's brilliant, powerful, and frankly, a joy to use.

Ok. One last thing.

You can use dplyr in combination with ggplot2 (and other functions of the Tidyverse) to do rapid data analysis and exploration.

Again, this is why the Tidyverse system is so powerful. You can combine together multiple simple tools to perform complex operations. Everything snaps together.

Here, we're going to subset our data and then create a bar chart.

Here, we're going to filter the data down to the human characters (and we'll remove the records where `height`

equals `NA`

).

After filtering the data, we'll use ggplot2 to plot the data.

Notice as you read the code, that we're using dplyr pipes to combine together a couple of dplyr functions. Then we're piping the output of those dplyr subsetting functions into ggplot2!

They work together seamlessly.

starwars %>% select(name, gender, height, species) %>% filter(species == 'Human') %>% filter(!is.na(height)) %>% ggplot(aes(y = fct_reorder(name, height), x = height, fill = gender) ) + geom_bar(stat = 'identity') + scale_fill_manual(values = c('maroon', 'navy'))

OUT:

This is not terribly complicated, but it shows some of the power of the dplyr/ggplot2/Tidyverse system.

Here, we're combining together a couple of dplyr functions to subset the data (we could use even more dplyr functions, if we wanted to perform other data manipulations).

Then we're taking the output of the final `filter()`

call and piping it into ggplot2 to visualize the data and create a bar chart.

To be clear: this might seem a little complex if you're unfamiliar with these functions. But if you break it down, there are only 6 or 7 techniques (i.e., Tidyverse functions) that we're using here. We're just taking relatively simple tools and combining them together.

Moreover, once you master these tools, you can do much, much more.

Data manipulation is a foundational skill.

If you want to master data science, you *must* become highly proficient in data manipulation.

And if you choose to use R, I recommend that you use dplyr.

There are other data manipulation tools for R, but dplyr is easy to learn, easy to use, and extremely powerful.

Once you master the essential techniques of dplyr, you'll wish you had learned it a lot sooner.

The dplyr examples here are pretty simple and easy to understand.

But other parts of dplyr and the Tidyverse can be a lot more complicated.

If you're serious about learning dplyr and data science in R, you should consider joining our premium course called *Starting Data Science with R*.

Starting Data Science will teach you all of the essentials you need to do data science in R, including:

- How to manipulate your data with dplyr
- How to visualize your data with ggplot2
- Tidyverse helper tools, like tidyr and forcats
- How to analyze your data with ggplot2 + dplyr
- and more ...

Moreover, it will help you completely *master* the syntax within a few weeks. We'll show you a practice system that will enable you to memorize all of the R syntax you learn. If you have trouble remembering R syntax, this is the course you've been looking for.

Find out more here:

Learn More About Starting Data Science with R

The post A quick introduction to dplyr appeared first on Sharp Sight.

]]>The post np.random.randn Explained appeared first on Sharp Sight.

]]>The tutorial is divided up into several different sections, including a quick overview of what the function does, an explanation of the syntax, and a section that shows step-by-step examples.

You can click on any of the following links and it will take you to the appropriate section in the tutorial.

**Table of Contents:**

- Introduction to Numpy Random randn
- The syntax of np.random.randn
- Examples of np.random.randn
- np.random.randn FAQ

Let’s start off with a quick introduction to the Numpy random randn function.

As you probably know, the Numpy random randn function is a function from the Numpy package.

Numpy is a library for the Python programming language for working with numerical data.

As such, the functions from Numpy all deal with either creating Numpy arrays or manipulating Numpy arrays.

Numpy random randn does the former; it creates Numpy arrays (with one simple exception, which we will discuss in example 1.

Numpy random randn creates new Numpy arrays, but the numbers returned have a very specific structure: Numpy random randn returns numbers that are generated randomly from the normal distribution.

Remember that the normal distribution is a continuous probability distribution that has the following probability density function:

(1)

Where is the mean and is the standard deviation.

Specifically, np.random.randn generates numbers from the *standard* normal distribution.

The standard normal distribution is a normal distribution that has a mean of 0 and a standard deviation of 1.

So when we set and for the standard normal distribution, equation 1 simplifies to the following:

(2)

Essentially, Numpy random randn generates normally distributed numbers from a normal distribution that has a mean of 0 and a standard deviation of 1.

So now that you know a little about what np.random.randn does, let’s discuss the syntax.

One quick note before we look at the syntax.

Whenever we import a Python package in our code, we have the option to import with a particular alias.

How exactly we import a module will slightly change how we call the function.

Here in our code, we’ll import Numpy with the alias ‘`np`

‘ using the following code:

import numpy as np

This is the common convention among Python users.

Because of this import style, we’ll use the prefix ‘`np`

‘ when we call the function.

Ok, now let’s take a look at the syntax.

When we call the function, assuming that we’ve imported Numpy as discussed above, we can call the function as `np.random.randn()`

.

Then, inside the parenthesis, there are a few parameters that we can use.

Let’s take a look at those.

Numpy random randn is actually fairly simple in terms of parameters.

First of all, we can call the function *without* any parameters.

However, if we do choose to use parameters, we simply provide integer arguments to the parameters that we can call `d0`

, `d2`

, `dn`

.

Let’s take a closer look.

`d0`

(optional)If we decide to use the `d0`

parameter, we simply provide an integer as the input.

When we do this, that becomes the number of normally distributed values that np.random.randn will generate along axis 0.

(Remember: axes are like directions along a Numpy array. If you’re confused about Numpy array axes, you should read our tutorial about Numpy axes.)

So if we use the code `np.random.randn(3)`

, Numpy will generate a new Numpy array with three normally distributed values. You’ll see an example of this in the examples section.

`d1`

(optional)The `d1`

parameter does something very similar to `d0`

.

Remember, `d0`

specifies the Number of values in the axis 0 direction.

Similarly, `d1`

specifies the Number of values in the axis 1 direction.

Keep in mind that we can only use `d1`

if we’re already using `d0`

.

So when we use `d0`

and `d1`

(and no additional parameters), we’re essentially telling Numpy to create a 2-dimensional Numpy array, where the number of rows are specified by `d0`

(axis 0), and the number of columns are specified by `d1`

.

If you’re really a Numpy beginner, this might seem confusing. For a 2D array, `d0`

controls the rows, but for a 1D array, `d0`

seems to control the columns, right?

No.

`d0`

always controls the number of elements *in the axis 0 direction*. However, the axis 0 direction appears horizontal for 1D arrays, but appears vertical for 2D arrays.

If you’re confused about this, you really, really need to learn more about Numpy axes, so please read our Numpy axis tutorial.

`dn`

(optional)Beyond `d0`

and `d1`

, there are actually more parameters for np.random.randn().

All of these additional parameters control the number of elements along a particular axis, for the output array.

So `d2`

controls the number of elements along axis 2. `d3`

controls the number of elements along axis 3, and so on.

These parameters are completely optional. You’re only going to use them if you need to create Numpy arrays with a larger number of dimensions (i.e., beyond 1D or 2D arrays).

The output of Numpy random randn depends on how you call the function.

If you use the parameters (i.e., `d0`

, `d1`

, `dn`

), the output will be a Numpy array with dimensions `(d0, d1, ..., dn)`

. All of the numbers in the ouput array will be drawn from the standard normal distribution, as described by equation 2.

However, if you call the function without any parameters – i.e., `np.random.randn()`

with nothing inside the parenthesis – then the function will return a single floating point number drawn from the standard normal distribution.

Ok. Let’s take a look at some examples. The syntax and parameters will make a lot more sense once you can play with some code and see how it works..

**Examples:**

- Generate a single number with np.random.randn
- Create a 1D Numpy array with Numpy Random Randn
- Create a 2D Numpy array with Numpy Random Randn

You can click on any of the above links, and they will take you to the appropriate example.

One quick note …

As explained in the section about syntax, how we write the syntax depends partially on how we’ve imported Numpy.

We’re going to import Numpy with the alias ‘`np`

‘, which you can do with the following code:

import numpy as np

This is the common convention among Python data scientists, and we’ll be using it going forward.

First, let’s just generate a single random normal number np.random.randn.

Here, we’re going to call the function *without any arguments to the parameters*.

np.random.seed(0) np.random.randn()

OUT:

1.764052345967664

When we use `np.random.randn()`

like this, without any inputs, it simply returns a number that’s drawn randomly from the standard normal distribution.

Keep in mind that in this example, we’ve used the Numpy random seed function as well. By setting np.random.seed(0), we’ll get the same number every single time we run `np.random.randn()`

. If we use a different seed (besides 0), we’ll get a different number. And if we don’t use np.random.seed at all, we’ll get a different normally distributed number every time. Essentially, we use Numpy random seed when we want the output of our code to be reproducable.

(If you’re confused about this, you need to read our guide to Numpy random seed.)

Next, we’ll create a 1-dimensional array with Numpy random randn.

To do this, we’re going to call `np.random.randn()`

with a single argument (i.e., an input to the function).

np.random.seed(0) np.random.randn(3)

OUT:

array([1.76405235, 0.40015721, 0.97873798])

Here, we used the input value `3`

as the argument to the function.

This value is being passed to the `d0`

parameter, which controls the number of elements along axis 0.

Here, since we’re only passing a value to `d0`

(and not any other parameters), this creates a 1-dimensional array with 3 values.

(Note: we’re using Numpy random seed function for reproducibility. See example 1 for an explanation.)

Finally, let’s create a 2-dimensional numpy array.

To do this, we’ll pass two integer input values to the function.

np.random.seed(0) np.random.randn(2,3)

OUT:

array([[ 1.76405235, 0.40015721, 0.97873798], [ 2.2408932 , 1.86755799, -0.97727788]])

Notice the shape of the output.

The output array has 2 rows and 3 columns.

That’s because we called the function as `np.random.randn(2,3)`

.

The first number, 2, controls the number of elements along axis 0. Remember, for a 2D array, axis 0 is the rows.

The second number, 3, controls the number of elements along axis 1. Remember, for a 2D array, axis 0 is the columns.

If you’re confused about this, go back and re-read the syntax section, which explains the function parameters.

(Note: again, we’re using Numpy random seed function for reproducibility. See example 1 for an explanation.)

Now that you’ve seen some examples, let’s quickly discuss one common question about numpy random randn.

These two functions, np.random.randn and np.random.normal, are very similar.

Both functions produce data that’s drawn from a normal distribution.

The major difference is that np.random.randn only draws numbers from the *standard* normal distribution, which has and .

However, np.random.normal can essentially draw numbers from a normal distribution with any mean and any standard deviation.

Another way of saying this is that np.random.normal allows us to manually set the mean and standard deviation, but for np.random.randn, the mean and standard deviation are strictly set.

So for example, if we manually set and for np.random.normal, they will create the same output (assuming that we set the same shape).

Here’s an example.

Run the code for both of these.

np.random.seed(0) np.random.randn(2,3)

np.random.seed(0) np.random.normal(size = (2,3), loc = 0, scale = 1)

You’ll find that they produce the same output.

OUT:

array([[ 1.76405235, 0.40015721, 0.97873798], [ 2.2408932 , 1.86755799, -0.97727788]])

Again, numpy.random.randn and numpy.random.normal both produce numbers drawn from the normal distribution.

The difference is that numpy.random.normal gives you more control over the mean and standard deviation.

Ultimately, numpy.random.randn is like a special case of numpy.random.normal with `loc = 0`

and `scale = 1`

.

Do you have other questions about the Numpy random randn function?

If so, just leave your question in the comments section at the bottom of the page.

The examples here about Numpy floor are pretty simple and easy to understand.

But other parts of Numpy can be a lot more complicated.

If you’re serious about learning Numpy (and serious about data science in Python), you should consider joining our premium course called *Numpy Mastery*.

Numpy Mastery will teach you everything you need to know about Numpy, including:

- How to create Numpy arrays
- How to use the Numpy random functions
- What the “Numpy random seed” function does
- How to reshape, split, and combine your Numpy arrays
- How to perform mathematical operations on Numpy arrays
- and more …

Moreover, it will help you completely *master* the syntax within a few weeks. We’ll show you a practice system that will enable you to memorize all of the Numpy syntax you learn. If you have trouble remembering Numpy syntax, this is the course you’ve been looking for.

Find out more here:

Learn More About Numpy Mastery

The post np.random.randn Explained appeared first on Sharp Sight.

]]>