The post How to use to facet_wrap appeared first on Sharp Sight.

]]>The small multiple design is an incredibly powerful (and underused) data visualization technique.

facet_wrap is great, because it enables you to create small multiple charts easily and effectively. It makes it easy to create small multiple charts.

Having said that, this tutorial will explain exactly how to create small multiple charts with facet_wrap.

First, the tutorial will quickly explain small multiple charts. After that, it will show you the syntax to create small multiple charts with facet_wrap. And finally, the tutorial will show you a few examples, so you can see how the technique works.

The small multiple chart is a chart where a data visualization is repeated in several small panels.

For example, here’s an example of a small multiple chart from the New York Times:

In this example, the map of the United States has been re-created for every year. Each small map (one for every year) is broken out into a separate panel.

Each panel is a “small” version of the overall data visualization technique. So there are multiple small versions of the same type of chart. Multiple versions. Small versions. “Small multiple.” That’s where the name comes from.

Because this design breaks the visualization into separate panels, it is sometimes called the “panel chart.” You might also hear it called a trellis chart.

Small multiple charts are often hard to create. Creating them in Excel is a bit of a pain in the a$$. Many other data visualization tools can’t create them at all.

But creating a small multiple chart is relatively easy in R’s ggplot2.

ggplot2 has a two primary techniques for creating small multiple charts: facet_wrap and facet_grid.

The primary difference between facet_wrap and facet_grid is in how they lay out the panels of the small multiple chart.

Essentially, facet_wrap places the first panel in the upper right hand corner of the small multiple chart. Each successive panel is placed to the right until it reaches the final column of the panel layout. When it reaches the final column of the layout, facet_wrap “wraps” the panels downward to the next row.

So ultimately, facet_wrap lays out the panels like a “ribbon” that wraps around (and downward) from one row to the next.

Creating this sort of small multiple chart is hard in most software. However, it’s rather easy to do in ggplot2 with facet_wrap.

With that in mind, let’s look at how to create this sort of small multiple plot in ggplot2.

Creating small multiple charts is surprisingly easy in ggplot2, once you understand the syntax.

Here, I’m going to quickly review the syntax of ggplot2, and then I’ll explain how to use facet_wrap.

To use facet_wrap and create small multiple charts, you first need to be able to create basic data visualizations with ggplot. That means that you should first have a good understanding of the ggplot2 syntax.

ggplot2 is extremely systematic. Let’s quickly break down the ggplot2 syntax to see how it works.

There are 4 basic parts of a simple data visualization in ggplot2: the `ggplot()`

function, the `data`

parameter, the `aes()`

function, and the geom specification.

Let’s quickly talk about each part.

**The ggplot() function**

The

`ggplot()`

function is the core function of the ggplot2 data visualization system. When you use this function, you’re basically telling ggplot that you’re going to `ggplot()`

function initiates plotting.But what *exactly* you’re going to create is determined by the other parts of the syntax.

**The data = parameter**

The data that you plot is specified by the

`data =`

parameter.Remember that ggplot2 is essentially a tool for visualizing *data* in the R programming language. More specifically, ggplot visualizes data that is contained inside of *dataframes*. ggplot2 almost exclusively operates on dataframes.

Having said that, the `data`

parameter enables you to specify the dataframe that contains your data. It enables you to specify the dataframe that contains the variables that you want to visualize.

**Geometric objects (e.g., geom_line)**

In the example above, the second line of code has a geom, specifically

`geom_line`

. You might be asking … “what the hell is a geom?”A geom is something you draw. It’s short for “geometric object.” Once you understand that “geoms” are actually “geometric objects,” they become easier to understand.

“Geoms” (aka, geometric objects) are the geometric objects that get drawn in the data visualization; things like lines, bars, points, and tiles.

Keep in mind that there are dozens of geoms in the ggplot2 system, but all of them are essentially just types of shapes that we can draw in a data visualization.

**The aes() function**

The hardest thing to understand in ggplot2 is the

`aes()`

function. The `aes()`

function enables you to create a set of mappings from data (in your dataframe) to the aesthetic attributes of the plot.

That doesn’t make sense to many people, so let me quickly explain.

Your dataframe has data. It has variables.

The plot that you’re trying to draw has “geoms” … geometric objects.

Those geometric objects have *aesthetic attributes*; things like color and size. Think about it. If you draw a point (a point geom), that point will have *attributes* like the color and size.

Importantly, when we create a data visualization, what we’re doing is connecting the *data* in a dataset to elements in the visualization.

More specifically, we create a “mapping” that connects the variables in a dataset to the aesthetic attributes of the geometric objects that we draw.

Here’s where the `aes()`

function comes in.

The `aes()`

function is the function that creates those mappings. It creates the mappings between variables in your dataframe (the data frame that you specify with the `data`

parameter), and the aesthetic attributes of the geoms that you draw.

Essentially, the `aes()`

function enables you to connect the data to the visuals that your audience can see.

If you do it right, you map the data to the geoms in a way that creates something that’s insightful.

The basic syntax that we just reviewed enables you to make individual charts.

That’s often enough, but sometimes we need more.

In some cases, we need to create many similar charts that are almost exactly the same, but with slight variations.

For example, maybe you want to re-create a bar chart for every year and compare them. Or maybe you want to create two versions of a scatter plot, but for different values of a categorical variable like male/female, so you can compare them side by side.

If you do this manually, creating multiple similar versions of the same chart can be tedious. And sometimes it’s very hard to do. What if you want to create the same chart for every year in your data, and there are 30 years!?

There’s a solution to this.

You can use facet_wrap to create a small multiple chart.

As I mentioned earlier in this tutorial, you can use facet_wrap to create a small multiple chart. A visualization with many small versions of the same chart, arranged in a grid format.

The syntax for this is easy. It starts with the syntax for a basic visualization in ggplot, and then adds the function `facet_wrap()`

.

Let’s take a look.

In a simple example like the syntax above, there are two parts.

First there is the “solo chart.” This is the syntax for creating a data visualization in ggplot2. At minimum, you’ll need to use the `ggplot()`

function to initiate plotting. You’ll also need to specify your geom (or geom*s*, if you have a more complicated plot). And you’ll need the `aes()`

function to specify your variable mappings. Essentially, you need to at least have all of the piece of a data visualization.

After that, you use the `facet_wrap()`

function to “break out” the solo chart into several small versions of that chart. facet_wrap basically enables you to specify the facets, or panels of the small multiple design.

Inside of facet_wrap is your faceting variable. This is the specific variable upon which your visualization will be faceted.

Notice the syntax. The variable name is preceded by the tilde symbol, `~`

. Typically, the faceting variable itself is a categorical variable (i.e., a factor variable). When we use facet_wrap, it will create one small version of the “solo chart” for every value of your faceting variable.

So if you facet on a variable called `Student`

, and that variable has two values, `Yes`

and `No`

, then the code `facet_wrap(~Student)`

will create *two* small versions of your chart. It will create one version for the two different values of the categorical faceting variable.

Ok. Now that I’ve explained how the syntax works, let’s work through a couple of concrete examples.

Here, I’ll walk you through these examples step by step.

Before you get started, you’ll need to have a few things in place. First, you will need to have installed a few packages: `tidyverse`

, `ISLR`

, and `nycflights13`

. If you’re working in RStudio, you can do that from Tools > Install Packages.

Next, you will need to *load* those packages into your working environment in RStudio. To do that, you’ll need to run the following code:

library(ISLR) library(tidyverse) library(nycflights13)

Initially, we’re going to be working with the `Credit`

dataframe from the `ISLR`

package.

Very quickly take a look at the data to see what’s in it. You can do that by printing out the data:

Credit %>% as_tibble() %>% print()

Keep in mind that the `Credit`

dataset is a traditional dataframe object (not a “tibble”). Tibbles print out better, so in the above code, I’ve coerced it to a tibble before printing.

In any case, take a look at the data. This is data about bank customers, and you can see not only bank product data (like the customers’ “Balance”), but also information about the customer like their income, education level, gender, and marital status. We’re not going to use everything in this dataset, but it’s a good habit to examine your data so you know what’s in it.

In our first example, we’re going to make a simple small multiple chart using facet_wrap.

This example will be similar to the code that we looked at earlier when I explained the syntax.

Before we actually make the small multiple, let’s first start by creating a “solo” chart with ggplot2. The small multiple that we create later will build on this simple chart.

As I noted above, we’ll be working with the `Credit`

dataset from the `ISLR`

package. We’re going to plot a density plot of the `Balance`

variable.

To do this, we’ll use the following code:

ggplot(data = Credit, aes(x = Balance)) + geom_density()

And here is the chart that it creates:

Let’s quickly unpack what we did here.

We initiated plotting using the `ggplot()`

function.

The data that we are using is the `Credit`

dataset from the `ISLR`

package. We specified that we would be plotting this data by using the syntax `data = Credit`

.

We indicated that we wanted to plot the `Balance`

variable by using the code `x = Balance`

. This appears inside of the `aes()`

function. So essentially, we are mapping the `Balance`

variable to the x axis (i.e., the `x`

aesthetic).

And finally, we specified the geom that we want to use with the code `geom_density()`

. We could also have used a different type of geom. For example, we could have used `geom_histogram()`

, which would have made a histogram instead of a density plot.

Combined together, this code creates a single, simple density plot.

Now, let’s break this out into a small multiple plot.

To do this, we’re going to facet on the `Student`

variable. The `Student`

variable is a categorical variable – a factor variable – that indicates whether or not the customer is a student.

When working with factor variables like this, it can be helpful to inspect them and identify the unique values. Again, you want to inspect your data so you know what’s in it; this will help you know what to expect when you create your charts.

Credit %>% as_tibble() %>% print()

Again, you can see that the `Student`

variable is a factor variable.

At a quick glance, it looks like the allowed values are `Yes`

and `No`

, but let’s confirm. Here, we’ll quickly identify the unique values of the `Student`

variable:

Credit %>% group_by(Student) %>% summarise()

The output of this code shows us the two unique values of `Student`

, `Yes`

and `No`

:

Now that we’ve looked at the `Student`

variable, we’re a little better prepared to create our small multiple chart with facet_wrap. We’re going to “break out” the simple density chart that we made above into two small panels. There will be two, because there are two levels of the `Student`

variable.

Let’s do it:

ggplot(data = Credit, aes(x = Balance)) + geom_density() + facet_wrap(~Student)

And here’s the output:

So what do we have here?

If you look at the individual panels, you can see that each panel is a density plot. Why? Because that’s the “solo” chart that we created with ggplot in the first two lines of code. Those first two lines specify that we’ll create a density plot of the `Balance`

variable (if you don’t understand this, go back to the earlier section where I explain how to make the “solo” chart).

Additionally, the overall chart is broken out into *two* panels: one panel for “`Yes`

” and one panel for “`No`

“. These values are the two values of the `Student`

variable. Essentially, the third line of code, `facet_wrap(~Student)`

, has taken the base density plot and broken it out into two panels; one panel for each value of the `Student`

variable.

Now that we’ve reviewed how to make a simple small multiple chart, let’s do something a little more complicated.

Here, we’re going to manipulate the *number of columns* in the grid layout of a small multiple chart.

To show you an example of this, we’ll work with a new dataset. We’ll be working with the `weather`

dataframe from the `nycflights13`

package.

Quickly, let’s take a look at the contents:

print(weather)

There are a few good variables that we could work with here, but right now, we’re going to focus on `temp`

and `month`

.

`temp`

is a numeric variable (a double) and `month`

is an integer. Having said that, because `month`

is an integer variable with only 12 values, it will operate somewhat similar to a categorical variable. We can therefore use `month`

as our faceting variable.

First, let’s just create a density plot of the `temp`

variable:

ggplot(data = weather, aes(x = temp)) + geom_density()

This is a simple density plot. This will serve as the basic “solo” chart that we will break out into multiple panels by using facet_wrap.

Let’s do that.

Here, we’re going to use facet_wrap to create a small version of this density plot for every value of the `month`

variable.

ggplot(data = weather, aes(x = temp)) + geom_density() + facet_wrap(~month)

Notice a few things:

First, there are 12 values for the `month`

variable. By using the code `facet_wrap(~month)`

, we’ve broken out the base density plot into 12 separate panels, one for each month.

Notice also that facet_wrap has laid out the panels like a ribbon. The first panel is in the top left hand corner (month 1), and they are then laid out left to right, top to bottom. Moreover, the panels “wrap” around to a new row in the grid layout when they reach a certain number of panels. Panels `1`

, `2`

, `3`

, and `4`

are in the first row of the grid layout, but then panel `5`

is in the next row. Panel `5`

was “wrapped” downward into the next row of the grid layout.

But how many columns does the grid layout have? By default, `ggplot2`

will calculate the number of columns of the layout based on the total number of categories for your faceting variable. There are 12 values for the `month`

variable, so ggplot2 calculated that there should be 4 columns in the layout.

That’s the default behavior though. By default, ggplot2 will calculate the number of rows and columns in the layout for you.

That said, you can *change* the default behavior and specify the exact number of rows or columns yourself.

Here, we’re going to manually specify the number of columns in the layout.

To do this, we’re going to use the `ncol`

parameter of facet_wrap.

ggplot(data = weather, aes(x = temp)) + geom_density() + facet_wrap(~month, ncol = 3)

If you’ve understood the other examples earlier in this tutorial, this should make sense. The code is almost exactly the same as the code we just used to create a small multiple chart a few paragraphs ago. But now, we’re specifying that we want exactly 3 columns. To do this, we’ve used the code `ncol = 3`

inside of facet_wrap.

Keep in mind that you can specify fewer columns or more columns depending on the design that you want to produce. Play around with it and see what you like.

Just like you can specify the number of columns, you can also specify the number of rows.

To specify the number of rows of the grid layout, you can use the `nrow`

parameter.

It works almost exactly the same way as the `ncol`

parameter, so if you understood the example in the previous section, this should make a lot of sense.

Let’s take a look.

Here, we’re going to make a small multiple chart with 2 rows in the panel layout.

ggplot(data = weather, aes(x = temp)) + geom_density() + facet_wrap(~month, nrow = 2)

This is pretty straight forward. The code `ncol = 2`

has forced the grid layout to have 2 rows.

The small multiple chart is one of my favorite data visualization designs.

This visualization layout enable you to make direct comparisons between categories. A great deal of data analysis is just about making comparisons. Faceting enables you to make those comparisons.

Data analysis also requires you to “zoom in” on your data to look at things with more detail. Again, faceting enables you to do this. The small multiple design is perfect for “zooming in” on your data to see new details and find new insights.

This is why I really love facet_wrap.

In most software, creating a small multiple chart is a pain in the a$$. Try to make a small multiple chart in Excel and you’ll see what I mean. It’s possible, but time consuming and error prone.

But in ggplot2, making small multiple charts is *easy*. Just add a line of code that invokes facet_wrap (or facet_grid), and you can turn almost any data visualization into a small multiple chart.

Because of this, I think that facet_wrap is one of the best tools for you to have in your R data visualization toolkit. If you’re serious about doing great work as a data scientist or data analyst in R, I recommend that you master it.

If you’re interested in mastering tools like facet_wrap, and other tools in R, sign up for our email list.

Here at Sharp Sight, we teach data science.

Every week, we publish articles and tutorials about data science …

… specifically, we publish free tutorials about data science in R.

If you sign up for our email list, you’ll get these tutorials delivered right to your inbox.

You’ll learn about:

- ggplot2
- dplyr
- tidyr
- machine learning in R
- … and more.

Want to learn data science in R? Sign up now.

The post How to use to facet_wrap appeared first on Sharp Sight.

]]>The post How to use geom_line in ggplot2 appeared first on Sharp Sight.

]]>Using geom_line is fairly straight forward if you know ggplot2. But if you’re a relative beginner to ggplot, it can be a little intimidating.

That being said, I’m going to walk you through the syntax step by step.

We’ll first talk about the ggplot syntax at a high level, and then talk about how to make a line chart with ggplot using geom_line.

After I explain how the syntax works, I’ll show you a concrete example of how to use that syntax to create a line chart.

Ok … let’s jump in.

One of the great things about creating data visualizations with ggplot2 is that the syntax is extremely formulaic.

It looks complex to beginners, but once you understand how it works, it is concise, powerful, and flexible. The ggplot2 system is very well designed, and once you get the hang of it, it makes it easy to create beautiful, high-quality charts … especially line charts.

Having said that, let’s talk about the syntax of ggplot2 first, so you understand how it works at a high level.

As I mentioned earlier, ggplot2 is highly systematic. Let’s take a look at the high-level syntactical features of ggplot2, so you understand how the system works.

Let’s quickly discuss the main parts of the ggplot2 syntax.

The `ggplot()`

function is the foundation of the ggplot2 system. It essentially initiates the ggplot2 system and tells R that we’re going to plot something.

So when you see the `ggplot()`

function, understand that the function will create a chart of some type.

Having said that, the exact *type* of chart is determined by the other parameters.

The `data =`

parameter specifies the data that we’re going to plot.

Specifically, the `data =`

parameter indicates the *dataframe* that we will be plotting; the dataframe that contains the data we will visualize. To be clear, `ggplot2`

works almost exclusively with dataframes. Your data and variables will need to be in a dataframe in order for ggplot2 to operate on them.

The `aes()`

function specifies how we want to connect the visual aspects of our chart to the data that’s in our dataframe.

A little more technically, the `aes()`

function specifies the *aesthetic mappings* from the data to the chart.

That might sound confusing, so let me explain. When we create a data visualization, we are creating a visual representation of data that exists in a dataset. We are effectively translating from “data space” to “visual space.”

In order to translate from a dataset to visual objects that we can draw and see, we need to *connect* variables in the data to objects in a visualization.

More technically, we need to *map* variables in the data to elements of the plot.

For example, if we are creating a line chart in R, typically we will “map” one variable to the x axis and “map” another variable to the y axis.

The `aes()`

function enables us to specify how we want to perform those mappings. It enables us to specify which variables in the data should connect to which parts of the plot. Keep in mind that those “parts” of the plot are technically called the “aesthetic attributes” of the plot. That’s where the name of the function comes from; `aes()`

is an abbreviation of “aesthetic attribute.”

The last part of the basic ggplot2 syntax is the geometric object. We often call the geometric objects of a plot “geoms.”

You might be asking, “What the f*$# is a geom?”

It’s not that hard to understand ….

Geometric objects are things that we can draw. Things like bars, points, lines, etc.

So you want to make a scatter plot? You need to plot *point* geoms. Want to make a bar chart? You need to plot *bar* geoms. Want to make a line chart? You need to plot *line* geoms.

The type of geom you select dictates the type of chart you make.

Now that we’ve quickly reviewed ggplot2 syntax, let’s take a look at how geom_line fits in.

Remember what I just wrote: the type of geom you select dictates the type of chart make.

If you want to make a line chart, typically, you need to use geom_line to do it. (There are a few rare examples to this, but this is almost always how you do it.)

So essentially, you need to use geom_line to tell ggplot2 to make a line chart.

This might still seem a little abstract. We’ve talked about the syntax at a high level, but to really understand syntax, it’s almost always best to work with some concrete examples.

Having said that, let’s work through an example so you can see how we structure the syntax. I’ll also explain the syntax, so you know how it works.

In this example, we’re going to plot the stock price of Tesla stock using ggplot2.

I’ve already pulled the stock data (from finance.yahoo.com)

First, we’re going to import the `tidyverse`

library. If you’re not familiar with it, the `tidyverse`

package is a bundle of other R packages. Specifically, it is a collection of packages related to data manipulation, data visualization, and data science in R. `ggplot2`

is one of the packages in the `tidyverse`

, so when we load the `tidyverse`

, it will automatically load `ggplot2`

.

library(tidyverse)

Next, let’s load the data. I’ve already downloaded the data and cleaned it up using dplyr, so now we just need to import it using the `read_csv`

function.

`read_csv`

will import the data into an R dataframe.

# IMPORT DATA INTO R tsla_stock_metrics <- read_csv("https://www.sharpsightlabs.com/datasets/TSLA_start-to-2018-10-26_CLEAN.csv")

Quickly, let’s print out the data to take a look:

print(tsla_stock_metrics)

Remember … as you’re performing an analysis (big or small), you should consistently inspect your data by doing things like printing out observations.

Ok. Now, let’s create a rough draft of the chart.

ggplot(data = tsla_stock_metrics, aes(x = date, y = close_price)) + geom_line()

And here’s what it looks like:

Let’s quickly review what we’ve done in this code.

The `ggplot()`

function indicates that we’re going to plot something; that we’re going to make a data visualization of some type using the ggplot2 system.

The code `data = tsla_stock_metrics`

indicates that we’ll be plotting data that’s contained within the `tsla_stock_metrics`

dataframe.

After the `data`

parameter, the `aes()`

function is specifying our variable mappings. Specifically, we are mapping the `date`

variable to the x axis (the `x`

aesthetic) and we’re mapping the `close_price`

variable to the y axis (the `y`

aesthetic).

Finally, we’re using `geom_line()`

to indicate that we want to draw *line* geoms. Remember … the type of geom that you use determines the type of chart that you make. Keep in mind, that you could actually try different geoms. For fun, consider changing the geom to something else … maybe geom_bar. As an exercise, experiment with changing the geom to see what happens.

Ultimately, this code produces a pretty decent “first draft” line chart. It’s not perfect (we’ll work on this more later in the tutorial), but it’s not bad for a first draft.

This is actually one of the reasons that I love ggplot2. As I’ve said many times in the past, ggplot2 makes excellent “first draft” charts. The charts look pretty good even *without* formatting.

And as another side note, I want to point out that as you’re performing an analysis, this would be your first step. You generally want to make a simple chart that doesn’t have a lot of formatting as your first draft. That’s because a quick-and-dirty chart like this won’t take nearly as much time as a finalized version, but it still conveys quite a bit of information. You can use these rough draft charts early in an analysis, and share them with close team members.

Having said that, in some cases, you will want to ultimately have a chart that is more refined. For example, if you’re creating an analysis that needs to be published or presented to someone important (a client, upper management, etc) you will want to have a chart that looks a little better.

With that in mind, we’ll create one more version of this line chart. We’re going to create a chart that is more formatted.

ggplot(data = tsla_stock_metrics, aes(x = date, y = close_price)) + geom_line(color = '#E51837', size = .6) + labs(title = 'Tesla stock price from IPO to Oct 2018' ,y = 'Close\nPrice' ,x = 'Date' ,subtitle = str_c("TSLA stock price increased over 10x from Jun 2010 to Oct 2018,\n" ,"but with substantial volatility") ) + theme(text = element_text(color = "#444444", family = 'Helvetica Neue') ,plot.title = element_text(size = 26, color = '#333333') ,plot.subtitle = element_text(size = 13) ,axis.title = element_text(size = 16, color = '#333333') ,axis.title.y = element_text(angle = 0, vjust = .5) )

And here is the output:

Let me quickly explain this.

The basis for this chart is almost identical to the first rough draft chart. Take a look at the first two lines:

ggplot(data = tsla_stock_metrics, aes(x = date, y = close_price)) + geom_line(color = '#E51837', size = .6)

This code is almost identical to the initial first draft chart that we made earlier in this tutorial. The major difference in these first two lines is that we modified the color and the size of the line inside of `geom_line()`

.

The rest of the code after those first two lines is all formatting code. We used the `labs()`

function to add a title and text labels. After the `labs()`

function, we used the `theme()`

function to format the “non data elements” of the chart. Specifically, we modified the text color, the text size (of the plot title and axis titles), and a few other things.

In this tutorial, I showed you how to use geom_line to make a line chart in ggplot2. I showed you how to make a very simple line chart, but also how to make a more “polished” line chart.

If you want to learn data science in R, you really need to know this technique. In fact, there is a whole set of foundational techniques you should master if you want to be a good data scientist. You should know how to create a bar chart, create a scatter plot, and create histograms. You should know how to filter data, add new variables, and perform a variety of other data visualization and data manipulation tasks.

Learning (and mastering) these skills is not hard … we can show you how.

Here at Sharp Sight, we teach data science.

If you’re interested in data science, sign up for our email list.

Every week, we publish articles and tutorials about data science …

… specifically, we publish free tutorials about data science in R.

If you sign up for our email list, you’ll get these tutorials delivered right to your inbox.

You’ll learn about:

- ggplot2
- dplyr
- tidyr
- machine learning in R
- … and more.

Want to learn data science in R? Sign up now.

The post How to use geom_line in ggplot2 appeared first on Sharp Sight.

]]>The post How to use the NumPy sum function appeared first on Sharp Sight.

]]>In the tutorial, I’ll explain what the function does. I’ll also explain the syntax of the function step by step. Finally, I’ll show you some concrete examples so you can see exactly how np.sum works.

Let’s jump in.

Let’s very quickly talk about what the NumPy sum function does.

Essentially, the NumPy sum function sums up the elements of an array. It just takes the elements within a NumPy array (an `ndarray`

object) and adds them together.

Having said that, it can get a little more complicated. It’s possible to also add up the rows or add up the columns of an array. This will produce a new array object (instead of producing a scalar sum of the elements).

Further down in this tutorial, I’ll show you examples of all of these cases, but first, let’s take a look at the syntax of the np.sum function. You need to understand the syntax before you’ll be able to understand specific examples.

Like many of the functions of NumPy, the np.sum function is pretty straightforward syntactically.

We typically call the function using the syntax `np.sum()`

. Note that this assumes that you’ve imported numpy using the code `import numpy as np`

.

Then inside of the `np.sum()`

function there are a set of parameters that enable you to precisely control the behavior of the function.

Let’s take a look.

The NumPy sum function has several parameters that enable you to control the behavior of the function.

Although technically there are 6 parameters, the ones that you’ll use most often are `a`

, `axis`

, and `dtype`

. I’ve shown those in the image above.

There are also a few others that I’ll briefly describe.

Let’s quickly discuss each parameter and what it does.

** a** (required)

The

`a =`

parameter specifies the input array that the `sum()`

function will operate on. It is essentially the array of elements that you want to sum up.Typically, the argument to this parameter will be a NumPy array (i.e., an `ndarray`

object).

Having said that, technically the np.sum function will operate on any *array like object*. That means that in addition to operating on proper NumPy arrays, np.sum will also operate on Python tuples, Python lists, and other structures that are “array like.”

** axis** (optional)

The

`axis`

parameter specifies the axis or axes upon which the sum will be performed.Does that sound a little confusing? Don’t feel bad. Many people think that array axes are confusing … particularly Python beginners.

I’ll show you some concrete examples below. The examples will clarify what an axis is, but let me very quickly explain.

The simplest example is an example of a 2-dimensional array.

When you’re working with an array, each “dimension” can be thought of as an axis. This is sort of like the Cartesian coordinate system, which has an x-axis and a y-axis. The different “directions” – the dimensions – can be called *axes*.

Array objects have dimensions. For example, in a 2-dimensional NumPy array, the dimensions are the rows and columns. Again, we can call these dimensions, or we can call them *axes*.

Every axis in a numpy array has a number, starting with 0. In this way, they are similar to Python indexes in that they start at 0, not 1.

So the first axis is axis 0. The second axis (in a 2-d array) is axis 1. For multi-dimensional arrays, the third axis is axis 2. And so on.

Critically, you need to remember that the axis 0 refers to the rows. Axis 1 refers to the columns.

Why is this relevant to the NumPy sum function? It matters because when we use the `axis`

parameter, we are specifying an axis along which to sum up the values.

So for example, if we set `axis = 0`

, we are indicating that we want to sum up the rows. Remember, `axis 0`

refers to the row axis.

Likewise, if we set `axis = 1`

, we are indicating that we want to sum up the columns. Remember, `axis 1`

refers to the column axis.

If you’re still confused about this, don’t worry. There is an example further down in this tutorial that will show you how axes work.

** dtype** (optional)

The

`dtype`

parameter enables you to specify the data type of the output of np.sum.So for example, if you set `dtype = 'int'`

, the np.sum function will produce a NumPy array of integers. If you set `dtype = 'float'`

, the function will produce a NumPy array of floats as the output.

Python and NumPy have a variety of data types available, so review the documentation to see what the possible arguments are for the `dtype`

parameter.

Note as well that the `dtype`

parameter is optional.

** out** (optional)

The

`out`

parameter enables you to specify an alternative array in which to put the result computed by the np.sum function.Note that the `out`

parameter is optional.

** keepdims** (optional)

The

`keepdims`

parameter enables you to This might sound a little confusing, so think about what np.sum is doing. When NumPy sum operates on an `ndarray`

, it’s taking a multi-dimensional object, and summarizing the values. It either sums up all of the values, in which case it collapses down an array into a single scalar value. Or (if we use the `axis`

parameter), it reduces the number of dimensions by summing over one of the dimensions. In some sense, we’re and collapsing the object down.

More technically, we’re reducing the number of dimensions. So by default, when we use the NumPy sum function, the output should have a reduced number of dimensions.

But, it’s possible to change that behavior. If we set `keepdims = True`

, the axes that are reduced will be kept in the output. So if you use np.sum on a 2-dimensional array and set `keepdims = True`

, the output will be in the form of a 2-d array.

Still confused by this? Don’t worry. I’ll show you an example of how `keepdims`

works below.

Note that the `keepdims`

parameter is optional.

** initial** (optional)

The

`initial`

parameter enables you to set an initial value for the sum. Note that the `initial`

parameter is optional.

Ok, now that we’ve examined the syntax, lets look at some concrete examples. I think that the best way to learn how a function works is to look at and play with very simple examples.

In these examples, we’re going to be referring to the NumPy module as `np`

, so make sure that you run this code:

import numpy as np

Let’s start with the simplest possible example.

We’re going to create a simple 1-dimensional NumPy array using the np.array function.

np_array_1d = np.array([0,2,4,6,8,10])

If we print this out with `print(np_array_1d)`

, you can see the contents of this `ndarray`

:

[0 2 4 6 8 10]

Now that we have our 1-dimensional array, let’s sum up the values.

Doing this is very simple. We’re going to call the NumPy sum function with the code `np.sum()`

. Inside of the function, we’ll specify that we want it to operate on the array that we just created, `np_array_1d`

:

np.sum(np_array_1d)

Which will produce the following output:

30

Because np.sum is operating on a 1-dimensional NumPy array, it will just sum up the values. Visually, we can think of it like this:

Notice that we’re not using any of the function parameters here. This is as simple as it gets.

When operating on a 1-d array, np.sum will basically sum up all of the values and produce a single scalar quantity … the sum of the values in the input array.

Next, let’s sum all of the elements in a 2-dimensional NumPy array.

Syntactically, this is almost exactly the same as summing the elements of a 1-d array.

Basically, we’re going to create a 2-dimensional array, and then use the NumPy sum function on that array.

Let’s first create the 2-d array using the np.array function:

np_array_2x3 = np.array([[0,2,4],[1,3,5]])

The resulting array, `np_array_2x3`

, is a 2 by 3 array; there are 2 rows and 3 columns.

If we print this out using `print(np_array_2x3)`

, you can see the contents:

[[0 2 4] [1 3 5]]

Next, we’re going to use the np.sum function to add up all of the elements of the NumPy array.

This is very straight forward. We’re just going to call np.sum, and the only argument will be the name of the array that we’re going to operate on, `np_array_2x3`

:

np.sum(np_array_2x3)

When we run the code, it produces the following output:

15

Essentially, the NumPy sum function is adding up all of the values contained within `np_array_2x3`

. When you add up all of the values (0, 2, 4, 1, 3, 5), the resulting sum is 15.

This is very straightforward. When you use the NumPy sum function without specifying an axis, it will simply add together all of the values and produce a single scalar value.

Having said that, it’s possible to also use the np.sum function to add up the rows or add the columns.

Let’s take a look at some examples of how to do that.

Here, we’re going to sum the rows of a 2-dimensional NumPy array.

First, let’s just create the array:

np_array_2x3 = np.array([[0,2,4],[1,3,5]])

This is a simple 2-d array with 2 rows and 3 columns.

And if we print this out using `print(np_array_2x3)`

, it will produce the following output:

[[0 2 4] [1 3 5]]

Next, let’s use the np.sum function to sum the rows.

np.sum(np_array_2x3, axis = 0)

Which produces the following array:

array([1, 5, 9])

So what happened here?

When we use np.sum with the `axis`

parameter, the function will sum the values along a particular axis.

In particular, when we use np.sum with `axis = 0`

, the function will sum over the 0th axis (the rows). It’s basically summing up the values row-wise, and producing a new array (with lower dimensions).

To understand this, refer back to the explanation of axes earlier in this tutorial. Remember: axes are like directions along a NumPy array. They are the dimensions of the array.

Specifically, axis 0 refers to the rows and axis 1 refers to the columns.

So when we use np.sum and set `axis = 0`

, we’re basically saying, “sum the rows.” This is often called a row-wise operation.

Also note that by default, if we use np.sum like this on an n-dimensional NumPy array, the output will have the dimensions n – 1. So in this example, we used np.sum on a 2-d array, and the output is a 1-d array. (For more control over the dimensions of the output array, see the example that explains the `keepdims`

parameter.)

Similar to adding the rows, we can also use np.sum to *sum across the columns*.

It works in a very similar way to our prior example, but here we will modify the axis parameter and set `axis = 1`

.

First, let’s create the array (this is the same array from the prior example, so if you’ve already run that code, you don’t need to run this again):

np_array_2x3 = np.array([[0,2,4],[1,3,5]])

This code produces a simple 2-d array with 2 rows and 3 columns.

And if we print this out using `print(np_array_2x3)`

, it will produce the following output:

[[0 2 4] [1 3 5]]

Next, we’re going to use the np.sum function to *sum the columns*.

np.sum(np_array_2x3, axis = 1)

Which produces the following array:

array([6, 9])

Essentially, the np.sum function has summed across the columns of the input array.

Visually, you can think of it like this:

Once again, remember: the “axes” refer to the different dimensions of a NumPy array. Axis 0 is the rows and axis 1 is the columns. So when we set the parameter `axis = 1`

, we’re telling the np.sum function to operate on the columns only. Specifically, we’re telling the function to *sum up* the values across the columns.

In the last two examples, we used the `axis`

parameter to indicate that we want to sum down the rows or sum across the columns.

Notice that when you do this it actually *reduces* the number of dimensions.

You can see that by checking the dimensions of the initial array, and the the dimensions of the output of np.sum.

So if we check the `ndim`

attribute of `np_array_2x3`

(which we created in our prior examples), you’ll see that it is a 2-dimensional array:

np_array_2x3.ndim

Which produces the result `2`

. The array `np_array_2x3`

is a 2-dimensional array.

Now, let’s use the np.sum function to sum across the rows:

np_array_colsum = np.sum(np_array_2x3, axis = 1)

How many dimensions does the output have? Let’s check the `ndim`

attribute:

np_array_colsum.ndim

This produces the following output:

1

What that means is that the output array (`np_array_colsum`

) has only 1 dimension. But the original array that we operated on (`np_array_2x3`

) has 2 dimensions.

Why?

When we used np.sum with `axis = 1`

, the function summed across the columns. Effectively, it collapsed the columns down to a single column!

This is an important point. By default, when we use the `axis`

parameter, the np.sum function collapses the input from n dimensions and produces an output of lower dimensions.

The problem is, there may be situations where you want to keep the number of dimensions the same. If your input is n dimensions, you may want the output to also be n dimensions.

It’s possible to create this behavior by using the `keepdims`

parameter.

Here’s an example. We’re going to use np.sum to add up the columns by setting `axis = 1`

. But we’re also going to use the `keepdims`

parameter to keep the dimensions of the output the same as the dimensions of the input:

np_array_colsum_keepdim = np.sum(np_array_2x3, axis = 1, keepdims = True)

If you take a look a the `ndim`

attribute of the output array you can see that it has 2 dimensions:

np_array_colsum_keepdim.ndim

This will produce the following:

2

`np_array_colsum_keepdim`

has 2 dimensions. It has the same number of dimensions as the input array, `np_array_2x3`

.

To understand this better, you can also print the output array with the code `print(np_array_colsum_keepdim)`

, which produces the following output:

[[6] [9]]

Essentially, `np_array_colsum_keepdim`

is a 2-d numpy array organized into a single column.

This is a little subtle if you’re not well versed in array shapes, so to develop your intuition, print out the array `np_array_colsum`

. Remember, when we created `np_array_colsum`

, we did __not__ use `keepdims`

:

print(np_array_colsum)

Here’s the output of the print `statement`

.

[6 9]

Do you see that the structure is different?

When we use np.sum on an axis *without* the `keepdims`

parameter, it collapses at least one of the axes. But when we set `keepdims = True`

, this will cause np.sum to produce a result with the same dimensions as the original input array.

Again, this is a little subtle. To understand it, you really need to understand the basics of NumPy arrays, NumPy shapes, and NumPy axes. So if you’re a little confused, make sure that you study the basics of NumPy arrays … it will make it much easier to understand the `keepdims`

parameter.

If you want to learn data science in Python, it’s important that you learn and master NumPy.

NumPy is critical for many data science projects.

In particular, it has many applications in machine learning projects and deep learning projects.

So if you’re interested in data science, machine learning, and deep learning in Python, make sure you master NumPy.

Here at Sharp Sight, we teach data science.

Here at the Sharp Sight blog, we regularly post tutorials about a variety of data science topics … in particular, about NumPy.

If you want to learn NumPy and data science in Python, sign up for our email list.

If you sign up for our email list, you’ll receive Python data science tutorials delivered to your inbox.

You’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to use the NumPy sum function appeared first on Sharp Sight.

]]>The post How to use the Numpy append function appeared first on Sharp Sight.

]]>Here, I’ll explain what the function does. I’ll explain the syntax (piece by piece), and I’ll show you some step-by-step examples so you can see exactly how np.append works.

Let’s get to it.

The NumPy append function enables you to append new values to an existing NumPy array.

Other tutorials here at Sharp Sight have shown you ways to create a NumPy array. You can create one from a list using the np.array function. You can use the zeros function to create a NumPy array with all zeros. You can use the NumPy arange function to create NumPy arrays as sequences of regularly spaced values. All of those methodologies enable you to create a *new* NumPy array.

But often times, you’ll have an existing array and you need to add new elements. To do that, none of those functions will do. You need a new tool.

Enter the np.append function.

Let’s take a look at the syntax of the np.append function.

Much like the other functions from NumPy, the syntax is fairly straightforward and easy to understand. Let’s break it down.

Typically, we call the function using the syntax `np.append()`

. Keep in mind that this assumes that you’ve imported the NumPy module with the code `import numpy as np`

.

Once you call the function itself – like all NumPy functions – there are a set of parameters that enable you to precisely control the behavior of the append function.

Let’s take a look at the parameters of NumPy append.

`arr`

The `arr`

parameter specifies the base array to which you will append the new values. Said differently, it’s the array that you’re going to modify by appending new values.

`values`

The `values`

parameter specifies the values that you want to append to the base array (i.e., the values you will append to the array specified in the `arr`

parameter).

The values that you specify here can be presented as a list of literal values (i.e. `[1, 2, 3]`

) or you can specify a `ndarray`

object by providing the name of the NumPy array.

** axis** (optional)

The

`axis`

parameter specifies the axis upon which you will append the new values to the original array. By default, `axis = None`

. If you specify a value, you will specify axis equals `0`

or `1`

.

Now at this point, you might be asking … “what the hell is an axis?”

I’ll be honest. Axes in the NumPy system are one of the hardest things for most beginners to understand. It’s not that hard once they are explained, but array axes are not intuitive (at least, they aren’t intuitive the way they’ve been implemented in NumPy).

Axes are best explained with examples, so further down in this tutorial, I’ll show you exactly what the array axes are and how to think of them with respect to this syntax.

Now that we’ve examined the syntax at a high level, let’s take a look at some simple examples.

(By the way, when you’re learning *any* new syntax, the best way to master it is by studying and practicing simple examples.)

In the following examples, we’re going to be referring to the NumPy module as `np`

, so make sure that you run this code:

import numpy as np

First, we’ll work with a very simple example.

Here, we’re going to append values to the end of a 1-dimensional array. This is much simpler than some other examples, because in this case, we don’t have to specify the `axis`

.

First, let’s just create a simple, 1-dimensional array filled with ones.

base_array_1d = np.ones(3, dtype = int)

Essentially, this creates a 1-d NumPy array that contains three ones. If we printed this out with the code `print(base_array_1d)`

, we would see the following contents:

[ 1 1 1 ]

Now, let’s append 3 new values to the end of this array. To do this, we’ll use the NumPy append function.

np.append(base_array_1d, [6,7,8])

Which produces the following output:

array([1, 1, 1, 6, 7, 8])

Visually though, we can think of this operation as follows:

The np.append function is basically taking the new values (`[6, 7, 8]`

) and attaching them to the end of the original array.

It’s pretty straight forward.

Now, let’s append values to a 2-dimensional array.

There are a couple ways to do this. Importantly, you can append new values as a new row, or a new column, so to speak. Additionally, you can append new values __without__ specifying whether it should be a row or column. That’s actually the simplest to do, so we’ll look at that first.

First, let’s create a simple 2 dimensional array:

base_array_2x2 = np.ones(shape = (2, 2), dtype = int)

This code creates a simple 2 by 2 array filled with ones that looks like this:

[[1 1] [1 1]]

Now that we have a base array to work with, let’s append two values. We’re going to use np.append to append two new values to this array. However, we are not going to specify *where* to add them. That is, we are not going to use the `axis`

parameter to specify whether we will add the values as a new row or a new column.

np.append(base_array_2x2, [6,7])

Remember that the array `base_array_2x2`

is 2 dimensional. Also, notice that we did not use the `axis`

parameter here to specify exactly where to add these new vales.

How will NumPy append handle this?

If you have a multi-dimensional array and you do not specify an axis with the `axis`

parameter, np.append will *flatten* the original array first. That is, it will transform the array from a multi-dimensional array to a 1-dimensional array.

Once the array is flattened out, it will simply append the new values to the end.

This is often *not* what people want when they try to append new values to a multi-dimensional NumPy array, so you need to be careful.

If you want the base-array to maintain its original shape, you need to use the `axis`

parameter of np.append to specify exactly where and how to attach the new values.

Let’s take a look at some examples of how to do that.

As we just saw, appending new values to a NumPy array gets a little more complicated when you’re working with multi-dimensional arrays. In particular, things get more complicated when you want to add new values specifically as new rows or columns.

If you want to append new values as a row (or a column), then you have to use the `axis`

parameter of the NumPy append function. There are also a couple other details that you need to be mindful of, otherwise you’ll get an error.

Let’s take a look at an example, so you can see what I mean.

Here, we’re going to append two new values as a new row of data. That is, we’re going to append the values to the bottom of a 2-d NumPy array.

First, let’s create our 2-d NumPy array:

base_array_2x2 = np.ones(shape = (2, 2), dtype = int)

This is a very simple 2-dimensional array that contains all 1’s:

[[1 1] [1 1]]

Next, we’re going to add two new values to the bottom of the array:

np.append(base_array_2x2, [[8, 8]], axis = 0)

Notice that in order to do this, we needed to use the `axis`

parameter. Specifically, we set `axis = 0`

.

I’m going to be honest. Array axes are one of the more challenging and un-intuitive things in NumPy. I’ll probably write a blog post to explain them at some point in the future.

Having said that, you need to remember that to add the values to the bottom of an array (i.e., as a new row of data), you need to set `axis = 0`

.

There’s also something else that you need to pay attention to.

Critically, when you use the `axis`

parameter to append new values to an existing NumPy array, the new values *must have the right dimensions*. So if your original array is a 2-dimensional array, the new values that you’re appending must also be structured as a 2-d array.

If the new values are not structured properly, you’ll get an error. For example:

np.append(base_array_2x2, [8, 8], axis = 0)

This code produces the following error:

ValueError: all the input arrays must have same number of dimensions

Why? WTF is going on here?

Look very carefully at the code. The new values that we’re trying to append are structured as a 1-d structure. You can tell because they are only enclosed by single brackets: `[8, 8]`

. NumPy append is basically treating this as a *1-d* array of values, and it’s trying to append it to a pre-existing *2-d* NumPy array. The dimensions *do not match*.

To get this to work properly, the new values must be structured as a 2-d array. In other words, the new values need to be passed to the append() function as a list-of-lists: `[[8, 8]]`

. This is a little subtle if you’re a beginner so pay careful attention. The values here are enclosed by two sets of brackets: `[[8, 8]]`

. The np.append function will treat this as a 2-d array (instead of a 1-d array). That’s why the code `np.append(base_array_2x2, [[8, 8]], axis = 0)`

works, but `np.append(base_array_2x2, [8, 8], axis = 0)`

doesn’t.

Essentially, when you’re appending values like this, need to watch the number of brackets. The number of brackets that will dictate the number of dimensions of your new values … and np.append expects the new values to have the same dimensions of the original array.

Now let’s append new values as a new *column*.

This works in a way that’s very similar to our prior example of adding values as a new row. Having said that, make sure you’ve read the prior example before trying this … the principle are the same, so you need to understand what was written in the prior example.

Ok. Here, we’re going to create a simple 2-d array containing all 1’s.

base_array_2x2 = np.ones(shape = (2, 2), dtype = int)

Next, we’re going to create another NumPy array that has 2 rows and 1 column. To do this, we’re going to create the array with the np.array function, and then reshape the array using np.reshape.

new_array_2x1 = np.array([9, 9]).reshape(2, 1)

We need this new array to be shaped in a particular way, because when we append new values to a multi-dimensional array, the array *dimensions must match*. (If you don’t understand this, please review the previous section.) In this case, it’s easier to create an array with the right dimensions by using np.array along with the reshape method.

Ok, now that we have two arrays with the right dimensions, we will append the new array to the base array using the np.append function:

np.append(base_array_2x2, new_array_2x1, axis = 1)

Notice here that we’re using the `axis`

parameter again. Specifically, we’re setting `axis = 1`

. Essentially, this indicates that we want to append the new values to the base array as a *new column*.

Once again, I’ll point out that the `axis`

parameter can be a little confusing, especially for beginners. I recommend that you just memorize which is which.

When you use `axis = 1`

, NumPy append will add the new values as a *column*. When you use `axis = 0`

, NumPy append will add the new values as a *row*.

In general, NumPy is important for data science in Python.

In particular, many of the tools and libraries for data science in Python either use or are built on top of NumPy.

For example, the Pandas library is built on top of NumPy.

Moreover, there are important uses of NumPy in both machine learning and deep learning.

That being said, if you want to learn and master data science in Python, sign up for our email list.

Here at Sharp Sight, we teach data science. We want to help you master data science as fast as possible.

If you sign up for our email list, you’ll receive Python data science tutorials delivered to your inbox.

You’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to use the Numpy append function appeared first on Sharp Sight.

]]>The post How to use the NumPy linspace function appeared first on Sharp Sight.

]]>It’s somewhat similar to the NumPy arange function, in that it creates sequences of evenly spaced numbers structured as a NumPy array.

There are some differences though. Moreover, some people find the linspace function to be a little tricky to use.

It’s not that hard to understand, but you really need to learn how it works.

That being said, this tutorial will explain how the NumPy linspace function works. It will explain the syntax, and it will also show you concrete examples of the function so you can see it in action.

Near the bottom of the post, this will also explain a little more about how np.linspace differs from np.arange.

Ok, first things first. Let’s look a little more closely at what the np.linspace function does and how it works.

The NumPy linspace function creates sequences of evenly spaced values within a defined interval.

Essentally, you specify a starting point and an ending point of an interval, and then specify the total number of breakpoints you want within that interval (*including* the start and end points). The np.linspace function will return a sequence of evenly spaced values on that interval.

To illustrate this, here’s a quick example. (We’ll look at more examples later, but this is a quick one just to show you what np.linspace does.)

np.linspace(start = 0, stop = 100, num = 5)

This code produces a NumPy array (an `ndarray`

object) that looks like the following:

That’s the `ndarray`

that the code produces, but we can also visualize the output like this:

So what’s going on here?

Remember: the NumPy linspace function produces a *evenly spaced observations* within a defined interval.

We specified that interval with the `start`

and `stop`

parameters. In particular, this interval starts at 0 and ends at 100.

We also specified that we wanted 5 observations within that range. So, the linspace function returned an `ndarray`

with 5 evenly spaced elements. The first element is 0. The last element is 100. The remaining 3 elements are evenly spaced between 0 and 100.

As should be expected, the output array is consistent with the arguments we’ve used in the syntax.

Having said that, let’s look a little more closely at the syntax of the np.linspace function so you can understand how it works a little more clearly.

The syntax of the NumPy linspace is very straightforward.

Obviously, when using the function, the first thing you need to do is call the function name itself:

To do this, you use the code `np.linspace`

(assuming that you’ve imported NumPy as `np`

).

Inside of the `np.linspace`

code above, you’ll notice 3 parameters: `start`

, `stop`

, and `num`

. These are 3 parameters that you’ll use most frequently with the linspace function. There are also a few other optional parameters that you can use.

Let’s talk about the parameters of np.linspace:

There are several parameters that help you control the `linspace`

function: `start`

, `stop`

, `num`

, `endpoint`

, and `dtype`

.

To understand these parameters, let’s take a look again at the following visual:

`start`

The `start`

parameter is the beginning of the range of numbers.

So if you set `start = 0`

, the first number in the new `nd.array`

will be 0.

Keep in mind that this parameter is required.

`stop`

The `stop`

parameter is the stopping point of the range of numbers.

In most cases, this will be the last value in the range of numbers. Having said that, if you modify the parameter and set `endpoint = False`

, this value will *not* be included in the output array. (See the examples below to understand how this works.)

** num** (optional)

The

`num`

parameter controls how many total items will appear in the output array.For example, if `num = 5`

, then there will be 5 total items in the output array. If, `num = 10`

, then there will be 10 total items in the output array, and so on.

This parameter is optional. If you don’t provide a value for `num`

, then np.linspace will use `num = 50`

as a default.

** endpoint** (optional)

The

`endpoint`

parameter controls whether or not the `stop`

value is included in the output array. If `endpoint = True`

, then the value of the `stop`

parameter will be included as the last item in the `nd.array`

.

If `endpoint = False`

, then the value of the `stop`

parameter will **not** be included.

By default, `endpoint`

evaluates as `True`

.

** dtype** (optional)

Just like in many other NumPy functions, with

`np.linspace`

, the `dtype`

parameter controls the data type of the items in the output array.If you don’t specify a data type, Python will *infer* the data type based on the values of the other parameters.

If you do explicitly use this parameter, however, you can use any of the available data types from NumPy and base Python.

Keep in mind that you won’t use all of these parameters every time that you use the np.linspace function. Several of these parameters are optional.

Moreover, `start`

, `stop`

, and `num`

are much more commonly used than `endpoint`

and `dtype`

.

Also keep in mind that you don’t need to explicitly use the parameter names. You can write code *without* the parameter names themselves; you can add the arguments as “positional arguments” to the function.

Here’s an example:

np.linspace(0, 100, 5)

This code is functionally identical to the code we used in our previous examples: `np.linspace(start = 0, stop = 100, num = 5)`

.

The main difference is that we did not explicitly use the `start`

, `stop`

, and `num`

parameters. Instead, we provided arguments to those parameters by *position*. When you don’t use the parameter names explicitly, Python knows that the first number (0) is supposed to be the `start`

of the interval. It know that 100 is supposed to be the `stop`

. And it knows that the third number (5) corresponds to the `num`

parameter. Again, when you don’t explicitly use the parameter names, Python assigns the argument values to parameters *strictly by position*; which value appears first, second, third, etc.

You’ll see people do this frequently in their code. People will commonly exclude the parameter names in their code and use positional arguments instead. Although I realize that it’s a little faster to write code with positional arguments, I think that it’s clearer to actually use the parameter names. As a best practice, you should probably use them.

Now that you’ve learned how the syntax works, and you’ve learned about each of the parameters, let’s work through a few concrete examples.

A quick example

np.linspace(start = 0, stop = 1, num = 11)

Which produces the output array:

array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

An example like this would be useful if you’re working with percents in some way. For example, if you were plotting percentages or plotting “accuracy” metrics for a machine learning classifier, you might use this code to construct part of your plot. Explaining how to do that is beyond the scope of this post, so I’ll leave a deeper explanation of that for a future blog post.

A very similar example is creating a range of values from 0 to 100, in breaks of 10.

np.linspace(start = 0, stop = 100, num = 11)

The code for this is almost identical to the prior example, except we’re creating values from 0 to 100.

Since it’s somewhat common to work with data with a range from 0 to 100, a code snippet like this might be useful.

As mentioned earlier in this blog post, the `endpoint`

parameter controls whether or not the `stop`

value is included in the output array.

By default (if you don’t set any value for `endpoint`

), this parameter will have the default value of `True`

. That means that the value of the `stop`

parameter *will* be included in the output array (as the final value).

However, if you set `endpoint = False`

, then the value of the `stop`

parameter will *not* be included.

Here’s an example.

In the following code, `stop`

is set to 5.

np.linspace(start = 1, stop = 5, num = 4, endpoint = False)

But because we’re also setting `endpoint = False`

, 5 will *not* be included as the final value.

On the contrary, the output `nd.array`

contains 4 evenly spaced values (i.e., `num = 4`

), starting at 1, up to but *excluding* 5:

array([ 1., 2., 3., 4.])

Personally, I find that it’s a little un-intuitive to use `endpoint = False`

, so I don’t use it often. But if you have a reason to use it, this is how to do it.

As mentioned earlier, the NumPy linspace function is supposed to “infer” the data type from the other input arguments. You’ll notice that in many cases, the output is an array of floats.

If you want to manually specify the data type, you can use the `dtype`

parameter.

This is very straightforward. Using the `dtype`

parameter with np.linspace is identical to how you specify the data type with np.array, specify the data type with np.arange, etc.

Essentially, you use the `dtype`

parameter and indicate the exact Python or NumPy data type that you want for the output array:

np.linspace(start = 0, stop = 100, num = 5, dtype = int)

In this case, when we set `dtype = int`

, the linspace function produces an nd.array object with *integers* instead of floats.

Again, Python and NumPy have a variety of available data types, and you can specify any of these with the `dtype`

parameter.

If you’re familiar with NumPy, you might have noticed that np.linspace is rather similar to the np.arange function.

The essential difference between NumPy linspace and NumPy arange is that linspace enables you to control the precise end value, whereas arange gives you more direct control over the increments between values in the sequence.

To be clear, if you use them carefully, both linspace and arange can be used to create evenly spaced sequences. To a large extent, these are two similar different tools for creating sequences, and which you use will be a matter of preference. I personally find np.arange to be more intuitive, so I tend to prefer arange over linspace. Again though, this will mostly be a matter of preference, so try them both and see which you prefer.

Here at Sharp Sight, we teach data science. We want to help you master data science as fast as possible.

If you sign up for our email list, you’ll receive Python data science tutorials delivered to your inbox.

You’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more

… delivered to your inbox every week.

Want to learn data science in Python? Sign up now.

The post How to use the NumPy linspace function appeared first on Sharp Sight.

]]>