The post How to make a matplotlib bar chart appeared first on Sharp Sight.

]]>Specifically, you’ll learn how to use the plt.bar function from pyplot to create bar charts in Python.

I’ll be honest … creating bar charts in Python is harder than it should be.

People who are just getting started with data visualization in Python sometimes get frustrated. I suspect that this is particularly true if you’ve used other modern data visualization toolkits like ggplot2 in R.

But if you’re doing data science or statistics in Python, you’ll need to create bar charts.

To try to make bar charts easier to understand, this tutorial will explain bar charts in matplotlib, step by step.

The tutorial has several different sections. Note that you can click on these links and they will take you to the appropriate section.

- A quick introduction to matplotlib
- The syntax for the matplotlib bar chart
- Examples of how to make a bar chart with matplotlib

If you need help with something specific, you can click on one of the links.

However, if you’re just getting started with matplotlib, I recommend that you read the entire tutorial. Things will make more sense that way.

Ok. First, let’s briefly talk about matplotlib.

If you’re new to data visualization in Python, you might not be familiar with matplotlib.

Matplotlib is a module in the Python programming language for data visualization and plotting.

For the most part, it is the most common data visualization tool in Python. If you’re doing data science or scientific computing in Python, you are very likely to see it.

However, even though matplotlib is extremely common, it has a few problems.

The big problem is the syntax. Matplotlib’s syntax is fairly low-level. The low-level nature of matplotlib can make it harder to accomplish simple tasks. If you’re only using matplotlib, you might need to use a lot of code to create simple charts.

There’s a solution to this though.

To simplify matplotlib, you can use pyplot.

Pyplot is a sub-module within matplotlib.

Essentially, pyplot provides a group of relatively simple functions for performing common data visualization tasks.

For example, there are simple functions for creating common charts like the scatter plot, the bar chart, the histogram, and others.

If you’re new to matplotlib and pyplot, I recommend that you check out some of our related tutorials:

- How to make a scatterplot with matplotlib
- A quick introduction to the matplotlib histogram
- How to make a line chart with matplotlib

In this tutorial though, we’re going to focus on creating bar charts with pyplot and matplotlib.

With that in mind, let’s examine the syntax.

The syntax to create a bar chart with pyplot isn’t that bad, but it has a few “gotchas” that can confuse beginners.

Let’s take a high-level look at the syntax (we’ll look at the details later).

To create a bar chart with pyplot, we use the `plt.bar()`

function.

Inside of the plt.bar function are several parameters.

In the picture above, I’ve shown four: `x`

, `height`

, `width`

, and `color`

. The plt.bar function has more parameters than these four, but these four are the most important for creating basic bar charts, so we will focus on them.

Let’s talk a little more specifically about these parameters.

Here, I’ll explain four important parameters of the plt.bar function: `x`

, `height`

, `width`

, and `color`

.

The `x`

parameter specifies the position of the bars along the x axis.

So if your bars are at positions 0, 1, 2, and 3 along the x axis, those are the values that you would need to pass to the `x`

parameter.

You need to provide these values in the form of a “sequence” of scalar values. That means that your values (e.g., 0, 1, 2, 3) will need to be contained inside of a Python sequence, like a list or a tuple.

In this tutorial, I’m assuming that you understand what a Python sequence is. If you don’t, do some preliminary reading on Python sequences first, and then come back when you understand them.

The `height`

parameter controls the height of the bars.

Similar to the `x`

parameter, you need to provide a sequence of values to the `height`

parameter …. one value for each bar.

So if there are four bars, you’ll need to pass a sequence of four values. If there are five bars, you need to provide a sequence of five values. Etc.

The examples section will show you how this works.

The `width`

parameter controls the width of the bars.

You can provide a single value, in which case all of the bars will have the same width.

Or, you can provide a sequence of values to manually set the width of different bars.

By default, the `width`

parameter is set to .8.

The `color`

parameter controls the interior color of the bars.

You can set the value to a named color (like “red”, “blue”, “green”, etc) or you can set the color to a hexidecimal color.

Although I strongly prefer hex colors (because they give you a lot of control over the aesthetics of your visualizations), hex colors are a little more complicated for beginners. Having said that, this tutorial will only explain how to use named colors (see the examples below).

Ok … now that you know more about the parameters of the plt.bar function, let’s work through some examples of how to make a bar chart with matplotlib.

I’m going to show you individual examples of how to manipulate each of the important parameters discussed above.

Before you work with the examples, you’ll need to run some code.

You need to run code to import some Python modules. You’ll also need to run code to create some simple data that we will plot.

Here is the code to import the proper modules.

We’ll be working with matplotlib, numpy, and pyplot, so this code will import them.

import matplotlib import numpy as np import matplotlib.pyplot as plt

Note that we’ve imported numpy with the nickname `np`

, and we’ve imported pyplot with the nickname `plt`

. These are fairly standard in most Python code. We can use these nicknames as abbreviations of the modules … this just makes it easier to type the code.

Next, you need to create some data that we can plot in the bar chart.

We’re going to create three sequences of data: `bar_heights`

, `bar_labels`

, and `bar_x_positions`

.

# CREATE DATA bar_heights = [1, 4, 9, 16] bar_labels = ['alpha', 'beta', 'gamma', 'delta'] bar_x_positions = [0,1,2,3]

As noted above, most of the parameters that we’re going to work with require you to provide a *sequence* of values. Here, all of these sequences have been constructed as Python lists. We could also use tuples or another type of Python sequence. For example, we could use the NumPy arange function to create a NumPy array for `bar_heights`

or `bar_x_positions`

. As long as the structure is a “sequence” it will work.

Ok, now that we have our data, let’s start working with some bar chart examples.

Let’s start with a simple example.

Here, we’re just going to make a simple bar chart with pyplot using the plt.bar function. We won’t do any formatting … this will just produce a bar chart with default formatting.

To do this, we’re going to call the `plt.bar()`

function and we will set `bar_x_positions`

to the `x`

parameter and `bar_heights`

to the `height`

parameter.

# PLOT A SIMPLE BAR CHART plt.bar(bar_x_positions, bar_heights)

And here is the output:

This is fairly simple, but there are a few details that I need to explain.

First, notice the position of each of the bars. The bars are at locations 0, 1, 2, and 3 along the x axis. This corresponds to the values stored in `bar_x_positions`

and passed to the `x`

parameter.

Second, notice the height of the bars. The heights are 1, 4, 9, and 16. As should be obvious by now, these bar heights correspond to the values contained in the variable `bar_heights`

, which has been passed to the `height`

parameter.

Finally, notice that we’re passing the values `bar_x_positions`

and `bar_heights`

by *position*. When we do it this way, Python knows that the first argument (`bar_x_positions`

) corresponds to the `x`

parameter and the second argument (`bar_heights`

) corresponds to the `height`

parameter. There’s a bit of a quirk with matplotlib that if you make the parameter names explicit with the code by typing `plt.bar(x = bar_x_positions, height = bar_heights)`

, you’ll actually get an error. So in this example, you have to put the correct variables in the correct positions inside of `plt.bar()`

, and you have to exclude the actual parameter names.

Next, we’ll change the color of the bars.

This is a very simple modification, but it’s the sort of thing that you can make your plot look better, if you do it right.

There are a couple different ways to change the color of the bars. You can change the bars to a “named” color, like ‘red,’ ‘green,’ or ‘blue’. Or, you can change the color to a hexidecimal color. Hex colors are a little more complicated, so I’m not going to show you how to use them here. Having said that, hex colors give you more control, so eventually you should become familiar with them.

Ok. Here, we’re going to make a simple change. We’re going to change the color of the bars to ‘red.’

To do this, we can just provide a color value to the `color`

parameter:

plt.bar(bar_x_positions, bar_heights, color = 'red')

The code produces the following output:

Admittedly, this chart doesn’t look that much better than the default, but it gives you a simple example of how to change the bar colors. This code is easy to learn and easy to practice (you should always start with relatively simple examples).

As you become more skilled with data visualization, you will be able to select other colors that look better for a particular data visualization task.

The point here is that you can change the color of the bars with the `color`

parameter, and it’s relatively easy.

Now, I’ll show you how to change the width of the bars.

To do this, you can use the `width`

parameter.

plt.bar(bar_x_positions, bar_heights, width = .5)

And here’s the output:

Here, we’ve set the bar widths to .5. In this case, I think that the default (.8) is better. However, there may be situations where the bars are spaced out at larger intervals. In those cases, you’ll need to make your bars wider. My recommendation is that you make the space between the bars about 20% of the width of the bars.

You might have noticed in the prior examples that there is a bit of a problem with the x-axis of our bar charts: they don’t have labels.

Let’s take a look by re-creating the simple bar chart from earlier in the tutorial:

# ADD X AXIS LABELS plt.bar(bar_x_positions, bar_heights)

It produces the following bar chart:

Again, just take a look at the bar labels on the x axis. By default, they are just the x-axis positions of the bars. They are *not* the categories.

In most cases, this will not be okay.

In almost all cases, when you create a bar chart, the bars need to have labels. Typically, each bar is labeled with an appropriate category.

How do we do that?

When you use the plt.bar function from pyplot, you need to set those bar labels *manually*. As you’ve probably noticed, they are not included when you build a basic bar chart like the one we created earlier with the code `plt.bar(bar_x_positions, bar_heights)`

.

Here, I’ll show you how.

To add labels to your bars, you need to use the plt.xticks function.

Specifically, you need to call `plt.xticks()`

, and provide two arguments: you need to provide the x axis positions of your bars as well as the labels that correspond to those bars.

So in this example, we will call the function as follows: `plt.xticks(bar_x_positions, bar_labels)`

. The `bar_x_positions`

variable contains the position of each bar, and the `bar_labels`

variable contains the labels of each bar. (Remember that we defined both variables earlier in this tutorial.)

# ADD X AXIS LABELS plt.bar(bar_x_positions, bar_heights) plt.xticks(bar_x_positions, bar_labels)

And here is the result:

Notice that each bar now has a categorical label.

Ok, now I’ll show you a quick trick that will improve the appearance of your Python bar charts.

One of the major issues with standard matplotlib bar charts is that they don’t look all that great. The standard formatting from matplotlib is – to put it bluntly – ugly.

To be clear, the basic formatting is fine if you’re just doing some data exploration at your workstation. The basic formatting is okay if you’re creating charts for personal consumption.

But if you need to show your charts to anyone important, then the default formatting probably isn’t good enough. The default formatted charts look basic. They lack polish. They are a little unprofessional. You might not understand this, but you need to realize that the appearance of your charts matters when you present them to anyone important.

That being the case, you need to learn to format your charts properly.

The full details of how to format your charts is beyond the scope of this post, but here I’ll show you a quick way to dramatically improve the appearance of your pyplot charts.

We’re going to use a special function from the seaborn package to improve our charts.

To use this function, you’ll need to install seaborn. You can do that with the following code:

# import seaborn module import seaborn as sns

Once you have seaborn imported, you can use the seborn.set() function to set new plot defaults for your matplotlib charts. Because we imported seaborn as `sns`

, we can call the function with `sns.set()`

.

#set plot defaults using seaborn formatting sns.set()

This essentially changes many of the plot defaults like the background color, gridlines, and a few other things.

Let’s replot our bar chart so you can see what I mean.

#plot bar chart plt.bar(bar_x_positions, bar_heights)

Here’s the plot:

I’ll be honest … I think this is dramatically better. Just using this one simple modification makes your matplotlib bar chart look much more professional.

One issue that you might run into though is that when you use the seaborn.set() function all of your charts have that formatting. That might not be what you want!

So how do you revert to the original matplotlib formatting?

You can do that by running the following code:

# REMOVE SEABORN FORMATTING sns.reset_orig()

If you run this, it will reset the matplotlib formatting back to the original default values.

Let’s do one more example.

Here, we’ll use several techniques together to create a more complete and refined bar chart in Python.

We’ll set the bar positions and heights using the plt.bar function. Then we’ll add the bar labels using, the plt.xticks function. We’ll change the color using the `color`

parameter. And we’ll improve the background formatting by using the `sns.set()`

function from seaborn.

Let’s take a look:

# COMBINED EXAMPLE import seaborn as sns sns.set() plt.bar(bar_x_positions, bar_heights, color = 'darkred') plt.xticks(bar_x_positions, bar_labels)

And here is the output:

Let’s quickly break this down.

We used the `plt.bar()`

function to create a simple bar chart. The bar locations have been defined with the `bar_x_positions`

variable and the bar heights have been defined with the `bar_heights`

variable. We set the color of the bars to ‘darkred’ by using the `color`

parameter. We set the bar category labels by using the `plt.xticks`

function. And we improved the overall plot formatting by using the `sns.set()`

function.

There is certainly more that we could do to improve this chart. We could add a plot title, axis titles, and maybe change the fonts.

Having said that, this looks pretty damn good for a simple bar chart, and it’s only a few lines of code. In my opinion, it’s dramatically better than a simple default bar chart made with matplotlib.

And one last thing …

As I noted earlier, if you use the `sns.set()`

function to use seaborn formatting for your plots, you may want to reset the defaults afterwards. To do that, run the following code:

# reset defaults sns.reset_defaults()

This will return your matplotlib formatting back to the matplotlib defaults.

This tutorial should have given you a solid foundation for creating bar charts with matplotlib.

Having said that, there’s a lot more to learn. If you want to get the most out of matplotlib, you’ll need to learn more tools and more functions. You’ll need to learn more about matplotlib, but you’ll also need to learn more about NumPy and NumPy arrays. For example, you’ll often need to use techniques like NumPy linspace to set axis tick locations.

Overall, my point is that there’s more to learn. If you want to be great at data science in Python, you really need to know matplotlib.

So, this tutorial should be great for helping you learn some of the basics of the matplotlib bar chart, but if you’re really interested in data science, you’ll need to learn quite a bit more.

If you want to learn more about matplotlib and data science in Python, sign up for our email list.

When you sign up, you’ll get our tutorials delivered directly to your inbox. Every week, we publish data science tutorials … members of our email list hear about them whenever they are published.

If you sign up, you’ll get free tutorials about:

- Matplotlib
- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to make a matplotlib bar chart appeared first on Sharp Sight.

]]>The post How to use Pandas iloc to subset Python data appeared first on Sharp Sight.

]]>Working with data in Pandas is not terribly hard, but it can be a little confusing to beginners. The syntax is a little foreign, and ultimately you need to practice a lot to really make it stick.

To make it easier, this tutorial will explain the syntax of the iloc method to help make it crystal clear.

Additionally, this tutorial will show you some simple examples that you can run on your own.

This is critical. When you’re learning new syntax, it’s best to learn and master the tool with simple examples first. Learning is much easier when the examples are simple and clear.

Having said that, I recommend that you read the whole tutorial. It will provide a refresher on some of the preliminary things you need to know (like the basics of Pandas DataFrames). Everything will be more cohesive if you read the entire tutorial.

But, if you found this from a Google search, and/or you’re in a hurry, you can click on one of the following links and it will take you directly to the appropriate section:

- A quick refresher on Pandas
- Pandas DataFrame basics
- The syntax of Pandas iloc
- Examples: how to use iloc

Before I explain the Pandas iloc method, it will probably help to give you a quick refresher on Pandas and the larger Python data science ecosystem.

There are a few core toolkits for doing data science in Python: NumPy, Pandas, matplotlib, and scikit learn. Those are the big ones right now.

Each of those toolkits focuses on a different part of data science or a different part of the data workflow.

For example, NumPy focuses on numeric data organized into array-like structures. It’s a data manipulation toolkit specifically for numeric data.

Matplotlib focuses on data visualization. Commonly, when you’re doing data science or analytics, you need to *visualize* your data. This is true even if you’re working on an advanced project. You need to perform data visualization to *explore* your data and understand your data. Matplotlib provides a data visualization toolkit so you can visualize your data. You can use matplotlib for simple tasks like creating scatterplots in Python, histograms of single variables, line charts that plot two variables, etc.

And then there’s Pandas.

Pandas also focuses on a specific part of the data science workflow in Python.

… it focuses on ** data manipulation with DataFrames**.

Again, in this tutorial, I’ll show you how to use a specific tool, the iloc method, to retrieve data from a Pandas DataFrame.

Before I show you that though, let’s quickly review the basics of Pandas dataframes.

To understand the iloc method in Pandas, you need to understand Pandas DataFrames.

DataFrames are a type of data structure. Specifically, they are 2-dimensional structures with a row and column form.

So Pandas DataFrames are strictly 2-dimensional.

Also, the columns can contain different data types (although all of the data *within* a column must have the same data type).

Essentially, these features make Pandas DataFrames sort of like Excel spreadsheets.

Importantly, each row and each column in a Pandas DataFrame has a number. An *index*.

This structure, a row-and-column structure with numeric indexes, means that you can work with data by the row number and the column number.

That’s exactly what we can do with the Pandas iloc method.

The `iloc`

method enables you to “locate” a row or column by its “integer index.”

We use the numeric, integer index values to locate rows, columns, and observations.

**i**nteger **loc**ate.

`iloc`

.

Get it?

The syntax of the Pandas iloc isn’t that hard to understand, especially once you use it a few times. Let’s take a look at the syntax.

The syntax of iloc is straightforward.

You call the method by using “dot notation.” You should be familiar with this if you’re using Python, but I’ll quickly explain.

To use the iloc in Pandas, you need to have a Pandas DataFrame. To access iloc, you’ll type in the name of the dataframe and then a “dot.” Then type in “`iloc`

“.

Immediately after the `iloc`

method, you’ll type a set of brackets.

Inside of the brackets, you’ll use integer index values to specify the rows and columns that you want to retrieve. The order of the indexes inside the brackets obviously matters. The first index number will be the row or rows that you want to retrieve. Then the second index is the column or columns that you want to retrieve. Importantly, the column index is *optional*.

If you don’t provide a column index, iloc will retrieve all columns by default.

As I mentioned, the syntax of iloc isn’t that complicated.

It’s fairly simple, but it *still takes practice*.

Even though it’s simple, it’s actually easy to forget some of the details or confuse some of the details.

For example, it’s actually easy to forget which index value comes first inside of the brackets. Does the row index come first, or the column index? It’s easy to forget this.

It’s also easy to confuse the `iloc[]`

method with the `loc[]`

method. This other data retrieval method, `loc[]`

, is extremely similar to `iloc[]`

, and the similarity can confuse people. The `loc[]`

, method works differently though (we explain the loc method in a separate tutorial).

Although the iloc method can be a little challenging to learn in the beginning, it’s possible to learn and master this technique *fast*. Here at Sharp Sight, our premium data science courses will teach you to memorize syntax, so you can permanently remember all of those important little details.

This tutorial won’t give you all of the specifics about how to memorize the syntax of iloc. But, I can tell you that it just takes practice and repetition to remember the little details. You need to work with simple examples, and practice those examples over time until you can remember how everything works.

Speaking of examples, let’s start working with some real data.

Like I said, you need to learn these techniques and practice with simple examples.

Here, in the following examples, we’ll cover the following topics:

- rows selection with iloc
- column selection with iloc
- retrieve specific cells with iloc
- retrieve ranges of rows and columns (i.e., slicing)
- get specific subsets of cells

Before we work on those examples though, you’ll need to create some data.

First, we’ll import the Pandas module. Obviously, we’ll need this to call Pandas functions.

#=============== # IMPORT MODULES #=============== import pandas as pd

Next, you’ll need to create a Pandas DataFrame that will hold the data we’re going to work with.

There are two steps to this. First, we need to create a dictionary of lists that contain the data. Essentially, in this structure, the “key” will be the name of the column, and the associated list will contain the values of that column. You’ll see how this works in a minute.

#========================== # CREATE DICTIONARY OF DATA #========================== country_data_dict = { 'country':['USA', 'China', 'Japan', 'Germany', 'UK', 'India'] ,'continent':['Americas','Asia','Asia','Europe','Europe','Asia'] ,'GDP':[19390604, 12237700, 4872137, 3677439, 2622434, 2597491] ,'population':[322179605, 1403500365, 127748513, 81914672, 65788574, 1324171354] }

Now that we have our dictionary, `country_data_dict`

, we’re going to create a DataFrame from this data. To do this, we’ll apply the `pd.DataFrame()`

function to the `country_data_dict`

dictionary. Notice that we’re also using the `columns`

parameter to specify the order of the columns.

#================================= # CREATE DATAFRAME FROM DICTIONARY #================================= country_data_df = pd.DataFrame(country_data_dict, columns = ['country', 'continent', 'GDP', 'population'])

Now we have a DataFrame of data, `country_data_df`

, which contains country level economic and population data.

First, I’ll show you how to select single rows with iloc.

For example, let’s just select the first row of data. To do this, we’ll call the iloc method using dot notation, and then we’ll use the integer index value inside of the bracets.

country_data_df.iloc[0]

Which produces the following output:

country USA continent Americas GDP 19390604 population 322179605 Name: 0, dtype: object

Essentially, the code pulls back the first row of data, and *all* of the columns.

Notice that the “first” row has the numeric index of `0`

. If you’ve used Python for a little while, this should make sense. When we use indexes with Python objects – including lists, arrays, NumPy arrays, and other sequences – the numeric indexes start with `0`

. The first value of the index is 0. This is very consistent in Python.

Here’s another example.

We can pull back the sixth row of data by using index value `5`

. Remember, because the index values start at `0`

, the numeric index value will be one less than the row of data you want to retrieve.

Let’s pull back the row of data at index value `5`

:

country_data_df.iloc[5]

Which produces the following output:

country India continent Asia GDP 2597491 population 1324171354 Name: 5, dtype: object

Again, this is essentially the data for row index 5, which contains the data for India. Here, you can see the data for all of the columns.

There’s actually a different way to select a single row using iloc.

This is important, actually, because the syntax is more consistent with the syntax that we’re going to use to select columns, and to retrieve “slices” of data.

Here, we’re still going to select a single row. But, we’re going to use some syntax that explicitly tells Pandas that we want to retrieve *all columns*.

country_data_df.iloc[0, :]

Which produces the following:

country USA continent Americas GDP 19390604 population 322179605 Name: 0, dtype: object

Notice that this is the same output that’s produced by the code `country_data_df.iloc[0, :]`

.

What’s going on here?

Notice that in this new syntax, we still have an integer index for the rows. That’s in the first position just inside of the brackets.

But now we also have a ‘`:`

‘ symbol in the second position inside of the brackets.

The colon character (‘`:`

‘) essentially tells Pandas that we want to retrieve all columns.

Remember from the syntax explanation above that we can use two integer index values inside of `iloc[]`

. The first is the row index and the second is the column index.

When we want to retrieve *all* columns, we can use the ‘`:`

‘ character.

You’ll understand this more later. It’s relevant for when we retrieve ‘slices’ of data.

Similarly, you can select a single column of data using a special syntax that uses the ‘`:`

‘ character.

Let’s say that we want to retrieve the first column of data, which is the column at index position `0`

.

To do this, we will use an integer index value in the *second* position inside of the brackets when we use `iloc[]`

. Remember that the integer index in the second position specifies the *column* that we want to retrieve.

What about the rows?

When we want to retrieve a single column and *all rows* we need to use a special syntax using the ‘`:`

‘ character.

You’ll use the ‘`:`

‘ character in the first position inside of the brackets when we use `iloc[]`

. This indicates that we want to retrieve all of the rows. Remember, the first index position inside of `iloc[]`

specifies the rows, and when we use the ‘`:`

‘ character, we’re telling Pandas to retrieve *all* of the rows.

Let me show you an example of this in action.

In this example, we’re going to retrieve a single column.

The code is simple. We have our DataFrame that we created above: `country_data_df`

.

We’re going to use dot notation after the DataFrame to call the `iloc[]`

method.

Inside of the brackets, we’ll have the ‘`:`

‘ character, which indicates that we want to get all rows. We also have `0`

in the second position inside the brackets, which indicates that we want to retrieve the column with index `0`

(the first column in the DataFrame).

Let me show you the code:

country_data_df.iloc[:,0]

And here is the output.

0 USA 1 China 2 Japan 3 Germany 4 UK 5 India Name: country, dtype: object

Notice that the code retrieved a single column of data – the ‘`country`

‘ column – which is the first column in our DataFrame, `country_data_df`

.

It’s pretty straightforward. Using the syntax explained above, iloc retrieved a single column of data from the DataFrame.

Now, let’s move on to something a little more complicated.

Here, we’re going to select the data in a specific cell in the DataFrame.

You’ll just use `iloc[]`

and specify an integer index value for the data in the row and column you want to retrieve.

So if we want to select the data in row `2`

and column `0`

(i.e., row index `2`

and column index `0`

) we’ll use the following code:

country_data_df.iloc[2,0]

Which produces the following output:

'Japan'

Again. This is pretty straightforward.

Using the first index position, we specified that we want the data from row `2`

, and we used the second index position to specify that we want to retrieve the information in column `0`

.

The data that fits *both* criteria is `Japan`

, in cell `(2, 0)`

.

Notice that the Pandas DataFrame essentially works like an Excel spreadsheet. You can just specify the row and column of the data that you want to pull back.

Now that I’ve explained how to select specific rows and columns using `iloc[]`

, let’s talk about slices.

When we “slice” our data, we take multiple rows or multiple columns.

There’s a special syntax to do this, which is related to some of the examples above.

Essentially, we can use the colon (‘`:`

‘) character inside of `iloc[]`

to specify a start row and a stop row.

Keep in mind that the row number specified by the stop index value is *not* included.

It’s always best to illustrate an abstract concept with a concrete example, so let’s take a look at an example of how to use iloc to retrieve a slice of rows.

Here, we’re going to retrieve a subset of rows.

This is pretty straightforward.

We’re going to specify our DataFrame, `country_data_df`

, and then call the `iloc[]`

method using dot notation.

Then, inside of the iloc method, we’ll specify the start row and stop row indexes, separated by a colon.

Here’s the exact code:

country_data_df.iloc[0:3]

And here are the rows that it retrieves:

country continent GDP population 0 USA Americas 19390604 322179605 1 China Asia 12237700 1403500365 2 Japan Asia 4872137 127748513

Notice what data we have here.

The code has retrieved rows `0`

, `1`

, and `2`

.

It also retrieved *all* of the columns.

This is pretty straightforward … we’re retrieving a subset of rows by using the colon (‘`:`

‘) character inside of `iloc[]`

.

Now, we’re going to retrieve a subset of *columns* using iloc.

This is very similar to the previous example where we retrieved a subset of rows. The only difference is how exactly we use the row and column indexes inside of `iloc[]`

.

Here, we’re going to specify that we’re going to use data from `country_data_df`

. Then we’ll use dot notation to call the `iloc[]`

method following the name of the DataFrame.

Inside of the `iloc[]`

method, we’re using the “`:`

” character for the row index. This means that we want to retrieve all rows.

For the column index, we’re using the range `0:2`

. This means that we want to retrieve the columns starting from column `0`

up to and excluding column `2`

.

Here’s the exact code:

country_data_df.iloc[:,0:2]

Which produces the following result:

country continent 0 USA Americas 1 China Asia 2 Japan Asia 3 Germany Europe 4 UK Europe 5 India Asia

If you understand column indexes and how to get slices of data with iloc, this is pretty easy to understand.

The code `country_data_df.iloc[:,0:2]`

gets columns `0`

and `1`

, and gets all rows.

Visually, this is what is being retrieved:

To be clear, Pandas slices can get more complicated than this.

I recommend that you first learn, practice, and master these simple examples before you move on to anything more complicated.

Finally, let’s retrieve a subset of *cells* from our data.

Doing this is really just a combination of getting a slice of columns and a slice of rows with `iloc`

, at the same time.

Let me show you.

country_data_df.iloc[1:5,0:3]

Which produces the following output:

country continent GDP 1 China Asia 12237700 2 Japan Asia 4872137 3 Germany Europe 3677439 4 UK Europe 2622434

So what did we do here?

We called the `iloc[]`

using dot notation after the name of the Pandas DataFrame.

Inside of the `iloc[]`

method, you see that we’re retrieving rows ‘`1:5`

‘ and columns ‘`0:3`

.’

This means that we want to retrieve rows `1`

to `4`

(remember, the “stop” index is *excluded*, so it will exclude `5`

). It is also saying that we want to retrieve the contents of columns from `0`

through `2`

.

This has the effect of selecting the data in rows `1`

through `4`

and columns `0`

through `2`

. The cells that get retrieved must meed both criteria.

Visually, we can represent the results like this:

Again, this is relatively easy to understand if you understand the basics of iloc and the basics of slices.

That being said, you have questions, leave your question in the comment section below.

I’m sure that you’ve heard it before: data manipulation is really important for data science.

I’ve said it before, and so have many other professional data scientists.

In fact, you’ll often here the quote that “80 percent of your work as a data scientist will be data manipulation.”

That’s probably pretty close to true. Data manipulation is *really important*.

If you want to learn data science in Python, that means that you should really know the Pandas module and how to retrieve data using methods like iloc.

Having said that, if you’re interested in learning more about Pandas and more about data science in Python, then sign up for our email list.

Here at Sharp Sight, we teach data science.

Every week, we post new tutorials about Python data science topics like:

- Pandas
- Matplotlib
- Sci-kit learn
- NumPy
- Seaborn
- Keras

We also publish data science tutorials for the R programming language.

When you sign up for our email list, you’ll get these tutorials delivered *directly to your inbox* every week.

If you want *FREE* data science tutorials every week, then sign up now.

The post How to use Pandas iloc to subset Python data appeared first on Sharp Sight.

]]>The post How to use the NumPy median function appeared first on Sharp Sight.

]]>This tutorial will teach you a few things.

First, it will show you how the NumPy median function works syntactically. We’ll cover the basic syntax and parameters of the np.median function.

I’ll also show you some examples of how to use it. As always, one of the best ways to learn new syntax is studying and practicing simple examples.

If you’re a relative beginner with NumPy, I recommend that you read the full tutorial.

But if you only need help with a specific aspect of the NumPy median function, then you can click on one of the links below. The following links will take you to the appropriate section of the tutorial:

If you’re a real beginner, you may not be 100% familiar with NumPy. So before I explain the np.median function, let me explain what NumPy is.

What exactly is NumPy?

NumPy is a data manipulation module for the Python programing language.

At a high level, NumPy enables you to work with numeric data in Python. A little more specifically, it enables you to work with large arrays of numeric data.

You can create and store numeric data in a data structure called a NumPy array.

NumPy also has a set of tools for performing computations on arrays of numeric data. You can do things like combine arrays of numeric data, split arrays into multiple arrays, or reshape arrays into arrays with a new number of rows and columns.

NumPy also has a set of functions for performing calculations on numeric data. The NumPy median function is one of these functions.

Now that you have a broad understanding of what NumPy is, let’s take a look at what the NumPy median function is.

The NumPy median function computes the median of the values in a NumPy array. Note that the NumPy median function will also operate on “array-like objects” like Python lists.

Let’s take a look at a simple visual illustration of the function.

Imagine we have a 1-dimensional NumPy array with five values:

We can use the NumPy median function to compute the median value:

It’s pretty straight forward, although the np.median function can get a little more complicated. It can operate on 2-dimensional or multi-dimensional array objects. It can also calculate the median value of each row or column. You’ll see some examples of these operations in the examples section.

Ok. Now let’s take a closer look at the syntax of the NumPy median function.

One quick note. This explanation of the syntax and all of the examples in this tutorial assume that you’ve imported the NumPy module with the code `import numpy as np`

.

This is a common convention among NumPy users. When you write and run a NumPy/Python program, it’s common to import NumPy as `np`

. This enables you to refer to NumPy with the “nickname” `np`

, which makes the code a little simpler to write and read.

I just wanted to point this out to you to make sure you understand.

Ok. Let’s take a look at the syntax.

Assuming that you’ve imported NumPy as `np`

, you call the function by the name `np.median()`

. In some programs, you might also see the function called as `numpy.median()`

, if the coder imported NumPy as `numpy`

. Both are relatively common, and it really depends on how the NumPy module has been imported.

Inside of the `median()`

function, there are several parameters that you can use to control the behavior of the function more precisely. Let’s talk about those.

The np.median function has four parameters that we will discuss:

`a`

`axis`

`out`

`keepdims`

There’s actually a fifth parameter called `overwrite_input`

. The `overwrite_input`

parameter is not going to be very useful for you if you’re a beginner, so for the sake of simplicity, we’re not going to discuss it in this tutorial.

Ok, let’s quickly review what each parameter does:

`a`

The `a`

parameter specifies the data that you want to operate on. It’s the data on which you will compute the median.

Typically, this will be a numpy array. However, the np.median function can also operate on “array-like objects” such as Python lists. For the sake of simplicity, this tutorial will work with NumPy arrays, but remember that many (if not all) of the examples would work the same way if you used an array-like object instead.

Note that this parameter is required. You need to provide something to the `a`

parameter, otherwise the np.median function won’t work.

The `axis`

parameter controls the axis along which the function will compute the median.

More simply, the axis parameter enables you to compute median values along the rows of an array, or the median values along the columns of an array (instead of computing the median of all of the values).

Using the `axis`

parameter confuses many people.

Later in this tutorial, I’ll show you an example of how to use the `axis`

parameter; hopefully that will make it more clear.

But quickly, let me explain how this works.

NumPy arrays have *axes*. It’s best to think of axes as directions along the array.

So if you have a 2-dimensional array, there are two axes: axis 0 is the direction down the rows and axis 1 is the direction across the columns. (Keep in mind that higher-dimensional arrays have additional axes.)

When we use NumPy functions like np.median, we can often specify an axis along which to perform the computation.

So when we set axis = 0, the NumPy median function computes the median values downward along axis 0. This effectively computes the column medians.

Similarly, when we set axis = 1, the NumPy median function computes the median values horizontally across axis 1. This effectively computes the row medians.

Hopefully these images illustrate the concept and help you understand.

But if you’re still confused, I’ll show you examples of how to use the axis parameter later in the examples section.

The `out`

parameter enables you to specify a different output array where you can put the result.

So if you want to store the result of np.median in a different array, you can use the `out`

parameter to do that.

This is an optional parameter.

The `keepdims`

parameter enables you to make the dimensions of the output the same as the input.

This is a little confusing to many people, so let me explain.

Remember that the np.median function (and other similar functions like np.sum and np.mean) summarize your data in some way. They are computing summary statistics.

When you summarize the data in this way, you are effectively collapsing the number of dimensions of the data. For example, if you have a 1-dimensional NumPy array, and you compute the median, you are collapsing the data from a 1-dimensional structure down to a 0 dimensional structure (a single scalar number).

Or similarly, if you compute the column means of a 2-d array, you’re collapsing the data from 2 dimensions down to 1 dimension.

Essentially, the output of the NumPy median function has a *reduced number of dimensions*.

What if you don’t want that? What if you want the output to have the same number of dimensions as the input?

You can force NumPy median to make keep the dimensions the same by using the `keepdims`

parameter. We can set `keepdims = True`

to make the dimensions of the output the same as the dimensions of the input.

I understand that this might be a little abstract, so I’ll show you an example in the examples section.

Note: the `keepdims`

parameter is optional. By default it is set to `keepdims = False`

, meaning that the output of np.array will not necessarily have the same dimensions as the input.

Ok. Let’s work through some examples. In the last section I explained the syntax, which is probably helpful. But to really understand the code, you need to play with some examples.

Before you get started with the examples though, you’ll need to run some code.

You need to import NumPy. Run this code to properly import NumPy.

import numpy as np

By running this code, you’ll be able to refer to NumPy as `np`

when you call the NumPy functions.

Ok.

This first example is very simple. We’re going to compute the median value of a 1-dimensional array of values.

First, you’ll need to create the data.

To do this, you can call the np.array function with a list of numeric values.

np_array_1d = np.array([0,20,40,60,80,100])

And now we’ll print out the data:

print(np_array_1d)

And here’s the output:

[0 20 40 60 80 100]

This is pretty straight forward. Using the np.array function, we’ve created an array with six values from 0 to 100, in increments of 20.

Now, we’ll calculate the median of these values.

np.median(np_array_1d)

Which gives us the following output:

50.0

This is fairly straightforward, but I’ll quickly explain.

Here, the NumPy median function takes the NumPy array and computes the median.

The median of these six values is 50, so the function outputs `50.0`

as the result.

Next, let’s work through a slightly more complicated example.

Here, we’re going to calculate the median of a 2-dimensional NumPy array.

First, we’ll need to create the array. To do this, we’re going to use the NumPy array function to create a NumPy array from a list of numbers. After that, we’re going to use the reshape method to reshape the data from 1-dimensional array to a 2-dimensional array that has 2 rows and 3 columns.

np_array_2d = np.array([0,20,40,60,80,100]).reshape((2,3))

And we can examine the array by using the `print()`

function.

print(np_array_2d)

[[ 0 20 40] [ 60 80 100]]

As you can see, this dataset has six values arranged in a 2 by 3 NumPy array.

Now, we’ll compute the median of these values.

np.median(np_array_2d)

Which produces the following output:

50.0

This example is very similar to the previous example. The only difference is that in this example, the values are arranged into a 2-dimensional array instead of a 1-dimensional array.

Ultimately though, the result is the same.

If we use the np.median function on a 2-dimensional NumPy array, by default, it will just compute the median of all of the values in the array. Here in this example, we only have six values in the array, but we could also have a larger number of values … the function would work the same.

Moreover, the NumPy median function would also work this way for higher dimensional arrays. For example, if we had a 3-dimensional NumPy array, we could use the `median()`

function to compute the median of all of the values.

However, with 2-d arrays (and multi-dimensional arrays) we can use the axis parameter to compute the median along rows, columns, or other axes.

Let’s take a look.

First, I’m going to show you how to compute the median of the columns of a 2-dimensional NumPy array.

To do this, we need to use the `axis`

parameter. Remember from earlier in the tutorial that NumPy axes are like directions along the rows and columns of a NumPy array.

Remember: axis 0 is the direction that points down against the rows, and axis 1 is the direction that points horizontally across the columns (in a 2-d array).

So how exactly does the `axis`

parameter control the behavior of np.median?

This is important: when you use the `axis`

parameter, the `axis`

parameter controls which axis gets summarized.

Said differently, it controls which axis gets collapsed.

So if you set `axis = 0`

inside of np.median, you’re effectively telling NumPy to compute the medians *downward*. The medians will be computed down along axis 0. Essentially, it will *collapse* axis 0 and compute the medians down that axis.

In other words, it will compute the column medians.

This confuses many people, because they think that by setting `axis = 0`

, it will compute the row medians. That’s not how it works.

Again, it helps to think of NumPy axes as directions. The axis parameter specifies the direction along which the medians will be computed.

Let me show you.

Here, we’re going to compute the column medians by setting `axis = 0`

.

Again, we’ll start by creating a dataset.

np_array_2d = np.array([0,20,40,60,80,100]).reshape((2,3))

And we can examine this array by using the `print()`

function.

print(np_array_2d)

[[ 0 20 40] [ 60 80 100]]

Next, let’s compute the median while setting `axis = 0`

.

# CALCULATE COLUMN AXES np.median(np_array_2d, axis = 0)

And here’s the output:

array([ 30., 50., 70.])

What happened here?

NumPy calculated the medians along axis 0. This effectively computes the column medians:

Again, this might seem counter intuitive, so remember what I said previously. The `axis`

parameter controls which axis gets summarized. By setting axis = 0, we told NumPy median to summarize axis 0.

Now, let’s compute the row medians.

This example is almost identical to the previous example, except here we will set `axis = 1`

.

Once again, we’ll create a dataset. (This is the same as the previous example, so if you’ve already run it, you don’t need to re-run it.)

np_array_2d = np.array([0,20,40,60,80,100]).reshape((2,3))

And here are the contents of `np_array_2d`

:

[[ 0 20 40] [ 60 80 100]]

It’s just a simple 2-d NumPy array.

Now, we’re going to calculate the median and set `axis = 1`

. This will effectively calculate the row medians.

np.median(np_array_2d, axis = 1)

Here’s the output:

array([ 20., 80.])

If you’ve read this tutorial carefully so far, you should understand this. Still, I’ll explain.

The input array, `np_array_2d`

, is a 2-d NumPy array. There are 2 rows and 3 columns.

When we use the `np.median`

function on this array with `axis = 1`

, we are telling the function to compute the medians along the direction of axis 1. Remember, in a 2-d array, axis 1 is the direction that runs horizontally across the columns.

When we use NumPy median with axis = 1, we’re basically telling NumPy to *summarise* axis 1.

This amounts to computing the row medians.

This is fairly easy to understand, but you really need to understand how NumPy axes work. So if you’re still confused, make sure to read read our NumPy axis tutorial, and then come back and read this example and the prior example.

Finally, let’s talk about how to use the `keepdims`

parameter.

Remember, using the np.median function has the effect of *summarizing* or collapsing your data. As I showed you earlier, if you have an array of 6 values, and you use np.median on that array, it will summarize those values by computing a single value (the median).

Similarly, if you compute the median and use the `axis`

parameter, the median function will also reduce the number of dimensions. Like we saw in one of the previous examples, if we use np.median on a 2-dimensional array with `axis = 0`

or `axis = 1`

, the np.median function will compute the column medians or row medians respectively. In either case, the input had 2 dimensions, but the output (e.g., the row median) had only 1 dimension.

This reduction in dimensions is okay in many instances, but sometimes you want the output tho have the same number of dimensions as the input.

To force that behavior, we can use the `keepdims`

parameter.

By default, the `keepdims`

parameter is set to `keepdims = False`

. As explained above, this means that the dimensions of the output does not need to be the same as the dimensions of the input.

To change this, we must set `keepdims = True`

.

Here’s an example. We’re going to create a 2-d NumPy array and then calculate the column medians:

np_array_2d = np.array([0,20,40,60,80,100]).reshape((2,3))

Quickly, let’s examine the number of dimensions of this array by examining the `ndim`

attribute.

np_array_2d.ndim

Which shows us the number of dimensions:

2

The array has 2 dimensions.

Now, let’s compute the median with `axis = 0`

, and examine the number of dimensions by also using the `ndim`

attribute.

np.median(np_array_2d, axis = 0).ndim

When we run this code, the result is `1`

. The output of `np.median(np_array_2d, axis = 0)`

has 1 dimension. The explanation for this is just as I explained above: np.median summarizes data, which reduces the number of dimensions.

However, we can *keep* the same number of dimensions by setting `keepdims = True`

.

Let’s run the operation and look at the number of dimensions of the output:

np.median(np_array_2d, axis = 0, keepdims = True).ndim

Which produces the following output

2

What happened here?

The `keepdims`

parameter forces the median function to keep the dimensions of the output the same as the dimensions of the input. The input array (`np_array_2d`

) has 2 dimensions, so if we set `keepdims = True`

, the output of np.median will also have 2 dimensions.

NumPy’s median function is one of several important functions in the NumPy module. Basically, if you’re new to NumPy, there’s a lot more to learn than what we covered here.

And NumPy is really important if you want to learn data science in Python. NumPy is critical for data manipulation in Python. If you want to learn data science in Python, you really need to study NumPy.

With that in mind, I suggest that you sign up for our email list.

Here at Sharp Sight, we regularly publish free tutorials about data science topics.

For example, we regularly publish tutorials about NumPy. If you want to learn NumPy, sign up now.

When you sign up, all of our tutorials will be sent to you. It’s like having a Python data science tutor right in your inbox.

When you sign up, you’ll get free tutorials on:

- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to use the NumPy median function appeared first on Sharp Sight.

]]>The post Pandas dataframe: a quick introduction appeared first on Sharp Sight.

]]>At a high level, we’ll cover a few things:

- What Pandas is
- What Pandas DataFrames are
- How to create pandas DataFrames
- The basics of working with pandas DataFrames

Each of the above links will take you to the appropriate section, so if you’re looking for something specific, click on the link.

On the other hand, if you’re just getting started with Pandas and with data manipulation in Python, you should probably read the whole tutorial. Seriously. If you’re not in a hurry, just take a few minutes and read. Run some code. Stay a while.

Ok. Let’s get to it.

First, let’s just talk about Pandas.

With all due respect to the people who create modules for programming languages, I think that the names of many packages are outright ridiculous.

Pandas? Why do pandas have anything to do with data?

I don’t know. I really don’t know.

In all seriousness, I think that the name Pandas confuses some beginners because it doesn’t have anything to do data. It’s not entirely clear what it’s about.

Let’s clear it up then.

Pandas is a data manipulation package for the Python programming language.

Pandas is actually one of a couple data manipulation packages in Python. The other core data manipulation package for Python is NumPy.

Although Pandas and NumPy both provide data manipulation tools, they focus on different things.

NumPy essentially focuses on numeric data that’s structured in an array. In fact, NumPy exclusively works with numeric data. **Num**eric data in **Py**thon. NumPy. Get it?

Importantly, NumPy arrays can be 1-dimensional, 2-dimensional, or multi-dimensional. So although they are limited in that they must contain numeric data, they are more flexible in that they can have an arbitrary number of dimensions. This can make them excellent for certain types of machine learning tasks (like deep learning).

Pandas, on the other hand, has a different focus. Pandas mostly focuses on a data structure called the “DataFrame,” which are strictly 2-dimensional (unlike the NumPy array), and contain heterogeneous columns (also unlike the NumPy array).

Now that we’re talking about the DataFrame, let’s discuss the two data structures of Pandas – the Series and the DataFrame – and how they are related.

Pandas enables you to create two new types of Python objects: the Pandas `Series`

and the Pandas `DataFrame`

.

These two structures are related. In this tutorial, we’re going to focus on the DataFrame, but let’s quickly talk about the Series so you understand it.

The Pandas Series object is essentially a 1-dimensional array that has an index.

In simpler terms, a Series is like a column of data. A column of data with an index.

Importantly, Series can contain data of any data type, as long as the all of the data in the Series have the same type. So a Series object can contain integers, strings, floats, etc … as long as all of the values have the same data type.

Again, you can think of a Series object as a column of data.

This brings us to the Pandas dataframe.

So then … what is a dataframe?

In Python, a DataFrame is a 2-dimensional data structure that enables you to store and work with heterogeneous columns of data.

If that’s a little confusing, let me explain them a little differently:

Essentially, Pandas DataFrames are like Excel spreadsheets.

Here, I’m assuming that you’re familiar spreadsheets from Microsoft Excel.

Excel spreadsheets are fairly simple. They are 2-dimensional. And they have a row-and-column structure. All of the data are contained in columns that have the same data type.

Having said that, the different columns can have a different data type. So one column might have character data, and another column might have numeric data.

Pandas DataFrames are essentially the same as Excel spreadsheets in that they are 2-dimensional. They have a row-and-column structure. And the different columns can be of different data types.

Notably, Pandas DataFrames are essentially made up of one or more Pandas Series objects. Remember from a previous section that I mentioned how Pandas Series are like “columns” of data. Essentially, you can combine several of these column-like Series objects into a larger structure … a DataFrame.

When you’re working with dataframes, it’s very common to need to reference specific rows or columns. It’s also very common to reference *ranges* of rows and columns.

There are a couple of ways to do this, but one critical way to reference specific rows and columns is by *index*.

Every row and every column in a Pandas dataframe has an integer index.

You can use these indexes to retrieve specific rows and specific columns by their number.

Similarly, you can use these index values to retrieve *ranges* of data. For example, you could retrieve rows 1 through 4.

Working with these numeric index values isn’t that complicated, but there’s a fair amount of material that you’ll need to know to do it properly.

Later in this tutorial, I’ll show you some simple examples of how to retrieve rows and columns by index.

For more detailed explanations, you should check out our tutorial on the Pandas iloc method and our tutorial on the Pandas loc method.

You might be wondering, “why do we need DataFrames?”

As I noted earlier, DataFrames are more constrained than NumPy arrays in that they are strictly 2-dimensional. On the other hand, DataFrames can have different data types in different columns, whereas NumPy arrays need to have data that’s all of the same type.

So the structure of Pandas DataFrames makes them ideal for certain types of data tasks, and bad for others.

Specifically, Pandas DataFrames are good when you have heterogenious data.

Moreover, DataFrames are good when you need to perform certain types of tasks like:

- pivot tables
- groupings and aggregations
- visualizations

Again, DataFrames aren’t perfect for all data science tasks, but for some things they are ideal.

We’ll talk more about how to work with DataFrames later.

First though, let’s take a look at how to actually create DataFrames in Python.

Ok. Here, we’re actually going to start working with some Python code.

We’ll start first by creating DataFrames with the Pandas module. Later in the tutorial, I’ll also show you some simple things that you can do to work with DataFrames.

But there’s one quick thing before we actually start working with the DataFrame code.

Before running any of the example code in the following sections, you need to import Pandas into your working environment.

To do this, you can run the following code:

import pandas as pd

Here, we’ve imported Pandas with the alias `pd`

. This is extremely common in Python code. When we import Pandas with an alias like this, we can type `pd`

in our code instead of typing `pandas`

. This simplifies things a little bit and makes your code a little easier to write.

Keep in mind that you could also import Pandas with the code `import pandas`

, in which case you would refer to the Pandas module as `pandas`

in your code.

First, let’s create a Pandas DataFrame from a python dictionary.

As our first step in creating a DataFrame from a dictionary, we’ll create a Python dictionary.

country_gdp_dict = { 'country':['USA', 'China', 'Japan', 'Germany', 'UK', 'India'] ,'GDP': [19390604, 12237700, 4872137, 3677439, 2622434, 2597491] }

This dictionary has two different items. In both cases, the “key” is a string, and the “value’ is a list that contains some values associated with the key.

For example the word ‘`country`

‘ is a key in our dictionary and the list of values (`['USA', 'China', 'Japan', 'Germany', 'UK', 'India']`

) is the associated “value” of that key.

Essentially, when we turn this dictionary into a DataFrame, the key/value pairs will become the column name and the column data.

Before we do that though, let’s take a quick look at the data by using the `print()`

function:

print(country_gdp_dict)

Which produces the following:

{'country': ['USA', 'China', 'Japan', 'Germany', 'UK', 'India'] , 'GDP': [19390604, 12237700, 4872137, 3677439, 2622434, 2597491]}

This dictionary contains a few rows of data about country-level nominal GDP, from Wikipedia: en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal).

From here, we can use the pandas.DataFrame function to create a DataFrame out of the Python dictionary.

So now we have a dictionary that contains some data: `country_gdp_dict`

.

Next, we’ll take this dictionary and use it to create a Pandas DataFrame object.

To do this, we’ll simply use the pandas.DataFrame function. Of course, because we’ve imported Pandas with the alias `pd`

, we can call this function with the code `pd.DataFrame()`

.

country_gdp_df = pd.DataFrame(country_gdp_dict)

And we can examine the DataFrame by printing it out:

print(country_gdp_df)

Which produces the following output:

GDP country 0 19390604 USA 1 12237700 China 2 4872137 Japan 3 3677439 Germany 4 2622434 UK 5 2597491 India

Here, you can see the structure of the DataFrame.

The DataFrame has two columns: `GDP`

and `country`

.

The DataFrame has six rows of data, and each row has an associated index. Notice that the row indexes start at 0, so that the first row is row ‘`0`

‘, the second row is row ‘`1`

‘, etc. This is consistent with how Python handles indexes. The index values of essentially all Python objects start at 0. For example, the index values of Python lists and other sequences start at 0.

One other thing that I want to show you is how to re-order the columns of your DataFrame.

You might have noticed that the order of the columns in the final DataFrame was slightly different then the order we used when we created the dictionary that contained the data.

To fix this and give the columns a different order, you can use the `columns`

parameter inside of the pd.DataFrame function.

country_gdp_df = pd.DataFrame(country_gdp_dict, columns = ['country','GDP'])

And if we print out the DataFrame, we will see the order of the columns:

print(country_gdp_df)

country GDP 0 USA 19390604 1 China 12237700 2 Japan 4872137 3 Germany 3677439 4 UK 2622434 5 India 2597491

Personally, I prefer this order. We’ll be looking at the data at a country level, so it helps to have the `country`

variable in the first column position.

Here, I’ve shown you one way to create a Python DataFrame with Pandas … we created a DataFrame from a dictionary of lists.

There are other ways to create Python DataFrames though. You can create DataFrames from dictionaries of Series objects, a dictionary of dictionaries, etc. Moreover, you can simply import data from csv files and other file types into a DataFrame.

Essentially, there are many ways to create Pandas DataFrames.

Having said that, I want to keep things simple here. Many of the other ways of creating DataFrames are less common or they require more explanation.

I’ll probably create separate tutorials to explain the other techniques.

Ok … now that I’ve shown you how to create a Python DataFrame, let’s look at some things that we can do with DataFrames.

Working with DataFrames is a pretty broad subject. In fact, a lot of basic data science in Python involves working with DataFrames in one way or another.

With that in mind, this section isn’t going to tell you *everything* about working with data frames.

However, it will show you some of the basics. Here, I’ll show you how to get the column and row names from a pandas DataFrame. I’ll show you basic indexing, and also basic information retrieval.

Just a quick reminder.

If you haven’t already done so, you need to import Pandas and create the DataFrame we’ll work with.

The following code is the same as the code above, so if you already ran it, you don’t need to. But if you *haven’t* run this already, go ahead.

Here’s the code to import Pandas.

import pandas as pd

And here’s the code to create the DataFrame that we’ll work with, `country_gdp_df`

.

country_gdp_dict = { 'country':['USA', 'China', 'Japan', 'Germany', 'UK', 'India'] ,'GDP': [19390604, 12237700, 4872137, 3677439, 2622434, 2597491] } country_gdp_df = pd.DataFrame(country_gdp_dict, columns = ['country','GDP'])

Ok.

Here, we’re going to retrieve the column names from the DataFrame.

country_gdp_df.columns

Which produces the following output:

Index(['country', 'GDP'], dtype='object')

Essentially, when we retrieve the `columns`

attribute from a Pandas DataFrame, it returns the columns. It returns the columns as a Pandas `Index`

object.

I’m not going to explain `Index`

objects in depth here, but you can treat these as sequences. This enables you to do things like retrieving a column name by it’s position. Let me show you how.

Here, we’ll retrieve the first column in the DataFrame. Remember, in Python, index values start at `0`

, so if we want to retrieve the first column name, we need to retrieve column `0`

.

So to retrieve the first column of `country_gdp_df`

, we will request the `0`

th column, using bracket notation. It’s just like using a Python list.

country_gdp_df.columns[0]

Which produces the following column name:

'country'

Essentially, when we use the code `country_gdp_df.columns[0]`

, we are retrieving the first column name from the `country_gdp_df`

DataFrame. Remember: the indexes for the column names start at `0`

, so the `0`

th column is the first column.

We can retrieve the row names from a DataFrame in a somewhat similar way.

One thing you need to know though: the row labels are called the “index” of the DataFrame. DataFrame indexes are a little technical and a little complicated for beginners, so in the interest of simplicity, I’m not going to write much about DataFrame indexes here.

What you really need to understand is that the `index`

attribute returns the row names. And unless you’ve given the rows specific names (by specifying an index), the `index`

attribute essentially returns the *row number* starting at 0.

Let me show you.

Here, we’re just going to retrieve the `index`

parameter using Python dot notation after the name of the DataFrame:

country_gdp_df.index

Which produces the following output:

RangeIndex(start=0, stop=6, step=1)

Again, this returns a type of `Index`

object, but if you take a look you can see that it is a *range* starting at 0 and stopping at 6, in steps of 1. Remember that in Python, index values are up to and *not* including the `stop`

number. So essentially, this `RangeIndex`

object includes the numbers from `0`

to `5`

.

Before I wrap up this tutorial, I want to show you two do some basic data inspection with Pandas DataFrames.

Here, I’m going to show you two methods that you can use to inspect your data, `head()`

and `tail()`

.

The head method essentially prints out the first 5 rows of data.

To use it, you can specify the DataFrame you want to inspect, and then use the dot notation to call the head() method.

country_gdp_df.head()

Which produces the following output:

country GDP 0 USA 19390604 1 China 12237700 2 Japan 4872137 3 Germany 3677439 4 UK 2622434

Notice that this is the first 5 rows of data (rows `0`

through `4`

).

You can also use the tail() method in a similar way to inspect the last 5 rows of data.

Here’s some code to use the tail method:

country_gdp_df.tail()

Which prints out the following rows:

country GDP 1 China 12237700 2 Japan 4872137 3 Germany 3677439 4 UK 2622434 5 India 2597491

Notice that these are the *last* 5 rows of data, rows *1* to *5*. The first row – row `0`

, which is the row for the USA – has been omitted.

This doesn’t look like much here, but when you have a dataset with hundreds or thousands of rows (or more!) this method can be very useful.

After reading about the basics of Pandas DataFrames here in this tutorial, one of the next things you need to learn is how to subset your data.

That being the case, I strongly recommend that you read the following tutorials next:

- How to use the
`iloc[]`

method to subset a Pandas DataFrame - How to use the
`loc[]`

method to subset a Pandas DataFrame

Those two tutorials will explain Pandas DataFrame subsetting. They can be a little complicated, so they have separate tutorials.

In the interest of brevity, this is a fairly quick introduction to Pandas DataFrames.

Honestly, there’s a lot more that you can (and should) learn about DataFrames in Python.

As I already mentioned, you should read our other tutorials about subsetting Pandas DataFrames.

You should also learn some of the basics of data visualization … really, what good is a DataFrame if you don’t do anything with it?

You can learn more about data visualization in Python by reading about creating scatterplots, how to create a histogram in Python, and more.

Even beyond those other tutorials, there’s still a lot more to learn about data science in Python.

What else specifically do you want to learn about? Leave a comment in the comments section at the bottom of the page and tell me.

Having said that, if you want to learn more about Pandas and more about data science in Python, sign up for our email list.

Here at Sharp Sight, we regularly post tutorials about data science topics. We have tutorials about data visualization and data manipulation. There are also regular tutorials about specific topics like Pandas, matplotlib, and more.

Additionally, we post articles about data science in R as well.

So if you’re serious about learning data science, sign up for our email list.

When you sign up, we’ll deliver our tutorials directly to your inbox, every week.

The post Pandas dataframe: a quick introduction appeared first on Sharp Sight.

]]>The post How to make a matplotlib histogram appeared first on Sharp Sight.

]]>If you’re interested in data science and data visualization in Python, then read on. This post will explain how to make a histogram in Python using matplotlib.

Here’s exactly what the tutorial will cover:

- A quick introduction to matplotlib
- The syntax for the matplotlib histogram
- Examples of how to make a histogram with matplotlib

Clicking on any of the above links will take you to the relevant section in the tutorial.

Having said that, if you’re a relative beginner, I recommend that you read the full tutorial.

Ok, let’s get started with a brief introduction to matplotlib.

If you’re new to Python – and specifically data science in Python – you might be a little confused about matplotlib.

Here’s a very brief introduction to matplotlib. If you want to skip to the section that’s specifically about matplotlib histograms, click here.

Matplotlib is a module for data visualization in the Python programming language.

If you’re interested in data science or data visualization in Python, matplotlib is very important. It will enable you to create very simple data visualizations like histograms and scatterplots in Python, but it will also enable you to create much more complicated data visualizations. For example, using matplotlib, you can create 3-dimensional plots of your data.

Data visualization is extremely important for data analysis and the broader data science workflow. So even if you’re not interested in data visualization per-se, you really do need to master it if you want to be a good data scientist.

That means, if you’re doing data science in Python, you should learn matplotlib.

Related to matplotlib is *pyplot*.

You’ll often see pyplot mentioned and used in the context of matplotlib. Beginners often get confused about the difference between matplotlib and pyplot, because it’s often unclear how they are related.

Essentially, pyplot is a sub-module in matplotlib. It provides a set of convenient functions that enable you to create simple plots like histograms. For example, you can use `plt.plot()`

to create a line chart or you can use the `plt.bar()`

function to create a bar chart. Both `plt.plot()`

and `plt.bar()`

are functions from the Pyplot module.

In this tutorial, we’ll be using the `plt.hist()`

function from pyplot. Just remember though that a pyplot histogram is effectively a matplotlib histogram, because pyplot is a sub-module of matplotlib.

Now that I’ve explained what matplotlib and pyplot are, let’s take a look at the syntax of the `plt.hist()`

function.

From this point forward, we’re going to be dealing with the pyplot `hist()`

function, which makes a histogram.

The syntax is fairly straight forward in the simplest case. On the other hand, the `hist()`

function has a variety of parameters that you can use to modify the behavior of the function. Really. There are a lot of parameters.

In the interest of simplicity, we’re only going to work with a few of those parameters.

If you really need to control how the function works, and need to use the other parameters, I suggest you consult the documentation for the function.

There are 3 primary parameters that we’re going to cover in this tutorial: `x`

, `bins`

, and `color`

.

The `x`

parameter is essentially the input values that you’re going to plot. Said differently, it is the data that you want to plot on the x-axis of your histogram.

(If that doesn’t make sense, take a look at the examples later in the tutorial.)

This parameter will accept an “array or sequence of arrays.”

Essentially, this means that the numeric data that you want to plot in your histogram should be contained in a Python array.

For our purposes later in the tutorial, we’re actually going to provide our data in the form of a NumPy array. NumPy arrays are also acceptable.

The `bins`

parameter controls the number of bins in your histogram. In other words, it controls the number of bars in the histogram; remember that a histogram is a collection of bars that represent the tally of the data for that part of the x-axis range.

More often than not, you’ll provide an *integer* value to the `bins`

parameter. If you provide an integer value, the value will set the number of bins. For example, if you set `bins = 30`

, the histogram will have 30 bars.

You can also provide a string or a Python sequence to the `bins`

parameter to get some additional control over the histogram bins. Having said that, using the `bins`

parameter that way can be a little more complicated, and I don’t recommend it to beginners.

Also, keep in mind that the `bins`

parameter is optional, which means that you don’t need to provide a value.

If you don’t provide a value, matplotlib will use a default value. It will use the default value defined in the `matplotlib.rcParams`

file, which contains matplotlib settings. Assuming that you haven’t changed those settings in `matplotlib.rcParams`

, the `bins`

parameter will default to 10 bins.

For examples of how to work with the bins parameter, consult the example below about histogram bins.

Finally, let’s talk about the `color`

parameter.

As you might guess, the `color`

parameter controls the color of the histogram. In other words, it controls the color of the histogram bars.

This parameter is optional, so if you don’t explicitly provide a color value, it will default to a default value (which is typically a sort of inoffensive blue color).

If you decide to manually set the color, you can set it to a “named” color, like “red,” or “green,” or “blue.” Python and matplotlib have a variety of named colors that you can specify, so take a look at the color options if you manipulate the `color`

parameter this way.

You can also provide hexidecimal colors to the `color`

parameter. This is actually my favorite way to specify colors in data visualizations, because it gives you tight control over the aesthetics of the chart. On the other hand, using hex colors is more complicated, because you need to understand how hex colors work. Hex colors are beyond the scope of this blog post, so I won’t explain them here.

Ok, now that I’ve explained the syntax and the parameters at a high level, let’s take a look at some examples of how to make a histogram with matplotlib.

Most of the examples that follow are simple. If you’re just getting started with matplotlib or Python, first just try running the examples exactly as they are. Once you understand them, try modifying the code little by little just to play around and build your intuition. For example, change the `color`

parameter from “red” to something else. Basically, run the code and then play around a little.

One more thing before we get started with the examples.

Before you run the examples, make sure to run the following code:

import matplotlib import numpy as np import matplotlib.pyplot as plt

This code will import matplotlib, pyplot, and NumPy.

We’re going to be using matplotlib and pyplot in our examples, so you’ll need them.

Also, run this code to create the dataset that we’re going to visualize.

# CREATE NORMALLY DISTRIBUTED DATA norm_data = np.random.normal(size = 1000, loc = 0, scale = 1)

This will create a dataset called `norm_data`

, using the NumPy random normal function. This data is essentially normally distributed data that has a mean of 0 and a standard deviation of 1. How to use NumPy random normal is beyond the scope of this post, so if you want to understand how the code works, consult our tutorial about np.random.normal.

Ok, on to the actual examples.

Let’s start simple.

Here, we’ll use matplotlib to to make a simple histogram.

# MAKE A HISTOGRAM OF THE DATA WITH MATPLOTLIB plt.hist(norm_data)

And here is the output:

This is about as simple as it gets, but let me quickly explain it.

We’re calling `plt.hist()`

and using it to plot `norm_data`

.

`norm_data`

contains normally distributed data, and you can see that in the visualization.

Aesthetically, the histogram is very simple. Because we didn’t use the `color`

parameter or `bins`

parameter, the visualization has defaulted to the default values. There are 10 bins (my current default) and the color has defaulted to blue. The plot is also relatively unformatted.

I will be honest. I think the default histogram is a little on the ugly side. At least, it’s rather plain. That’s OK if you’re just doing data exploration for yourself, but if you need to present your work to other people, you might need to format your chart to make it look more pleasing.

Let’s talk about how to change the color of the bars, which is one way to make your chart more visually appealing.

As noted above, we can change the color of the histogram bars using the `color`

parameter.

As you saw earlier in the previous example, the bar colors will default to a sort of generic “blue” color.

Here, we’re going to manually set it to “red.”

plt.hist(norm_data, color = 'red')

The code produces the following output:

As you can see, the bars are now red.

The chart is still a little visually boring, but this at least shows you how you can change the color. As you become more skilled in data visualization, you can use the `color`

parameter to make your histograms more visually appealing.

Now, let’s modify the number of bins.

Changing the number of bars can be important if your data are a little uneven. You can increase the number of bins to get a more fine-grained view of the data. Or, you can decrease the number of bins to smooth out abnormalities in your data.

Because this tutorial is really about how to create a Python histograms, I’m not going to talk a lot about histogram application. However, I do want you to see *how* you can modify the `bins`

parameter. That will give you more control over the visualization when you begin to apply the technique.

Here’s the code:

plt.hist(norm_data, bins = 50)

And here’s the output:

So what have we done here?

We increased the number of bins by setting `bins = 50`

. As I noted above, the bins parameter generally defaults to 10 bins. Here, by increasing the number of bins to 50, we’ve generated a more fine-grained view of the data. This can help us see minor fluctuations in the data that are invisible when we use a smaller number of bins.

Now that we’ve covered some of the essential parameters of the plt.hist function, I want to show you a quick way to improve the appearance of your plot.

We’re going to use the seaborn module to change the default formatting of the plot.

To do this, we will first import seaborn.

# import seaborn module import seaborn as sns

Next, we’ll use the `seaborn.set()`

function to modify the default settings of the chart. As you’ll see in a moment, this will change the default values for the background color, gridlines, and a few other things. Ultimately, it will just make your histogram look better.

#set plot defaults using seaborn formatting sns.set()

Finally, let’s replot the data using plt.hist.

#plot histogram with matplotlib.pyplot plt.hist(norm_data)

As you can see, the chart looks different. More professional, in my opinion.

The bar colors are slightly different, and the background has been changed. The changes are actually fairly minor, but I think they make a big difference in making the chart look better.

One quick note.

If you run the above code and use the `sns.set()`

function to set the plot defaults with seaborn, you might run into an issue.

… you might find that all of your matplotlib charts have the new seaborn formatting.

How do you make that go away?

You can remove the seaborn formatting defaults by running the following code.

# REMOVE SEABORN FORMATTING sns.reset_orig()

When you run this code, it will return the plot formatting to the matplotlib defaults.

Ok, let’s do one more example.

Here, I want to show you how to put the pieces together.

We’re going to modify several parameters at once to create a histogram:

# FINALIZED EXAMPLE import seaborn as sns sns.set() plt.hist(norm_data, bins = 50, color = '#CC0000')

And here is the output:

What have we done here?

We used `plt.hist()`

to plot a histogram of `norm_data`

.

Using the `bins`

parameter, we increased the number of bins to 50 bins.

We used the `color`

parameter to change the color of the bars to the hex color ‘`#CC0000`

‘, which a shade of red.

Finally, we used the `sns.set()`

function to change the plot defaults. This modified the background color and the gridlines.

Overall, I think this is a fairly professional looking chart, created with a small amount of code.

There’s definitely more that we could do to improve this chart (with titles, etc), but for a rough draft, it’s pretty good.

In this tutorial, we’re really just scratching the surface.

There’s a lot more that you can do with matplotlib, beyond just making a histogram.

To really get the most out of it, and to gain a solid understanding of data visualization in Python, you need to study matplotlib.

With that in mind, if you’re interested in learning (and mastering) data visualization and data science in Python, you should sign up for our email list right now.

Here at the Sharp Sight blog, we regularly post tutorials about a variety of data science topics … in particular, about matplotlib.

If you sign up for our email list, our Python data science tutorials will be delivered to your inbox.

You’ll get free tutorials on:

- Matplotlib
- NumPy
- Pandas
- Base Python
- Scikit learn
- Machine learning
- Deep learning
- … and more.

Want to learn data science in Python? Sign up now.

The post How to make a matplotlib histogram appeared first on Sharp Sight.

]]>