The post Supervised vs Unsupervised Learning, Explained appeared first on Sharp Sight.

]]>The tutorial will start by discussing some foundational concepts and then it will explain supervised and unsupervised learning separately, in more detail.

If you need something specific, just click on the link. The following links will take you to specific sections of the article.

**Table of Contents:**

- An Introduction to Supervised vs Unsupervised Learning
- Supervised Learning, Explained
- Unsupervised Learning, Explained

Having said that, if you’re confused about supervised vs unsupervised learning, you’ll probably want to read the whole article from start to finish.

If you’re somewhat new to machine learning, you’ve probably heard the terms “supervised” and “unsupervised” learning.

The difference is often confusing to machine learning beginners, and unfortunately, it’s often poorly explained in most textbooks.

Let me quickly explain what these terms mean, and after that, I’ll dive into each topic separately, to go into a little more depth.

I think that the best way to think about the difference between supervised vs unsupervised learning is to look at the structure of the training data.

In supervised learning, the data has an output variable that we’re trying to predict.

But in a dataset for unsupervised learning, the target variable is absent. There are still “input” variables, but there’s no target.

So if we’re doing supervised learning for regression, the training data will have a numeric “target” variable that we’re trying to predict. Or if we’re doing supervised learning for classification, the dataset will have a column that contains the correct label for the row of data.

This, of course, is somewhat simplified. I’m being a little imprecise here for the sake of simplicity.

That being the case, let’s take a look at each of these types of machine learning – supervised learning and unsupervised learning – one at a time.

As I mentioned above, in Supervised learning, we have a dataset that contains “input” variables *and* an output variable.

To see this, let’s take a look at a simple example.

Below, we have a dataset.

This dataset has a set of input variables. Frequently, in machine learning we talk about these as the “X” variables, and they’re often referred to generally as , , … . (Keep in mind though, when you work on a specific machine learning problem, these input variables will often have specific names. They are the names of the input variables in your dataset.)

In the machine learning literature, the “input” variables collectively have several different names. Collectively, they are are referred to as features, predictors, or independent variables. Different people use different terminology, but feature, predictor, input variable, and independent variable all essentially mean the same thing.

But when we do supervised learning, the dataset will also have a target variable, .

In supervised learning, this target variable is very important. In fact, in supervised learning, the task for the learning algorithm is to learn how to *predict* on the basis of the input variables, , , … .

But why do we call it “supervised” learning? The structure of the data in supervised learning actually provides some insight.

During the learning process, the target variable, , *supervises* the learning process.

We feed the training data into the learning algorithm. During the learning process, the algorithm uses the target variable to produce the model.

In supervised learning, the resulting model that can make predictions. For any set of values for the input variables, the model will produce a predicted output that we can call .

But after the model is built, we can also use the original target to evaluate the model. To do this, we can compare to .

Ideally, a good machine learning model will make good predictions. If the values in are close to the values in , then it’s probably a good model. But if is too far away from , then it might not be a good model. Or there might be better model that we could build.

So ultimately, the comparison of the actual value to the predicted value provides feedback which “supervises” the creation of the model. The target variable also “supervises” the *evaluation* process in supervised learning.

Now to be clear, there’s a lot of nuance to this once you get into the details. I’m explaining this in a rough, imprecise way to help you understand what’s happening in supervised learning.

But the important thing to understand is that in supervised learning, we have a target variable. And that target variable “supervises” the model building process.

Before we move on to unsupervised learning, let’s quickly talk about some common supervised learning algorithms.

In machine learning, many of the most popular and most frequently used techniques are supervised learning algorithms.

For example, all of the following are supervised learning techniques:

- Linear Regression
- Logistic Regression
- Support Vector Machines
- Decision Trees (including Random Forests and Boosted Trees)
- Deep Neural Networks

(Although, there are some versions of deep networks that are *unsupervised* as well.)

What this means is that as a beginner, you’ll mostly work with supervised learning techniques. In fact, if you’re a beginner, I recommend that you mostly focus on supervised learning first, with 1 or 2 exceptions.

Now, let’s turn to *un*supervised learning.

If you already understand supervised learning, then you should be able to understand what unsupervised learning is by way of comparison.

Let’s start with the data. To me, much like when you’re trying to understand *supervised* learning, the best place to start when you’re trying to understand unsupervised learning is with the *input data*.

Similar to supervised learning, in *unsupervised* learning, our input data has “input” variables, , , … .

But in contrast to supervised learning, there’s no supervising output variable in unsupervised learning. The so-called “target” variable is absent from the data. There’s nothing to predict. There isn’t a structured, well-defined output that the learning algorithm can generate.

That being the case, because the target variable is absent, we can’t use supervised learning techniques on such a dataset.

But, we can use *unsupervised* algorithms on such a dataset.

So what exactly would we use unsupervised learning for?

Isn’t machine learning about predicting things?

No, not always.

There are some types of problems where prediction is not the goal.

Instead, what’s often the case in unsupervised learning, is that we want to find *structure* in the data.

To understand this sort of problem, let’s look at a quick example.

A quintessential example of unsupervised learning is *clustering*.

There are several types of clustering algorithms, including K-means clustering, hierarchical clustering, and others.

Speaking roughly though, clustering algorithms have the same objective: to identify groups in a dataset.

To understand, let’s look at a visual example.

Let’s say that you have a dataset. And the dataset has only two variables: and . (A “target” variable is completely absent from this dataset.)

Using a scatterplot, you can plot the input variables variables, like this:

When you plot these variables as a scatterplot, it should be obvious to you that there’s some structure in this dataset. It’s obvious to a *human* that there’s some structure here.

But how do we enable a *computer* to find that structure?

We can use unsupervised learning.

Unsupervised learning provides a set of tools that will enable a computer to identify this structure in a dataset.

So for example, we could use K-means clustering on this data, which is an unsupervised learning technique. By using K-means clustering, a computer could identify a set of “clusters” in this input data. In K-means clustering, “clusters” are groups of observations that are similar to each other.

So K-means is a technique that *enables a computer to find structure in an input dataset*.

Now to be fair, this is really an ultra-simple example. Real world clustering problems typically involve a lot more than two input variables. And the so-called “clusters” are almost never so clearly separated. In real-world clustering problems, you have messy, complex data, as well hard choices about what actually constitutes a “cluster”.

Setting those complexities aside, this should give you a rough idea of what clustering looks like. And this, in turn, should help you understand what unsupervised learning is and why we use it: it’s often used to find structure in data.

Now before you get confused, I want to make a point. Clustering like what I showed you above is not the same thing as classification (a type of supervised learning).

In classification, there is a *well defined* set of possible output classes.

So for example, we might have a dataset where we’re trying to classify rows of data as X’s or O’s. In that case, the possible output classes are well defined: X or O.

Or in a different problem, we might have a dataset where we’re trying to classify dogs and cats. So in that problem, the output classes are well defined: dog or cat.

Classification problems have a well defined set of possible output classes.

However, in clustering, there are *not* well defined output classes. There’s nothing that we’re trying to predict. There’s just an input dataset, and the clustering algorithm tries to find distinct groups.

Unsupervised learning is somewhat less commonly used, especially by machine learning beginners. Having said that, there are still some important use cases, and a variety of techniques for different tasks.

Broadly, the most common uses for unsupervised learning are:

- dimension reduction
- clustering

For example, Principal Component Analysis (PCA) is an unsupervised learning technique. PCA is one of one of the most important techniques for “dimension reduction” (i.e., reducing the number of dimensions/predictors in a dataset).

And, as mentioned previously, there are several different “clustering” techniques for grouping together rows of data into “clusters”.

That said, a few of the most common unsupervised learning techniques are:

- Principal Component Analysis
- K-Means Clustering
- Hierarchical Clustering

To be clear though, there are also quite a few other unsupervised learning techniques. I’ll leave those for another blog post.

Although supervised learning and unsupervised learning are the two most common categories of machine learning (especially for beginners), there are actually two other machine learning categories worth mentioning: semisupervised learning and reinforcement learning.

Semi-supervised learning is somewhat similar to supervised learning.

Remember that in supervised learning, we have a so-called “target” vector, . This contains the output values that we want to predict.

It’s important to remember that in supervised learning learning, the the target variable has a value for every row.

But in semi-supervised learning, it’s a little different. In semi-supervised learning, there *is* a target variable. BUT, some of the values are missing.

So it’s similar to supervised learning, in the sense that there is a target variable that can supervise the modeling process.

But it’s slightly different in the sense that the target variable has some missing values. This introduces some challenges, and different tools.

Reinforcement learning is another category of machine learning techniques.

This is a complicated subject, so I won’t explain this in depth.

But at a high level, here’s what it is:

In reinforcement learning, the learning system can perform certain actions. Favorable actions are rewarded and unfavorable actions are penalized.

Over time, the learning system “learns” the best actions to take by pursuing rewards and avoiding penalties.

Although this is a very complicated subject that’s beyond the scope of this blog post, you should note that reinforcement learning has some interesting applications. For example, machine learning systems that learn to play video games are frequently built using reinforcement learning. The AlphaGo system (built by DeepMind) also incorporated reinforcement learning.

Again though reinforcement learning is a more advanced topic. So if you’re a relative beginner in machine learning, you should stick to supervised and unsupervised learning.

Do you still have questions about supervised vs unsupervised learning?

If you have specific questions or you think that there’s something I haven’t explained sufficiently, please leave your question in the comments section at the bottom of the page.

This article should have given you a good overview of supervised vs unsupervised learning.

But if you really want to master machine learning there is a lot more to learn. You’ll need to understand:

- regression vs classification
- model building
- model evaluation
- deployment

…. as well as a variety of specific supervised and unsupervised learning techniques.

If you’re interested in learning more about machine learning, then sign up for our email list. Through this year and into the foreseeable future, we’ll be posting detailed tutorials about different parts of the machine learning workflow.

We plan to publish detailed tutorials about the different machine learning techniques, like linear regression, logistic regression, decision trees, neural networks, and more.

So if you want to master machine learning, then sign up for our email list. When you sign up, we’ll send our new tutorials directly to your inbox as soon as they’re published.

The post Supervised vs Unsupervised Learning, Explained appeared first on Sharp Sight.

]]>The post A Quick Introduction to the Python Pandas Package appeared first on Sharp Sight.

]]>Here’s a quick overview of the different sections of the article.

You can click on any of these links, and it will take you to the appropriate section if you need something specific:

**Table of Contents:**

- Introduction to Pandas
- Introduction to Dataframes
- Pandas Data Manipulation Methods
- How to “Chain” Pandas Methods Together

First off, let’s just quickly review what Pandas is.

Pandas is a data science toolkit for doing data wrangling in Python.

You’re probably aware that data wrangling (AKA, data manipulation) is extremely important in data science. In fact, there’s a saying in data science that “80% of your work in data science will be data wrangling.”

Although the reality is a bit more nuanced, that saying is mostly true.

So if you’re doing data science in Python, you need a toolkit for “wrangling” your data. That’s what Pandas provides.

Pandas gives you tools to modify, filter, transpose, reshape, and otherwise clean up your Python data.

But it’s mostly set up to work with data in a particular data structure … the Pandas dataframe.

That being the case, let’s quickly review what dataframes are.

The Pandas dataframe is a special data structure in Python.

Dataframes store data in a row-and-column format that’s very similar to an Excel spreadsheet.

So dataframes have rows and columns. Each column has a label (i.e., the column name), and rows can also have a special label that “index,” which is like a label for a row.

Keep in mind that dataframe indexes are a little complicated, so to understand them better, check out our tutorial on the Pandas dataframe indexes or our tutorial on the Pandas set index method.

Importantly, different columns can have different types of data.

So one column might have `string`

data (i.e., character data). But another column might have `int`

data (i.e., numeric integer data). Different columns can contain different datatypes.

Having said that, all of the data *within* a column needs to be of the *same* type. So a for a column that has `string`

data, all of the data will be string data. For a column that has `float`

data, all of the data will be floating point numbers.

In the examples that I’ve just shown you in the previous sections, the dataframes were pretty clean.

Sometimes, however, our data is messy. Maybe the data is in multiple files, so we need to join multiple files together.

Sometimes, we need to add a new column to the dataframe.

Or maybe we need to subset our data to retrieve specific rows or specific columns.

These sorts of operations are extremely common, and Pandas has a variety of tools for performing them.

Let’s take a look.

Pandas actually has a few dozen data manipulation functions, tools, and methods. To be honest, there are simply too many to cover in this introduction to the Pandas package (it is supposed to be a “quick” introduction, after all).

That being the case, I’ll simplify things a little by giving you a quick overview of how Pandas methods work. After that, I’ll show you the 5 most important data manipulation tools in Pandas that you need to know.

But first, I need to tell you *why* you should be using Pandas *methods*, instead of other ways to manipulate your data.

I need to be honest.

80 to 90% of the Python data manipulation code I see is absolutely terrible.

The reason is that most Python data scientists use what’s known as “bracket syntax” to wrangle their data.

So to retrieve a variable, they use brackets, like this: `dataframe['variable']`

.

Then to add a new variable, you’ll see convoluted code like this:

dataframe['new_var'] = dataframe['old_var_1']/dataframe['old_var_2']

This style of code is often messy, in the sense that you need to repeatedly type the name of the dataframe.

Additionally, it’s dangerous, because in many cases, you directly overwrite your original dataframe.

It also makes it harder to perform complex, multi-step data manipulations, because you can’t perform multiple different data manipulations in series.

I have to be honest: I really dislike this style of Pandas syntax.

Instead, I strongly encourage you to use Pandas *methods*.

They’re often easier to use and easier to debug.

And even better, if you use Pandas *methods* to work with your data, you can combine multiple methods together (which I’ll show you later in this tutorial).

Really, if you aren’t using them already, you should start using Pandas methods to wrangle your Python data.

There are a few dozen Pandas methods, and they all work a little bit differently.

But in spite of their differences, there are some commonalities.

Let’s quickly review how the Pandas methods work, syntactically.

To call a Pandas method, you first type the name of the dataframe, which here, I’ve called `dataframe`

.

Then you type a “dot”, and then the name of the method (e.g., `query`

, `filter`

, `agg`

, etc).

Then, inside the parenthesis for the method, you will have some code that’s unique to that method, and exactly how you’re using it.

With this general syntax in mind, let’s take a look at 5 specific Pandas methods that you should learn first.

In my opinion, the most important data manipulation operations are:

- retrieving a subset of columns
- retrieving a subset of rows
- adding new variables
- sorting data
- aggregating data

That being the case, let’s look at the 5 Pandas methods that perform these:

`filter()`

`query()`

`assign()`

`sort_values()`

`agg()`

There are quite a few other Pandas methods, but I strongly recommend that you learn these first. These are the tools that you’ll probably use the most often to wrangle your data.

Here, we’re going to take a look at examples of how to use `filter()`

, `query()`

, `sort_values()`

, `assign()`

, and `agg()`

.

Before you run any of these examples though, you’ll need to run some preliminary code first:

import pandas as pd import seaborn as sns import numpy as np

And also run the following:

titanic = sns.load_dataset('titanic')

Since we’ll be using Pandas, obviously you’ll need to import Pandas first.

But we’ll need some data to operate on, so here, we’ve used the `sns.load_dataset()`

function to load the `titanic`

dataset.

Ok, let’s start with the filter method.

One of the most common tasks in data science is subsetting columns.

For example, what happens if your dataset has too many columns, and you just want to work with a few of them?

Specifically, what if you want to retrieve a subset of columns *by column name*?

In Pandas, you can do this with the `filter()`

method.

Let’s take a look.

Let’s say that you’re working with the `titanic`

dataframe, which has 15 columns. With 15 columns, it’s sometimes a little difficult to print, and difficult to work with. And let’s say that for right now, you only want to look at 3 columns: `sex`

, `age`

, and `survived`

.

You can do that as follows:

titanic.filter(['sex', 'age', 'survived'])

OUT:

sex age survived 0 male 22.0 0 1 female 38.0 1 2 female 26.0 1 3 female 35.0 1 4 male 35.0 0 .. ... ... ... 886 male 27.0 0 887 female 19.0 1 888 female NaN 0 889 male 26.0 1 890 male 32.0 0 [891 rows x 3 columns]

Notice what happened here. The `titanic`

dataframe has 15 columns. But when we use the Pandas filter method, it enables us to retrieve a subset of columns by name.

Here, we retrieved 3 columns – `sex`

, `age`

, and `survived`

.

To retrieve these, we used so-called “dot syntax” to call the filter method with the code `titanic.filter()`

.

Then inside the parenthesis, we provided a list of the names of the columns that we wanted to retrieve: `['sex', 'age', 'survived']`

.

So when we use `filter`

, we simply provide a list of column names and it will return that subset of columns. Notice that in the output, the columns are returned in the order they appear in the list … not in the order of the original dataframe.

Keep in mind that there are also other methods of subsetting columns, including the `iloc`

method, which subsets rows and columns by numeric index. And also the `loc`

method, which subsets rows and columns by label. These work differently from filter, however, so you should learn those tools separately.

In any case, there’s still more for you to learn about filter as well, so for more information on how to use `filter()`

, check out our tutorial on the Pandas filter method.

Next, let’s take a look at the `query()`

method.

The `query`

retrieves rows of data.

More specifically, it retrieves rows that match some logical condition that you specify.

Let’s take a look at an example, and then I’ll explain.

Here, we’re going to retrieve rows for people who embarked from Southampton.

Let’s run the code:

titanic.query('embark_town == "Southampton"')

OUT:

survived pclass sex age ... deck embark_town alive alone 0 0 3 male 22.0 ... NaN Southampton no False 2 1 3 female 26.0 ... NaN Southampton yes True 3 1 1 female 35.0 ... C Southampton yes False 4 0 3 male 35.0 ... NaN Southampton no True 6 0 1 male 54.0 ... E Southampton no True .. ... ... ... ... ... ... ... ... ... 883 0 2 male 28.0 ... NaN Southampton no True 884 0 3 male 25.0 ... NaN Southampton no True 886 0 2 male 27.0 ... NaN Southampton no True 887 1 1 female 19.0 ... B Southampton yes True 888 0 3 female NaN ... NaN Southampton no False [644 rows x 15 columns]

Notice that the output has 644 rows of data instead of the original 891 rows from the full `titanic`

dataframe. That’s because the output only contains rows where `embark_town`

equals `'Southampton'`

.

How did we do this?

Here, we called the `query()`

method using dot syntax.

Inside of the method, we used the logical expression `'embark_town == "Southampton"'`

.

Remember that `embark_town`

is a column in the `titanic`

dataframe. Additionally, `Southampton`

is one of the values within that column.

So the expression `'embark_town == "Southampton"'`

instructs the `query()`

method to retrieve only those rows where `embark_town`

equals `Southampton`

.

A couple extra notes on this …

First, the logical expression (i.e., `'embark_town == "Southampton"'`

) must be enclosed inside of quotation marks. Single quotes or double quote will work. Essentially, the logical expression must be presented to `query()`

as a `string`

.

What that means is that if your logical expression *contains* a string value, you’ll need to use quotes for that string value as well. So if your logical expression is enclosed in single quotes, you need to enclose any string values in double quotes, or visa versa. (Effectively, you need to know how to work with strings to use `query`

properly.)

Second, the logical condition that we used here was pretty simple. Having said that, it is possible to have fairly complex logical conditions.

Since there’s more to learn about this technique, you might want to check out our tutorial on the Pandas query method.

The assign method adds new variables to a dataframe.

To be clear, operations to create new variables can be simple, but they can also be very complex depending on what exactly you want your new variable to contain.

For the sake of simplicity and clarity, we’ll work with an extremely simple toy example here.

In this example, we’re going to create a variable called `fare_10x`

.

Imagine that you’re working with the `titanic`

dataset, and you find out that the `fare`

variable is off by a factor of 10. You want to create a new variable that’s equal to the original `fare`

variable, multiplied by 10.

You can do this with `assign()`

.

Let’s take a look:

titanic.assign(fare_10x = titanic.fare * 10)

OUT:

survived pclass sex age ... embark_town alive alone fare_10x 0 0 3 male 22.0 ... Southampton no False 72.500 1 1 1 female 38.0 ... Cherbourg yes False 712.833 2 1 3 female 26.0 ... Southampton yes True 79.250 3 1 1 female 35.0 ... Southampton yes False 531.000 4 0 3 male 35.0 ... Southampton no True 80.500 .. ... ... ... ... ... ... ... ... ... 886 0 2 male 27.0 ... Southampton no True 130.000 887 1 1 female 19.0 ... Southampton yes True 300.000 888 0 3 female NaN ... Southampton no False 234.500 889 1 1 male 26.0 ... Cherbourg yes True 300.000 890 0 3 male 32.0 ... Queenstown no True 77.500 [891 rows x 16 columns]

Here, we’ve added a new variable to the output called `fare_10x`

.

As I mentioned, this is equal to the value of the `fare`

variable, times 10.

To create this, we simply called the `.assign()`

method using “dot” syntax.

Then inside of the parenthesis, we provided the expression `fare_10x = titanic.fare * 10`

. This is a “name/value” expression that provides the name of the new variable on the left-hand-side of the equal sign, and the value that we’ll assign to it on the right-hand-side.

Again, to be fair, this is a bit of a toy example. It’s possible to create much more complicated variables based on various logical conditions, and other operations.

For example, we can create a 0/1 indicator variable called `adult_male_ind`

, which is 1 if the person is an adult male, and 0 otherwise (this is often called “dummy encoding”).

To do this though, we need to use a special function from Numpy called the Numpy where function.

titanic.assign(adult_male_ind = np.where(titanic.adult_male == True, 1, 0))

This is more complicated to do, and to accomplish it, you really need to know about Numpy.

All that being said, this example should get you started, but for more information, check out our tutorial on the Pandas assign method.

Additionally, if you want to become great at data manipulation, make sure to learn about Numpy!

Now, let’s look at the `sort_values`

method.

The `sort_values`

method *sorts* a dataframe.

This should be mostly self-explanatory, but let’s look at an example so you can see the method in action.

Here, we’re going to sort the `titanic`

dataframe by the age variable.

titanic.sort_values(['age'])

OUT:

survived pclass sex age ... deck embark_town alive alone 803 1 3 male 0.42 ... NaN Cherbourg yes False 755 1 2 male 0.67 ... NaN Southampton yes False 644 1 3 female 0.75 ... NaN Cherbourg yes False 469 1 3 female 0.75 ... NaN Cherbourg yes False 78 1 2 male 0.83 ... NaN Southampton yes False .. ... ... ... ... ... ... ... ... ... 859 0 3 male NaN ... NaN Cherbourg no True 863 0 3 female NaN ... NaN Southampton no False 868 0 3 male NaN ... NaN Southampton no True 878 0 3 male NaN ... NaN Southampton no True 888 0 3 female NaN ... NaN Southampton no False [891 rows x 15 columns]

By default, this sorted the data in ascending order.

We can also sort the data in *descending* order by setting `ascending = False`

.

titanic.sort_values(['age'], ascending = False)

OUT:

survived pclass sex age ... deck embark_town alive alone 630 1 1 male 80.0 ... A Southampton yes True 851 0 3 male 74.0 ... NaN Southampton no True 493 0 1 male 71.0 ... NaN Cherbourg no True 96 0 1 male 71.0 ... A Cherbourg no True 116 0 3 male 70.5 ... NaN Queenstown no True .. ... ... ... ... ... ... ... ... ... 859 0 3 male NaN ... NaN Cherbourg no True 863 0 3 female NaN ... NaN Southampton no False 868 0 3 male NaN ... NaN Southampton no True 878 0 3 male NaN ... NaN Southampton no True 888 0 3 female NaN ... NaN Southampton no False [891 rows x 15 columns]

Here, we sorted the data by the `age`

variable … first in ascending order, and then in descending order.

Just like all of these tools, we called sort_values using “dot” syntax. We typed the name of the dataframe, then `.sort_values()`

.

Importantly though, inside of the parenthesis, we provided a list of variables that we wanted to sort on. Here, we only sorted on `age`

, so we passed in the Python list `['age']`

as the argument.

Keep in mind, that you can pass in a list of *multiple* variables to sort the data on multiple variables.

For more examples of how to sort a Pandas dataframe, check out our sort_values tutorial.

Finally, let’s look at the `agg()`

method.

`agg()`

summarizes your data.

For example, `agg()`

can calculate the mean, median, sum and other summary statistics.

Let’s look at a simple example.

Here, we’ll calculate the means of our numeric variables.

titanic.agg('mean')

OUT:

survived 0.383838 pclass 2.308642 age 29.699118 sibsp 0.523008 parch 0.381594 fare 32.204208 adult_male 0.602694 alone 0.602694 dtype: float64

Notice that the output contains the means of the numeric variables. It also includes the means of variables that can be coerced to numeric (although some, like the mean of `adult_male`

, have unintuitive interpretations).

To call this method, we’ve simply used dot notation to call `.agg()`

. Inside of the method, we’ve provided the name of a summary statistic, `'mean'`

.

Notice that the name of the statistic is being passed in as a string (i.e., it must be enclosed inside of quotations). Also take note that it’s possible to pass in a list of statistic names, like `['mean', 'median']`

. Uses of the `agg`

method can get quite complex.

One quick note about these Pandas methods that we’ve looked at.

These Pandas methods create a *new* dataframe as an output.

What that means is that these data manipulation methods typically do *not* change the original dataframe. When we use these tools, the original dataframe remains intact.

This is really important, because that means that the changes that these methods make will not automatically be saved! This is really frustrating for beginners, who don’t understand what’s going on.

That being the case, you typically need to save the output of these Pandas methods, in one way or another.

If you want to save the output, you need to store the output in a variable name using the assignment operator.

Here is an example:

titanic_embark_southampton = titanic.query('embark_town == "Southampton"')

Alternatively, we could overwrite and update the original dataset, also using the assignment operator:

titanic = titanic.query('embark_town == "Southampton"')

But be careful when you do this.

If you re-use the name of your dataset like this, it will overwrite the original data. Although there are some cases where this is okay, there are other instances where you will want to keep your original dataset intact.

So again, be careful if and when you overwrite your data, and make sure that it’s exactly what you want to do.

Now that you’ve learned about the 5 most important Pandas methods, let’s quickly talk about how to combine them together.

There’s a traditional way to do this, which I don’t like much at all.

There’s also a “secret” way to do it that is much easier to write, read, and debug (this secret way is actually similar to dplyr pipes in R).

Let’s talk about the traditional way first, and then I’ll show you the newer, better way.

When you use Python methods, you can typically “chain” them together, one after another on a single line.

So let’s say that you want to first subset the rows where `embark_town == "Southampton"`

, but then you want to subset the columns to retrieve `sex`

, `age`

, and `survived`

.

You can do this by typing the name of the dataframe, and then you can use “dot syntax” to first call the `query`

method, and then call the `filter`

method immediately after it, on the same line:

titanic.query('embark_town == "Southampton"').filter(['sex', 'age', 'survived'])

In this above code, we’ve called `query()`

and `filter()`

, in series, on the same line. The output will be a subsetted dataframe that subsets both the rows and columns.

This code works, but to me, there are some problems.

First, this code is extremely hard to read. It’s just too long moving horizontally. Everything sort of runs together.

Second, what if we wanted to use 3 methods in a row? What about 4? This doesn’t scale well.

And third, this code will be hard to debug. If you have a problem with any one of the methods that you use in series and you want to *remove* it from the chain, you need to delete that section, or make a copy of the code and delete the section you want to remove. Debugging code like this is a mess.

There’s got to be a better way, right?

Yes, and almost no one else will tell you how to do it.

But I’m a terribly generous guy, so I’ll show you the secret.

There’s actually a way to “chain” together multiple Pandas methods on *separate lines*.

The syntax is actually similar to how dplyr pipes work in the R programming language.

As an aside, I actually learned R before I learned Python. Although both are great languages, I actually think that R’s dplyr is better designed in many ways compared to Pandas.

One of the big reasons that I like R is that R’s Tidyverse functions are designed for using multiple functions in series, on separate lines. This is extremely useful for data wrangling and data analysis.

As soon as I started learning data science in Python, I wanted to replicate this behavior, but couldn’t find a way.

Eventually though, I discovered a simple way to do it, and my Pandas code has changed forever.

To create a “chain” of Pandas methods, you need to enclose the whole expression inside of parenthesis.

Once you do this, you can have multiple Pandas methods, all on separate lines. This makes writing, reading, and debugging your Pandas code *much* easier.

Here’s an example.

Let’s say that you want to use two Pandas methods, like we just did a couple sections ago. Let’s say that we want to first subset our rows with `query`

, and then we want to subset the columns with `filter`

.

Here’s how it will look with our special syntax:

This is a very powerful technique for manipulating your data. It enables you to do complex, multi-step data manipulations in an elegant way.

Before I show you an extreme case though, let’s look at a simple example.

Here, we’re going to re-do the “traditional” Pandas chain that we looked at a few sections ago.

First, we’re going to call `query()`

and then we’ll call `filter()`

:

(titanic .query('embark_town == "Southampton"') .filter(['sex', 'age', 'survived']) )

This is effectively the same as our previous code, which looked like this:

titanic.query('embark_town == "Southampton"').filter(['sex', 'age', 'survived'])

Both pieces of code will produce the same output.

The difference is that the this new version is *much* easier to read. Additionally, it’s easier to debug or modify.

For example, if you need to remove one of the methods, just put a hash mark in front of it to comment it out:

(titanic #.query('embark_town == "Southampton"') .filter(['sex', 'age', 'survived']) )

Trust me, this is very convenient for debugging a chain of Pandas methods when you’re doing data manipulation.

It’s certainly useful even for simple chains of 2 methods, but it’s even more useful when you chain together multiple methods.

Using this chaining method, you can actually chain together as many Pandas methods as you want. You’re not limited to 2. You can do 3, 4 or more (although at some point, it will get a little ridiculous).

Let’s take a look at an example.

Here, we’re going to do 3 operations:

- subset the rows where
`embark_town == "Southampton"`

- subset the columns down to
`sex`

,`age`

, and`survived`

- sort the data in descending order by
`age`

Let’s take a look:

(titanic .query('embark_town == "Southampton"') .filter(['sex', 'age', 'survived']) .sort_values(['age'], ascending = False) )

OUT:

sex age survived 630 male 80.0 1 851 male 74.0 0 672 male 70.0 0 745 male 70.0 0 33 male 66.0 0 .. ... ... ... 846 male NaN 0 863 female NaN 0 868 male NaN 0 878 male NaN 0 888 female NaN 0 [644 rows x 3 columns]

Here, we did a somewhat complex, multi-step data manipulation with code that’s relatively easy to read and easy to work with.

And as I mentioned, we could actually add more methods if we needed to!

One additional comment about this multi-line chaining syntax.

When you read this code, you read each line with a “then”.

Let’s look at out multi-line chaining code again, and I’ll add some comments to show you what I mean.

(titanic # Start with the titanic dataset .query('embark_town == "Southampton"') # THEN retrieve the rows where embark_town equals Southampton .filter(['sex', 'age', 'survived']) # THEN retrieve the columns for sex, age, and survived .sort_values(['age'], ascending = False) # THEN sort the data by age in descending order )

We can read this code line-by-line as a series of procedures.

In my opinion, writing your code this way makes it 10X easier to read.

You should start doing this. It will make your code easier to write, easier to debug, and easier to read.

In addition to readability, using these multi-line Pandas chains makes it much easier to do complex data manipulations.

For example, in a previous tutorial series, we wrangled and analyzed covid-19 data. In one part of that analysis, we chained together *6 different Pandas tools*.

Doing data manipulation this way is *extremely powerful*. Once you master this technique, you’ll never go back.

It’s simple, powerful, and quite honestly, a joy to use.

Hopefully, this introduction to the Python Pandas package was helpful.

Data manipulation is a critical, core skill in data science, and the Python Pandas package is really necessary for data manipulation in Python. Like it or not, you need to know it if you want to do data science in Python.

Having said that, there’s a lot that this tutorial didn’t cover. There’s a *lot* more to learn about Pandas.

If you want to dive deeper into other parts of the Pandas package, and on data manipulation generally, check out these other tutorials:

- An Introduction to Pandas dataframes
- 3 Secrets for Mastering Data Manipulation
- Why data manipulation is the foundation of data science
- The 19 Pandas functions you need to memorize

Essentially, even though this tutorial should get you started, there’s a lot more to learn. Once you master *all* of the essentials of Pandas, you’ll be able to do much, much more.

Do you have questions about the Python Pandas package?

Is there something that you don’t understand, or something you think I missed?

If so, leave your question in the comments section near the bottom of the page.

If you’re really ready to master Pandas, and master data manipulation in Python, then you should enroll in our premium course, Pandas Mastery.

Pandas Mastery will teach you all of the essentials you need to do data manipulation in Python.

It covers all of the material in this tutorial, and a lot more. It will teach you all of most important Pandas methods (a few dozen), and how to combine them.

But it will also show you a unique training system that will help you *memorize all of the syntax you learn*, and become “fluent” in Python data manipulation.

If you’re serious about mastering data manipulation in Python, and you want to learn Pandas as fast as possible, you should enroll.

We’re reopening Pandas Mastery next week, so to be notified as soon as the doors open click here and sign up for the waiting list:

Yes! Tell Me When Pandas Mastery Opens Enrollment

The post A Quick Introduction to the Python Pandas Package appeared first on Sharp Sight.

]]>The post A Quick Introduction to Machine Learning appeared first on Sharp Sight.

]]>Here’s a quick table of contents that will give you an overview of the article. If you want to read about something specific, just click on the link and it will take you to that section of the tutorial.

**Table of Contents:**

- WTF is Machine Learning?
- Examples of How Machine Learning is Used Today
- How Machine Learning Works
- The Model Building Process
- Different Machine Learning Techniques

Of course, if you’re really new to data science generally, and machine learning in particular, you’ll probably want to read the whole article. You’ll get a much better overview if you read the whole thing, start to finish.

Ok … let’s get to it.

So what is machine learning, anyway?

If you’ve done any reading about data science in the last few years, you’ve probably heard the term “machine learning.” Machine learning has become very popular in the tech community generally and in data science specifically.

You’ve probably already heard that machine learning is a form of artificial intelligence. This is true.

But machine learning is a particular type of artificial intelligence.

As opposed to older forms of AI, like rule-based systems where a programmer hard-coded IF/THEN statements to instruct a computer how to behave, machine learning takes a more data-driven approach.

Machine learning is a set of techniques that enable computers to learn from data.

As I suggested previously, most people consider machine learning to be a sub-discipline of AI. But given the data-driven approach, machine learning also has deep roots in statistics. As such, machine Learning sits at the intersection of data science, artificial intelligence, statistics, and computer science.

Having said that, saying that machine learning enables computers to “learn from data” might seem a little abstract. It might be easier to understand what machine learning is by looking at how it’s used today.

You can start to understand what machine learning is by looking at where and how it’s used. It’s actually being used increasingly in a wide range of technology products.

Here are a few examples of machine learning in our everyday lives:

- recommendation systems (on sites like Amazon and Netflix)
- self driving cars
- spam classification

Let’s take a quick look at these.

Have you ever considered purchasing something on Amazon, and noticed a small section on the page titled “Books you may like” or “Other products you may like”?

You might have seen a similar section on the Netflix home-screen titled “Because you watched …”. So if you watched The Avengers, this section of Netflix recommends other movies *similar to* or *related to* The Avengers.

Broadly, these types of systems are called “recommendation systems,” and they are typically built with various forms of machine learning tools.

At a high level, these systems “learn” from your past purchases and use history. Amazon has data on your past purchases, and they use that data to build a machine learning system which “learns” what you like. Once it knows what you like, the ML system can make suggestions (i.e., predictions) about what *other* items you might like.

Another cool application of machine learning is self driving cars.

Even a decade ago, it would have seemed impossibly futuristic to have cars that could drive themselves, even a little bit.

But today, Teslas are sold with an “autopilot” feature that enables the car to “to steer, accelerate and brake” automatically under certain conditions.

This self-driving feature still has limited capabilities at this stage, but it does work in some circumstances.

So, how did they use machine learning to enable this?

Teslas and similar cars are equipped with sensors, radar systems, and cameras. Those cameras and sensors produce a stream of data, which Tesla’s engineers have fed into a machine learning system (specifically, a “a deep neural network” system). That machine learning system has “learned” about different road features like cars, road signs, road markings, etc, and learned about appropriate responses to different road features.

Data feeds into the system, and the system has learned to evaluate and respond to different driving events.

Perhaps the most canonical example of how machine learning is used in everyday life is the spam filter for your email.

Almost all modern email clients, such as Google’s Gmail, have a spam filter.

The spam filter automatically evaluates incoming email messages and attempts to categorize any “junk” mail as “spam,” after which, it’s sent to the spam folder so you don’t have to see it.

This too is built with machine learning.

The contents of an email – things like words, grammar, titles, senders – can all be considered forms of data. Email companies have used historical email data to “train” machine learning systems. These systems have “learned” to categorize and identify “spam” messages based on the email contents.

Now, after training these spam classification systems, email services can use them to classify new incoming messages so your inbox stays relatively free of junk email.

What you’ll notice about all of these examples – recommendation systems, self-driving cars, and spam filters – is that there is a data stream.

The data stream is used to “train” the machine learning system. The machine learning system “learns” from the data. And then the system produces some output like a prediction, classification, or recommendation.

Although they all use different machine learning techniques, they all “learn” from some data stream.

Essentially, if a software systems today appears to perform some type of prediction or classification, there’s a good chance that it’s using some machine learning.

So now that we’ve looked at a few examples of machine learning so you can see how it’s used today, let’s look at how machine learning works at a high level.

There are a variety of different types of machine learning tools which have different strengths and different applications.

But although there are differences from one system to another, there are some commonalities with regard to how these machine learning system are created.

That being the case, we can examine how the machine learning model building process works at a high level. This will give you a rough overview of what actually happens when we use machine learning on a dataset.

Essentially, when we build a machine learning model, we have a dataset called a *training* dataset.

We then use an algorithm to extract some knowledge from that dataset. The machine learning algorithm “learns” from the training data.

Once this process is complete, we can deploy the model as a system that will accept *new* data, and will produce some output when it sees the new data. Speaking very generally, this output is frequently a prediction or classification of some type.

But the details about how we build, select, and deploy a model are of course, a little more involved.

Let’s quickly take a closer look at the process of how machine learning models are built.

For the most part, the model building process is a step-by-step process that follows a similar general path for each project.

Of course, there are always little differences from one project to another, but at a high level, there are a few typical steps when you build a machine learning system.

This image that shows the model building process. It simplifies things quite a bit, but it gives you a rough idea of what happens when you build a machine learning system.

Typically, you start by clarifying the objectives of the machine learning system. In a business setting, this frequently involves talking to team members and business partners to generate system requirements and outcomes.

Then you get the data and clean the data. If you have any experience with data science already, you’ve probably heard that “data wrangling is 80% of the job” in data science. It’s a little more complicated than that, but it’s true that data preparation and exploration is a big part of the task when you create a new machine learning model.

Next you build several models. Notice that I said model* s*, plural. In almost all cases, you’ll need to build several different models. Sometimes this means building models that are very similar but with slight differences that change or modify the performance. But this can also mean using multiple different machine learning techniques to build different types of machine learning systems. For example, for a project, you might build a decision tree model, a support vector machine, and a logistic regression model, just to see how each different model type performs.

And finally, once you’ve built several models, you need to evaluate them, select the “best” model relative to your project requirements. Once you select one of the models, you finalize it and then deploy the model.

Notice also the backwards arrows that sometimes move backwards to a previous step in the process. This is important. Building a machine learning model is highly iterative, and sometimes you need to go backwards and re-do a previous step of the process.

For example, you might start building your models, and realize that you need more data or different data. In that case, you’d need to go backwards, get new data, clean it, and then start moving through the steps of the model building process again.

Obviously, the process is a lot more complicated once you actually start doing the work. There are a lot of fine-grained details that I’m passing over for the sake of simplicity, clarity, and brevity.

But this should give you a rough idea of how the model building process works in machine learning.

As I mentioned previously, there are many different techniques for doing machine learning.

And when we build a machine learning system, we often try many different techniques, and then evaluate the different models and compare them against each other.

The reason that we do this is that there are many different algorithms that can operate on datasets in order to produce useful outputs.

These different algorithms – these different machine learning techniques – have strengths and weaknesses.

Some work best when you have small datasets.

Others work comparatively better when you have small datasets.

Some work best on highly structured data …

Others work relatively better when you’re inputting unstructured data.

And so on …

There are dozens (even hundreds) of different machine learning techniques, depending on how you want to categorize them.

For example, a few of the common machine learning techniques are:

- linear regression
- logistic regression
- decision trees
- random forests
- boosted trees
- support vector machines
- deep neural networks

These are a few broad classes of machine learning techniques, and there are many variants of almost all of these tools.

Whenever you build a machine learning system, you’ll literally have dozens of possible techniques to use, and many ways you can customize or optimize each technique.

Moreover, choosing the right technique for a particular problem is both an art and a science. There are some general rules of thumb for choosing the right technique, but there’s also a bit of “art” involved, in the sense that it takes experience and intuition gained over months and years of practice.

This article should have give you a quick overview of what machine learning is and how it works.

But there’s still more to learn.

We still need to cover topics like:

- regression vs classification
- supervised vs unsupervised learning
- the bias/variance problem
- the problem of overfitting

… and quite a bit more.

If you have specific questions about machine learning that I haven’t addressed in this article, please leave your question in the comments section at the bottom of the page.

If you’re interested in learning more about machine learning, then sign up for our email list. Through this year and into the foreseeable future, we’ll be posting detailed tutorials about different parts of the machine learning workflow.

We’ll be addressing some of the other topics I just mentioned. We also plan to publish detailed tutorials about the different machine learning techniques, like linear regression, logistic regression, decision trees, neural networks, and more.

So if you want to learn and master machine learning, then sign up for our email list. When you sign up, we’ll send our new tutorials directly to your inbox as soon as they’re published.

The post A Quick Introduction to Machine Learning appeared first on Sharp Sight.

]]>The post How to use the R case_when function appeared first on Sharp Sight.

]]>It explains the syntax, and also shows clear examples in the examples section.

You can click on any of the links below, and it will take you to the appropriate section in the tutorial.

**Table of Contents:**

Having said that, the tutorial might make more sense if you read it start to finish.

With that in mind, let’s jump in.

Frequently, when we’re doing data manipulation in R, we need to modify data based on various possible conditions.

This is particularly true when we’re creating new variables with the mutate function from dplyr.

To show you this, let’s look at an example.

Let’s say that there’s a class of students in a statistics class. These students take a test, and they get a score of 0 to 100 on the test.

Based on their test score, each student will get a test grade:

- If the score is greater than or equal to 90, assign an ‘A’
- Else if the score is greater than or equal to 80, assign a ‘B’
- Else if the score is greater than or equal to 70, assign a ‘C’
- Else if the score is greater than or equal to 60, assign a ‘D’
- Else, assign an ‘F’

So you have one piece of information, and based on that information, you’re trying to generate new values based on conditions. You need to generate new information with some if-elif-else style logic.

How do you do this in R?

You can do it in R with the `case_when()`

function.

To understand how, let’s look at the syntax.

Here, we’ll look at the syntax of `case_when`

.

The case_when syntax can be little bit complex, especially if you use it with multiple possible cases and conditions.

That being the case, I’ll try to explain this in stages, to help you understand.

We’ll first look at the syntax for a very simple use of case_when, and then we’ll move on to a use that has multiple conditions.

Let’s first look at a simple example of the syntax.

We can use case_when to implement a simple sort of logic, where the function just tests for s single condition, and outputs a value if that condition is `TRUE`

.

To do this syntactically, we simply type the name of the function: `case_when()`

.

Then, inside the parenthesis, there is an expression with a “left hand side” and a “right hand side,” which are separated by a tilde (`~`

).

Inside the parenthesis of case_when, the left hand side is a conditional statement that should evaluate as `TRUE`

or `FALSE`

.

This condition is the condition that we’re looking for that indicates membership in a particular case.

This will almost always be a:

- Comparison operation (i.e.,
`>=`

) - Compound logical expression that combines multiple comparison operations with the and/or/not operators (
`&`

,`|`

,`!`

)

Essentially, the left hand side of the expression needs to be a logical expression that evaluates as `TRUE`

or `FALSE`

.

This is the “match condition” that we’re looking for to match a particular “case.”

The right hand side of the expression provides the replacement value.

So if the left hand side is looking for the values that match a particular case, the right hand side of the expression provides the output of `case_when()`

for that case.

This explanation above explains how case_when() works if we have a single condition and case that we’re looking for.

But the real power of `case_when()`

comes in when you’re using it to implement if/else logic, or if/elif/else logic with multiple cases.

Let’s take a look at the syntax for those

In the syntax explanation immediately above, I showed you how to use case_when with a simple condition, but nothing else.

Here, we’ll look at the syntax that searches for a condition and assigns an output if that condition is `TRUE`

. But if the condition is `FALSE`

, output a different value.

In this syntax for if-else using case_when, you might have noticed the `TRUE`

syntax in the second line. Why do we need this?

Remember from the earlier section that when we use case_when, we use two-sided expressions to evaluate a condition, and then output a value if that condition is `TRUE`

. If the left hand side is `TRUE`

, then `case_when()`

outputs the value on the right hand side.

In this syntax example here, the second line hard-codes the value `TRUE`

in that final two-sided expression. This forces case_when to output the “`else-output-value`

” if none of the previous conditions were `TRUE`

.

Now that we’ve looked at two examples with one condition, let’s look at how case_when() works when we have multiple cases.

The case_when syntax that tests for different cases is similar to the syntax for one case.

When we have multiple cases, we have “a sequence of two-sided formulas.” Said differently, the syntax will have a sequence of multiple formulas for a “test condition” and “output”.

So in the above image, `condition-1`

is a logical condition that tests for the first case, and `output-value-1`

is the output. Then `condition-2`

is a logical condition that tests for the second case. And so on.

Although the above image shows equations for three cases, we can technically have many more (although, the code would get messy).

Before we look at some examples, there’s one last bit of syntax that I’ll explain.

When you’re using case_when with multiple cases, it’s like using multiple if-else statements, where you test the first condition, and then output a value if condition 1 is true. Then you test the second condition, and output a different value if condition 2 is true. And so on.

But typically, when you do multiple if-else statements, there’s a final “else” that provides an output if none of the previous conditions were true.

How do we do that with case_when?

I actually showed you this earlier in the syntax explanation for if/else logic, but let’s look at it here in the context of if/elif/else.

If we’re implementing if/elif/else logic, we need to have a final two-sided formula (after the other two-sided formulas), that specifies a value to output if none of the other conditions were true.

Notice exactly how we do this.

On the right hand side of the final two-sided equation, we have the “else” output value.

But the *left* hand side of the final two-sided equation is the boolean value `TRUE`

.

Why?

Remember, for every two-sided formula, if the left hand side is `TRUE`

, then it outputs the right hand side.

So for this final formula, we *force* this to evaluate as `TRUE`

by literally using the value `TRUE`

. This forces case_when to output the “` else-output-value`

” for any remaining values that weren’t previously categorized.

It’s a bit of a syntactical hack to force case_when to categorize “everything else”.

I realize that all of this might seem a little abstract, and possibly a little difficult to understand.

Because of that, I think it’s very useful to look at examples of how to use case_when with real data.

So let’s do that.

Here we’ll take a look at several examples of how to use the R case_when function.

For simplicity and clarity, we’re going to start with a simple example of how to use case_when on an R vector.

But since we commonly use case_when with *dataframes*, the remaining examples will show you how to use case_when on an R dataframe.

You can click on any of the following links, and it will take you to the appropriate example.

**Examples:**

- Use case_when to perform a simple if_else
- Use case_when to perform if-elif-else
- Use case_when to do if-else, and create a new variable in a dataframe
- Create new variable by multiple conditions via mutate (if-elif-else)
- Create a new variable in a dataframe with case_when, using compound logical conditions

Before you run the examples, you’ll need to run some code to import the case_when function, and also to create some data that we’ll work with.

The case_when function is part of the `dplyr`

library in R.

Having said that, you’ll need to import `dplyr`

explicitly or import the `tidyverse`

package (which includes `dplyr`

).

You can do that by running the following:

library(dplyr)

Or alternatively, you can import the Tidyverse like this:

library(tidyverse)

In the following examples, we’re going to work with a vector of data, and also a dataframe.

You can run this code to create the vector:

test_score_vector <- c(94,90,88,75,66,65,45)

This vector contains several numbers that represent student test scores.

We'll also create a dataframe called `test_score_df`

that contains related data.

test_score_df <- tribble(~student, ~major, ~test_score ,'natascha', 'business', 94 ,'arun', 'statistics', 90 ,'mike', 'statistics', 88 ,'steve', 'statistics', 75 ,'james', 'business', 66 ,'ashley', 'statistics', 65 ,'oscar', 'statistics', 45 )

The numbers in the `test_score`

variable are the same numbers from `test_score_vector`

.

But the `test_score_df`

dataframe also contains student names and each student's major (in the `student`

variable and `major`

variable, respectively).

Once you run the code to create these datasets, you'll be ready to go.

First, we'll do a very simple example.

Here, we're going to operate on the vector `test_score_vector`

, which contains test scores for seven students.

We're going to use case_when to assign a Pass/Fail grade for each score.

If the test score is greater than or equal to 60, case_when will return '`Pass`

'.

Otherwise, case_when will return '`Fail`

'.

Let's take a look:

case_when(test_score_vector >= 60 ~ 'Pass' ,TRUE ~ 'Fail' )

OUT:

[1] "Pass" "Pass" "Pass" "Pass" "Pass" "Pass" "Fail"

This is fairly simple, but let me explain.

Inside the parenthesis of case_when, we have the expression `test_score_vector >= 60 ~ 'Pass'`

. This checks each value of `test_score_vector`

to see if the value is greater than or equal to 60. If the value meets this condition, case_when returns 'Pass'.

However, if a value does *not* match that condition, then case_when moves to the next condition.

You'll see on the second line, we have the expression `TRUE ~ 'Fail'`

. This effectively assigns the value '`Fail`

' to all of the values that didn't match the first condition.

This is like a catch-all "else" statement in a typical if/else statement.

Next, we're going to use `case_when()`

on a vector of data, `test_score_vector`

, but we're going to use it to test multiple cases and assign the following values:

- If
`test_score_vector`

is greater than or equal to 90, assign '`A`

' - Else if
`test_score_vector`

is greater than or equal to 80, assign '`B`

' - Else if
`test_score_vector`

is greater than or equal to 70, assign '`C`

' - Else if
`test_score_vector`

is greater than or equal to 60, assign '`D`

' - Else, assign '
`F`

'

So we're going to use `case_when()`

as an if-elif-else statement, applied to a vector of data.

Let's take a look.

case_when(test_score_vector >= 90 ~ 'A' ,test_score_vector >= 80 ~ 'B' ,test_score_vector >= 70 ~ 'C' ,test_score_vector >= 60 ~ 'D' ,TRUE ~ 'F' )

OUT:

[1] "A" "A" "B" "C" "D" "D" "F"

So what happened here?

The input was the vector `test_score_vector`

, which contained the values `c(94,90,88,75,66,65,45)`

.

The output was the values `"A" "A" "B" "C" "D" "D" "F"`

.

Essentially, case_when evaluated each number in the input vector, and assigned an output value depending on that input:

- If the value was greater than or equal to 90, it assigned the value '
`A`

'. - Then, if the value was greater than or equal to 80, but less than 90, it assigned the value '
`B`

'. - etc

So depending on the input number, it assigned a letter score of A, B, C, D, or F ... just like most grading schemes in the USA.

Notice as well the final line of the case_when statement. The final line `TRUE ~ 'F'`

effectively assigns the value '`F`

' as an "else" value, if none of the previous conditions were `TRUE`

.

Next, we're going to use case_when in the context of manipulating a *dataframe*.

This example will actually be almost exactly the same as example 1, but instead of operating on a vector, we'll operate on a dataframe.

So here, we're going to add a new variable to our dataframe, `test_score_df`

. Specifically, we're going to add a variable called `pass_fail_grade`

which will assign '`Pass`

' if the test score is greater than or equal to 60, and will assign '`Fail`

' otherwise.

To do this, we're going to use `case_when`

, but we're going to use it *inside* of the dplyr mutate function.

Remember: the dplyr mutate function adds new variables to an R dataframe.

Let's take a look.

test_score_df %>% mutate(pass_fail_grade = case_when(test_score_vector >= 60 ~ 'Pass' ,TRUE ~ 'Fail' ) )

OUT:

# A tibble: 7 x 4 student major test_score pass_fail_grade 1 natascha business 94 Pass 2 arun statistics 90 Pass 3 mike statistics 88 Pass 4 steve statistics 75 Pass 5 james business 66 Pass 6 ashley statistics 65 Pass 7 oscar statistics 45 Fail

What happened here?

Notice that the output dataframe has a new variable called `pass_fail_grade`

.

This variable contains the values `Pass`

or `Fail`

, which have been assigned depending on the value of `test_score`

. If `test_score`

is greater than or equal to 60, then the assigned value is `Pass`

, else the assigned value is `Fail`

.

Also take note that in order to do this, we needed to use case_when *inside* of mutate.

So the code starts at the top of the code with the name of the dataframe, `test_score_df`

.

We used the pipe operator to pipe the dataframe into `mutate`

, to create a new variable.

Inside of `mutate`

, we call `case_when`

.

`case_when`

looks at the `test_score`

variable, and tests different conditions for different cases, assigning a '`Pass`

' if `test_score`

is greater than or equal to 60, else the assigning a value of `Fail`

.

But importantly, the Pass/Fail output of `case_when`

is being assigned to the new variable `pass_fail_grade`

. This all happens inside of the `mutate`

function.

I realize that this is a slightly more complicated application, but in reality, this is a very common way to use case_when in R. We commonly use case_when to create new variables in a dataframe, in conjunction with the mutate function.

Now, let's increase the complexity.

This example will be somewhat similar to example 3, in that we're going to operate on a dataframe.

But it's also similar to example example 2, in the sense that we'll use case_when to look for multiple different cases.

Here, we're going to start with the `test_score_df`

dataframe. We'll pipe that into the mutate function, to create a new variable called `test_grade`

. Inside of mutate, to generate the specific values of `test_grade`

, we'll use case_when.

Let's take a look.

test_score_df %>% mutate(test_grade = case_when(test_score_vector >= 90 ~ 'A' ,test_score_vector >= 80 ~ 'B' ,test_score_vector >= 70 ~ 'C' ,test_score_vector >= 60 ~ 'D' ,TRUE ~ 'F' ) )

OUT:

# A tibble: 7 x 4 student major test_score test_grade 1 natascha business 94 A 2 arun statistics 90 A 3 mike statistics 88 B 4 steve statistics 75 C 5 james business 66 D 6 ashley statistics 65 D 7 oscar statistics 45 F

If you understood example 2 and example 3, then this should make some sense.

Here, we're using case_when inside of mutate to create a new categorical variable.

The case_when function is operating on test_score, and outputs:

- '
`A`

' if`test_score`

is greater than or equal to 90 - '
`B`

' if`test_score`

is greater than or equal to 80 - '
`C`

' if`test_score`

is greater than or equal to 70 - '
`D`

' if`test_score`

is greater than or equal to 60 - '
`F`

' if none of the previous conditions where true

It evaluates these conditions one at a time, from top to bottom, and if a condition is false, it just moves on to the next.

The output of case_when is being saved with the name `test_grade`

, which mutate adds to the output dataframe.

Let's do one final example.

Here, we're going to add a variable with a Pass/Fail grade to our dataframe, `test_score_df`

.

This is somewhat similar to example 3. Like example 3, we'll be adding a pass/fail variable to the dataframe.

*But*, there will be an important difference here.

In this example, we're going to use slightly more complex conditions to assign `Pass`

or `Fail`

.

We're going to assign the Pass/Fail grade based on two variables: test score and major.

Here, `case_when`

will use the following logic:

- everyone who gets a score over 70 will pass
- If a person gets above a 60, and is
**not**a statistics major, they will also pass - everyone else will fail

So effectively, if a person gets between a 60 and 70 on the test, the Pass/Fail grade will depend on their major. In that range, people with a statistics major will fail, but everyone else will pass.

Let's take a look.

test_score_df %>% mutate(pass_fail_grade = case_when(test_score_vector >= 70 ~ 'Pass' ,(test_score_vector >= 60) & (major != 'statistics') ~ 'Pass' ,TRUE ~ 'Fail' ) )

OUT:

# A tibble: 7 x 4 student major test_score pass_fail_grade 1 natascha business 94 Pass 2 arun statistics 90 Pass 3 mike statistics 88 Pass 4 steve statistics 75 Pass 5 james business 66 Pass 6 ashley statistics 65 Fail 7 oscar statistics 45 Fail

So what happened here?

Notice that everyone with a test score above 70 received a `Pass`

grade.

Notice that everyone with a test score below 60 received a `Fail`

grade.

But in the range between 60 and 70, there are two special cases (the records for `james`

and `ashley`

).

James had a test score of 66, but he's a business major, so he passed.

Ashley received a score of 65, but she's a statistics major, so she failed.

The logic for this was in the second line of case_when, with the code `(test_score_vector >= 60) & (major != 'statistics') ~ 'Pass'`

.

This code assigned a `Pass`

grade if test score was greater than or equal to 60 AND major was **not** equal to '`statistics`

'. Effectively, any row of data that had grade between 60 and 70 and was anything other than a statistics major would evaluate as `True`

on the left hand side of the expression, and would receive a `Pass`

.

Rows of data with a test grade between 60 and 70 and a `statistics`

major would evaluate as `False`

, which would then cause case_when to evaluate the row of data with the expression `TRUE ~ 'Fail'`

, which would automatically assign a grade of '`Fail`

'.

Effectively, with this grading scheme, statistics majors are evaluated more strictly and must earn a test score above 70 in order to pass, but other majors only need to score above 60.

Now that you've seen some examples of `case_when`

, let's review some frequently asked questions about this function.

**Frequently asked questions:**

- How do you use case_when to perform if-else?
- How do you use case_when to perform if-elif-else?
- How do you use case_when to add a new variable to a dataframe?

To use case_when as an if-else generator, you simply have one test expression, and then a second catch-all expression at the end with the form `TRUE ~ 'else-value'`

.

I covered this in example 1 and example 3.

Example 1 shows you how to do this with a vector of data.

Example 3 shows you how to do this with an R dataframe to create a new variable.

To use case_when as an if-elif-else function, you will have several test conditions in sequence, and then a final catch-all expression at the end with the form `TRUE ~ 'else-value'`

.

I covered this in example 2 and example 4.

Example 2 shows you how to do if/elif/else with a vector of data.

Example 4 shows you how to do if/elif/else with an R dataframe to create a new variable.

To create a new variable in a dataframe using case_when, you need to use case_when inside of the dplyr mutate function.

I show examples of this in example 3, example 4, and example 5.

Do you have other questions about case_when?

If so, leave your question in the comments section below.

The case_when function is extremely useful for doing data manipulation in R.

But, it's really one tool among several dozen tools in dplyr and the Tidyverse.

If you want to master data manipulation in R, you really need to master all of the other functions like mutate, filter, group_by, and many more.

And beyond that, there's more to learn about data visualization and data analysis in R too.

Having said that, if you're serious about learning dplyr, and data science in R, you should consider joining our premium course called *Starting Data Science with R*.

Starting Data Science will teach you all of the essentials you need to do data science in R, including:

- How to manipulate your data with dplyr
- How to visualize your data with ggplot2
- Tidyverse helper tools, like tidyr and forcats
- How to analyze your data with ggplot2 + dplyr
- and more ...

Moreover, it will help you completely *master* the syntax within a few weeks. We'll show you a practice system that will enable you to *memorize* all of the R syntax you learn. If you have trouble remembering R syntax, this is the course you've been looking for.

Find out more here:

Learn More About Starting Data Science with R

The post How to use the R case_when function appeared first on Sharp Sight.

]]>The post Numpy Dot, Explained appeared first on Sharp Sight.

]]>I’ll explain exactly what the function does, how the syntax works, and I’ll show you clear examples of how to use np.dot.

**Table of Contents:**

If you need something specific, you can click on any of the above links, and it will take you to the appropriate section of the tutorial.

Having said that, if you’re new to Numpy, or need a quick refresher about mathematical dot products, you should probably read the whole tutorial.

First of all, let’s start with the basics.

What does Numpy dot do?

At a high level, Numpy dot computes the dot product of two Numpy arrays.

If you’re a little new to Numpy though, or if you don’t completely understand dot products, that might not entirely make sense.

So let’s quickly review some basics about Numpy and about dot products.

Let’s start with Numpy.

As you’re probably aware, Numpy is an add-on package for the Python programming language.

We mostly use Numpy for data manipulation and scientific computing, but we use Numpy on specific types of data in specific data structures.

In particular, Numpy creates and operates on Numpy arrays.

A Numpy array is a data structure that stores numerical data in a row and column structure.

So for example, a 2-dimensional Numpy array looks something like this:

Numpy arrays can come in a variety of shapes and sizes. For example, we can build 1-dimensional arrays, 2-dimensional arrays, and n-dimensional arrays.

Additionally, we can create Numpy arrays where the Numbers have a variety of different properties. For example, we can create arrays that contain normally distributed numbers, numbers drawn from a uniform distribution, numbers that are all the same value, just to name a few.

So Numpy has a variety of functions for creating Numpy arrays with different types of properties.

In addition to having functions for creating Numpy arrays, the Numpy package also has functions for operating on and computing with Numpy arrays.

So Numpy has many functions for performing mathematical computations, like computing the sum of an array, computing exponentials of array values, and many more.

Additionally, Numpy has functions for doing more advanced operations, like operations from linear algebra.

So what does the Numpy dot function do?

The simple explanation is that `np.dot`

computes dot products.

To paraphrase the entry on Wikipedia, the dot product is an operation that takes two equal-length sequences of numbers and returns a single number.

Having said that, the Numpy dot function works a little differently depending on the exact inputs.

There are three broad cases that we’ll consider with `np.dot`

:

- both inputs are 1D arrays
- both are 2D arrays
- one inputs is a scalar and one input is an array

Let’s take a look at how Numpy dot operates for these different cases.

Let’s say we have two Numpy arrays, and , and each array has 3 values.

Given two 1-dimensional arrays, `np.dot`

will compute the dot product.

The dot product can be computed as follows:

Notice what’s going on here. These arrays have the same length, and each array has 3 values.

When we compute the dot product, we multiply the first value of by the first value of . We multiply the second value of by the second value of . And we multiply the third value of by the third value of . Then, we take the resulting values, and sum them up.

The output is a single scalar value … in this example, .

In mathematical terms, we can generalize the example above. If we have two vectors and , and each vector has elements, then the dot product is given by the equation:

(1)

Essentially, when we take the dot product of two Numpy arrays, we’re computing the *sum* of the *pairwise products* of the two arrays.

The second case is when one input is a scalar value , and one input is a Numpy array, which here we’ll call .

If we use Numpy dot on these inputs with the code `np.dot(r,b)`

Numpy will perform *scalar multiplication* on the array:

So when we use Numpy dot with one scalar and one Numpy array, it multiples every value of the array by the scalar and outputs a new Numpy array.

The final case that we’ll cover is when both of the input arrays are 2-dimensional arrays.

In this case, with two 2D arrays, the `np.dot`

function will perform matrix multiplication.

A full explanation of matrix multiplication is beyond the scope of this tutorial, but let’s look at a quick example.

Let’s say that you have two 2D arrays, and .

Next, let’s multiply those arrays together using matrix multiplication.

During matrix multiplication, we multiply the values of the rows of with the values of the columns of and sum them up in the following way:

And here’s the final computed output:

Notice that the output array, , has the same number of rows as and the same number of columns as .

Additionally, each value in the output array is calculated by summing the product of the *i*th row of and the *j*th row of .

More generally, to compute the output array , each value of the output array is defined as:

(2)

If you’re not familiar with linear algebra generally, and matrix algebra specifically, I realize that these equations can be a little confusing. Maybe even a little intimidating.

That said, if you want to learn more about these operations, I recommend that you read the book Linear Algebra and its Applications, by David Lay and colleagues. It’s a very approachable book about linear algebra that will help you understand some of the operations we’re performing with Numpy dot.

Ok. So now that we’ve looked at what Numpy dot does, let’s take a closer look at the syntax.

Here, I’ll explain the syntax of the Numpy dot function.

One thing before we look at the syntax.

In order to use Numpy functions, you need to import Numpy first. You can do that with the following code:

import numpy as np

This is important, because how you import Numpy will affect the syntax.

It’s the common convention among Python data scientists to import Numpy with the alias ‘`np`

‘, and we’ll be sticking with that convention here.

The syntax of `np.dot`

is really very simple.

Assuming that you’ve imported Numpy with the alias `np`

, you call the function as `np.dot()`

.

Then, inside the parenthesis, there are a few parameters that allow you to provide inputs to the function.

Let’s take a look at those inputs.

There are 2 core parameters for the `np.dot()`

function.

`a`

`b`

Let’s quickly take a look at those, one by one.

`a`

(required)The `a`

parameter allows you to specify the first input value or array to the function.

Technically, the argument to this parameter can be a scalar value, or any “array-like” object.

Because it allows array-like objects, this can be a proper Numpy array, or it can be a Python list, a tuple, etc. A scalar value like an `int`

or `float`

will also work.

Keep in mind that you must provide an argument to this parameter.

`b`

(required)The `b`

parameter allows you to specify the second input value or array to the function.

Similar to the `a`

parameter, the argument to `b`

can be a scalar value, or any “array-like” object. So acceptable arguments to this parameter include Numpy arrays, Python lists, and tuples. Scalar values like an `int`

or `float`

are acceptable as well.

Keep in mind that you must provide an argument to this parameter.

`out`

(optional)Note that `np.dot()`

also has an `out`

parameter. This is somewhat rarely used, so we’re not going to cover it here. For more information about this parameter, review the official documentation.

The output of `np.dot()`

depends on the inputs.

There are a few cases:

- If both inputs are scalars,
`np.dot()`

will multiply the scalars together and output a scalar. - If one input is a scalar and one is an array,
`np.dot()`

will multiply every value of the array by the scalar (i.e., scalar multiplication). - If both inputs are 1-dimensional arrays,
`np.dot()`

will compute the dot product of the inputs - If both inputs are 2-dimensional arrays, then
`np.dot()`

will perform matrix multiplication.

So as you can see, the output really depends on how you use the function.

With that in mind, let’s take a look at some examples so you can see how it works, and see the different types of outputs that `np.dot()`

produces given certain types of inputs.

Ok. Let’s work through some step-by-step examples.

If you need something specific, you can click on any of the following links, and it will take you to the approprate example.

**Examples:**

- Multiply two numbers
- Multiply a Number and an Array
- Compute the Dot Product of Two 1D Arrays
- Perform Matrix Multiplication on Two 2D Arrays

Before you run any of the examples, you’ll need to import Numpy first.

You can do that with the following code:

import numpy as np

Once you’ve done that, you should be ready to go.

Ok. Let’s start with a simple example.

Here, we’re going to use two numbers (i.e., scalar values) as the inputs to `np.dot()`

.

np.dot(2,3)

OUT:

6

This is maybe a little unexpected, but very simple.

Here, we called `np.dot()`

with two scalar values, 2 and 3.

When we call `np.dot()`

with two scalars, it simply multiplies them together.

Obviously `2 * 3 = 6`

.

Next, let’s provide an array and scalar as inputs.

Here, we’ll actually provide a scalar (an integer) and a Python list. Instead of a Python list, we could also provide a Numpy array, but I’ve used a Python list instead because it makes the operation a little easier to understand here.

Let’s take a look.

np.dot(2,[5,6])

OUT:

array([10, 12])

Again, this is very simple.

The first argument to the function is the scalar value 2.

The second argument to the function is the Python list `[5,6]`

.

When we provide a scalar as one input and a list (or Numpy array) as the other input, `np.dot()`

simply multiplies the values of the array by the scalar.

Mathematically, we’d consider this to be scalar multiplication of a vector or matrix.

Next, let’s input two 1-dimensional lists.

Here, we’ll use two Python lists, but we could also use 1D Numpy arrays. I’m using Python lists because it makes the operation a little easier to understand at a glance.

Let’s take a look.

np.dot([3,4,5],[7,8,9])

OUT:

98

What’s going on here?

Here, the `np.dot()`

is computing the dot product of the two inputs.

These inputs are 1-dimensional Python lists. And as I said earlier, we could also use 1D Numpy arrays.

Mathematically, 1D lists and 1D Numpy arrays are like *vectors*.

When we’re working with vectors and take the dot product, the dot product is computed by equation 1 that we saw earlier.

So when Numpy dot has two 1D lists or arrays as inputs …

… it takes the product of the pairwise elements, and then adds them together:

And the output is a scalar value. In this case, the result is 98.

Finally, let’s look at what happens when we use Numpy dot on two 2-dimensional arrays.

First, let’s just create two 2-dimenstional Numpy arrays.

To do this, we’ll use the `np.arange()`

function to create a sequence of numbers, and then use the Numpy reshape method to reshape the numbers into a 2D shape.

A_array_2d = np.arange(start = 3, stop = 9).reshape((2,3)) B_array_2d = np.arange(start = 10, stop = 16).reshape((3,2))

And let’s print them out, just so you can see the contents.

print(A_array_2d)

OUT:

[[3 4 5] [6 7 8]]

print(B_array_2d)

OUT:

[[10 11] [12 13] [14 15]]

Notice that both of these Numpy arrays are 2-dimensional. Having said that, the number of rows in `A_array_2d`

is the same as the number of columns in `B_array_2d`

.

Ok. Now let’s use the Numpy dot function on these two arrays.

np.dot(A_array_2d, B_array_2d)

OUT:

array([[148, 160], [256, 277]])

So what happened here?

In this case, we used two 2-dimensional Numpy arrays as the inputs.

When we use 2D arrays as inputs, `np.dot()`

computes the *matrix product* of the arrays.

When it does this, it `np.dot()`

calculates the values of the output array according to equation 2 that we saw earlier.

So under the hood, this is what Numpy is doing when we run the code `np.dot(A_array_2d, B_array_2d)`

:

Look carefully. In the above image, you can see that the computation multiplies the values of the rows of A by the values of the columns of B, sums them up, and puts them into the final output array, with the following values:

This operation is known as the matrix product.

So again: when we use Numpy dot on two 2-dimensional arrays, `np.dot()`

computes the matrix product.

I understand that this might be slightly confusing if you don’t have a lot of experience with linear algebra. If that’s the case, I recommend that you do some reading about linear algebra generally, and matrix multiplication in particular.

Now that we’ve looked at some examples, let’s look at a few common questions about the `np.dot()`

function.

**Frequently asked questions:**

- What’s the difference between
`np.dot()`

and`np.matmul()`

? - What’s the difference between
`np.dot()`

and`ndarray.dot()`

?

`np.dot()`

and `np.matmul()`

?Numpy dot and Numpy matmul are similar, but they behave differently for some types of inputs.

The two big differences are for:

- multiplication by scalars
- multiplication of high-dimensional Numpy arrays

Let’s look at these one at a time.

The first difference between `np.dot()`

and `np.matmul()`

is that `np.dot()`

allows you to multiply by scalar values, but `np.matmul()`

does not.

As we saw in example 2, when we use `np.dot()`

with one scalar (e.g., an integer) and an array/list, Numpy dot will simply multiply every value of the array by the scalar value.

np.dot(2,[5,6])

OUT:

array([10, 12])

However, if you try to do this with `np.matmul()`

you’ll get an error:

np.matmul(2,[5,6])

OUT:

ValueError: matmul: Input operand 0 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?,m?) requires 1)

The second area where `np.dot()`

and `np.matmul()`

are different is when they operate on high-dimensional arrays.

According to the doccumentation …

When you use `np.dot`

:

If a is an N-D array and b is a 1-D array, it is a sum product over the last axis of a and b.

If a is an N-D array and b is an M-D array (where M>=2), it is a sum product over the last axis of a and the second-to-last axis of b:

But when you use `np.matmul`

:

If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.

Ultimately, `np.dot()`

and `np.matmul()`

behave differently for scalar multiplication, and for multiplication of higher-dimensional inputs.

`np.dot()`

and `ndarray.dot()`

?`np.dot()`

and `ndarray.dot()`

are very similar, and effectively perform the same operations.

The difference is that `np.dot()`

is a Python *function* and `ndarray.dot()`

is a *Numpy array method*.

So they effectively do the same thing, but you call them in a slightly different way.

Let’s say we have two 2-dimensional arrays.

A_array_2d = np.arange(start = 3, stop = 9).reshape((2,3)) B_array_2d = np.arange(start = 10, stop = 16).reshape((3,2))

We can call the `np.dot()`

function as follows:

np.dot(A_array_2d, B_array_2d)

But we use so-called “dot syntax” to call the `.dot()`

method:

A_array_2d.dot(B_array_2d)

The output is the same, but the syntax is slightly different.

If you’re still confused about this, make sure to read more about the difference between Python *functions* and Python *methods*.

Do you still have questions about the Numpy dot function?

If so, leave your questions in the comments section below.

In this tutorial, I’ve explained how to use the `np.dot()`

function to compute dot products of 1D arrays and perform matrix multiplication of 2D arrays.

This should help you understand `np.dot`

, but if you really want to learn Numpy, there’s a lot more to learn.

If you’re serious about mastering Numpy, and serious about data science in Python, you should consider joining our premium course called *Numpy Mastery*.

Numpy Mastery will teach you everything you need to know about Numpy, including:

- How to create Numpy arrays
- What the “Numpy random seed” function does
- How to reshape, split, and combine your Numpy arrays
- How to use the Numpy random functions
- How to perform mathematical operations on Numpy arrays
- and more …

Moreover, this course will show you a practice system that will help you *master* the syntax within a few weeks. We’ll show you a practice system that will enable you to memorize all of the Numpy syntax you learn. If you have trouble remembering Numpy syntax, this is the course you’ve been looking for.

Find out more here:

Learn More About Numpy Mastery

The post Numpy Dot, Explained appeared first on Sharp Sight.

]]>