The post A quick introduction to the NumPy array appeared first on Sharp Sight.

]]>That being the case, if you want to learn data science in Python, you’ll need to learn how to work with NumPy arrays.

In this blog post, I’ll explain the essentials of NumPy arrays, including:

- What NumPy arrays are
- How to create a NumPy array
- Attributes of NumPy arrays
- Retrieving individual values from NumPy arrays
- Slicing NumPy arrays
- Creating and working with 2-dimensional NumPy arrays

Let’s get started.

A NumPy array is a collection of elements that have the same data type.

You can think of it like a container that has several compartments that hold data, as long as the data is of the same data type.

Visually, we can represent a simple NumPy array sort of like this:

Let’s break this down.

We have a set of integers: 88, 19, 46, 74, 94. These values are all integers; they are all of the same type. You can see that these values are stored in “compartments” of a larger structure. The overall structure is the NumPy array.

Very quickly, I’ll explain a little more about some of the properties of a NumPy array.

As I mentioned above, NumPy arrays must contain data all of the same type.

That means that if your NumPy array contains integers, *all* of the values must be integers. If it contains floating point numbers, *all* of the values must be floats.

I won’t write extensively about data types and NumPy data types here. There is a section below in this blog post about how to create a NumPy array of a particular type. Having said that, if you want to learn a lot more about the various data types that are available in NumPy, then (as the saying goes) read the f*cking manual.

Each of the compartments inside of a NumPy array have an “address.” We call that address an “index.”

If you’re familiar with computing in general, and Python specifically, you’re probably familiar with indexes. Many data structures in Python have indexes, and the indexes of a NumPy array essentially work the same.

If you’re not familiar with indexes though, let me explain. Again, an index is sort of like an address. These indexes enable you to reference a specific value. We call this indexing.

Just like other Python structures that have indexes, the indexes of a NumPy array begin at zero:

So if you want to reference the value in the very first location, you need to reference location “0”. In the example shown here, the value at index 0 is `88`

.

I’ll explain how exactly to use these indexes syntactically, but to do that, I want to give you working examples. To give you working examples, I’ll need to explain how to actually create NumPy arrays in Python.

There are a lot of ways to create a NumPy array. Really. A lot. Off the top of my head, I can think of at least a half dozen techniques and functions that will create a NumPy array. In fact, the purpose of many of the functions in the NumPy package is to create a NumPy array of one kind or another.

But, this blog posts is intended to be a *quick* introduction to NumPy arrays. That being the case, I don’t want to show you every possible way to make a NumPy array. Let’s keep this simple. I’ll show you a few very basic ways to do it.

In particular, I’ll how you how to use the NumPy `array()`

function.

To use the NumPy `array()`

function, you call the function and pass in a Python list as the argument.

Let’s take a look at some examples. We’ll start by creating a 1-dimensional NumPy array.

Creating a 1-dimensional NumPy array is easy.

You call the function with the syntax `np.array()`

. Keep in mind that before you call `np.array()`

, you need to import the NumPy package with the code `import numpy as np`

.

When you call the `array()`

function, you’ll need to provide a list of elements as the argument to the function.

#import NumPy import numpy as np # create a NumPy array from a list of 3 integers np.array([1,2,3])

This isn’t complicated, but let’s break it down.

We’ve called the `np.array()`

function. The argument to the function is a list of three integers: `[1,2,3]`

. It produces a NumPy array of those three integers.

Note that you can also create NumPy arrays with other data types, besides integers. I’ll explain how to do that a little later in this blog post.

You can also create 2-dimensional arrays.

To do this using the `np.array()`

function, you need to pass in a list of lists.

# 2-d array np.array([[1,2,3],[4,5,6]])

Pay attention to what we’re doing here, syntactically.

Inside of the call to `np.array()`

, there is a list of two lists: `[[1,2,3],[4,5,6]]`

. The first list is `[1,2,3]`

and the second list is `[4,5,6]`

. Those two lists are contained *inside* of a larger list; a list of lists. Then that list of lists is passed to the array function, which creates a 2-dimensional NumPy array.

This might be a little confusing if you’re just getting started with Python and NumPy. In that case, I highly recommend that you review Python lists.

There are also other ways to create a 2-d NumPy array. For example, you can use the `array()`

function to create a 1-dimensional NumPy array, and then use the `reshape()`

method to reshape the 1-dimensional NumPy array into a 2-dimensional NumPy array.

# 2-d array np.array([1,2,3,4,5,6]).reshape([2,3])

For right now, I don’t want to get too “in the weeds” explaining `reshape()`

, so I’ll leave this as it is. I just want you to understand that there are a few ways to create 2-dimensional NumPy arrays.

I’ll write more about how to create and work with 2-dimensional NumPy arrays in a future blog post.

It’s also possible to create 3-dimensional NumPy arrays and N-dimensional NumPy arrays. However, in the interest of simplicity, I’m not going to explain how to create those in this blog post. I’ll address N-dimensional NumPy arrays in a future blog post.

Using the NumPy `array()`

function, we can also create NumPy arrays with specific data types. Remember that in a NumPy array, all of the elements must be of the same type.

To do this, we need to use the `dtype`

parameter inside of the `array()`

function.

Here are a couple of examples:

**integer**

To create a NumPy array with integers, we can use the code `dtype = 'int'`

.

np.array([1,2,3], dtype = 'int')

**float**

Similarly, to create a NumPy array with floating point number, we can use the code `dtype = 'float'`

.

np.array([1,2,3], dtype = 'float')

These are just a couple of examples. Keep in mind that NumPy supports almost 2 dozen data types … many more than what I’ve shown you here.

Having said that, a full explanation of Python data types and NumPy data types is beyond the scope of this post. Just understand that you can specify the data type using the `dtype`

parameter.

For more information on data types in NumPy, consult the documentation about the NumPy types that are available.

I want to point out one common mistake that many beginners make when they try to create a NumPy array with the `np.array()`

function.

As I mentioned above, when you create a NumPy array with `np.array()`

, you need to provide a *list of values*.

Many beginners forget to do this and simply provide the values directly to the `np.array()`

function, without enclosing them inside of a list. If you attempt to do that it will cause an error:

np.array([1,2,3,4,5]) #This works! np.array(1,2,3,4,5) #This will cause an error

In the two examples above, pay close attention to the syntax. The top example works properly because the integers are contained inside of a Python list. The second example causes an error because the integers are passed directly to `np.array()`

, without enclosing them in a list.

Having said that, pay attention! Make sure that when you use `np.array()`

, you’re passing the values as a list.

Again, if you’re confused about this or don’t understand Python lists, I strongly recommend that you go back and review lists and other basic “built-in types” in Python.

NumPy arrays have a set of attributes that you can access. These attributes include things like the array’s size, shape, number of dimensions, and data type.

Here’s an abbreviated list of attributes of NumPy arrays:

Attribute | What it records |
---|---|

`shape` |
The dimensions of the NumPy array |

`size` |
The total number of elements in the NumPy array |

`ndim` |
The number of dimensions of the array |

`dtype` |
The data type of the elements in the array |

`itemsize` |
The length of a single array element in bytes |

I want to show you a few of these. To illustrate them, let’s make a NumPy array and then investigate a few of its attributes.

Here, we’ll once again create a simple NumPy array using `np.random.randint()`

.

np.random.seed(72) simple_array = np.random.randint(low = 0, high = 100, size=5)

`simple_array`

is a NumPy array, and like all NumPy arrays, it has attributes.

You can access those attributes by using a dot after the name of the array, followed by the attribute you want to retrieve.

Here are some examples:

`ndim`

is the number of dimensions.

simple_array.ndim

Which produces the output:

1

What this means is that `simple_array`

is a 1-dimensional array.

The `shape`

attribute tells us the number of elements along each dimension.

simple_array.shape

With the output:

(5,)

What this is telling us is that `simple_array`

has 5 elements along the first axis. (And that’s the only information provided, because `simple_array`

is 1-dimensional.)

The `size`

attribute tells you the total number of elements in a NumPy array.

simple_array.size

With the output:

5

This is telling us that `simple_array`

has 5 total elements.

`dtype`

tells you the type of data stored in the NumPy array.

Let’s take a look. We can access the `dtype`

parameter like this:

simple_array.dtype

Which produces the output:

dtype('int64')

This is telling us that `simple_array`

contains *integers*.

Also remember: NumPy arrays contain data that are all of the same type.

Although we constructed `simple_array`

to contain integers, but we could have created an array with floats or other numeric data types.

For example, we can create a NumPy array with decimal values (i.e., floats):

array_float = np.array([1.99,2.99,3.99] ) array_float.dtype

Which gives the output:

dtype('float64')

When we construct the array with the above input values, you can see that `array_float`

contains data of the `float64`

datatype (i.e., numbers with decimals).

Now that I’ve explained attributes, let’s examine how to index NumPy arrays.

Indexing is very important for accessing and retrieving the elements of a NumPy array.

Recall what I wrote at the beginning of the blog post:

A NumPy array is like a container with many compartments. Each of the compartments inside of a NumPy array have an “address.” We call that address an “index.”

Notice again that the index of the first value is 0.

We can use the index to retrieve specific values in the NumPy array. Let’s take a look at how to do that.

First, let’s create a NumPy array using the function `np.random.randint()`

.

np.random.seed(72) simple_array = np.random.randint(low = 0, high = 100, size=5)

You can print out the array with the following code:

print(simple_array)

And you can see that the array has 5 integers.

[88, 19, 46, 74, 94]

For the sake of clarity tough, here’s a visual representation of `simple_array`

. Looking at this will help you understand array indexing:

In this visual representation, you can see the values stored in the array, `88, 19, 46, 74, 94`

. But, I’ve also shown you the index values associated with each of those elements.

These indexes enable us to retrieve values in specific locations.

Let’s take a look at how to do that.

The simplest form of indexing is retrieving a single value from the array.

To retrieve a single value from particular location in the NumPy array, you need to provide the “index” of that location.

Syntactically, you need to use bracket notation and provide the index inside of the brackets.

Let me show you an example. Above, we created the NumPy array `simple_array`

.

To get the value at index 1 from `simple_array`

, you can use the following syntax:

# Retrieve the value at index 1 simple_array[1]

Which returns the value `19`

.

Visually though, we can represent this indexing action like this:

Essentially, we’re using a particular index (i.e., the “address” of a particular location in the array) to retrieve the value stored at that location.

So the code `simple_array[1]`

is basically saying, “give me the value that’s at index location 1.” The result is `19`

… `19`

is the value at that index.

NumPy also supports negative index values. Using a negative index allows you to retrieve or reference locations starting from the *end* of the array.

Here’s an example:

simple_array[-1]

This retrieves the value at the very end of the array.

We could also retrieve this value by using the index `4`

(both will work). But sometimes you won’t know exactly how long the array is. This is a convenient way to reference items at the end of a NumPy array.

I just showed you simple examples of array indexing, but array indexing can be quite complex.

It’s actually possible to retrieve *multiple* elements from a NumPy array.

To do this, we still use bracket notation, but we can use a colon to specify a range of values. Here’s an example:

simple_array[2:4]

This code is saying, “retrieve the values stored from index 2, up to but *excluding* index 4.”

Visually, we can represent this as follows:

Now that you’ve learned how to use indexes in 1-dimensional NumPy arrays, lets review how to use indexes in 2-dimensional NumPy arrays.

Working with 2-d NumPy arrays is very similar to working with 1-d arrays. The major difference (with regard to indexes) is that 2-d arrays have 2 indexes, a row index and a column index.

To retrieve a value from a 2-d array, you need to provide the specific row and column indexes.

Here’s an example. We’ll create a 2-d NumPy array, and then we’ll retrieve a value.

np.random.seed(72) square_array = np.random.randint(low = 0, high = 100, size = 25).reshape([5,5]) square_array[2,1]

Here, we’re essentially retrieving the value at row index 2 and column index 1. The value at that position is `45`

.

This is fairly straightforward. The major challenge is that you need to remember that the row index is first and the column index is second.

Finally, let’s review how to retrieve slices from 2-d NumPy arrays. Slicing 2-d arrays is very similar to slicing 1-d arrays. The major difference is that you need to provide 2 ranges, one for the rows and one for the columns.

np.random.seed(72) square_array = np.random.randint(low = 0, high = 100, size = 25).reshape([5,5]) square_array[1:3,1:4]

Let’s break this down.

We’ve again created a 5×5 square NumPy array called `square_array`

.

Then, we took a slice of that array. The slice included the rows from index 1 up-to-and-excluding index 3. It also included the columns from index 1 up-to-and-excluding index 4.

This might seem a little confusing if you’re a true beginner. In that case, I recommend working with 1-d arrays first, until you get the hang of them. Then, start working with relatively small 2-d NumPy arrays until you build your intuition about indexing works with 2-d arrays.

If you’re a beginner or you don’t have a lot of experience with NumPy arrays, this might seem a little overwhelming. It’s not that complicated, but there’s a lot here and it will take a while to learn and master.

That said, I want to know if you’re still confused about something.

What questions do you still have about NumPy arrays?

Leave your questions and challenges in the comments below …

The post A quick introduction to the NumPy array appeared first on Sharp Sight.

]]>The post How to do linear regression in R appeared first on Sharp Sight.

]]>Although machine learning and artificial intelligence have developed much more sophisticated techniques, linear regression is still a tried-and-true staple of data science.

In this blog post, I’ll show you how to do linear regression in R.

Before I actually show you the nuts and bolts of linear regression in R though, let’s quickly review the basic concepts of linear regression.

Linear regression is fairly straight forward.

Let’s start with the simplest case of simple linear regression. We have two variables in a dataset, X and Y.

We want to predict Y. Y is the “target” variable.

We make the assumption that we can predict Y by using X. Specifically, we assume that there is a linear relationship between Y and X as follows:

If you haven’t seen this before, don’t let the symbols intimidate you. If you’ve taken highschool algebra, you probably remember the equation for a line, , where is the slope and is the intercept.

The equation for linear regression is essentially the same, except the symbols are a little different:

Basically, this is just the equation for a line. is the intercept and is the slope.

To clarify this a little more, let’s look at simple linear regression visually.

Essentially, when we use linear regression, we’re making predictions by drawing straight lines through a dataset.

To do this, we use an existing dataset as “training examples.” When we draw a line through those datapoints, we’re “training” a linear regression model. By the way, this input dataset is typically called a training dataset in machine learning and model building.

When we draw such a line through the training dataset, we’ll essentially have a little model of the form by using the formula . Remember: a line that we draw through the data will have an equation associated with it.

This equation is effectively a model that we can use that linear model to make predictions.

For example, let’s say that after building the model (i.e., drawing a line through the training data), we have a *new* input value, . To make a prediction with our simple linear regression model, we’re just need to use that datapoint as an input to our linear equation. If you know the value, you can compute the predicted output value, by using the formula .

We can visualize that as follows:

This should give you a good conceptual foundation of how linear regression works.

You:

- Obtain a training dataset
- Draw the “best fit” line through the training data
- Use the equation for the line as a “model” to make predictions

I’m simplifying a little, but that’s essentially it.

The critical step though is drawing the “best” line through your training data.

So the hard part in all of this is drawing the “best” straight line through the original training dataset. A little more specifically, this all comes down to computing the “best” coefficient values: and … the intercept and slope.

Mathematically it’s not terribly complicated, but I’m not going to explain the nuts and bolts of how it’s done.

Instead, I’ll show you now how to use R to perform linear regression. By using R (or another modern data science programming language), we can let software do the heavy lifting; we can use software tools to compute the best fit line.

With that in mind, let’s talk about the syntax for how to do linear regression in R.

There are several ways to do linear regression in R.

Nevertheless, I’m going to show you how to do linear regression with base R.

I actually think that performing linear regression with R’s caret package is better, but using the `lm()`

function from base R is still very common. Because the base R methodology is so common, I’m going to focus on the base R method in this post.

Performing a linear regression with base R is fairly straightforward. You need an input dataset (a dataframe). That input dataset needs to have a “target” variable and at least one predictor variable.

Then, you can use the `lm()`

function to build a model. `lm()`

will compute the best fit values for the intercept and slope – and . It will effectively find the “best fit” line through the data … all you need to know is the right syntax.

The syntax for doing a linear regression in R using the `lm()`

function is very straightforward.

First, let’s talk about the dataset. You tell `lm()`

the training data by using the `data =`

parameter.

So when we use the `lm()`

function, we indicate the dataframe using the `data =`

parameter.

We also need to provide a “formula” that specifies the target we are trying to predict as well as the input(s) we will use to predict that target:

Notice the syntax: `target ~ predictor`

. This syntax is basically telling the `lm()`

function what our “target” variable is (the variable we want to predict) and what our “predictor” variable is (the x variable that we’re using as an input to for the prediction). In other words, we will predict the “target” as a function of the “predictor” variable.

Keep in mind that this aligns with the equation that we talked about earlier. We’re trying to predict a “target” (which is typically denoted as ) on the basis of a predictor, X.

When you use `lm()`

, it’s going to take your training dataset (the input dataframe) and it will find the best fit line.

More specifically, the `lm()`

function will compute the slope and intercept values – and – that will fit the training dataset best. You give it the predictors and the targets, and `lm()`

will find the remaining parts of the prediction equation: and .

This might still seem a little abstract, so let’s take a look at a concrete example.

First, we’ll just create a simple dataset.

This is pretty straightforward … we’re just creating random numbers for x. The y value is designed to be equal to x, plus some random, normally distributed noise.

Keep in mind, this is a bit of a toy example. On the other hand, it’s good to use toy examples when you’re still trying to master syntax and foundational concepts. (When you practice, you should simplify as much as possible.)

#-------------- # LOAD PACKAGES #-------------- library(tidyverse) #------------------------ # CREATE TRAINING DATASET #------------------------ set.seed(52) df <- tibble(x = runif(n = 70, min = 0, max = 100) , y = x + rnorm(70, mean = 0, sd = 25) ) # INSPECT df %>% glimpse()

And let’s make a quick scatterplot of the data:

#---------------------------- # VISUALIZE THE TRAINING DATA #---------------------------- ggplot(data = df, aes(x = x, y = y)) + geom_point()

It’s pretty clear that there’s a linear relationship between x and y. Now, let’s use `lm()`

to identify that relationship.

Here, we’ll let’s create a model using the `lm()`

function.

#=================== # BUILD LINEAR MODEL #=================== model_linear1 <- lm(y ~ x, data = df)

We can also get a printout of the characteristics of the model.

To get this, just use the `summary()`

function on the model object:

#===================================== # RETRIEVE SUMMARY STATISTICS OF MODEL #===================================== summary(model_linear1)

Notice that this summary tells us a few things:

- The coefficients
- Information about the residuals (which we haven’t really discussed in this blog post
- Some “fit” statistics like “residual standard error” and “R squared”

Now that we have the model, we can visualize it by overlaying it over the original training data.

To do this, we’ll extract the slope and intercept from the model object and then plot the line over the training data using `ggplot2`

.

#==================== # VISUALIZE THE MODEL #==================== model_intercept <- coef(model_linear1)[1] model_slope <- coef(model_linear1)[2] #----- # PLOT #----- ggplot(data = df, aes(x = x, y = y)) + geom_point() + geom_abline(intercept = model_intercept, slope = model_slope, color = 'red')

As you look at this, remember what we’re actually doing here.

We took a training dataset and used `lm()`

to compute the best fit line through those training data points. Ultimately, this yields a slope and intercept that enable us to draw a line of the form . That line is a model that we can use to make predictions.

As I said at the beginning of the blog post, linear regression is still an important technique. There are many techniques that are sexier and more powerful for specific applications, but linear regression is still an excellent tool to solve many problems.

Moreover, many advanced machine learning techniques are extensions of linear regression.

“Many fancy statistical learning approaches can be seen as extensions or generalizations of linear regression.”

– An Introduction to Statistical Learning

Learning linear regression will also give you a basic foundation that you can build on if you want to move on to more advanced machine learning techniques. Many machine learning concepts have roots in linear regression.

Having said that, make sure you study and practice linear regression.

What questions do you still have about linear regression and linear regression in R?

Leave your questions and challenges in the comments below …

The post How to do linear regression in R appeared first on Sharp Sight.

]]>The post How to make a scatter plot in R appeared first on Sharp Sight.

]]>It’s so common that almost everyone knows how to make one in one way or another.

You see them in business, academia, media, news. Students use them.

Scatter plots are also extremely common in data science and analytics.

The scatter plot is everywhere, partially due to its simplicity and partially because its incredible usefulness for finding and communicating insights.

As simple as it might be, if you want to master data science, one of your first steps should be mastering the scatter plot. It’s a fundamental technique that you absolutely need to know backwards and forwards.

In this blog post, I’ll show you *how to make a scatter plot in R*.

There’s actually more than one way to make a scatter plot in R, so I’ll show you two:

- How to make a scatter plot with base R
- How to make a scatter plot with ggplot2

I definitely have a preference for the ggplot2 version, but the base R version is still common. Because you’re likely to see the base R version, I’ll show you that version as well (just in case you need it).

Let’s get started. First, I’ll show you how to make a scatter plot in R using base R.

Let’s talk about how to make a scatter plot with base R.

I have to admit: I don’t like the base R method. I think that many of the visualization tools from base R are awkward to use and hard to remember. I also think that the resulting visualizations are a little ugly.

Having said that, you’ll still see visualizations made with base R, so I want to show you how it’s done.

Let’s take a step-by-step look at how to make a scatter plot using base R:

Here, we’ll quickly create a sample dataset.

set.seed(55) df <- tibble(x_var = runif(100, min = 0, max = 25) ,y_var = log2(x_var) + rnorm(100) )

And let’s print out the dataframe so we can take a look:

print(df)

As you can see, the dataframe `df`

contains two numeric variables, `x_var`

and `y_var`

.

Next, we’ll plot the scatter plot using the `plot()`

function.

plot(x = df$x_var, y = df$y_var)

Ok, let me explain how that code works.

We’re initiating plotting using the `plot()`

function. Inside of the `plot()`

function, the `x =`

parameter and `y =`

parameter allow us to specify the which variables should be plotted on the x-axis and y-axis respectively.

Keep in mind though, the `plot()`

function does not directly work with dataframes. Instead, the `plot()`

function works with vectors.

The variables we want to plot are inside of the dataframe `df`

. Because of this, we need to access those vectors; we need to “pull them out” of the dataframe and tell the `plot()`

function where to get them. To do this, we need to use the `$`

operator. The `$`

operator enables us to extract specific columns from a dataframe. So notice the syntax: `df$x_var`

is basically getting the `x_var`

variable from `df`

, and `df$y_var`

is basically getting the `y_var`

variable from `df`

.

Essentially, we’re extracting our variables from the dataframe using the `$`

operator, and then plotting them with the `plot()`

function.

That’s basically it. You can do more with a scatter plot in base R, but as I said earlier, I really don’t like them. I strongly prefer to use ggplot2 to create almost all of my visualizations in R. That being the case, let me show you the ggplot2 version of a scatter plot.

As I just mentioned, when using R, I strongly prefer making scatter plots with ggplot2. By default, a ggplot2 scatter plot is more refined. It just looks “better right out of the box.”

Having said that, ggplot2 can be a little intimidating for beginners, so let’s quickly review what ggplot2 is and how it works.

ggplot2 is an add-on package for the R programming language. The focus of ggplot2 is *data visualization*. It enables R users to create a wide range of data visualizations using a relatively compact syntax.

Although the syntax seems confusing to new users, it is extremely systematic. The systematic nature of ggplot2 syntax is one of it’s core advantages. Once you know how to use the syntax, creating simple visualizations like the scatter plot becomes easy. Moreover, more advanced visualizations become relatively easy as well.

The secret to using ggplot2 properly is understanding how the syntax works.

There are a few critical pieces you need to know:

- The
`ggplot()`

function - The
`data =`

parameter - The
`aes()`

function - Geometric objects (AKA, “geoms”)

The `ggplot()`

function is simply the function that we use to initiate a ggplot2 plot.

The `data`

parameter tells ggplot2 the name of the dataframe that you want to visualize. When you use ggplot2, you need to use variables that are contained within a dataframe. The data parameter tells ggplot where to find those variables. Remember: ggplot2 operates on dataframes.

The `aes()`

function tells `ggplot()`

the “variable mappings.” This might sound complex, but it’s really straightforward once you understand. When we visualize data, we are essentially connecting variables in a dataframe to parts of the plot. For example, when we make a scatter plot, we “connect” one numeric variable to the x axis, and another numeric variable to the y axis. We “map” these variables to different axes within the visualization. The `aes()`

function allows us to specify those mappings; it enables us to specify which variables in a dataframe should connect to which parts of the visualization. If this doesn’t make sense, just sit tight. I’ll show you an example in a minute.

Finally, a geometric object is the thing that we draw. When you create a bar chart, you are drawing “bar geoms.” When you create a line chart, you are drawing “line geoms.” And when you create a scatter plot, you are drawing “point geoms.” The geom is the thing that you draw. In ggplot2, we need to explicitly state the type of geom that we want to use (bars, lines, points, etc). When drawing a scatter plot, we’ll do this by using `geom_point()`

.

Ok. Now that I’ve quickly reviewed how ggplot2 works, let’s take a look at an example of how to create a scatter plot in R with ggplot2.

First, you need to make sure that you’ve loaded the ggplot2 package. This also assumes that you’ve *installed* the ggplot2 package. (If you haven’t installed the ggplot2 package, do that before running this code.)

library(ggplot2)

Next, we’ll need some data to plot.

We already created the dataframe, `df`

, earlier in this post. But just in case you didn’t run that code yet, here it is again. (This is the same as the code to create the dataframe above, so if you’ve already run that, you won’t need to run this again. But just in case, here’s the code one more time.)

set.seed(55) df <- tibble(x_var = runif(100, min = 0, max = 25) ,y_var = log2(x_var) + rnorm(100) ) df

This code creates a simple dataframe with two variables, `x_var`

and `y_var`

.

Now that we have our dataframe, `df`

, we will plot it with ggplot2.

ggplot(data = df, aes(x = x_var, y = y_var)) + geom_point()

Ok, we have our scatter plot. It’s pretty straightforward, but let me explain it.

We’re initiating the ggplot2 plotting system by calling the `ggplot()`

function.

Inside of the `ggplot2()`

function, we’re telling ggplot that we’ll be plotting data in the `df`

dataframe. We do this with the syntax `data = df`

.

Next, inside the `ggplot2()`

function, we’re calling the `aes()`

function. Remember, the `aes()`

function enables us to specify the “variable mappings.” Here, we’re telling ggplot2 to put our variable `x_var`

on the x-axis, and put `y_var`

on the y-axis. Syntactically, we’re doing that with the code `x = x_var`

, which maps `x_var`

to the x-axis, and `y = y_var`

, which maps `y_var`

to the y-axis.

Finally, on the second line, we’re using `geom_point()`

to tell ggplot that we want to draw point geoms (i.e., points).

That’s it. That’s all there is to it. The syntax might look a little arcane to beginners, but once you understand how it works, it’s pretty easy.

Having said that, there are still a few enhancements we could make to improve the chart. Let’s talk about a few of those.

To change the color of the points in our ggplot scatterplot to a solid color, we need to use the `color`

parameter.

ggplot(data = df, aes(x = x_var, y = y_var)) + geom_point(color = 'red')

Again, this is very straightforward. To do this, we just set `color = 'red'`

inside of `geom_point()`

. We do this inside of `geom_point()`

because we’re changing the color of the points. (There are more complex examples were we have multiple geoms, and we need to be able to specify how to modify one geom layer at a time.)

To add a trend line, we can use the statistical operation `stat_smooth()`

.

Keep in mind that the default trend line is a LOESS smooth line, which means that it will capture non-linear relationships.

But, you can also add a linear trend line.

To add a linear trend line, you can use `stat_smooth()`

and specify the exact method for creating a trend line using the `method`

parameter.

Specifically, you’ll use the code `method = 'lm'`

as follows:

ggplot(data = df, aes(x = x_var, y = y_var)) + geom_point(color = 'red') + stat_smooth(method = 'lm')

This is essentially using the `lm()`

function to build a linear model and fit a straight line to the data.

Finally, let’s add a quick title to the plot.

There are a few ways to add a title to a plot in ggplot2, but here we’ll just use the `labs()`

function with the `title`

parameter.

ggplot(data = df, aes(x = x_var, y = y_var)) + geom_point(color = 'red') + stat_smooth(method = 'lm') + labs(title = 'This is a scatter plot of x_var vs y_var')

Ok, I want to be clear: this is not a very good title. I’m only using this as an example (the whole chart is sort of a dummy example). Writing good chart titles is a bit of an art, and I’m not going to discuss it here.

I really just want you to understand that you can add a plot to a ggplot scatterplot by using the `labs()`

function with the `title`

parameter.

There’s definitely more I could show you, but the examples above should get you started with making a scatter plot in R.

If you want to learn more about data visualization and data science in R, sign up for our email list.

When you sign up, you’ll receive weekly data science tutorials, delivered directly to your inbox.

You’ll also get immediate access to our FREE Data Science Crash Course.

If you want our free tutorials and our free Data Science Crash Course, sign up for our email list now.

The post How to make a scatter plot in R appeared first on Sharp Sight.

]]>My general advice is “it depends.”

Or to clarify my response, I like to ask the question “who are you, and what are your goals?” The programming language you use depends on your background and your long term goals.

Having said that, there’s typically only two major options that I think most data science students should consider: R and Python.

For the most part, R and Python are the only options that most new data science students should consider.

The question is, which is better?

In this blog post, I’ll walk you through the pros and cons of R vs Python for data science. We’ll start with an analysis of the pros and cons of R, and then later, I’ll discuss the pros and cons of Python.

At the bottom of the post, I’ll summarize my recommendations.

First, let’s start with R.

If you’re interested in becoming a data scientist, R has some distinct advantages.

Let’s talk about the cases where R is the best choice (verses Python).

If you have limited programming experience, I would probably recommend learning R first.

This might seem counter-intuitive if you’ve read about the benefits of Python. Python is commonly lauded as a very easy-to-learn programming language. In particular, experts mention Python as a great programming language for people without any prior programming experience.

Fair enough. That’s probably true, *if* you want to become a software developer. For *programming and software development*, I think that Python is a great choice for your first programming language.

But data science and software development are not the same thing.

Python might be great for beginning *software developers*, but I think R is much better for beginning *data scientists*.

Let me explain why.

The difference comes down to a subtle difference in how data scientists use programming languages verses how software developers use programming languages.

For beginning data scientists, “programs” should look like scripts, not software. This is a subtle difference, but it’s important.

Here’s what I mean. Let’s say you’re working with a dataframe. Here, I’ll show you a dataframe in R, the `Auto`

dataframe:

library(ISLR) data("Auto")

If you’re not familiar, a dataframe is data in a row-and-column format, sort of like an excel spreadsheet.

In this particular dataframe, there is a variable called `weight`

, which is just the car weight of the cars in the data.

Let’s say that you want to create a new variable called `weight_kg`

, which is the weight in kilograms.

There’s more than one way to do this. One way is to create a for-loop that cycles through all of the values and computes the value of the new variable. That’s sort of the hard way to do it.

A different way is to just use a pre-made function that will automatically compute the values of the new variable:

mutate(Auto, weight_kg = weight * .45)

This is much easier to do. You just need to know the right tool to use (in this case, the `mutate()`

function from R’s Tidyverse package).

When you know the right functions and toolkits to use, data science “programs” become more like data processing “scripts.” You end up calling pre-built tools in sequence to process the data: input the data, clean the data, analyze the data.

What I’m getting at is that you should use pre-built functions and tools to perform these tasks. You should *not* try to build your own tools to accomplish these tasks.

That means that you shouldn’t need many traditional programming and software concepts. Ideally, you should avoid things like for-loops, classes, object oriented programming, and other software development concepts.

Having said all of that, I think that R is better than Python because R’s data toolkit is better developed and easier to use.

Specifically, I think that R’s toolkit requires less understanding of software development concepts. To be clear, Python does have pre-built data toolkits, just like R does. However, Python’s tools and syntax still have a “software dev” feel to them; they feel more reliant on software development concepts (like for-loops, classes, object orientation, etc). For example, if you skim through a few Python data science books, you’ll still see for-loops, class declarations, and other items that would be challenging to a beginner without a background in software development or computer science.

In contrast, R has a very clean set of tools for performing data processing tasks:

`ggplot2`

for data visualization`dplyr`

for data manipulation`lubridate`

for working with dates and times`stringr`

for working with strings- Etc …

In many cases, you can use these R tools without any knowledge of software development or computer science concepts.

Again, I’ll reiterate that Python has data processing libraries too. I’m not denying that. The difference is that the syntax for data science in R feels less like software development, and more like a toolkit for writing data processing scripts. R’s Tidyverse is much easier to learn for data science beginners with limited programming experience.

A related point is that R’s syntax is just a little simpler when performing many data science tasks.

Syntactically, the tools of R’s Tidyverse are very well designed. The functions and tools are very well named. Importantly, this makes them easy to use and easy to remember.

They are also designed in such a way that you can use them without running into small and subtle errors.

Let me show you an example.

Let’s say that you want to remove a variable from a dataframe. (Here, we’ll use the `Iris`

dataframe as an example because you can find it in both R and Python.)

Let’s take a look at the syntax for removing a variable for both R and Python:

**Python**

import seaborn as sns import pandas as pd iris = sns.load_dataset('iris') iris.drop(['sepal_length','species'], axis = 1)

**R**

library(dplyr) select(iris, -sepal_length, -species)

Specifically, we’re comparing the final lines for each programming language; the “drop” syntax for Python and the “select” syntax for R.

Personally, I think that the R syntax is better and “cleaner” in some subtle but important ways.

Let’s first talk about the Python syntax. To remove a variable from a Python dataframe, we use the `drop()`

method. That part is simple. `drop()`

is well named and easy to remember. But after that, it gets a little more complicated.

The first argument of the `drop()`

method is a list of the variables that we want to remove. A small problem here is that the variable names must be inside of brackets. Syntactically, there’s a reason for this (this is a ‘list’ data structure). But no matter the reason, the fact that you need to use brackets around the variable names introduces a subtle bit of complexity that can be confusing for beginners. Often times, a beginner will *forget* to use those brackets and will be very confused when the code doesn’t work.

Additionally, the variable names need to be enclosed in quotation marks. This is a very subtle bit of syntax, but if you don’t enclose the variables in quotations, you’ll get an error that says “`name 'sepal_length' is not defined`

.” This is extremely subtle and it’s the sort of thing that a beginner will miss.

Finally, to drop the variables with Python’s `drop()`

, we need to specify an “axis.” For a beginner, this begs the question: what the hell is an axis and why is it important?

To be honest, dataframe axes aren’t that hard to understand. However, my point is that these little syntactic quirks are the things that can confuse a beginner. And this is just one example. I can give you dozens of other examples where the Python data science syntax is confusing like this.

Let’s contrast this with the R version.

To drop a variable in R, we’ll use the `select()`

function from `dplyr`

.

I think the worst thing about this syntax is that it’s called “select.” We need to use the `select()`

function to drop a variable (it would probably be easier to remember if we could use a function called `drop()`

.)

However, once you remember that you need to use the `select()`

function to remove a variable, the syntax is pretty straightforward. Inside the `select()`

function, the first argument is the name of the dataframe. The next set of arguments is the set of variables that you want to drop with minus signs in front of them, separated by commas.

This syntax feels much more intuitive to me. Yes, you need to remember to use the minus sign, but that feels intuitive to me. In a sense, we are “subtracting” the variables from the data. You also need to remember to separate the variables by commas. But again, that feels intuitive. Using the commas inside of `select()`

feels like listing things in a sentence. Finally, in the R syntax, we don’t have to specify anything about an “axis,” like we did in the Python syntax.

Overall, R data science syntax feels intuitive. It almost feels like writing pseudocode. It’s easy to remember, easy to write, and easy to read. I strongly prefer the syntax for R’s Tidyverse over the data tools of Python.

Readers here at the Sharp Sight blog will know that we think that data analysis is a valuable and highly under-appreciated skill.

More often than not, a lot of junior-level data science simply amounts to hard-core data analysis. Lower level data science is like data analysis with power tools (such as R or Python).

At more advanced levels, data science can be more complicated than mere data analysis, but at lower levels, data analysis will be a large amount of your work.

That being the case, it pays to be able to do data analysis. You need to be able to explore datasets. You need to be able to clean datasets. You need to be able to find insights in data.

Traditionally, over the last few decades, the tool of choice for this was Microsoft Excel. More recently, the “data analysis” field evolved and became more advanced. Somewhere in the mid to late 90’s, “analysts” began using power tools like SQL, SAS and SPSS.

As the field evolved, you started seeing data analysis departments start calling themselves “analytics” departments. People in these departments used the data “power tools” of the time (SQL, SAS, SPSS) to create business value from larger amounts of data than was previously possible with Excel alone. In some sense, that’s all “analytics” was … it was just “data analysis” with power tools.

In many ways, “data science” is just the next evolution of analytics, which was just an evolution of data analysis. That is to say, “data science” is often just a really advanced version of data analysis.

If you find yourself in an environment where much of the data work is just “hard core data analysis” (instead of machine learning and advanced topics), I strongly recommend that you use R. Specifically, I recommend using the tools of the Tidyverse.

To put it simply, R’s Tidyverse packages (`ggplot2`

, `dplyr`

, `tidyr`

, `stringr`

, `lubridate`

) are arguably the best set of tools for manipulating, visualizing, and analyzing data on the market. If your work consists mostly of creating large reports and ad-hoc data analyses, R and the Tidyverse are exceptional.

The reason why (as noted elsewhere in this post) is that the syntax for wrangling and analyzing data in the Tidyverse is superior. R’s Tidyverse syntax is easy to learn, easy to remember, and easy to use. Specifically, the various functions of the Tidyverse are very well named. You don’t have to remember a complicated function name with the Tidyverse. You don’t have to remember arcane syntax to get things done.

For example, if you want to “filter” your dataset and create a sub-set, there’s a simple function: `filter()`

. If you want to select a specific column from a dataset, you can use the well-named `dplyr`

function `select()`

.

Moreover, the functions of the Tidyverse are highly modular. Every function does one thing, and it does that thing well. This modularity makes the functions easier to learn and remember. It also makes them work like building blocks. You can take many simple Tidyverse functions and “connect” them together to create a more complicated process. Using these modular functions is almost like snapping together little Lego building blocks. If you know how to put together simple pieces, you can perform analyses that are very complicated.

To put it simply, R is excellent for analyzing data and getting things done in an analytics environment. For analytics, R is superior to Python, in my opinion.

If you come from a statistics background and you’ve used R in the past, I think R might be a better fit than Python.

I’ve encountered many former statistics students who have used R in the past, but haven’t really done any programming.

In that case, I highly recommend R.

First, I specifically recommend the Tidyverse dialect of R, because it’s so easy to learn and use.

Moreover, there’s strong ecosystem of statisticians and statistics experts in the R world. Anecdotally, more statisticians seem to use R than the Python. For example, when looking at statistics textbooks, you’ll find that if they use code, they often use R.

You’ll also find that almost every statistical tool or algorithm has been implemented in R. Therefore, if you want to use some rare statistical techniques in your data science work, you’ll probably have an easier time finding those tools in R than Python.

Although I strongly recommend R for many beginning data science students, it’s not always the best choice.

For some people, Python is the best language to learn for data science. Python may be a better choice than R for people with specific background, goals, and interests.

Let’s talk about some cases where Python might be a better choice than R.

If you have experience in software development or computer science, I think that Python may be a better choice.

Here, I have in mind people who’ve learned basic programming and programming principles. For example, if you took a computer science class in college, or you were a CS major, Python may be a better choice than R. Similarly, if you come from a web development background, Python may be a better choice.

Now, I want to make it clear again that data science is not the same thing as software development. Frequently in data science, you’ll see fewer programming structures like for-loops, while-loops, and control structures. You’ll see more data manipulation or data visualization tools. More data wrangling. More charts and graphs. As I mentioned earlier, data science code often looks less like “software,” and more like a data analysis “script.” Of course, it’s not perfectly clear cut, but at entry levels, a lot of data science looks like data analysis scripting. It’s typically at more advanced levels where data scientists start creating proper software.

That being said, if you have programming experience, you might still feel more comfortable with Python.

Part of the reason for this is that in my opinion, Python is better for software.

I’ve already said that I think R is superior when you’re creating “data analysis scripts.” If you want to slice and dice some data, wrangle data, or visualize data, I think R’s Tidyverse packages are the best.

But if you want to build software *systems *, I think that Python is actually the better choice.

Writing software is where Python shines. For software, writing Python code just feels more effortless. As many experts have noted, writing Python code almost feels like you’re writing pseudocode.

Moreover, it’s commonly noted that Python is a better “all purpose” programming language. When discussing this, people frequently point out that Python is used more often by companies in production systems. People frequently comment that Python is more “production ready” and “all purpose” compared to R.

To be clear, I’m not saying that you can’t write software in R. I’m not saying that you can’t build production systems in R. I’m just saying that when a production system is necessary, many people prefer to build it in Python. Therefore, if you plan to create software systems as a data scientist, Python may be a better choice than R.

If you want to focus on machine learning in the long run, Python may be the best choice.

Now, I want to be clear: R does have a machine learning ecosystem. In particular, the `caret`

package is well developed. `caret`

has the ability to execute a wide variety of machine learning tasks. For example, with R’s `caret`

package you can create regression models; you can create support vector machines; you can create decision trees (both regression and classification); you can perform cross validation. R’s machine learning ecosystem is fairly well developed.

Having said that, Python comes out ahead here. Python’s scikit-learn provides a clean and easy-to-read syntax for implementing a variety of different machine learning techniques.

A big benefit here is just the simplicity of the scikit-learn syntax when comparing it to `caret`

. R’s `caret`

syntax feels a little clumsy sometimes. In particular, `caret`

doesn’t integrate well with the Tidyverse ecosystem of R packages. Related to this, R’s tools for machine learning often produce outputs that are difficult to work with in the context of R’s data science ecosystem. In contrast, Python’s scikit-learn syntax feels a little better integrated into the broader Python ecosystem.

I think that Python also has better resources for studying machine learning. Although two of my favorite machine learning books use R code, I think that there is a broader set of books for machine learning that use Python.

All of that is to say, if you want to focus on machine learning, I think Python may be the better programming language.

Now that we’ve covered the strengths and weaknesses of R and Python, let’s talk about some other factors that might influence your decision.

This one is a big one.

If you already have friends or associates that use either R or Python, this might be a good reason to choose that particular language.

The reason for this should be obvious: you can learn a lot from people when you have direct contact with them.

So for example, if a good friend or associate is a highly skilled Python programmer, it might be a good reason to choose Python.

To put it simply, having a close community of people that you can learn from might trump the strength & weakness calculus that I discussed above.

Similar to the case where your friends use one particular language, you might want to choose a particular language if you have a specific career goal.

Specifically, if you want to work for a particular company, and you find out that they use a particular language, that could be a major influence on your decision.

For example, if you know you want to work at Google, and you find out that your ideal “team” at Google uses Python, that may be a reason to start learning Python.

Having said that, here’s a word of caution: don’t get your heart set on one particular company. In the short run, it can be difficult to get a dream job at the exact company of your choice. Landing a “dream job” takes hard work. You need the right skill set, and you often need the right network of friends, which will be tough to build.

So, targeting a specific company might influence your decision. On the other hand, it might be smart to keep your options open, just in case. Don’t let this be the only reason you choose one language over another.

So, which should you choose, R or Python?

I think there are pros and cons for both, so the ultimate answer is “it depends.”

R and Python are both great for data science, but they excel at different things.

*R and Python are both great for data science, but they excel at different things.*

Click To Tweet

I think that R and the Tidyverse are far superior for data visualization and analytics (i.e., finding insights in data). R also comes out ahead for most true beginners. If you’ve never done any programming or data science in the past, R is probably the better option.

On the other hand, Python – while being inferior for data visualization and analytics – is superior for machine learning. In my opinion, Python is also better for building software.

Where does that leave us? It depends. Who are you and what are your goals? If you want to be really good at data visualization, I think that R’s `ggplot2`

is the best tool around. If you want to specialize in analytics and “finding insights,” I think that R is superior.

If you want to be a machine learning specialist, Python and sci-kit learn are probably preferable.

Very quickly, I want to address a point raised by several other smart people.

This is not strictly an either/or decision. There is a third option: learn both.

I think that in the long run, a top-performing data scientist should know both R *and* Python. They are good at different things, so if you want to have a full toolkit, you should consider learning them both.

Having said that, you should focus on only one language at a time.

If you try to learn both at the same time, it will probably take longer. Dividing your attention reduces your focus. You’ll make much faster progress if you focus intensely on one language at a time.

That being said, you still should choose one right now, which leads me to my final point.

We’ve talked about the strengths and weaknesses of R vs Python, but now I want to bring up something that’s more important than making the “right” choice.

Making __ a__ choice is the most important thing.

Don’t spend months trying to figure out the “best” language for you.

Take a week or two to think about the pros and cons. Ask a few friends or mentors what they think. Think about your short, medium and long-term data science goals.

But then pick something.

Pick something and get started.

Take action. Start mastering the data science skill set. Whether you choose R or Python, you’ll still be able to learn data visualization and data analysis. Whether you choose R or Python, you’ll still be able to learn machine learning.

It’s important that you don’t get paralyzed trying to decide on the best language. Some people waste months trying to decide between languages, and they end up wasting time that they could spend mastering data skills.

Ultimately, whether you choose R or Python, it’s more important that you pick something.

There are pros and cons to each language. Both have strengths and weaknesses.

But either way, both R and Python are pretty damn good. It’s hard to make a mistake with either one unless you have very specific goals in mind that would require one over the other.

Here’s what I recommend. Think about this blog post. Re-read it again if you need to. If you have questions, send them to me at josh@sharpsightlabs.com.

Then give it week to research and decide. After that, choose something. Get started.

The sooner you start, the sooner you’ll be prepared to actually work as a data scientist.

Whether you want to master R or Python, we can help.

Here at Sharp Sight, we teach data science.

And every week we publish data science tutorials to help you learn.

So if you’re interested in data science, sign up for our email list.

When you sign up for our email list, you’ll get free tutorials delivered to your inbox.

You’ll learn about data science in R, including `ggplot2`

, `dplyr`

, `tidyr`

, `readr`

, and the other packages of the Tidyverse.

You’ll also get tutorials about data science in Python, including tutorials about `numpy`

, `pandas`

, `matplotlib`

, and `scikit-learn`

.

The post R vs Python … which to learn for data science appeared first on Sharp Sight.

]]>`mutate()`

function.
Readers here at the Sharp Sight blog will know how much we emphasize “foundational” data science skills.

If you want to be effective as a junior data scientist, you need to master the fundamental skills.

If you want to eventually move on to more advanced skills like machine learning and advanced data visualization, you need to master the fundamental skills.

One of those fundamental skills is data manipulation. To be a really effective data scientist, you need to be masterful at performing essential data manipulations. This is because a very large proportion of your work will just involve getting and cleaning data.

Among the simple data manipulation tasks that you need to be able to perform are:

- selecting columns from data
- subsetting rows of data
- aggregating data
- summarising data (calculating summary statistics)
- sorting data
- creating new variables

In this blog post, we’ll talk about the last skill in that list. Using mutate in R to create new variables.

Let’s quickly run through the basics of mutate.

Before we do that though, let’s talk about `dplyr`

.

If you’re reading this blog post, you’re probably an R user. And there’s a good chance that you’re trying to figure out how to use the functions from `dplyr`

.

If you’re not 100% familiar with it, `dplyr`

is an add-on package for the R programming language. The `dplyr`

package is a toolkit that is exclusively for data manipulation. More specifically, it is a toolkit for performing the data manipulation tasks that I listed above. It has one function for each of those core data manipulation tasks:

`select()`

selects columns from data`filter()`

subsets rows of data`group_by()`

aggregates data`summarise()`

summarises data (calculating summary statistics)`arrange()`

sorts data`mutate()`

creates new variables

For the most part, `dplyr`

only does these tasks. It essentially has one function for each of them. (Note that these `dplyr`

“functions” are sometimes called “verbs”.)

Part of what makes `dplyr`

great is that it is “compact.” There are only 5 or 6 major tools and they are simple to use.

Now that we’ve discussed what `dplyr`

is, let’s focus in on the `mutate()`

function so you can learn how to use mutate in R.

The `mutate()`

function is a function for creating new variables. Essentially, that’s all it does. Like all of the `dplyr`

functions, it is designed to do one thing.

Using `mutate()`

is very straightforward. In fact, using any of the `dplyr`

functions is very straightforward, because they are quite well designed.

When you use `mutate()`

, you need typically to specify 3 things:

- the name of the dataframe you want to modify
- the name of the new variable that you’ll create
- the value you will assign to the new variable

So when you use `mutate()`

, you’ll call the function by name. Then the first argument is the dataframe that you want to manipulate.

For example, if you had a dataframe named `df`

, that would be the first item inside of the parenthasis (i.e., the first “argument” to the mutate function):

Remember that `mutate()`

– like all of the `dplyr`

functions – strictly operates on dataframes. It’s not set up to work with lists, matrices, vectors, or other data structures.

Ok, so the first argument is the name of the dataframe.

The second argument is a “name-value” pair. That might sound a little arcane, so let me explain it.

When you use `mutate()`

, you’re basically creating a variable. The new variable needs a name, but it also needs a value that gets assigned to that name. So when you use mutate, you provide the name and the new value … a name-value pair.

Let’s take a look at our syntax example again:

You can see here in this dummy code example that we’re creating a new variable called `new_variable`

. The value assigned to `new_variable`

is the value of `existing_var`

multiplied by 2. Note that in this example, we’re assuming a dataframe called df that already has a variable called `existing_var`

.

That’s really it. To use mutate in R, all you need to do is call the function, specify the dataframe, and specify the name-value pair for the new variable you want to create.

The explanation I just gave is pretty straightforward, but to make it more concrete, let’s work with some actual data.

Here, I’ll show you how to use the `mutate()`

function from `dplyr`

.

First, let’s load a few packages. We’ll load `dplyr`

so we have access to the `mutate()`

function. We’ll also load the `ISLR`

package. `ISLR`

is a package that contains several datasets. For the record, this package is actually related to the excellent book, an Introduction to Statistical Learning … a book about machine learning. We won’t be doing any machine learning here, but if you’re interested, get that book.

Ok. Here’s the code to load the packages.

#-------------- # LOAD PACKAGES #-------------- library(dplyr) library(ISLR)

We’ll be working with the `Auto`

dataframe from `ISLR`

.

Before we actually do anything with the data, let’s just inspect it.

Here we’ll print out the dataframe.

#------------- # INSPECT DATA #------------- print(Auto)

When you print it out, you can see that the data is a little hard to read. `print()`

will print out every row of data.

Before we move on, let’s fix that.

The reason that the `print()`

function prints out every row of data is because the `Auto`

dataframe is an old-fashioned `data.frame`

object, not a `tibble`

. Tibbles print better. Keep in mind that tibbles actually __are__ dataframes, but they are modified dataframes. One of the things that is different about tibbles is that they print out with better formatting.

That being the case, I’m going to quickly coerce `Auto`

to a `tibble`

using `as.tibble()`

.

As I do this, I’ll also rename it to `auto_specs`

. The name `Auto`

is a little non-descript, and it starts with a capital letter, which I don’t like. So very quickly, I’ll rename it while I’m coercing it to a tibble.

#------------------------------- # RENAME DATA & COERCE TO TIBBLE #------------------------------- auto_specs <- as.tibble(Auto) print(auto_specs)

This is much better.

You can see that when we print it out now, the `auto_specs`

has a slightly more readable structure. This is because we coerced this data to a `tibble`

.

Ok. Now we’re ready to use `mutate()`

.

This is very straightforward.

We’re going to call the `mutate()`

function, and the first argument (the first item inside the paranthesis) is the dataframe we’re going to modify, `auto_specs`

.

After that (and separated by a comma) we have the name-value pair for our new variable. The name of the new variable is `hp_to_weight`

and the value is `horsepower`

divided by `weight`

.

#---------------------------------- # CREATE NEW VARIABLE WITH mutate() #---------------------------------- auto_specs_new <- mutate(auto_specs, hp_to_weight = horsepower / weight) print(auto_specs_new)

That’s basically it. Using mutate in R to create a new variable is as simple as that.

There’s one thing that I want to point out. Notice that to the left hand side of the `mutate()`

function, I’ve used the assignment operator, `<-`

.

Why?

I did this so that the new output of `mutate()`

is "saved" to the `auto_specs`

data.

All of the `dplyr`

functions work with dataframes. The inputs to the `dplyr`

functions are dataframes. The outputs are also dataframes. So, mutate outputs a dataframe.

But by default, the `dplyr`

functions send the output directly to the console.

What that means is that the `dplyr`

functions do *not* automatically change the input dataframe.

Let me repeat that. The `dplyr`

functions do *not* automatically change the input dataframe.

What that means is that if you *don't* use the assignment operator to save the output with a name, the changes will not be saved to the input dataset.

To see this, try running `mutate()`

without saving the output to a new dataframe. Run the `mutate()`

function, and then print out the original input dataframe.

mutate(auto_specs, hp_to_weight = horsepower / weight) colnames(auto_specs)

Take a look at the column names. `hp_to_weight`

is not one of them!

That's because `mutate()`

does not directly modify the input dataframe. It leaves the input dataframe unchanged and then produces an output dataframe which is sent to the console by default (i.e., the console just prints the output).

If you want to *save* the output, you need to use the assignment operator and and save the output to an object name:

auto_specs_new <- mutate(auto_specs, hp_to_weight = horsepower / weight) print(auto_specs_new)

Notice that when we print out `auto_specs_new`

, it now has the new variable `hp_to_weight`

.

Let this be a reminder: if you want to add a new variable to a dataframe with `mutate()`

, and you want that change to be permanent, you need to save the output by using the assignment operator.

The `mutate()`

function is just one of several data manipulation tools that you'll need to master if you want to master data science in R.

Here at Sharp Sight, we teach data science. And we want to help you *master* data science as fast as possible.

If you sign up now for our email list, you'll get more free tutorials about data manipulation in R.

You'll also get free tutorials about a variety of other data science topics like data visualization, geospatial visualization, and machine learning.

The post How to use mutate in R appeared first on Sharp Sight.

]]>