You’ve probably heard it before: 80% of your work as a data scientist will be data wrangling.
While that’s sort of a rough number, experience bears out that data wrangling is a massive part of your job as a data scientist.
As such, it pays to know data manipulation. In fact, it pays to be really f*king good at data manipulation.
And when I say that it “pays,” I sort of mean that literally. If you want to get hired and get paid, your data wrangling skills should be solid.
At minimum, you need to know how to do several key data wrangling skills:
- Create new variables
- Sort data
- Summarise data (i.e. calculating summary statistics)
- Select specific columns
- Subset rows
In this blog post, we’ll talk about the last one: how to subset rows and filter your data.
What is the filter() function?
There are several ways to subset your data in R.
For better or for worse though, some ways of subsetting your data are better than others. Hands down, my preferred method is the
filter() function from
In this blog post, I’ll explain how the
filter() function works.
Before I do that though, let’s talk briefly about
dplyr, just so you understand what
dplyr is, how it relates to data manipulation. This will give you some context for learning about
A quick introduction to dplyr
For those of you who don’t know,
dplyr is a package for the R programing language.
dplyr is a set of tools strictly for data manipulation. In fact, there are only 5 primary functions in the dplyr toolkit:
filter()… for filtering rows
select()… for selecting columns
mutate()… for adding new variables
summarise()… for calculating summary stats
arrange()… for sorting data
dplyr also has a set of helper functions, so there’s more than these 5 tools, but these 5 are the core tools that you should know.
Subsetting data with dplyr filter
Let’s talk about some details.
How the dplyr filter function works
filter() and the rest of the functions of
dplyr all essentially work in the same way.
When you use the
dplyr functions, there’s a dataframe that you want to operate on. There’s also something specific that you want to do.
dplyr functions have a syntax that reflects this.
First, you just call the function by the function name.
Then inside of the function, there are at least two arguments.
The first argument is the name of the dataframe that you want to modify. In the above example, you can see that immediately inside the function, the first argument is the dataframe.
Next, is exactly how we want to filter the data. To specify how, we will use set of logical conditions to specify the rows that we want to keep. Everything else will get “filtered” out.
Using logic to filter your rows
Since we need to use logic to specify how to filter our data, let’s briefly review how logic works in R.
In R, we can make logic statements that are evaluated as true or false. Remember that R has special values for true and false:
Ok. Here’s a quick example of a logic statement that you can type into R:
10 > 1
Is 10 greater than 1? Yes! Of course.
As such, R will evaluate a this logic statement as
Although this logic statement is fairly simple, logic statements can be more complicated. We can use operators to combine simple logic statements into more complex logic statements.
The simple logical operators are:
We can use these to combine simple logic conditions into expressions that are more complex. For example:
# A TRUE statement made by combining simple logical expressions (10 > 1) & (1 != 2)
In the above example we have two simple logic expressions that have been combined with the ‘
Essentially, this statement is evaluating the following: Is 10 greater than 1 AND is 1 not equal to 2.
This statement is true. It’s true that 10 is greater than 1 and it’s also true that 1 is not equal to 2. Since both are true, the overall statement will be evaluated as true (remember … the
& operator requires both sides to be
This sort of logic is important if you want to use the dplyr filter function. That being the case, make sure that you understand logic in R. It’s beyond the scope of this blog post to completely explain logic, so if you’re confused by this, you’ll need to do more reading about logic in R.
Why we need logic to use dplyr’s filter function
So why the digression and review of logic in R?
Because you need to use logical expressions to use the
filter() function properly.
In some cases, we might want to filter the data with a simple expression … like keeping rows where a variable is greater than 10.
In other cases, we use much more complicated expressions.
But either way, logic is critical.
Examples of how to use the filter() function
So far, the explanation might seem a little abstract, so let’s take a look at some concrete examples.
We’ll start simple, and then increase the complexity.
Having said that, even before we actually filter the data, we’ll perform some preliminary work.
First, we’ll just load the
tidyverse package. Keep in mind that the
tidyverse package is a collection of packages that contains
ggplot2, and several other important data science packages.
In this blog post,
ggplot2 are important because we’ll be using both. Obviously, we’ll need
dplyr because we’re going to practice using the
filter() function from
Additionally, we’re going to work with the
txhousing dataset, which is included in the
ggplot2 package. So, we’ll need to have
ggplot2 loaded as well.
Again, loading the
tidyverse package will automatically load both
ggplot2, so we have them both covered.
#------------- # LOAD PACKAGE #------------- library(tidyverse)
Next, we’ll quickly inspect the data.
There are a few ways to do this, but I often use the
glimpse() provides quite a bit of information (like data types, row counts, etc) and the output is well formatted.
#------------------ # INSPECT DATAFRAME #------------------ glimpse(txhousing)
When inspecting your data, you’ll want to pay attention to a few things.
First, you’ll want to look at the variables. Knowing the variables and what they contain will give you a few ideas about how you can filter your data. For example, when looking at the data, I immediately think about filtering the data down to a particular year, or filtering to return records above a particular value for
median. Essentially, looking at the data will spark some ideas about how you might want to subset.
Second, pay attention to the number of rows. The total number of rows in a dataset can be a useful piece of information to capture. You might want to write it down in a little notebook as you’re analyzing your data. The reason is that as you filter, subset, and otherwise wrangle your data, it can be useful to know the original number of records.
For example, if you split your data and then perform some manipulations on that data, you might need to check that your two different datasets still collectively contain the same number of rows as the original dataset.
Essentially, knowing the original number of rows can help you “check your work” as you move through an analysis. I won’t discuss data analysis workflow in detail here, but understand that you should pay attention to the number of records.
A simple example of filter()
Let’s start with a very simple example.
filter(txhousing, year == 2001)
Pay attention to a few things.
First, at a quick glance, it appears that the records were filtered correctly. All of the rows have
year equal to
Second, there are 552 records in the output dataframe. At the very least, this tells us that the
filter() operation did create a subset.
As a quick check, we can take a look at the number of observations for every value of
txhousing %>% group_by(year) %>% summarise(record_count = n())
By summarizing the data by
year, we can look specifically at the number of records for the year
2001. We’re doing this to check that our filter operation worked correctly. This summary table shows that there is 522 records for the year
2001, which matches the number of records when we filtered our data with
filter(txhousing, year == 2001). So, it looked like our use of
filter() worked correctly.
Keep in mind, checking your data like this can be useful when you’re performing data manipulation.
Filter data using two logical conditions
In our last example, we filtered the data on a very simple logical condition. We filtered the data and kept only the records where year is exactly
What if we want to filter on several conditions?
To do that, we need to use logical operators.
Example: year equal to 2001 AND city equal to ‘Abilene’
Let’s take a look at a concrete example. In this new example, let’s extend the previous example. Previously, we filtered the data to keep only the records where
year == 2001.
Now, we’ll keep records where
year == 2001 and
city == 'Abilene'.
Syntactically, here’s what that looks like:
filter(txhousing, year == 2001 & city == 'Abilene')
And here’s the output:
This should make sense if you already understood the previous examples.
We’re still applying the dplyr
filter() function. The first argument is the dataframe that we’re manipulating,
The next argument (after the comma) is a mildly complex logical statement. Here, we’re telling the
filter() function that we only want to return rows of data where the
year variable is equal to
2001, and the
city variable is equal to
Again, this is pretty easy to understand, because the syntax almost reads like pseudocode.
A critical part of this syntax that you need to understand is the “and” operator:
&. This requires that both conditions be true. Year must be equal to 2001 and the city must be Abilene. If both conditions are not met for a particular row, that row will be “filtered” out.
So that’s how
filter() works. It’s checking logical conditions. If the logical condition or conditions are not met, then the row is filtered out.
Example: city equal Austin OR city equals Houston
Let’s try another example.
Here, we’ll keep rows where
city equals Austin or
city equals Houston.
To do this, we will use the ‘or’ operator, which is the vertical bar character: |.
filter(txhousing, city == 'Austin' | city == 'Houston')
This is very straightforward.
filter() will keep any row where
city == 'Austin' or
city == 'Houston'. All of the other rows will be filtered out.
Filtering using the %in% operator
Let’s say that you want to filter your data so that it’s in one of three values.
For example, let’s filter the data so the returned rows are for Austin, Houston, or Dallas.
One way of doing this is stringing together a series of statements using the ‘or’ operator, like this:
filter(txhousing, city == 'Austin' | city == 'Houston' | city == 'Dallas')
This works, but frankly, it’s a bit of a pain in the ass. It’s a little verbose.
There’s another way to do this in R using the
filter(txhousing, city %in% c('Austin', 'Houston', 'Dallas'))
Basically, this returns the records where
city is “in” the set of options consisting of Austin, Houston, or Dallas.
It can get more complicated, but master the basics first
Filtering data can get more complicated.
You can filter by using even more complicated logical expressions. There are also other more complicated techniques that you can perform by using functions and dplyr helper functions along with
Having said that, make sure you master the basic techniques first before you start working on more complicated techniques.
A quick warning: save your new dataframes with new names
Before wrapping this up, I want to mention one “quirk” about the filter() function.
This is important, so pay attention …
When you use the
filter() function, it does not modify the original dataframe.
Let me show you an example. Here, we’ll use the filter operation on
txhousing to filter the data to rows where
city == 'Houston'.
filter(txhousing, city == 'Houston')
Now, let’s inspect the data using
There’s still 8,602 rows. You can immediately see that the data still contains records where the
city variable is Abilene.
What the hell?! Didn’t we just filter the data?
Yes, but you need to understand that the
filter() function doesn’t change the original dataset. I repeat:
filter() does not filter rows out of the input dataset.
filter() returns a new dataset that contains the appropriate rows.
What that means is that if you run the examples I’ve shown you so far in this blog post, they will not change the original dataset. The new filtered data is just returned and sent directly to the terminal.
If you want to save this new filtered data (instead of having it sent directly to the terminal), you need to save it with a new name using the assignment operator.
For example, you could perform the filter operation above and give the output dataframe a new name:
txhousing_houston <- filter(txhousing, city == 'Houston')
Now, if we examine
txhousing_houston, you’ll see that it contains the appropriate filtered rows.
The data now have 187 rows, and at a quick glance, it appears that they are all for records where city is Houston. This looks correct.
Basically, I just want to remind you and reiterate that if you want to save and continue working with the filtered data that comes out of the
filter() function, you need to save that data with a new name.
Questions? Leave a comment below
Still confused about the dplyr filter function?
Leave your question in the comments below …