Long time readers at the Sharp Sight blog will know the emphasis I place on data manipulation.
You’ve heard me repeat the statistic many times: 80% of your work as a data scientist will be data manipulation.
Although it’s a little more complicated than that, roughly speaking, it’s true. A lot of your work will involve getting, cleaning, reshaping, and aggregating your data.
Data manipulation is not only one of the most common tasks, it’s also one of the most challenging for most people trying to learn data science.
Have you ever gotten stuck when trying to clean or wrangle a dataset?
You’re not alone.
Data manipulation is challenging.
A secret about data manipulation
Having said that, there’s a secret that you should know about data manipulation.
Even though many people get stuck, it’s much easier than you think, once you look at it the right way.
There are 19 “building blocks” for data manipulation in Python
To do data manipulation Python, you only need to be able to use a handful of techniques.
In particular, in Python, there are 19 “core” techniques that make up the vast majority of the work.
These 19 techniques are like building blocks …
They are like little components that you can just snap together to do data manipulation tasks.
When you look at it this way, data manipulation becomes much more simple.
For the most part, to master data manipulation, you only need to do two things:
- Learn the 19 data manipulation techniques
- Learn how to put the techniques together
When you think about it that way, it’s really not hard at all!
That’s why in our Python data manipulation course, Pandas Data Mastery, we help our students memorize these 19 techniques. Once you memorize the syntax for these techniques, data manipulation simply becomes a matter of “snapping them together” in the right order to get the end result.
Once you know the simple building blocks, you can put them together to perform fairly complex data manipulation in Python.
The 19 techniques you need to memorize
So what are these techniques?
To master data manipulation in Python using Pandas, here’s what you need to learn:
- read csv
- set index
- reset index
- sort values
All things considered, this is a pretty brief list.
You don’t need to learn 1000 things. Only 19.
You could do it in only a few weeks.
Let’s quickly review what each of these do.
When you start wrangling and cleaning your data, one of the first things that you typically need to do is just get your data.
Often, your data will be in an external data source, and one of the most common sources is
.csv files (“comma separated” text files).
That being said, you should know the Pandas
read_csv() function and how to use it.
After you get your data, you’ll often need to set an “index” for the rows of the dataframe.
Indexes make it easier to retrieve and subset your data. They are also used for “join” operations, when you need to join together multiple datasets.
That being the case, you need to know what indexes are, and you need to be able to use the Pandas
Sometimes, when you’re wrangling your data, you need to remove or change the index of the dataframe.
This is particularly true once you start doing more advanced computations … sometimes, you need to change the index or temporarily remove it to use a data visualization technique.
reset_index() will enable you to remove or change the dataframe index.
Assuming that you have your indexes set properly, one task that you will need to do over and over is create subsets.
It’s common to subset your data in the beginning stages of data cleaning (to inspect the data), but it’s also extremely common to create subsets at later stages as well … both later in the data manipulation phase, and also in more advanced data science phases like data visualization and analysis.
.loc method enables you to subset data by character index.
.iloc method is similar to
.loc in that it enables you to subset by an index, but when you use
.iloc, you subset by the integer index of the dataframe. (Dataframes have integers index values for every row and column.)
Again, subsetting is really important for data inspection, but also for data visualization and analysis. You need to know this.
In data manipulation and cleaning, there are times when you need to delete rows of data.
Sometimes, you want to delete rows permanently, because they don’t belong in your data for one reason or another.
Other times, you just want to remove rows temporarily so they don’t appear in a data visualization or data aggregation.
In Pandas, you can delete rows of data using the
Alternatively, you might only want to drop missing data.
So for example, if all of the data are missing for a row, it’s fairly likely that you’ll want to remove that row of data.
In other cases, you might not care if the data are missing for certain variables, but if the data are missing for an “important” variable, you will need to delete the row.
For example, if you have data for customers at a bank, it might be okay if the data for the customer’s age is missing, but it might be really bad if their name is missing. So you want to be able to selectively delete a row if the data are missing for a person’s name.
In Pandas, you can do this with
dropna() function enables you to delete rows with missing data … either if all of the data are missing, or if the data are missing for particular variables.
Dealing with missing values is an important part of data cleaning.
As mentioned above, you can delete rows with missing data.
But sometimes, you just want to fill in the missing data.
With Pandas, you can do this with
fillna() function lets you “fill in” the missing values with a value of your choice.
Once you fix problems in the rows of data, you commonly want to fix issues with the columns in your data.
One of the most common problems is that the names of the variables are incorrect or problematic in some way.
You can use the Pandas
rename() function to rename variables in a Pandas dataframe.
It’s also common to create new variables.
For example, let’s say that you have data about houses that were sold. The data has one variable called
price, which is the sale price of the house. And the data has another variable called
square_feet, which specifies the square feet of the home.
What if you want to do an analysis that involves price per square foot?
You’ll probably need to create new variable that contains price per square foot.
In Python, you can do this with the Pandas
assign() function is simple. It enables you to create new variables in a Pandas dataframe.
Sometimes, after you clean up the rows and columns, you need to “reshape” your data.
We call this process of “reshaping” your data transposition.
One of the tools for doing this in Pandas is the
The other major tool for reshaping your data is the
I’ll be honest: reshaping data with
pivot() often confuse people.
A lot of people get completely stuck at this stage.
The truth though, is that transposing data isn’t that hard. You just need someone to show you clear examples that build your intuition, and you need to master these two functions,
Once you know how they work, and once you’ve memorized the syntax, transposing your data is really not hard.
Once you’ve cleaned rows and columns and transposed your data into the right shape, you sometimes need to combine data. This is the case if you have multiple datasets.
One of the ways to combine data is with
concat() function stacks multiple datasets. So if you have multiple datasets with different rows that you need to combine together, you can use
concat() to “stack” them into one dataframe.
If you have multiple dataframes, you might also need to “merge” your data.
A merge operation is when you need to join two datasets together on an index or on the values for some row.
So let’s say that you have two different datasets. Each dataset has rows of data for particular people, and you want to “join” or combine the data based on the peoples’ names.
You can do this with the Pandas
merge() function (although it’s also possible to do it with the
join() function as well).
So once you have your data cleaned up, reshaped, and joined together, what do you do?
Often, you need to analyze your data.
This often requires you to create new subsets of data, and then sort, group, and aggregate.
So one thing that you need to do is subset down to specific columns.
filter() function enable you to do this. It enables you to select specific columns of a dataframe.
(To be clear, you also use the
filter() function during data cleaning.)
During analysis, it’s also extremely common to subset the rows.
Earlier, I mentioned
iloc, which enable you to subset rows based on index values.
But you can also subset rows based on logical conditions.
For example, if you were analyzing data with bank customers, you might want to retrieve data only for customers who are 18 or older. The logical condition,
age >= 18, acts as a condition for the subset … keep a row if
True and exclude the row if
You can perform operations like this in Python with the
Keep in mind that there are other ways to do this in Python besides the
query() function. Many people use a complicated “bracket” notation to retrieve data from Pandas dataframes based on logical conditions.
To be honest, the other ways to do this are a lot more complicated and harder to read.
If you aren’t already using it, you should be using the
query() function to subset your data based on logical conditions. The code is much easier to read and write than the alternatives.
Sometimes, when cleaning, inspecting, or analyzing your data, you need to sort the rows based on some variable.
So for example, if you have a dataset with different people, maybe you want to sort the rows of data by age.
In Pandas, you can do this with the
This is fairly common in data inspection.
But it’s also very common in data visualization and data analysis.
If you’ve ever needed to display your data in a specific, sorted order, this is the tool that you need.
In data analysis, it is extremely common to aggregate your data by a group.
That is, you commonly need to compute some metric by category. For example, lets say that you have bank data for individual bank customers. If you want to compute the average checking account balance by city, you’ll need to group your data by a
To do this in Pandas, you need to use the
I cannot emphasize how common this is. This is a critical data manipulation tool that you will use over and over again, both in data cleaning, but also particularly in data analysis.
If you feel like you can’t analyze your data well, you probably need to learn how to use this function better.
In Pandas, we use
agg() to compute summary statistics.
So if you have a large dataset, and you need to compute the average, maximum, minimum, or sum of some variable,
agg() is the tool you need.
This is a critical tool for data analysis. To analyze your data, you should be using this frequently.
Moreover, it becomes even more powerful once you combine it with
Data wrangling is much easier than you think
That’s about it.
The 19 tools that I just listed account for the vast majority of all data cleaning and data manipulation.
… that includes most of the data manipulation that you need for data visualization and analysis.
Think about it.
You only need to master 19 techniques to do the vast majority of data manipulation.
If you learned one technique per day, you could learn them all in less than a month.
Memorization is critical for mastery
But, but, but !!!!
The devil is always in the details.
By narrowing things down to 19 techniques, we’ve dramatically simplified data manipulation.
Still, mastering these techniques isn’t entirely easy, otherwise you probably would have done it already.
This is where memorization comes in.
If you want to master these 19 data manipulation tools FAST, then you need to memorize them.
No screwing around.
None of this “I’ll watch a few videos” and hope for the best.
If you want to master these data manipulation tools ultra fast, and dramatically improve your productivity, you need to focus on memorizing these techniques.
Because until you memorize them (one way or another), you’ll continue to forget them.
And if you forget them, you need to stop your work to look up the syntax.
This stop/start, cut-and-paste style coding is slow and horribly unproductive.
On the other hand, deliberately memorizing the syntax for these techniques removes one of the major bottlenecks of data science productivity: forgetting syntax.
Once you memorize and remember these techniques, your productivity will skyrocket.
Memorize these 19 techniques
So in some sense, the secret to mastering data manipulation …
… and in particular, the secret to mastering data manipulation FAST …
Is memorizing the syntax of these 19 special techniques.
If you’re serious about mastering data manipulation in Python, you should focus relentlessly on memorizing the syntax of these 19 techniques.
Once you do, your productivity and effectiveness will increase dramatically.
Discover how to memorize Pandas syntax
If you want a highly effective practice system for memorizing these Pandas data manipulation techniques, you should join our course, Pandas Mastery.
Pandas Mastery is our online course that will teach you these critical data manipulation tools, and show you how to memorize them.
Inside the course, you’ll discover our unique practice system that will enable you to memorize all 19 of these data manipulation tools.
And, it will only take a few weeks.
We’re re-opening Pandas Mastery for enrollment next week (July 21).
If you have questions about it, just leave your question in the comments section below.