3 Secrets for Mastering Data Manipulation

Many people struggle with data manipulation.

Have you ever started a data science project and gotten stuck?

Have you gotten stuck while trying to manipulate, clean, or “wrangle” your data?

I have.

The fact is, if you don’t know what you’re doing, data manipulation can be extremely complicated.

The vast majority of data science beginners become mired in this confusion. They get frustrated and many of them simply give up.

That’s a shame, because companies need skilled data workers.

But … it’s data manipulations is easy once you know the secret

What if I told you though, that learning, mastering, and using data manipulation is very easy, once you approach it the right way?

It is.

… you just need to know a few secrets.

3 secrets for mastering data manipulation

There are a few secrets for mastering data manipulation, or anything for that matter.

I’m not going to give you all of the secrets (we teach most of them in our premium courses), but I will give you three, because I’m such a generous person. You’re welcome.

Here are 3 things that you need to do to master data manipulation:

  • identify the principal techniques
  • memorize the syntax of the principal techniques
  • learn to combine the principal techniques together

Let’s talk about each of these in depth.

Identify the principal techniques

The first thing that you need to do is simply identify what I call “the principal techniques.”

Essentially, you need to identify the unique set of techniques that (when combined together in the right ways) enable you to perform data manipulation.

In the Four Hour Chef – a book nominally about cooking, but actually about rapid learning – Tim Ferriss wrote about breaking a skill down into small pieces:

First and foremost … we answer the question: how do I break this amorphous “skill” into small, manageable pieces?

Ferriss wrote about this as “deconstructing” a skill.

But there’s actually a different way to think about it.

What we need to do is actually a lot like principal component analysis.

In principal component analysis, we take a complicated “high dimensional” space, and reduce it to a simpler space of uncorrelated directions (an orthogonal basis).

In principal component analysis and other dimension reduction techniques, we reduce complexity by identifying the unique, orthogonal, principal components.

You can do something in similar for data manipulation.

In data manipulation, we can reduce the apparent complexity of the problem by identifying the “principal techniques” that enable you to clean or manipulate a dataset.

When you distill data manipulation down into components, what are the principal techniques?

There are a few other things that you might need to do, but for the most part, these are the essential techniques that you need to wrangle your data. Almost everything in data manipulation or data cleaning is some combination of these.

So what are these in Pandas, if you want to do data manipulation in Python?

I actually addressed this in a previous blog post, where I wrote about the 19 techniques that you need to memorize.

In Pandas, they are:

  • read csv
  • set index
  • reset index
  • loc
  • iloc
  • drop
  • dropna
  • fillna
  • assign
  • filter
  • query
  • rename
  • sort values
  • agg
  • groupby
  • concat
  • merge
  • pivot
  • melt

When we identify the principal techniques of data manipulation in Pandas, it’s only 19 techniques.

This simple act of identifying the principal techniques dramatically reduces the complexity.

Data manipulation becomes much simpler: you only need to know a couple dozen techniques and how to apply them.

Think about it … if you learned one technique every two days, you could learn them all in about a month.

Memorize the syntax of the principal techniques

Your second goal is to memorize the syntax for these principal techniques.

Why?

One of the first productivity bottlenecks in data science is forgetting syntax.

Have you ever forgotten a piece of Python syntax or Pandas syntax, and needed to look it up on Google?

What happens?

You completely stop your progress and your train of thought, switch over to your web browser, and then have to search through a half dozen webpages (or more) to find the syntax and a good example that will show you how to use it.

Frequently, even after you find the syntax and copy-paste it int your code, you need to spend quite a bit of time modifying it to get it to fit into your code and to get it to work properly.

This whole process dramatically slows you down.

In fact, this is related to a well known phenomina in cognitive psychology called task switching. When you switch tasks or switch contexts there’s a cost to your mental speed, accuracy, and performance.

You want to reduce this as much as possible. You want to reduce the need to stop and look up syntax.

Ideally, you want to write your code “fluently.” You want to write your code quickly, accurately, and from memory. The code should almost “flow” from your fingertips.

I often compare this state – the state of writing code fluently – to what athletes call being “in the zone.” You want to be in a state where everything just seems to happen effortlessly. That’s where you get real productivity.

But when you have to stop, search for code, cut-and-paste, and continue, the whole process grinds to a halt. You’re pulled out of the zone, and writing your code takes dramatically longer.

Forgetting code is the first major productivity bottleneck, because it pulls you out of the zone and slows you down by forcing you to look up syntax.

The way that you break through that bottleneck is you memorize the syntax.

When you memorize the syntax, you eliminate the problem of forgetting syntax, and you can focus on actually applying the syntax to solve problems.

Learn to combine the principal techniques together

This brings us to the third secret for mastering data science: you need to know how to combine techniques together.

Recall what I said earlier: most data manipulation is performed by using a handful of “principal techniques”. These are discrete tools for some small data manipulation task, like deleting a row, filtering data, transposing data, etc.

Almost all data manipulation is performed by a combination of those principal techniques.

So once you’ve identified those techniques (secret 1) and you’ve memorized their syntax (secret 2), you just need to know how to put them together to perform more complex data manipulation operations.

I sometimes call this the Building Block Strategy.

In the Building Block Strategy, we’re just using the “principal techniques” as building blocks. We can combine those building blocks in different ways to achieve different results.

For example, when you clean your data, you’ll often apply most of the principal techniques (i.e., the “building blocks”) in order to filter, reshape, and join your data, as well as fix missing values.

When you need to aggregate your data, you’ll frequently filter rows, group by a categorical variable, and then summarise a numeric variable, all in combination.

If you combine the right building blocks in the right order, you’ll get the right structure.

Similarly, if you combine the right data manipulation techniques in the right order, you’ll modify your data the right way.

Again: almost all data manipulation consists of combining the “principal techniques” in different combinations to achieve a specific result.

Mastering data manipulation is easier than you think

Learning and mastering data manipulation is much easier than you think, as long as you approach it the right way.

You need to …

  • identify the “principal techniques”
  • memorize the syntax of the principal techniques
  • learn to combine the principal techniques to do data manipulation tasks

And since there are only about 19 principal techniques for data manipulation in Pandas, those can be memorized in a few weeks (if you know how to memorize syntax).

Learning how to combine them together only takes a few more weeks, since there are repeatable “recipes” for doing most data manipulation tasks.

Ultimately, it’s possible to become highly skilled at data manipulation in about 8 weeks, give or take.

All you need is a clear system that guides you through the process.

Master data manipulation in Pandas in a few weeks

If you’re serious about mastering data manipulation with Pandas, you should join our course Pandas Mastery.

Pandas Mastery is our premium data science course to help you master data manipulation in Python using Pandas.

This course is the absolute fastest way to master Pandas, because it clearly identifies the “principal techniques” of data manipulation that we discussed in this blog post. Moreover, it will help you memorize the syntax for all of those techniques, and show you how to combine them to get things done.

If you’re ready to up your data science skills, this is the course you’ve been looking for.

The course will reopen for enrollment tomorrow, Tuesday July 21, and if you’re interested, you can sign up for the waitlist to be notified as soon as it opens.

Join the Pandas Mastery Waitlist

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

Leave a Comment