Jesus.
I just spent about an hour playing with GPT, asking it to write some Pandas code for me, and I want to set my computer on fire.
Do you know the frustration and mild contempt you feel when something will make your life harder, but also offend your aesthetic sensibilities?
Like going to a rental car vendor and getting a 2005 Pontiac Aztec that also has broken cupholders.
I mean, it will probably be frustrating to use, but it’s also an affront to my good taste.
This is my reaction to GPT’s Pandas code.
To be fair: it didn’t do absolutely everything wrong.
But at least 80% of the time, it wrote code like a beginner who doesn’t know what the f*ck he’s doing, and has learned everything from other beginners posting snippets on Stack Overflow (I’m being a bit of an asshole … Stack Overflow occasionally has great content).
I’m Going to Show You GPT’s Bad Pandas Code
In this blog post, I’m going to show you the bad Pandas code that GPT generated for me.
I’m going to show you a small, but enlightening range of examples, which will probably be enough to show the types of mistake that GPT makes.
For clarification, these examples were made with GPT-3.5, which is the one that I can use most consistently without limits.
A Quick Caveat, For Beginners
Ok.
I’m about to show you some code that will probably look “fine” to you if you’re a beginner. I might even offend you by telling you that it’s bad, because there’s a good chance that you’ve written code like this or used code like this in the past.
I understand.
While I will criticize this code for being bad, it’s okay if you’re written or used something similar. We were all beginners once, including me.
BUT, there is a better way. I have a strong perspective on Pandas code, and code in general.
If you write code like the GPT code I’m about to show you, just know that my jabs are mostly in good fun, and that I want to help you improve.
Note on Datasets
In the first few examples, I gave it a custom dataframe with comma delimited data as follows:
sales_data = pd.DataFrame({"name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"] ,"region":["East","North","East","South","West","West","South","West","West","East","South"] ,"sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000] ,"expenses":[42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]})
I’ll also use the titanic
dataframe from Seaborn in one or two of the later examples.
Subset Rows on a Single Numeric Condition
First, I asked GPT to write code to subset the rows of sales_data
based on a value of a numeric variable.
Specifically, I asked it to retrieve rows where sales is greater than 60000.
Here was the conversation:
And here is the code.
sales_data[sales_data['sales'] > 60000]
Let me just say that I hate this style of Pandas code.
I call this “Bracket notation.”
You’ll see here that you need to reference the name of the dataframe multiple times: once to specify the dataframe that we’re subsetting itself, but also to retrieve the sales
variable.
This is bad Pandas code. It’s messy, and harder to read than it should be. If you don’t understand why it’s bad yet, then keep reading the other examples.
The better way
Here’s how I would do it.
Use the Pandas query method instead:
sales_data.query('sales > 60000')
The code is simpler, but also more intuitive.
Here, we’re taking a dataframe and performing a “query” operation on it to retrieve a specific subset. It’s more human-readable. And the advantage of this will become more obvious as we move on to more complex examples.
Subset Rows on a Single Categorical Condition
Next, we’ll do a similar example, but with a categorical variable used to subset the data instead of a numeric variable:
Here’s the actual code produced by GPT:
sales_data[sales_data['region'] == 'East']
You’ll notice that it’s the same issue as the previous example: GPT is using that terrible “bracket notation”.
This code is difficult to read, and unintuitive.
And as I said previously, this problem will compound as the examples get more complex.
The Better Way
Here’s how I would do it.
sales_data.query('region == "East"')
Like the previous example, you should be using the ‘query’ method.
It’s easier to read, easier to understand, and easier to modify.
Subset on Two Conditions
Ok. Now, let’s move on to a more complex example.
Here, we’ll subset on two conditions. These will be the same conditions in the two previous examples, but we’re going to ask GPT to do them both in one piece of code.
Here’s the interaction with GPT:
And here’s the code that it created:
sales_data[(sales_data['sales'] > 50000) & (sales_data['region'].isin(['East', 'West']))]
The code just gets uglier.
Notice how many times it uses the name of the data, sales_data
.
And notice how many nested brackets and parenthesis are in here.
This code is hard to read, and it will be harder to modify.
The Better Way
One better way to write this is with a single query statement, and an “and” operator.
sales_data.query('(sales > 50000) and (region in ["East", "West"])')
Notice that you only need to reference the dataframe once. The code is also a little easier to read.
But, you can also do this with Pandas method chaining.
(sales_data .query('sales > 50000') .query('region in ["East", "West"]') )
Here, we’re actually filtering the rows twice … once based on sales, and once based on region.
Because we’re doing this in two steps, this code will be easier to modify (e.g., it’s easier to comment out one of the steps, if we need to).
This code is more intuitive, easier to read, and easier to use.
Sort Data in Descending Order
Data sorting is one of the few things that it did well.
Here, I’ve asked it to sort the data by sales, in descending order.
This is the code output:
sales_data.sort_values(by='sales', ascending=False)
This is fine. It’s exactly how I would have done it.
Perform a Multistep Operation to Query, Filter, Sort
Now, I’ll show you one more example.
This should demonstrate how bad the GPT code is, and how much better it could be if you wrote the code the right way.
In this example, I asked GPT to operate on the titanic dataframe.
I’ll leave out the conversation that I had with GPT, but it actually took me 2 or 3 tries to get it to retrieve the titanic dataframe properly. This is because the titanic data exists in multiple packages. GPT named the loaded data differently than I wanted, so I needed to clarify the exact name to give it.
Setting that aside, before we look at the Pandas data wrangling, here is the code to load the data:
import seaborn as sns titanic = sns.load_dataset('titanic')
Ok.
Once we got the data, I asked GPT to do a multistep data wrangling operation to do 3 things:
- subset the rows based on a categorical variable (return only rows with a particular value)
- subset the columns to retrieve only 3 specific columns
- sort the data in descending order by one of the variables
Here’s the conversation I had with GPT:
And here’s the code it produced.
titanic_subset = titanic[titanic['embark_town'] == 'Southampton'][['sex', 'age', 'survived']].sort_values(by='age', ascending=False)
To put it bluntly, I hate this code.
How many f*cking brackets do you need to do some simple data wrangling?
A lot.
And for 2 out of 3 of the steps, there’s no human-readable commands that clearly say what it’s doing. We can see “sort_values” sorting the data, but is it immediately and totally clear where there’s a row-subset and a column-subset? Only if you’re well versed in this type of bad Pandas code. But any relative beginner will be confused.
And what if I want to temporarily remove one of these steps? It will be difficult to do, because the code is all on one line.
This is bad code. If you write code like this: there’s a better way.
The better way
The right way to do this is using Pandas chaining, with every operation on a separate line.
This will make the code easier to read, easier to modify, and it will make it aesthetically cleaner all around.
Let’s take a look.
(titanic .query('embark_town == "Southampton"') .filter(['sex', 'age', 'survived']) .sort_values(['age'], ascending = False) )
Every operation has its own line.
The commands are relatively human readable.
If you want to remove any of these steps temporarily, you can simply comment the line out with a ‘#
‘ character at the beginning of the line.
This code is much better.
Use Pandas Methods, and Use Pandas Chains, ignore GPT
The first lesson here should be clear: GPT writes bad Pandas code.
Will it improve over time? Maybe.
The issue is that most humans write bad Pandas code, and GPT was probably trained on that code.
The second lesson is that there is a better way.
If you’re using Pandas, you should be using the Pandas methods, like query, filter, sort_values, assign, etc.
They are much better.
And when you do any multi-step data manipulation with Pandas, you should be using Pandas chains.
Still not convinced?
About 3 years ago, I did an analysis of Covid-19 data. In that analysis, I needed to do several complex data manipulations with multiple steps.
For example:
(covid_data .query("date == datetime.date(2020, 3, 29)") .filter(['country','confirmed','dead','recovered']) .groupby('country') .agg('sum') .sort_values('confirmed', ascending = False) .reset_index() )
To be fair: this code is still somewhat complex, but it would be dramatically more complicated if it was written by GPT.
Can you imagine all the brackets! Having all of this on a single line, as if that’s how humans read it?
And let’s not forget, you still need to tell GPT all of the 6 steps in English language, precisely enough to get it to do the exact, multi-step operation that you need.
GTFO.
For the time being, there’s no substitute for knowing what you’re doing.
If you want to be an effective data wrangler, you need to know Pandas. And you need to know how to use it the right way.
Tell me what you think
Do you agree with me that GPT writes bad Pandas code?
Or are you a masochist who absolutely loves brackets and ugly Python code.
I want to hear from you!
Leave your comments in the comments section below.
Hey mate. It does the exact same rubbish code in GPT-4! I have learnt an immense amount from your blogs about quality and readable code. I’m just glad I came across this sort of thing early into my career as my university in the UK taught me horrible bracket notation although I understand it, it’s not intuitive at all and you have shown clearly where the value lies in more sophisticated examples of sub queries. Keep up the good work.
Yeah …. I’ve heard Microsoft’s pair-programming AI makes similar mistakes.
It’s possible that future versions will correct these issues, but they’re so common among *human* data scientists, that I’m not optimistic.
Hey mate
Did you have a chance to time GPT’s code vs. yours on large amount of data?
The GPT style code will be faster in many cases, but compute is so cheap, it’s often not going to matter unless you’re working with extremely large datasets.
Many people – myself included – suggest to optimize for code readability instead of processing speed. When you’re doing data analytics (which is how I would be using most of the examples above), readability, ability to modify, ability to understand complex operations … those are central to the task.
Well, I guess by default it uses bracket notation, but you know you can tell ChatGPT to use the methods instead of brackets? It was able to produce all of your examples except the last just by asking “And can you use a method rather than brackets?” for the first prompt. To get the last example, it first produced
`titanic_subset = titanic.loc[titanic[“Embarked”] == “S”, [“Sex”, “Age”, “Survived”]]
titanic_subset = titanic_subset.sort_values(by=”Age”, ascending=False)`
But when it asked it to use .query and .filter instead, it produced the following:
titanic_subset = titanic.query(‘Embarked == “S”‘).filter([“Sex”, “Age”, “Survived”]).sort_values(by=”Age”, ascending=False)
I would only use chatGPT to speed up something that I already know how to do, and maybe that’s your larger point. And you still need to think a bit about prompts, since chatGPT models the whole internet, and not necessarily the best idioms. But I think the issue is bit more nuanced than you suggest here.
Yeah, it’s fair to point out that you can ask it to use methods.
But (as you also pointed out), you need to know what to ask it for.
You should know what you’re doing to some extent *before* you start leveraging GPT.
You can using as a true beginner without much experience, but it will probably write a lot of bad code, or code that you understand, or code that will teach you a lot of bad habits.
Hey mate,
When will you be opening up the python course? Any reason why it’s closed if it’s a self directed/paced course? And do you
offer any student discounts? I saw it is like $900 on website so not sure if that’s correct? My price range is around $100-200
Probably reopening Python Data Mastery next month (May 2023).
Hey mate,
When will you be opening up the python course? Any reason why it’s closed if it’s a self directed/paced course? And do you
offer any student discounts? I saw it is like $900 on website so not sure if that’s correct? My budget allows for around $100-200. I really want to take your python course.
Thank you
There’s a lot of administrative overhead when people join a course, so I like to batch enrollment into narrow windows.
Also a variety of other reasons.
In any case, it’s how I do things.
And no …. no discounts. If you can’t afford the full price right now, we have free blog posts that teach quite a bit about Pandas, Seaborn, Numpy, etc.
“How many f*cking brackets do you need to do some simple data wrangling?”
This made me scream with laughter
In many cases, when I write, I’m trying to amuse myself as much as the reader …
The sad thing is I just completed a Master’s Degree in data science and brackets is exactly how they taught us to do this stuff. Amazingly, I first learned about the ‘.query’ method from your website. I’ve seen and written so much of the bracket notation at this point that I actually do find it readable, but doing it the way you suggest is like coding for dummies….in a good way, I mean. I just shook my head, wondered why we didn’t learn this, tried your way for myself, and started doing it. It really is so much cleaner.
I’m sorry to say, many of these Master’s programs teach data science code that makes me shudder.
I understand your comment about “coding for dummies” but I think of it much differently.
I’m showing how to write code that’s clear, easy to write, easy to read, easy to modify, easy to maintain, and easy to share.
So-called “bracket” notation has none of these qualities.
Man I just watched a recent podcast episode featuring Matt harrison. It’s called the ‘Jon Krohn Podcast’. Matt literally details how you’re in the top 1% of coders if you method chain.
I guess only guys from R are really familiar with this. In the Python community it is very rare. However what this shows me is that it’s relatively easy to rise to the top 5% of programmers as there’s not much competition. I assume most coders write lousy code like this but get used to it.
I appreciate you dilvulging these tips for us. I literally completed a google professional certificate and the code was horrible, bracket notation also.
Yeah, if you can use the chaining methodology well, you’re definitely in the top few percent.
Honestly, I used R/Tidyverse before I used Python, and the “pipe” system used in the Tidyverse was a major influence on how I think about writing data science code. I looked for a way to do something similar in Python and figured out the method chaining method (actually, before Matt published his book on the topic). It’s been critical for making Python usable for me, for data science.
Concerning your certificate: unfortunately, almost all certificate programs and degree programs (even expensive ones!) teach the bracket method. A small tragedy …