More Terrible Pandas code from GPT

A few days ago, I was writing some code for a new machine learning course that I’m building, and I had a somewhat tricky problem to solve.

As I’ve said in the past, Pandas code (and data manipulation more generally) is very important for machine learning. Not just for wrangling your data into the proper shape, but also for analyzing the results after you build your models.

In this particular case, I wrote some code to compare multiple machine learning models, and I was trying to analyze the results contained within the output.

To do this, I wanted to subset the output down to the results for a specific model type, but also subset down to records where the training time was an exact number. (It’s unimportant why, but I was unsure that the procedure was working properly, so I wanted to subset the results data to verify that it was correct.)

I had a few ideas for how to do this, but the problem is tricky in a subtle way, so I wanted to see how GPT would solve it.

I asked GPT the following question:

And image of a conversation with GPT about Pandas code.

To put it lightly, I was displeased with the results.

GPT Wrote Terrible Pandas Code

I’ll show you how I solved the problem later in this blog post, but first, we’ll take a look at the many bad solutions GPT came up with, and how it completely ignored my preferences about how to solve the problem.

Let’s take a look at how GPT solved this problem.

And image of a conversation with GPT about Pandas code, continued.

Can you read this code, quickly and easily?

Me neither.

And good god.

The brackets.

So. many. brackets.

Anyone who’s seen me write Pandas code before or seen me critique it, knows that this code is problematic.

“I hate your solution, GPT”

I quickly let GPT know that I didn’t like the solution.

(I had had a long day, and was a bit of an a$$hole about it):

And image of a conversation with GPT about Pandas code, continued.

GPT then updated it’s response by using the dreaded “bracket notation” again, while also giving me an even more complex solution.

And image of a conversation with GPT about Pandas code, continued.

You’ll notice in this code that GPT is creating a new variable, but is using bracket notation to do it. It should be using the Pandas assign method.

It does use the Pandas query method, as I requested, but instead of using two separate query steps like I initially did in my code, it collapsed the two queries into one.

This is bad, because it makes it harder to remove one or the other query in case you need to change how you perform the operation or debug the code. Again, you should separate “and” queries whenever possible, because it will make your code easier to work with.

GPT Doubles Down, Again

I politely let GPT know that I was displeased with the output, implying that I wanted it to correct the issues.

It turned out badly.

You can see next that GPT attempted to re-do the code, but just made the solution more complex.

And image of a conversation with GPT about Pandas code, continued.

I pointed out that the lambda function just made the code hard to read.

(Again, I was a bit of an a**hole about it, but it’s a computer program.)

After I point this the problems, GPT attempted to re-do the code yet again:

And image of a conversation with GPT about Pandas code, continued.

Is this any simpler? Easier to use? Easy to read?

No.

Yet again, another over-complicated solution.

The right way to solve this

Ok.

Now that you’ve seen the many complicated ways that GPT used to try to solve this, let’s look at the good solution:

Get data

Before I show you how to do this, we’ll need to get the data (so you can run the code yourself).

import pandas as pd
bootstrap_df = pd.read_csv('https://www.sharpsightlabs.com/datasets/model_results.csv')

Solve the Problem with 2 Query Steps

Now that you have the data, I’ll show you a better way to solve the problem.

Here, we’ll use 2 calls to the Pandas query method.

The first to query will subset to rows for logistic regression data, and the second query will subset for rows where fit_time is one particular value.

(bootstrap_df
 .query('model == "LogReg"')
 .query('fit_time.round(6) == .678665')
 )

Notice how simple this is.

The real secret to doing this is to use the round() method, instead of all the complicated code suggested by GPT.

It’s very simple. GPT got it wrong, in the sense that it made it way, way to complicated.

You Need to Master Pandas

The point here is that you need to master Pandas on your own.

GPT can be a useful pair-programmer, and it can help you solve some types of problems.

But unfortunately, GPT writes terrible Pandas code.

Relying on GPT will only cause problems for you.

If you want to be a great data scientist, as I’ve said many times in the past …

… you need to master Pandas.

Tell me What You Think

Do you agree with me that GPT writes bad Pandas code?

Or do you prefer your code to be complex and virtually unreadable.

I want to hear from you!

Leave your comments in the comments section at the bottom of the page.

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

3 thoughts on “More Terrible Pandas code from GPT”

  1. GPT writes horrible, unreadable pandas code and GPT is only usable if you know what you’re doing already. This is very dangerous as a teaching tool for beginners I fear. They will internalise bad practices. They’re no shortcuts to becoming a great Data Scientist.

    Essentially, the good programmers will become great and the bad programmers will be replaced or just not paid very well. Let alone get into ML. GPT essentially eliminates the middle class of programmers so to speak.

    Did you use GPT 4 btw? GPT 3 is useless for these tasks. Also do you use Lambda functions at all? I have never understood them.

    Looking forward to more great ML content!

    P.S. Do you take donations?

    Reply
    • I’m pretty sure I used GPT4 this time.

      I try to avoid lambda functions whenever possible. They’re hard to read and they’re particularly hard to understand for beginners. I often try to write my code another way,

      PS – I don’t take donations right now, but I’ll probably set up a subscription on Twitter soon that will allow people to make small monthly payments for extra content there. http://twitter.com/Josh_Ebner

      Reply

Leave a Comment