A few weeks ago one of my students sent me an email.

He’s a student of our Numpy Mastery course, and he was playing around with some Numpy techniques that he had learned in the course.

In his email, he sent me some sample code, along with a question.

i am attaching a file and want to get feedback whether we should write such clean code while analysing data.

Because as an analyst i believe it would be very time consuming to do so. right?


An example of some Numpy code.

This is a good question.

As data scientists, we often focus on the highly technical side of the job, learning syntax, equations, or techniques.

But we often forget about some of the “softer” skills that are part of the job.

And one of the big things that beginners often forget about is writing clean code.

Having said that, I want to write about it. It’s important, and it’s something that you need to be aware of when you write code. The earlier you learn how to write clean code – and the more you practice – the better you’ll be in the end.

What is Clean Code?

Let’s start out by briefly discussing what clean code is.

Clean code …

  • uses meaningful names
  • is well formatted
  • is easy to read
  • defines and uses small, simple functions
  • etc

It would take an entire book to explain all of the details for creating clean code. Since this is a simple blog post, I won’t go into all the details. If you want to really understand the specifics of how to write clean code, I recommend the book Clean Code by Robert Martin. It will explain exactly what clean code is in great detail.

Clean code vs. messy code

To better understand what “clean” code is, it might help to look at a pair of simple examples.

Here, I’ll show you two pieces of code that effectively do the same thing. But the “clean” code is easier to read and understand, whereas the “messy” code is harder to read and understand.

Example: messy code

First, let’s look at some messy code.

Before we look at the actual code, we’ll need to import pandas.

import pandas as pd

Ok, next, let’s run some Pandas code.

df=pd.DataFrame({'var1':['honda civic','nissan sentra','bugatti veyron'],'var2':[32, 29,7]})

Just by looking at this, can you tell what we’re doing?

It might be obvious that we’re creating a dataframe, and it might be obvious that the data have something to do with cars, but beyond that, it’s hard to read.

Notice some of the design elements of this line of code.

  • Everything is on one line.
  • All of the syntax runs together. There are no spaces between any of the syntactical elements like names, operators, etc
  • Everything is poorly named. The variable names, var1 and var2, tell us nothing about the contents of those variables. The name of the dataframe, df, also tells us nothing about the data.

Overall, the characteristics of the code make it “messy” and hard to read.

Example: clean code

Now, let’s contrast that messy code with a “clean” alternative version.

car_mpg_data = pd.DataFrame({'make':['honda civic','nissan sentra','bugatti veyron']
                            ,'mpg_city':[32, 29, 7]
                           })

This code effectively does the same thing, but it’s structured much differently.

There are spaces between around some of the operators, and the code to create different variables is on different lines. This all enhances readability.

The dataframe name, car_mpg_data, makes it obvious what’s in the data. Similarly, the variable names, make and mpg_city, make it obvious what’s in those variables: the car make and the city ‘miles per gallon’ performance.

This second example is simply easier to read and easier to use.

And that’s really what clean code is …

At a high level, clean code is code that has been written in a way that makes it easy to read, easy to use, easy to share, and easy to debug. It’s code that’s written for clarity as much as executability.

Why we need Clean Code

So the question is, why is this important?

Ultimately, writing clean code is about creating long term value.

I recently wrote a blog post wherein I noted that your job as a data-professional is to create value.

On final analysis, this comes down to generating cash flow, both today, and in the future. (A more technical definition of “value” is the sum of all present and future cash flows, discounted by a discount rate. That’s a bit technical for someone who hasn’t studied finance, so you might want to read the book Value: The Four Cornerstones of Corporate Finance.)

Ultimately, you need to remember that your job as a data professional is to help a business generate value (i.e., positive cash flow and profit) by either increasing revenue or decreasing expenses.

As I mentioned in that previous post about value creation, it’s a little bit difficult to think like this as a data scientist, because the connection between data science projects on one hand, and profits or expenses on the other, is not always immediately obvious.

Still, you need to think in terms of value creation.

One good way to do this as a data scientist is to focus on productivity, since greater productivity will almost always increase revenue or decrease expenses.

And one way to increase productivity on a data team is to write clean code.

Clean code is an asset

Why?

If your code is easier to reuse, you can save yourself time in the future.

If your code is easier to share, you will save your teammates time when they have to use your code.

And if your code is easier to read, it will be easier to maintain and debug. Again, these will save you time in the future.

Time is money, and money is value.

Clean code – by enhancing readability, reusability, and maintainability – increases productivity. This in turn helps generate more value.

Clean code is a kind of asset that helps generate value for a business.

A tradeoff between clean code and speed

Although clean code is valuable for a business, is there a limit? Can you take it too far?

Yes.

In an ideal working environment, where you have ample time and resources to get things done, you should write clean code all the time, every time.

But we often work in less-than-ideal conditions.

In elite business environments, there are often very (VERY) tight deadlines and you aren’t always given the resources you need. That’s just the reality.

And writing clean code takes more time and effort … time that you don’t always have.

So what do you do?

There’s a tradeoff.

It’s a tradeoff between writing the cleanest code that you can, but also getting the job done on time.

You need to use judgement to analyze the tradeoff in any given situation and calibrate your work to the circumstances.

For instance, let’s say that you’re given a one-off project with a very tight deadline. Let’s say that in this particular case, you probably won’t need to reuse the code, and you probably don’t need to share your code with anyone else. You’ll write the code once, and probably never use it again.

Does your code need to be perfect?

No.

In a situation like this it might just be best to write your code quickly (but still accurately!) just to get the job done and meet the deadline. Taking extra time to make sure your code is “clean” might be a bad use of your time, especially if it might cause you to miss your deadline.

But on the other hand, in a different situation where you actually will need to reuse or share your code, you should try to make your code as clean as possible, even if it will take more time (as long as you’re not jeopardizing your deadline).

As I mentioned, clean code helps increase long term, intra-team, and inter-team productivity, which ultimately generate more value for the business.

Write the cleanest code you can, within constraints

Ultimately, it’s a tradeoff, and you’ll need to learn how to adjust your performance to different situations in light of that tradeoff. (A good manager or mentor can help you do this … they will be able to give you guidance on this.)

You need to be mindful of the quality of your code. Always err on the side of making your code readable, sharable, and reusable.

And if you’re in an environment where everyone writes messy code and no one wants to implement higher standards, get out.

Of course, that’s a different blog post for another time.

Sign up for more data science tutorials

Do you want to become a better data scientist? Do you want to write better data science code?

Sign up for our email list now.

When you sign up, you’ll get our data science tutorials delivered directly to your inbox every week.