Why AI Coding Systems Will Cause an Explosion of Data

An image of an explosion of digital data.

For years, people have been talking about the era of “big data.”

The data deluge.

As early as 2010, The Economist wrote about how “quantity of information in the world is soaring.”

Now 13 years later, businesses have still struggled to keep up with the volume of data.

This has been great for the data science industry generally, and professional data scientists specifically. It’s been a very lucrative career.

But with the advent of ChatGPT and other large language models, there’s suddenly been a question about whether data science will still be important.

If large language models can write code, will data scientists be needed?

Will the importance data science increase or decrease in this new era.

I’m here to tell you that I think data science will be even more important going forward, and it’s because we’re going to see yet another explosion of data.

AI Coding Systems Will Dramatically Increase the Amount of Data

I think that data will increase dramatically over the next few years, and AI coding systems will be one of the key drivers.

Let’s think through why.

If it’s Valuable and it Decreases in Price, You Get More of It

When you have something useful, and the price of that useful thing decreases dramatically, then you’ll get more of that thing.

For example: integrated circuits.

Integrated circuits are useful devices that enable the creation complex electrical circuits, and ultimately computer CPUs. Integrated circuits enabled the computational revolution that we’ve been living with since the mid 1960s.

Integrated circuits were valuable because they enabled us to make a larger number of electronic devises, smaller electronic devices, and more powerful electronic devices.

Yet the price of integrated circuits has decreased dramatically and consistently for decades. This is essentially a corollary of Moore’s Law.

So integrated circuits are valuable but the cost of integrated circuits has decreased.

What’s the result of this?

We have a lot more integrated circuits.

There was a rapid growth in the production of integrated circuits, starting in the 1970s, which continued for years.

And then since the 2000s and 2010s, we’ve had an explosion of integrated circuit manufacturing.

We have integrated circuits in everything: desktops, laptops, smartphones, wearable devices, miscellanious sensors. Even most household appliances contain integrated circuits.

My point here is to provide an example: when something is valuable, and the price decreases, you’ll get a lot more of that thing.

We’re going to get a lot more software

Next, let’s bring this back to software.

We can agree that software is valuable, right?

We can use software for business applications like spreadsheets, accounting, and sales.

We can use software for communication.

We can use software for medical applications and health and wellness.

Software brings us value by increasing profits, increasing productivity, saving time, decreasing risk, and generally solving problems.

Yet now, ChatGPT and other AI coding systems will dramatically decrease the cost of building software.

In the past, an individual needed to study for years and practice for years to become talented at writing software. There was a large cost in terms of time and money for being able to write software: books, college courses, diplomas, and again … years of hard work.

But AI coding systems will enable anyone to write complex software for a few dollars a month (right now in April of 2023, ChatGPT costs about $20 per month).

And it’s worth noting that it writes software at a very high level.

I’m guesstimating here, but AI coding systems probably write software at the level of a high-level developer with at least 5 to 10 years of experience. Skill like that probably used to cost on the order of hundreds of thousands of dollars, in terms of education and time spent.

But now anyone will be able to get this skill for a few dollars.

All of that is to say, the cost of writing software has just decreased dramatically.

What does that mean?

Software is very useful and valuable …

And the cost has decreased dramatically …

So we’re going to get a lot more software.

Conservatively, I’d estimate that we’ll get at least 100x more software.

But thinking it through long term, I think we could on the order of 1000x more software over the next decade or so.

More Software Means More Data

So we’re going to get more software, but why does this mean more data?

You need to remember that almost all software throws off data.

Some of the data is more passive, like log files that are generated by a lot of online software.

But some things are more actively recorded, like customer behavior. Most businesses with online logins record what you look at, what you “like,” what you save for later, what you buy, what you watch, etc.

My point is that almost all software generates data of one type or another.

100x more Data

So what happens to the amount of data that’s generated if we 100x the amount of software?

We’re going to get a lot more f*cking data.

My estimate is that the amount of data that the world will generate will scale at least linearly with the amount of software (maybe even super-linearly).

100x the software? Then 100x the data.

1000x the software? Then 1000x the data.

As AI coding systems enable the creation of a lot more software, we’re going to get a lot more data.

What this Means for Data Science

More data means a lot more data science work.

But, it’s complicated.

In a recent blog post here at Sharp Sight, I pointed out that for years, there’s been too much data science work for most data scientists to complete. In data science jobs, there has almost always been a backlog of work waiting to be done. In that post, I argued that as productivity tools, AI coding systems will enable data scientists to clear out those backlogs. The productivity boost will enable data scientists to do “all the things” that previously, they simply didn’t have time to do.

But the thinking in that blog post was predicated on the idea that the pace of new data creation would continue at a similar velocity as the past.

Yet in this blog post, I’m suggesting that the pace of new data creation is going to dramatically accelerate.

What I’m getting at is that on one hand, AI coding systems will make data scientists much more productive and will enable them to complete a lot more work.

But on the other hand, we’re likely to see a dramatic increase in new data creation.

Where this shakes out is still somewhat uncertain, but I think that data science is going to overtake traditional software engineering in importance.

What To Do About It

This is a data science blog, so obviously I’m going to suggest that you “learn data science.”

But that’s more than just a perfunctory suggestion.

I strongly believe that the amount of new data in the world is going to explode.

And in turn, that means that data science will increase in importance.

In coming weeks and months, I’m going to write more about how to navigate this new era, but I think that the really big suggestion is to learn machine learning. Why? Because all this new data will be potentially be useful to train new machine learning systems.

“Software that learns from data” is likely to be a big part of the future.

But, as always, I’ll point out that before you learn machine learning, you need to learn data science foundations, namely:

  • data wrangling
  • data visualization
  • data analysis

I’ve been beating this drum for the better part of 8 years, but those three skills are actually the foundations of machine learninng.

And once you learn those, you can learn foundational machine learning concepts like training, resampling, bias/variance, regularization, and all of the most widely used ML algorithms (like linear regression, logistic regression, decision trees, boosted trees, etc).

After learning conceptual foundations, you can move on to the sexy stuff like deep learning, natural language processing, etc.

But again: master the foundations first.

Tell Me What You Think

Do you agree with me that the amount of data will explode?

Do you agree with me that data science and machine learning will become more important?

Why?

I want to hear from you …

Write your comments and thoughts in the comments sections below.

Joshua Ebner

Joshua Ebner is the founder, CEO, and Chief Data Scientist of Sharp Sight.   Prior to founding the company, Josh worked as a Data Scientist at Apple.   He has a degree in Physics from Cornell University.   For more daily data science advice, follow Josh on LinkedIn.

Leave a Comment