One of the common questions that I get from my readers is about “big data.”
This is particularly true of beginners, who frequently read or hear the term “big data,” and think that it’s extremely important, and that being able to work with big data is necessary for a job in the data industry.
For example, a couple of weeks ago, reader Emmanuel asked me this question:
Now before I move on, I want to make it clear that by highlighting this, I’m not trying to make fun of Emmanuel. It’s a legitimate question from a sincere beginner, looking for guidance.
Having said that, I’m going to treat this question with same fair but tough-minded manner that I treat most things here at the Sharp Sight blog.
In the beginning, forget about “big data”
If you’re a relative beginner, then 95% of the time, you should forget about big data.
There are a few reasons for this, but the major reasons are:
- There’s no such thing as “big data”
- Big data will confuse you
- There’s a lot of work to be done with non-big data
Let’s talk about each of these.
There’s no such thing as “big data”
One of the problems with “big data” is that there’s not really any such thing as big data.
At least, there’s not any agreed upon definition of big data.
The best definition that I’ve heard is that big data is “data that won’t fit in memory.” And considering that you can always buy more memory (and that it keeps getting cheaper every year), what’s considered “big” today might be “average” tomorrow.
Big data is a moving target
What that means is that “big data” is sort of a moving target. What’s considered “big” actually changes over time (and pretty dramatically, I might add).
Here’s an example.
I started in the data industry over 15 years ago. For a couple of years, around 2008, I worked at a small, boutique consulting firm in Chicago.
Technically, we built databases. We would go into a business (like an insurance company, bank, etc), gather all of the data from their old systems, build new databases, and put the old data into the new databases. This process is commonly called extract, transform, and load (ETL for short).
Technically, that’s what we did. We built databases.
But the company marketed itself in a very particular way.
As a company, we were unique amongst our competitors, because we focussed exclusively on “big data” problems.
Our sales teams would win business by talking to C-suite executives at Fortune 500 countries, explaining that we were experts in “big data.” We’d frequently win business when competing against larger and better-known competitors, like Deloitte.
We were big data experts, long before big data was really even a thing.
And we were a successful company largely because we knew how to use this term “big data.”
Now, here’s the funny part.
Guess how large “big data” was back then.
Guess.
Go ahead … think about it.
I’ll wait.
How big was “big data” back in 2008?
How big was this “big data” that enabled us to win multi-million dollar contracts against major competitors?
1 Terabyte.
You read that right.
The way that our company pitched our services was that we were “experts in ‘big data’ problems,” and in order for us to consider working with you, you needed to have 1 terabyte or more of data.
For context, right now in 2020, you can buy a 1 terabyte external hard drive on Amazon for $49.99.
To be fair, building a full 1 TB database is different than just buying raw storage, but it gives you a sense of how much things have changed.
Ultimately, “big data” is a moving target.
What’s big today will likely be average in only a few years.
Big data will confuse you
Another problem is that big data will simply be more confusing.
When you have very large datasets, it actually becomes harder to understand and work with the data.
Let’s say that a dataset has 1,000,000 rows and 200 columns. You’ll have a very difficult time just understanding the different columns, the different categories, the average properties of the different variables, etc.
(And to be clear, 1 million rows and 200 columns is not even that large.)
When the data gets too large, exploring and understanding your data becomes harder.
So if you’re a beginner, this adds a whole new level of complexity on to the problem of learning and mastering data science.
It will be much easier to learn and master data science skills when you have relatively small, clean, easy-to-understand data.
There’s a lot of work to be done with non-big data
Finally, a lot of work that’s done by data science and data analytics professionals is done with “average” data.
This, of course, varies from job to job and company to company.
But the point stands: a lot of data science and data analytics work deals with a few thousand to a few million rows of data. It’s somewhat more rare to find environments working with massive volumes of data that require special tools.
This is particularly true for many jobs at junior levels. As a data scientist or data analytics professional, it’s more likely that you’ll be put on projects with smaller datasets anyway. It’s unlikely that in your first job, you’ll be put on projects that involve massive amounts of data (although there are exceptions).
My point here is that there are a lot of data-related jobs that work with non-big amounts of data … data that fits in a csv file or data that you can pull with a query to a database and pull back to your computer.
Learn small data, then learn big data
In the end, my advice is that you should learn small data before you try to learn big data.
Small data is simply easier to work with, especially as a beginner.
For example, right now, I’m playing around with some machine learning techniques.
One thing that is abundantly clear is how much harder it is to work with a large data set.
Now, the data that I’m working with isn’t that large, but it’s large enough that my laptop takes a few minutes to execute some operations that I’m trying to run.
Need to load the dataset? It takes a minute or two.
Need to fit the model? It takes a minute or two.
Cross validation? Again…. another minute or two.
It doesn’t sound that bad, but it all adds up.
If you’re like most people, you only have 20 or 30 minutes a day to learn a new skill … only 20 or 30 minutes a day to practice.
Don’t waste that time on processing time.
Use a small dataset that only takes a few seconds to load, fit, and validate.
By working with smaller datasets, you simply get more practice in. You get more repetitions per practice session.
Over time, that increase in repetitions will add up and you’ll make much more rapid progress.
Start simple and then increase complexity
If you really want to master data science fast, start simple and increase the complexity.
Start with simple, small, clean datasets.
Start with data that is easy to understand and work with, and then practice modifying, reshaping, and visualizing that data.
Also, focus on simple operations:
- filtering rows based on logical conditions
- selecting columns
- creating new variables
- deleting variables
- transposing and reshaping data
- dealing with missing values
- creating essential charts and graphs (like line charts, bar charts, etc)
- etcetera
Before you move on to anything advanced, you should be able to do all of these things with your eyes closed.
I mean that almost literally. You should know the syntax backwards and forwards. (Once you do, you’ll actually have the skills to be a productive member of a data team).
By working with small, clean, simple datasets, you’ll be able to focus on what matters: learning the syntax, and learning the principles for applying the syntax.
Once you know the basics, increase complexity
Once you know the essentials backwards and forwards, then you can increase complexity.
The best way to increase the complexity is to start working with messy, real-world data. This will be harder to do, because you’ll need to apply the basic techniques (filtering, subsetting, transposing) in new and more complex ways. You’ll need to clean your data and fix things that need fixing.
But if you have the syntax mastered, you’ve at least passed the first bottleneck. You can focus on learning how to apply the syntax to more complicated problems, rather than looking up syntax every 3 minutes.
Finally, once you’re able to work with and clean up messy, real-world data, then you can move on to big data.
In many cases, that will mean learning new toolkits and new syntax.
But many of the principles that you learned while working with smaller, simpler datasets will still be valid.
Forget about “big data” … master the fundamentals
Ultimately, even if your goal is to master “big data” (whatever that will mean in 2020 and beyond), your first goal is to master small data.
I get it. It’s not cool. It’s not sexy. You don’t score any points with your friends by saying that you’re studying “small data.”
But honestly: who cares what those people think.
Your goal is to be come a top performer in the shortest possible time.
You can’t think like everyone else.
You need to be disciplined, methodical, and deliberate.
So sit down and focus relentlessly on mastering small data first. Use “small data” to master the fundamental techniques.
As your skill grows, the data can grow too …
… and you’ll be working with big data in no time.
Excellent post, I have been seen people skirmishes about big data and its tools (hive, spark & Hadoop), complexity at its finest… I strongly believe that data professionals should first, master the basics as data modeling, DBMS, SQL, descriptive and inferential statistics, Python, or C++ for backend tasks and business operations for productive work.
Now I get it right. I get the whole point now.
You’re such a great explainer (pardon the grammar)… because I searched all over to get clear or detailed explanations on this but all were shabby… and kept confusing me the more… but with this that I’ve read, it’s really satisfying and it has drowned all my curiosity.
I really appreciate your efforts Josh.
Whether it’s a tutorial or just a speech, you always do it so well and your words are always so clear that even a 5 year old kid won’t find difficult to grab.
Thanks.
????????????