In a recent post, I wrote that when you’re starting out with data, you need to focus much more on process and technique, not syntax. Beginning students hear this, but it’s easy to ignore and get lost down the rabbit whole of syntax.
What frequently ends up happening is people start to sound more like software engineers: “I’m learning ‘for loops.’ I’m learning data types.” et cetera.
Your clients don’t need code
I think this is a mistake.
The problem is, your clients don’t need code.
When an executive or business partner comes to you, they will say “my conversion rate is down. Why?” Or, “we need to meet a sales goal of $X this quarter. What should we change to reach that goal?”
They’re really saying: “tell me what to do.”
They want to know how to improve their business.
They want insight.
Your clients want insight
It’s instructive to look at the meaning of the word “insight.”
According to Wiktionary, the meaning is:
- “A sight or view of the interior of anything.”
- “Power of acute observation and deduction; penetration; discernment; perception.”
- “The act or result of apprehending the inner nature of things or of seeing intuitively.”
Read those again: A sight into the interior of something. Penetrating sight. Discernment. Acute observation. Apprehending the inner nature of things.
These are about seeing differently. Seeing deeply. Understanding.
Let’s be clear: Insight is something that happens to your mind.
Whoa.
I know. That was deep.
We aren’t wired for data
Insight is a process that happens in the mind when we transition from not-seeing to seeing; when we transition from not-understanding to understanding.
Before this get’ too abstract, let’s bring this back to data.
In data science, ultimately we’re trying to generate insight from raw data. We’re trying to use data to see things clearly.
The problem with this is that our minds don’t “speak data.” We’re not wired to understand tabular data with millions of rows (or even dozens for that matter).
But our minds are very visual. We’re wired for visual information.
Our brains are wired for visual information
Our brains are highly visual. As Alberto Cairo noted in his book aesthetic attributes of a plot. In a sense, our visual system evolved to process visual data in terms of these aesthetic attributes.
Moreover, there’s a hierarchy of “importance” for these visual attributes. Some attributes are “more important” than others. To be more specific, our visual system appraises some visual cues more accurately than others.
For example: humans are much better at assessing position on a common scale (e.g., a scatterplot or bar chart) than area (e.g., a pie chart).
(In the interest of space, I won’t be covering the theory of visual cues here. However, Solomon Messing wrote a fairly detailed post on graphical perception. Nathan Yau and Alberto Cairo also cover visual cues in their respective books, Data Points and The Functional Art, both of which I highly recommend.)
All of this discussion of the mind and brain are important because the brain processes visual information according to fairly well understood mechanisms. The brain “sees” some types of things better (more accurately) than others.
The details of how the vision system works are important; if you know them, you can use them to create data visualizations that “speak the mind’s” language; visualizations that help your clients see clearly; visualizations that build insight.
Visualizations are tools for building insight
a visualization is, above all, a tool
– Alberto Cairo, The Functional Art
This is why data visualization is so important.
Our brains and minds are wired for visual information. And how visual information operates on the mind is fairly well understood.
We can use visualization techniques to translate from data-space to aesthetic-space, thereby creating insight that clients ultimately need. Data visualizations literally allow people to see data and see important features in data.
More simply: data visualizations are tools for building insight.
Said differently, if you know how the vision system works, and you know how to create visualizations that take advantage of the operation of the vision system, you can generate (and deliver) the insight that clients want.
Don’t misunderstand: you still need to learn code
Some people will read the above comments and misinterpret, so I want to be very clear about this:
You need to learn how to program.
And as you progress in your career, you may even need to learn proper software development skills; you may need to create large systems for finding and delivering insights to your customers.
What I’m asserting, however, is that you should not approach learning data science from a software development point of view. You should not start with control flow statements, data types, etc.
You should start by learning how to find and deliver insights from data.
In fact, you need to be able to do this reliably at a relatively small scale before scaling it up into any larger software system. As Paul Graham, a Silicon Valley investor and essayist has said, “do things that don’t scale.” In the context of all I said above, as a beginning data scientist, you need to be able to generate insight at a relatively small scale first.
If it’s not clear from all of the above, I believe that data visualization is the best way to do this.
How to get started in only a few weeks (by learning visualization first)
Data visualization is the best, most reliable way to deliver insight when you’re first starting out.
And in particular, I recommend ggplot2 to most beginners because it allows you to focus on visualization.
Yes, you do still need to learn syntax. But, ggplot2 (and dplyr) has a very compact syntax that you can learn in a few weeks (or faster if you’re really diligent).
By learning ggplot2, you learn powerful tools for creating insight within weeks.
You will be productive within weeks.
I’ve said before, that when you’re first starting out, data visualization is one of the highest ROI data skills you can learn.
And while you learn the syntax, you’re also learning how to think about data visualization. As I’ve mentioned repeatedly, ggplot2’s syntactical structure highly systematic. By learning that system, you will begin to understand how to approach visualization tasks.
Moreover, you don’t have to worry (yet) about loops, control flow, etc. You can focus on learning how to think about turning data into insights that clients want.
Later (within weeks, for many students) you can add dplyr to start manipulating your data and putting it into the right format. The best part is that when you begin to chain dplyr verbs together with ggplot2, you can find insight in more complicated data sets by zooming, filtering, and diving in to find more details (more on that in an upcoming post).
Learn data visualization first. Focus on technique. Focus on learning to create insight visually on a small scale, using ggplot2.
When you’re ready for larger datasets, you can add dplyr. Only after you’ve got those tools nailed, would I move on to larger software concepts.
I am a beginner of R. I started to use R Graphics Cookbook written by Winston Chang to learn ggplot2. I learn ggplot by just reading and typing the codes written in this book. I have finished reading chapter 5. I should spend more time but I already forgot some functions or don’t understand some parts in this book. What do you think is the best way to learn ggplot or R in general?
This web is awesome for a beginning data scientist.
Yup I m at right place for learning R