As a beginning data scientist, you’ll have quite a few subject areas that you need to learn (and eventually master).
While you’ll certainly need to learn some math and statistics, math and stats are not the first things I recommend to most beginners.
Almost always, I recommend that people start with data visualization.
The reason for this, is that data visualization is so critical to almost every part of getting things done as a data scientist: reporting, analysis, exploratory analysis (e.g., EDA prior to machine learning). You need data visualization constantly. It’s necessary for nearly every data scientist at all levels.
Furthermore, I’ve argued that at junior levels of a data team job hierarchy, data visualization (when combined with data manipulation) is sufficient for being productive. If you’re a junior member of a data team, your core responsibilities may exclusively revolve around visualization (i.e., reporting, analysis, etc).
Because it’s necessary (and in some cases, sufficient) for productivity, it’s a skill that you need to master early.
ggplot2 is the visualization tool I recommend
Of course, the question is, what tool should you use for data visualization?
Long time readers of the Sharp Sight blog will know where I stand on this: I think that
As it turns out, a recent 2016 survey by O’Reilly media also showed that
ggplot2 teaches you how to think about visualization
But setting aside the popularity of
It teaches you how to think about visualization, because there are two deep principles that underly the syntax (and a third principle that sort of arises as a result of the first two).
3 critical principles of visualization
Two important data visualization principles are sort of hard-wired into the structure of
- mapping data to aesthetics
There’s also a third principle that sort of arises as a result of layering:
- building plots iteratively
Understanding these will sharpen your intuition about how to visualize data and how to attack particular problems for which visual tools are a good solution.
To understand these principles, how they operate, and why they’re so important, let’s look at an example.
Principle 1: Mapping Data to aesthetics
Let’s say that we have a dataset:
#LOAD PACKAGE: tidyverse library(tidyverse) # This is the data we're going to plot ... foo <- c(-122.419416,-121.886329,-71.05888,-74.005941,-118.243685,-117.161084,-0.127758,-77.036871,116.407395,-122.332071,-87.629798,-79.383184,-97.743061,121.473701,72.877656,2.352222,77.594563,-75.165222,-112.074037,37.6173) bar <- c(37.77493,37.338208,42.360083,40.712784,34.052234,32.715738,51.507351,38.907192,39.904211,47.60621,41.878114,43.653226,30.267153,31.230416,19.075984,48.856614,12.971599,39.952584,33.448377,55.755826) zaz <- c(6471,4175,3144,2106,1450,1410,842,835,758,727,688,628,626,510,497,449,419,413,325,318) # CREATE DATA FRAME df.dummy <- data_frame(foo,bar,zaz) # INSPECT glimpse(df.dummy) head(df.dummy)
It has several numerical variables, so let’s make a quick scatterplot out of two of them,
Seemingly, not much to see here, but the code to accomplish this is pretty straightforward (if you’ve learned the basic
#----------------------------------------------------------- # LOAD GGPLOT # note: strictly speaking, we don't need to load this # since we already loaded "tidyverse" # however, this _is_ a blog post about ggplot2 after all ... #----------------------------------------------------------- library(ggplot2) #---------- # PLOT DATA #---------- ggplot(data = df.dummy, aes(x = foo, y = bar)) + geom_point()
Again, syntactically this is uncomplicated. Importantly though, underneath the syntax is a deep data visualization principle at work.
Once you get this principle, your understanding of data visualization will change forever (and you’ll become much more proficient with
When we create this chart, we’re actually mapping data to aesthetic attributes.
To explain what that means, let’s dissect the example a little bit.
The points in the scatterplot are “geometric objects” that we draw. In
But all “geometric objects” have “aesthetic attributes.” Aesthetic attributes are things like:
When we create a data visualization in
I’m going to repeat that, because it’s very important:
When we visualize data, we are mapping between the variables in our data and the aesthetic attributes of the geometric objects that we plot.
To bring this back to our simple scatterplot example, when we create this plot, we are mapping
Mapping variables is a really important concept …
I know what you’re thinking:
” Yeah, I get it,
… ‘foo’ on the x-axis and ‘bar’ on the y-axis.
… I can do that in Excel. ”
Not so fast.
Understand: this is a simple example, but there’s a very deep principle at work here.
Theoretically, geometric objects (i.e., the things that we draw in a plot, like points) don’t just have attributes like x-position and y-position. As I mentioned above, geometric objects have a variety of other aesthetic attributes like transparency, color, size, etc. Moreover, if we can map variables to attributes like x-position and y-position, we should be able to map variables to attributes like color and size, right?
… and this is exactly what
Mapping variables to parts of your plot is not limited to the x and y axes in
More importantly, it allows us to map variables to essentially any of these aesthetics.
To show you this, let’s extend our example and create a bubble chart.
Extended example: mapping a variable to size
All we need to do is map a new variable to the
#------------------------------------ # NEW PLOT # - map a variable to size and replot #------------------------------------ ggplot(data = df.dummy, aes(x = foo, y = bar)) + geom_point(aes(size = zaz))
What have we done here?
We’ve transformed the simple scatterplot into a bubble chart by mapping a new variable to the
Let me say that again. We just changed a scatterplot to a bubble chart simply by mapping a new variable to the size aesthetic.
And it doesn’t end there.
As I already noted, there are other aesthetics to which you can map variables beyond
In some simplistic sense, that’s all we’re really doing when we visualize data.
When we create a visualization, we’re ultimately creating a mapping from variables in the data to aesthetic attributes of the geometric objects that we draw.
It’s simple, but critical: any visualization you see can be deconstructed into geom specifications and mappings from data to the aesthetic attributes of those geometric objects.
That might not sound like a big deal, but once you “get it” – once you really understand what this means – your approach to visualizing data will be changed forever. You’ll look at more complex visualizations and understand that that they are easy to produce, if you know what geom to specify and how to map your variables. Nearly all visualizations become much easier to produce.
Principle 2: Build plots in layers
In addition to learning to conceptualize visualizations as “mappings from data to aesthetics” there’s another principle you need to understand: building plots in layers.
The principle of layering is important because to create more advanced visualizations, you’ll often need to:
- Plot multiple datasets, or
- Plot a dataset with additional contextual information that’s contained in a second dataset, or
- Plot summaries or statistical transformations over the raw data
To see what I mean, let’s modify the bubble chart that I just showed you above.
We’re going to:
- Get some additional information
- Store it in a new data frame
- Plot it as a new layer, underneath the bubbles
#-------------------------- # GET ANOTHER LAYER OF DATA #-------------------------- library(maps) df.more_data <- map_data("world") # PLOT ggplot(data = df.dummy, aes(x = foo, y = bar)) + geom_polygon(data = df.more_data, aes(x = long, y = lat, group = group)) + geom_point(aes(size = zaz), color = "red")
And this is what the new chart looks like:
Are you starting to get it?
This is just the bubble chart from earlier in the post with a new layer added. That’s. It.
We just transformed a bubble chart into a new visualization called a “dot distribution map,” which is much more insightful and much more visually interesting.
In the beginning of the post (when we created our dataset), I didn’t tell you that this is geospatial data. I didn’t tell you, because I wanted you to see that this dot distribution map is essentially the same as a bubble chart, with a new layer of contextual information plotted underneath the bubbles.
Mapping and layering allow us to create complex charts
Moreover, as we saw earlier, the bubble chart is just a modified scatter plot. It’s a scatterplot with an additional variable mapped to the
So, this dot distribution map is just a bubble chart, and the bubble chart was just a scatterplot.
Ultimately, we used two of our data visualization principles – mapping and layering – in order to build this visualization from a scatter plot, to bubble chart, to the dot distribution map that we now see:
- To create the scatterplot, we mapped foo to the x-aesthetic and mapped bar to the y-aesthetic
- To create the bubble chart, we mapped a new variable to the size-aesthetic
- To create the dot distribution map, we added a layer of polygon data under the bubbles.
Mapping and layering. That’s really the essence of it.
To be come great a data visualization, you need to understand mapping variables to aesthetics and building plots in layers.
These are two critical ideas that you need to understand, both technically (in order to write
Once you understand mapping and layering, you’ll begin to see that many “complex” visualizations are in fact, quite simple to make (if you know how to think about putting them together).
Principle 3: iteration
There’s actually a third principle at work here, that I haven’t mentioned yet: building plots iteratively.
This principle is only related to the syntax in a cursory way, but it does arise as a consequence of the
Part of becoming a data scientist is not only learning syntax, but also learning workflow. You need to learn processes.
You won’t learn workflow directly when you learn
Let me explain.
When we build plots in layers, we are ultimately building a plot iteratively: we layer in new information, piece by piece, or modify existing parts of the plot, piece by piece.
As an example, let’s go back to the chart that we created above.
We ultimately created a dot distribution map, but step-by-step, how did we actually build it?
We followed this basic process:
- Plotted a scatterplot by mapping variables to the
- We created a bubble chart by modifying the scatterplot. We essentially mapped a new variable to the “
- We layered in polygons to show the shape of the countries underneath the points.
Ultimately, we can break down the creation of the dot distribution map into discrete steps. We built the map iteratively.
If we wanted to go further, we could continue to polish the map by performing additional steps:
- Add a legend title
- Modify the size scale
- Modify the colors (note that even getting the colors perfect requires a lot of iterative, trial-and-error tinkering
If we performed these last few steps, our work could ultimately lead to a chart like this:
To a beginner, this finalized chart probably looks difficult to create. But, once you understand how to build a plot iteratively (in layers) it becomes easy.
My point is that
The structure of the syntax sort of requires you to build plots in layers, and this in turn builds your intuition about iteration and data visualization workflow.
Ultimately, this knowledge about workflow is language-agnostic and transferable if you move to another tool.
To learn how to think about visualization, learn
This is why I think that
ggplot2makes complex visualizations relatively easy, by allowing you to break down complex visualizations into simple mappings and layers ggplot2enables, and in some sense encourages, iterative creation ggplot2trains you to how to think about visualization (i.e., it trains you to think about visualizations as mappings and layers, and encourages you to work iteratively)
Now, I will admit that
So by learning
In turn, by mastering visualization – a core, necessary skill – you’ll become a better data scientist. You’ll be better at getting things done. And when you want to move on to higher-level skills like advanced visualization or machine learning, you’ll have the foundation you need.
Sign up to learn
Discover how to rapidly learn
If you sign up, you’ll get free tutorials about