Select Page

 

Data science gives you tools to find opportunities

A few months ago, I wrote a blog post about Amazon’s search for a location for a second headquarters. For those of you who don’t know about it, the technology giant – which currently has its headquarters in Seattle Washington – announced last September that it is planning to build a new second headquarters.

Over the last few weeks, this question of where Amazon will build its second headquarters opened a new question in my mind: what locations are generally great for business?

As an entrepreneur myself, I am personally interested in this question. Moreover, in a world that is increasingly driven by entrepreneurship and innovative business, this is something that we all need to consider. Whether you’re an entrepreneur or an employee, you need to think about where the best opportunities are. For example, as a data scientist, you need to think about where great jobs are today, but also where they might be 5 to 10 years from now.

Speaking more broadly though, this is the sort of question you’ll need to answer as a data scientist. In many cases, data scientists need to use visualization and other techniques to find opportunities.

Your business partners will need you to answer questions and find opportunities

Finding insights and identifying opportunities is a major part of the job for many data scientists. This is particularly true for junior data scientists. At beginning levels, most data scientists are not going to be working on machine learning systems and large software projects. Instead, many junior data scientists perform data analysis in order to find insights in data.

That being said, if your goal is to become a data scientist, you need to be able to analyze data, find insights, and identify opportunities.

For example, your business partners will want you to answer things like:

  • “what market is performing best?”
  • “which customer segment should we target?”
  • “where is our greatest opportunity for improvement?”

Of course, the specific questions will be different depending on what industry you’re working in, but the general purpose of the questions will be similar. Your business partners will typically want to improve something (i.e., profit, RIO, process performance, etc.) or gather information to make a good business decision. They’ll want you to find insights in data to help them with their problem.

Finding insights is a skill you need to develop if you want to get a data science job.

Practice finding insights by asking and answering questions

Finding insights in data is a skill, and like any skill, you’ll get better at it with practice.

A great way to practice finding insights in data is just answering your own questions about the world.

When you find yourself asking a question about the world, see if you can find a good dataset that might help you answer your question. If you can find a good dataset, analyze the data. Create some visualizations. See if you can use your data skills to find insights and answer your own questions.

This is particularly useful if you’re not working as a data scientist yet. To become a great data scientist, you need to practice by working on a lot of projects. Of course, if you already have a job as a data scientist, you’ll get plenty of projects at work (trust me). But if you haven’t landed your first data science job, you’ll need to find your own projects to help you practice.

That being said, when you have a question like “what’s the best state for business,” go online, look for a dataset, and start writing a script to analyze it. You can use these questions as opportunities to work on a small project. Over time, working on small projects like this will sharpen your data skills.

Mapping the best states for business

After asking myself “what is the best state for business,” I went online and found some survey data from chiefexecutive.net to help answer the question. chiefexecutive.net surveyed several hundred CEOs and asked them to rank their favorite states for business. By visualizing this survey data, I’m able to gain some insight and answer the question “what is the best state for business.”

Again, this is very similar to the sorts of questions that you’ll get from business partners in a data science job. Therefore, when you ask yourself one of these questions, it’s a great opportunity to find a dataset and do some simple analysis. It’s a great opportunity to practice your data science skills. If you get into the habit of answering questions like these yourself, you’ll be ready when you’re working as a real data scientist.

Code: mapping the best states for business

After finding a dataset from chiefexecutive.net, I wrote a quick R script to visualize the data. You’ll find the code below.

In the code, pay attention to what I’m doing here. I’ve used several critical skills from the data science workflow: getting data, cleaning data, and visualizing data.

To get the data, I’ve used some tools from rvest to scrape the data from the webpage.

I’ve used dplyr to “wrangle” the data. In particular, I’ve used dplyr::select() to reorder the variables. I also used dplyr::mutate() in combination with stringr::str_to_lower() to change some of the data to lower case.

Also pay attention to some of the visualization techniques and tools I’m using. For starters, to visualize this I’m just using ggplot(), although using geom_map() makes it a little more complicated.

Ultimately though, what makes this visualization look good is the detailed formatting. With that in mind, carefully look at the theme code. I’ve used the ggplot theme() function to carefully control the appearance of the visualization. I’ve also used scale_fill_gradientn() to create the red/yellow/green color scale, which is critical for the appearance of the plot. Ultimately, it’s the formatting and color scale that really make this visualization work, so make sure that you learn how to use those tools.

That said, this code may look a little complicated to a beginner, but keep in mind that there are actually only a few dozen tools that you need to learn to create this.

A few dozen tools. That’s it.

If you studied diligently and practiced, how fast could you learn memorize a few dozen R functions?

#==============
# LOAD PACKAGES
#==============
library(tidyverse)
library(rvest)
library(stringr)
library(fiftystater)


#=============
# READ WEBPAGE
#=============
html.business <- read_html("https://chiefexecutive.net/2017-best-worst-states-business/")


#============
# SCRAPE DATA
#============
html.business %>% 
  html_nodes('table') %>%
  .[[1]] %>%
  html_table() ->
  df.business


# CHECK COLUMN NAMES
df.business %>% colnames()


#====================
# CHANGE COLUMN NAMES
#====================
colnames(df.business) <- c('rank_2017'
                           ,'state'
                           ,'text'
                           ,'rank_2016'
                           ,'change_in_rank'
                           )


#=================
# COERCE TO TIBBLE
#=================
df.business %>%
  as_tibble() ->
  df.business


# INSPECT
df.business %>% glimpse()


#==================
# REORDER VARIABLES
#==================
df.business %>% 
  select(state
         ,rank_2016
         ,rank_2017
         ,change_in_rank
         ,text
         ) ->
  df.business


#========
# GET MAP
#========
map.usa <- map_data('state')

# inspect
map.usa %>% glimpse()


#==========================
# UN-CAPITALIZE STATE NAMES
#==========================
df.business %>%
  mutate(state = str_to_lower(state)) ->
  df.business


#=============
# CREATE THEME
#=============
theme.map <- theme(
  text = element_text(family = 'Helvetica Neue', color = '#444444')
  ,panel.background = element_rect(fill = '#DDDDDD')
  ,plot.background = element_rect(fill = '#DDDDDD')
  ,legend.background = element_blank()
  ,legend.position = c(.9, .4)
  ,legend.key = element_blank()
  ,panel.grid = element_blank()
  ,plot.title = element_text(size = 18, face = 'bold')
  ,plot.subtitle = element_text(size = 12)
  ,axis.text = element_blank()
  ,axis.ticks = element_blank()
  ,axis.title = element_blank()
  )


#=====
# PLOT
#=====

ggplot(df.business, aes(map_id = state)) +
  geom_map(aes(fill = rank_2017), map = fifty_states) +
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  theme.map +
  scale_fill_gradientn(colours = c('#009900','yellow', 'red')
                       ,breaks = c(1,10,20,30,40,50)) +
  labs(fill = str_c('Rank of best states\n'
                    ,'for business, 2017\n'
                    )
       ,title = str_c('New York, California, and Illinois '
                      ,'rank very low among CEOs'
                      )
       ,subtitle = str_c('chiefexecutive.net surveyed hundreds of CEOs and asked which states they prefered.\n'
                         ,'Large "blue state" states like California, Illinois, and New York ranked at the bottom\n'
                         ,'while Texas and Florida ranked at the top.  2017 is the 13th straight year with Texas\n'
                         ,'in the #1 position.\n'
                         )
       ) +
  guides(fill = guide_colourbar(reverse = T)) +
  coord_map("albers", lat0 = 30, lat1 = 40) +
  annotate(geom = 'text'
           ,label = 'data source: chiefexecutive.net/2017-best-worst-states/'
           ,x = -85, y = 22
           ,size = 3
           ,color = '#666666'
           )

 

 

It should take only a few months to learn this

Using data visualization to find insights in data is a skill, and it can be learned.

The bad news is that many people take a very long time to learn to do this. Most people will take over a year to learn to write code like this. Even worse, many people will try and fail (or give up), so they will never learn to write code like this.

I’m not pointing this out to be mean. I’m sympathetic to the fact that learning data science is hard. It’s confusing. And it takes time.

Having said that, I believe that a motivated, disciplined student could learn to write code like this in a few months. And when I say “learn” I don’t mean watch a few videos and have a rough idea about how it works. I mean that a motivated student could be able to write this code “fluently” after about 3 to 6 months; they should be able to write this code mostly from memory.

Discover how to master data science

Finding insights in data takes practice. Learning the tools of data science is hard work.

… but if you know how to practice, you can master data science faster than you thought possible.

Sign up now, and discover how to master data science fast.

In our tutorials, we will help you rapidly learn and master the tools you need to be a top-performing data scientist.

Moreover, if you sign up now, you’ll get access to our free Data Science Crash Course.

In the Data Science Crash Course, you’ll learn:

  • a step-by-step data science learning plan

  • the 1 programming language you need to learn

  • 3 essential data visualizations
  • how to do data manipulation in R
  • how to get started with machine learning
  • the difference between machine learning and statistics

SIGN UP NOW