Data Science Crash Course SharpSight
Data Science Crash Course SharpSight
CRASH-COURSE
How to set up R 5
ii
D ATA SCIENCE IS ONE OF THE MOST
VALUABLE SKILLS
OF THE 21 ST CENTURY
Right now, the world has more data than we know what to do with. Some
writers have called it “the data deluge,” which sounds hyperbolic, but it's
Meanwhile, the world desperately needs insight. The world is overrun with
problems waiting to be solved, we just need insight into how to fix those
problems.
sharpsightlabs.com 1
That’s where you come in. Data science is one of the biggest
opportunities of this century. If you can master the skills of data, you'll not
fixing complex problems, but also capture some of the value you create (in
Want to change the world? Want to create real value and build wealth?
sharpsightlabs.com 2
T HE
FIRST PROGRAMMING
LANGUAGE YOU SHOULD LEARN
When people want to get started with data science, inevitably they ask
More often than not, the answer they receive is a list of every tool that
might be occasionally used for data science. I've seen lists of 25+ tools
That's BS. You don't have the time to learn 25 tools. And, most
businesses that are hiring data scientists are looking for expertise with
only a few core tools (typically R or Python, SQL, Excel and a few auxiliary
tools).
All that said, don't listen to people giving you lists of dozens of tools. As
with almost any task, learning data science is best performed when you
simplify and focus your efforts on the things that have the highest return on
sharpsightlabs.com 3
investment. In this case, you want an "in demand" tool that has the
That tool is R.
In the long run, as a data professional, you'll want to develop skill in three
core areas: data acquisition and data shaping (sometimes called data
has excellent packages and toolsets for all of these skill areas. R is also
To put it simply: if you're getting started with data science, R is the best
tool to use.
With that in mind, I want to get you up-and-running with R within about an
sharpsightlabs.com 4
H OW TO SET UP R
Install R
1. Go to http://cran.r-project.org/
When you get to the installation webpage, find the appropriate file, and
sharpsightlabs.com 5
Install R Studio
programmer, you’ll know what an IDE is, but if you’re new to programming,
administer your code, as well as manage your data and the outputs of
1. Go to http://www.rstudio.com/products/rstudio/download/
2. Follow the instructions for installation for your particular platform (i.e.,
Install Packages
sharpsightlabs.com 6
Packages are extensions of the R programming language. R has pre-built
functions and datasets that come with the stock R installation, but
sometimes you want to do things that are outside of R’s basic function set.
In this case, you can extend R with new functions by installing “packages.”
The primary packages that we need to install are ggplot2 and dplyr.
In future chapters of this book, you’ll learn a little about both data visualization
Data visualization is the best skill area to start with for a couple of reasons.
First, it's easy to get started. It's easy to find data sets that are ready to be
sharpsightlabs.com 7
every part of the workflow. Whether you're creating a report, doing
it's one of the most necessary skills, but few analysts are truly exceptional
structured. I won’t go into detail here, but I’ll note that the structure of
ggplot2 will help you learn how to think about visualizing data. Once you
(Note: the following shows how this is done on a Mac. The basic
sharpsightlabs.com 8
Install ggplot2 and dplyr
1. Open R Studio
sharpsightlabs.com 9
4. Repeat instructions 2 and 3 for dplyr and caret
sharpsightlabs.com 10
A STEP - BY - STEP DATA SCIENCE
LEARNING PLAN
You’ve probably read numerous articles telling you how to start learning
data science. Collectively, they tell you to dozens of things you need to
learn. Learn Python. Learn R. Learn Hadoop. They tell you all the skills you
skills like manipulating vectors, matrices, loops. More tools like Pig, Hive,
It’s like standing at the base of Everest, saying to yourself, “how the hell
sharpsightlabs.com 11
You need a path
Look, there are lots of people out there that aren’t actually skilled in data
science, telling you how to start learning data science (I’m looking at you,
HR professionals).
Then, there are people who are actual data scientists, but are terrible
teachers and communicators. I’m sure you know what I’m talking about.
The guy with the super-elite PhD who says “oh, it’s easy,” and then
proceeds to talk for 45 minutes about arcane math that nobody can
understand.
Let me break this down for you: you can’t learn everything at once. Your
time is limited.
sharpsightlabs.com 12
Some skills are more useful than others, and some skills are easier to learn
than others.
Some skills are used every day by almost every data scientist, and other
skills are “specialty” skills, used either by a select few specialists or used
You need to be selective, and you need to learn things in the proper order.
You need to focus on learning the skills with the highest return on
investment (ROI). Focus on the skills that are easy to learn, easy to
sharpsightlabs.com 13
implement, that yield the greatest results. (ahem. Do you know the results
clients actually want? Maybe you need to learn what clients want first.)
Learn R
I believe that it’s best to focus your efforts. Learn one tool.
O’Reilly Media just released their 2014 Data Science Salary Survey. In
that report, they note that R is the most commonly used programming
sharpsightlabs.com 14
2. R has 2 packages that dramatically streamline the data science
workflow:
To be fair, I think you could also make a strong case for learning Python (it
As I’ve noted earlier, ggplot2 has a deep structure that underlies it’s
syntax. When you learn that structure, you begin to think about data
sharpsightlabs.com 15
Similarly, dplyr’s syntax is easy to learn, easy to use, and operates in a
way that streamlines your workflow. dplyr is one of the best data
“chaining” (much like Unix pipes) you can rapidly explore your data with
you’re first starting out. In your first few months, it’s the highest leverage,
highest ROI skill. It’s also one of the most versatile tools: you can use
ggplot2 because it has a deep underlying structure to it. When you learn
sharpsightlabs.com 16
the structure of the syntax, you are at the same time learning deep
• the scatterplot
1. These charts are the basis for more advanced visualizations. For
2. These charts have a structure to them. When you learn that structure,
you will start to learn how to think about visualizing your data.
sharpsightlabs.com 17
3. The charts are the essential “tools of insight.” They are foundational
Once you learn the foundational tools of data visualization, you can “back
When you just start learning data science, you can use “dummy data” or
very simple data sets that don’t require much data reshaping.
As you advance though, the “shape” of your data will be a problem: you’ll
have multiple data files that you need to join together; you’ll need to subset
reach this point – when the shape of your data is a bottleneck – then you
should put more time into learning data manipulation. An example of this is
sharpsightlabs.com 18
the recent tutorial analyzing ‘supercar’ data, where the data were found in
the 5 basic dplyr verbs, as well “chaining” using the %>% operator.
I know: this is the opposite of what most other people are telling you.
The fact is, the vast majority of data jobs – particularly the entry level jobs
Think of a baseball team. There are core baseball skills (hitting, throwing,
and running). The vast majority of people on the team have a mix of skills.
Teams are built from individuals with mixes of the core baseball skills. And
then there’s one guy who is highly specialized in the most arcane of skills:
sharpsightlabs.com 19
Machine learning is similar to pitching. It’s valuable, technical, a little
arcane, and difficult to do well. There are also fewer of those jobs.
manipulation: they are easier to learn, easier to implement, and the jobs
Not to mention: you need data visualization and data manipulation for
need to put your data into the right shape first. And when you’re done, in
most cases you’ll need to explore the results. Typically you’ll perform this
So to summarize my view on ML: in the beginning it’s the skill with the
in data manipulation and visualization), and there are fewer jobs requiring
machine learning.
sharpsightlabs.com 20
Keep in mind: I’m not saying that you should never learn machine
learning. It’s extremely valuable. I’m just saying that you should learn it
manipulation.
• Learn data visualization first (with R’s ggplot2), using simple data or
• Learn data manipulation second (with R’s dplyr), and practice data
sharpsightlabs.com 21
Y OU NEED TO MASTER THE BASICS
FIRST
He’s a web developer who primarily works in Ruby and Python, but also
the-mill web developer, but he’s confessed to me that he’s a bit bored and
Sharp Sight: “Do you know data visualization? Do you know data
wrangling techniques?”
sharpsightlabs.com 22
Sharp Sight: “You don’t understand. Data visualization is a prerequisite for
machine learning. You need to learn how to dive into a dataset and
analyze it before you can make machine learning algorithms work. You
I gave him solid advice. I gave him the advice I told you earlier in this
really systematic about learning data visualization and data wrangling, and
His response?
sharpsightlabs.com 23
“Jump in and figure it out” is a losing strategy
You have to know this guy. He’s young. He’s cocky. He doesn’t know what
“I’ll just figure it out” is code for, “I don’t want to do all that stuff you
recommended, so I’m just going to jump in, without the prerequisites, and
I’m fairly certain that if I grill him in 6 months, he won’t know many of the
I gave him a dataset, and asked him to write me some code (from scratch
sharpsightlabs.com 24
and by memory) to implement a logistic regression, he wouldn’t be able to
do it.
It’s not because he’s unintelligent (he’s a fairly smart guy). It’s because his
To be clear, there are absolutely people who will be able to “figure it out.”
their goal. But the odds aren’t good, and it’s terribly inefficient (you’ll work
It’s understandable … everyone wants to learn the sexiest stuff first. This
People who begin learning a musical instrument do the same thing. They
say, “I want to play guitar” but they want to jump right into playing
sharpsightlabs.com 25
advanced guitar solos, instead of meticulously and intensely mastering the
basics. And because they don’t want to master the basics, they fail to
learn critical skills and ultimately miss their target. They never learn the
If you look at top performers of all stripes, they are extremely methodical in
and strategic.
sharpsightlabs.com 26
They don’t demand to start with the cool stuff. Top performers diligently
I get it. The cool stuff is why you want to get into data science in the first
place. For example, machine learning is very exciting right now. It’s
powering self driving cars, intelligent IoT objects, and a variety of other
But ask yourself: do you want to be in the bottom 95% who fail to really
learn? The bottom 95% that under-perform? The bottom 95% who make
less money? The bottom that say “I’ll just figure it out” but then fail, and
Or do you want to be in the top 5%? … the top 5% who earn most of the
money, get the best perks, and work on the coolest projects.
sharpsightlabs.com 27
You can choose which group you fall into – the top 5% or the bottom 95%.
sharpsightlabs.com 28
H OW TO CREATE 3 ESSENTIAL
DATA VISUALIZATIONS
We’ve also talked about what you should learn first: data visualization.
And I’ve emphasized why: you need to master the basics first before you
Having said that, I’m going to show you 3 “core” data visualizations that
eventually be able to do these “with your eyes closed.” Getting there will
take practice. In the beginning though, you’ll be able to take this code,
sharpsightlabs.com 29
3 core data visualizations
1. Bar chart
2. Scatter plot
3. Line chart
I want to point out that while these are fairly basic visualizations, when we
take a closer look, they show something important. Each one is made up
somewhat basic. But if you can master these charts and a few other basic
sharpsightlabs.com 30
To run this code, you need to open up RStudio (by this point, you should
haven't installed them already, go back to the chapter with the installation
In RStudio, you need to type in the code in the following sections, highlight
the code, and click the "Run" button (see the screen shot).
sharpsightlabs.com 31
Bar Chart Code
First, the bar chart. For the time being, you can type this code in. (You
learning strategy.)
library(ggplot2)
sharpsightlabs.com 32
Scatter Plot Code
Next, the scatter plot. The scatter plot is very, very common. We use it for
library(ggplot2)
sharpsightlabs.com 33
Line Chart Code
For the line chart, we’re going to create a simple dummy dataset, then plot
it using ggplot2.
sharpsightlabs.com 34
DPLYR : H OW TO DO DATA
MANIPULATION WITH R
Ok. Here’s an ugly secret of that data world: lots of your work will be prep
work.
Of course, any maker, artist, or craftsman has the same issue: chefs have
their mise en place. Carpenters spend a hell of a lot of time measuring vs.
cutting. Etcetera.
So, you just need to be prepared that once you become a data scientist,
Getting data, aggregating data, subsetting data, cleaning it, and merging
sharpsightlabs.com 35
When you’re just starting out with analytics and data science, you can get
away with doing only minimal data manipulation. In the beginning, your
And if you do need to do some basic formatting, you can use use Excel if
it’s a small and simple dataset. (actually, Excel is a good tool in your
techniques, and you will need to put your data in the right format. And
sharpsightlabs.com 36
dplyr: the essential data manipulation toolkit
• Sorting
• Aggregating
dplyr gives you tools to do these tasks, and it does so in a way that
almost perfectly suited to real data science work. (Data science as it’s
To be clear, these aren’t just the “basics.” They are the essentials. These
are tasks that you’ll be doing every. single. day. You really need to master
these.
sharpsightlabs.com 37
Again though, dplyr makes them extremely easy. It’s the toolset that I
together the data wrangling tools of dplyr with the data visualization tools
of ggplot2. Once you start combining these together, you will have a
dplyr verbs
filter()
conditions.
sharpsightlabs.com 38
Let’s look at an example.
library(dplyr)
library(ggplot2)
head(diamonds)
#--------
# FILTER
#--------
Here, we’re subsetting (i.e., filtering) the diamonds dataset and keeping
select()
In the following code, we’re going modify our data frame, selecting only
sharpsightlabs.com 39
#-----------------------------------------
# SELECT
# - select specific columns from your data
#-----------------------------------------
head(df.diamonds_ideal)
# carat cut color price clarity
# 0.23 Ideal E 326 SI2
# 0.23 Ideal J 340 VS1
# 0.31 Ideal J 344 SI2
# 0.30 Ideal I 348 SI2
# 0.33 Ideal I 403 SI2
# 0.33 Ideal I 403 SI2
mutate()
sharpsightlabs.com 40
#--------------------------------------
# MUTATE:
# - Add variables to your dataset
#--------------------------------------
head(df.diamonds_ideal)
# carat cut color price clarity price_per_carat
# 0.23 Ideal E 326 SI2 1417.391
# 0.23 Ideal J 340 VS1 1478.261
# 0.31 Ideal J 344 SI2 1109.677
# 0.30 Ideal I 348 SI2 1160.000
# 0.33 Ideal I 403 SI2 1221.212
# 0.33 Ideal I 403 SI2 1221.212
price_per_carat.
arrange()
arrange() sorts your data. To be clear, there are other sorting functions
that you can use from base R, but the syntax is a bit of a pain.
sharpsightlabs.com 41
Ok, in the following example, I’ll show you how to use arrange(). Having
said that, we’re not going to use the diamonds dataset, because it’s too
So, we’ll create a simple data frame with a numeric variable. This numeric
variable has the numbers out of order and we’ll use arrange to reorder the
numbers.
sharpsightlabs.com 42
#------------------------
# ARRANGE: sort your data
#------------------------
head(df.disordered_data)
# 2
# 3
# 5
# 1
# 4
# 1
# 2
# 3
# 4
# 5
# 5
# 4
# 3
# 2
# 1
sharpsightlabs.com 43
Sorting your data may not seem useful, but it’s used more often than you
your data. This is more useful than it might seem at first blush.
summarize()
Again: the following (simple) example might not seem useful, but
group_by().
#-------------------------------
# SUMMARIZE:
# aggregate your data
#-------------------------------
summarize(df.diamonds_ideal
, avg_price = mean(price, na.rm = TRUE) )
# avg_price
# 3457.542
sharpsightlabs.com 44
dplyr: one of the data science essentials
master the basics first. That includes these 5 dplyr verbs. Master these
data manipulation tools, along with a few critical data visualization tools
(like the line chart, the bar chart, and the scatterplot) and you’ll have a
solid foundation that you can build on. You’ll have many of the tools you
sharpsightlabs.com 45
IS MACHINE LEARNING
W HAT
AND WHY IS IT SO IMPORTANT ?
sharpsightlabs.com 46
Entrepreneur and thought leader Peter Diamandis say that this technology
will “do more to improve healthcare than all the biological sciences
Billionaire venture capitalist Vinod Khosla agrees, saying that over the next
What is it?
Machine learning.
popular press and technology press. Media outlets like TechCrunch, New
sharpsightlabs.com 47
That said, it’s not always called “machine learning.” In some outlets, you’ll
hear about “statistical learning.” (This is less common in the popular news.
fair, there are minor differences between statistical learning and machine
predictive analytics, and automation. To be clear, these are not all identical
sharpsightlabs.com 48
Machine learning, a quick and dirty definition
for writing programs that input data and output predictions or decisions.
Currently, most software does not adapt. It does not learn. It has no
intelligence and only limited ability to change how it works to meet the
sharpsightlabs.com 49
In contrast, programs that incorporate machine learning can adapt. They
generalizations about the examples, and then apply those inferences and
sharpsightlabs.com 50
The main takeaway here, is that the key component in getting machine
the most powerful machine learning techniques not only require data, but
We may not recognize this, but much of what human beings do involves a
involve a variety of predictions, like “how far away is the object” and “how
sharpsightlabs.com 51
Similarly, in the task of driving a car, humans make predictions about
options, where each option has it’s own predicted outcome. For example,
when we decide to watch one movie over another, we are, in some sense,
And it’s certainly not just inconsequential decisions like choosing the best
how to invest, and how to allocate our time, and how to select people as
consequences.
sharpsightlabs.com 52
Good predictions are valuable
If you look closely, you can see that in a wide variety of areas, from driving
extraordinarily valuable.
tremendous value.
So, to recap where we’re at, machine learning is powered by data. Good
data is required for training machine learning algorithms. But at the same
sharpsightlabs.com 53
In some sense, machine learning unlocks value from data.
What that also means, is that machine learning will become more valuable
learning.
Google is sitting on more data (and better data), than almost anyone in the
value.
sharpsightlabs.com 54
Why machine learning will become even more valuable in the
next decade
Having said that, for companies like Google, it’s not only about the data
that they currently have, but the data that they will have in the future.
The estimates vary, but common estimates state that the amount of data
Given this exponential increase in data, by 2020, it’s estimated that the
conceptualize volumes this large (and rates of change this fast). What we
can say, in very simplistic terms, is that data is growing very, very fast and
sharpsightlabs.com 55
Sensors, exponential data, and connecting the physical &
digital worlds
This explosion of data is being driven not only by the internet itself, but
also the emerging connection between the virtual and physical worlds.
Mobile phones are instrumented with a variety of sensors that enable the
Just like in mobile phones, as sensors get smaller we will be able to add
more of the physical world. Effectively, this is what people mean when they
say “the Internet of Things.” The IoT isn’t so much an internet of things, but
sharpsightlabs.com 56
Setting aside the large topic of the IoT (that’s a different article) the point
miniaturize and we begin to connect the physical world with the digital
world.
And again, the value of this data (and really, the value of the IoT) will
article for the Wall Street Journal making the bold claim that “Software is
sharpsightlabs.com 57
Looking backwards, and seeing the rise of companies like AirBnB, Uber,
not to mention the continued success of Amazon and Netflix (as they
being built into the DNA of the most successful companies of our time.
Software is critical to the success of the companies who are disrupting old
driven companies, machine learning is part of the “secret” that drives their
success.
Take Amazon. Like nearly all businesses, what drives revenue at Amazon
sharpsightlabs.com 58
purchase. It also uses similar techniques to target past customers with
In both cases, data is the fuel, and machine learning is the critical part of
the “engine” of those machines. And software sort of wires it all together.
Yes, these are software-driven businesses. And we can say that they are
alike need to understand, is that in a very critical way, these are machine
learning businesses.
sharpsightlabs.com 59
As we instrument the world, and collect larger amounts of data from
everything, we can build software for the physical world that’s powered by
that data; software that interacts with the world (i.e., robotics) and also
software that digitizes and optimizes formerly physical services (like Uber
and AirBnB).
As we collect more data from the world, we’ll be able to use machine
Yes: software is eating the world, but machine learning is eating software.
machine learning. Machine learning will allow them to unlock the value of
big data.
sharpsightlabs.com 60
In any industry that Google (or Alphabet) is in – search, mobile, the IoT,
robotics, and even healthcare – it will allow them to create software that
companies like Google and will have a large impact on almost every part
a few.
will be larger than the impact of mobile … Almost any area I look at,
sharpsightlabs.com 61
So Google is rethinking everything because of machine learning. Investors
like Khosla are investing heavily, claiming that machine learning will have a
or not).
win.
sharpsightlabs.com 62
Moreover, data-driven businesses that leverage machine learning and
Google understands this. Investors like Khosla understand this. You need
to understand it too.
So, once you’ve mastered the essential tools of data science (i.e., data
machine learning.
sharpsightlabs.com 63
A
QUICK INTRODUCTION TO
MACHINE LEARNING IN R WITH
caret
If you’ve learned the essential data visualization and data exploration
learning.
But before diving into caret, let’s quickly discuss what machine learning
sharpsightlabs.com 64
What is machine learning?
Without going into extreme depth here, let’s unpack that by looking at an
example.
A simple example
Imagine that you want to understand the relationship between car weight
and car fuel efficiency (I.e., miles per gallon); how is fuel efficiency
To answer this question, you could obtain a dataset with several different
cars, and attempt to identify a relationship between weight (which we’ll call
sharpsightlabs.com 65
A good starting point would be to simply plot the data, so first, we’ll create
library(ggplot2)
sharpsightlabs.com 66
Just examining the data visually, it’s pretty clear that there’s some
sharpsightlabs.com 67
Ultimately, once we have this mathematical function (a model), we can use
What I just wrote in the last few paragraphs about “estimating functions”
most of that math is taken care of for you. What I mean, is that for the most
you. You just need to know which functions to use, and when to use them.
sharpsightlabs.com 68
Here’s an analogy: if you were a carpenter, you wouldn’t need to build your
own power tools. You wouldn’t need to build your own drill and power saw.
tools from scratch. You could just go and buy a drill and saw “off the
shelf.” To be clear, you’d still need to learn how to use those tools, but you
operate them.
When you’re first getting started with machine learning, the situation is
very similar: you can learn to use some of the tools, without knowing the
sharpsightlabs.com 69
Ok, so you don’t need to know that much math to get stared, but you’re not
entirely off the hook. As I noted above, you still need to know how to use
and they are not always consistent in how they work. The syntax for some
of the machine learning tools is very awkward, and syntax from one tool to
the next is not always the same. If you don’t know where to start, machine
learning in R.
sharpsightlabs.com 70
A quick introduction to caret
The caret package is a set of tools for building machine learning models
As the name implies, the caret package gives you a toolkit for building
• Data splitting
• Model evaluation
• Variable selection
sharpsightlabs.com 71
caret simplifies machine learning in R
While caret has broad functionality, the real reason to use caret is that it’s
that most of R’s different machine learning tools have different interfaces.
They almost all “work” a little differently from one another: the syntax is
slightly different from one modeling toolkit to the next; tools for different
parts of the machine learning workflow don’t always “work well” together;
caret solves this problem. To simplify the process, caret provides tools
for almost every part of the model building process, and moreover,
sharpsightlabs.com 72
For example, caret provides a simple, common interface to almost every
validation and parameter tuning – are built directly into this common
interface.
To say that more simply, caret provides you with an easy-to-use toolkit for
building many different model types and executing critical parts of the ML
this iterative workflow will allow you to develop good models faster, with
sharpsightlabs.com 73
caret’s syntax
Now that you’ve been introduced to caret, let’s return to the example
Again, imagine you want to learn the relationship between mpg and wt. As
will simplify the process somewhat: we’re going to assume that the
In terms of our modeling effort, this means that we’ll be using linear
sharpsightlabs.com 74
Without going into the deep details of linear regression, let’s look at how
train() is the function that we use to “train” the model. That is, train()
is the function that will “learn” the relationship between mpg and wt.
Here is the syntax for a linear regression model, regressing mpg on wt.
sharpsightlabs.com 75
#~~~~~~~~~~~~~~~~~~~~~~~~~~
# Build model using train()
#~~~~~~~~~~~~~~~~~~~~~~~~~~
library(caret)
That’s it. The syntax for building a linear regression is extremely simple
with caret.
Now that we have a simple model, let’s quickly extract the regression
coefficients and plot the model (i.e., plot the linear function that describes
sharpsightlabs.com 76
Retrieve regression coefficients and plot the model
#~~~~~~~~~~~~~~~~~~~~~~~~~~
# Retrieve coefficients for
# - slope
# - intercept
#~~~~~~~~~~~~~~~~~~~~~~~~~~
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Plot scatterplot and regression line
# using ggplot()
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sharpsightlabs.com 77
How caret’s syntax works
Now, let’s look more closely at the syntax and how it works.
When training a model using train(), you only need to tell it a few
things:
• The target variable you’re trying to predict (e.g., the mpg variable)
• The machine learning method you want to use (in this case “linear
regression”)
Formula notation
In caret’s syntax, you identify the target variable and input variables
sharpsightlabs.com 78
Effectively, y ~ x tells caret “I want to predict y on the basis of a single
input, x.”
Now, with this knowledge about caret’s formula syntax, let’s reexamine
the above code. Because we want to predict mpg on the basis of wt, we
use the formula mpg ~ wt. Again, this line of code is the “formula” that tells
train() our target variable and our input variable. If we translate this line
of code into English, we’re effectively telling train(), “build a model that
The train() function also has a data = parameter. This basically tells
the train() function what dataset we’re using to build the model. Said
differently, if we’re using the formula mpg ~ wt to indicate the target and
predictor variables, then we’re using the data = parameter to tell caret
sharpsightlabs.com 79
So basically, data = mtcars tells the caret function that the data and
Although it’s beyond the scope of this book to discuss all of the possible
learning methods that we could use here, there are many different
that, train() would still predict mpg on the basis of wt, but would use a
sharpsightlabs.com 80
Again, it’s beyond the scope of this book to discuss all of the different
model types. However, as you learn more about machine learning, and
want to try out more advanced machine learning techniques, this is how
you can implement them. You simply change the learning method by
your data, you just type in “lm” for the argument to method =; if you want
caret’s syntax allows you to very easily change the learning method. In
turn, this allows you to “try out” and evaluate many different learning
methods rapidly and iteratively. You can just re-run your code with different
sharpsightlabs.com 81
values for the method parameter, and compare the results for each
method.
Next steps
Keep in mind though, if you’re new to machine learning, there’s still lots
sharpsightlabs.com 82
W HAT ’ STHE DIFFERENCE BETWEEN
MACHINE LEARNING , STATISTICS ,
AND DATA MINING ?
In the last few chapters, you’ve learned a little bit about machine learning,
why it’s valuable, and how to get started with machine learning in R.
However, machine learning isn’t the only subject in which we use data for
class has heard a similar definition about statistics itself. And if you talk to
someone who works in data-mining, you’ll hear the same thing: data
mining is about using data to make predictions and draw conclusions from
data.
sharpsightlabs.com 83
This raises the question: what is the difference between machine learning,
The long answer has a bit of nuance (which we’ll discuss soon), but the
Larry Wasserman wrote a blog post about this a few years ago. If you’re
one of the premier universities for stats and ML. Moreover, I’ll point out that
sharpsightlabs.com 84
universities in the world, I’d say that he’s uniquely positioned to answer this
question.
mining). However, I’ll add that his answer applies equally well to “data
mining”.
Having said that, Wasserman notes that if you look at some of the details,
sharpsightlabs.com 85
With that in mind, let’s review the major similarities as well as the
differences, both to prove the point that they really are extremely similar,
First, let’s examine the what makes machine learning, statistics, and data
The primary reason that these three subjects are effectively the same is
that they cover almost exactly the same material and use almost exactly
Let me give you an example. If you examine the table of contents of the
sharpsightlabs.com 86
examine the contents a bit more closely, you’ll see sections concerning
Next, if you look at the syllabus for Andrew Ng’s machine learning course
on Coursera, and you’ll see all of the topics I just listed. The material
covered in Ng’s machine learning course is almost exactly the same as the
learning” experts. With only minor exceptions, the material is exactly the
same.
If you perform the same exercise and look at the table of contents for the
book Data Mining by Witten, Frank, and Hall, you’ll find almost exactly the
sharpsightlabs.com 87
So to summarize, the material covered in machine learning, statistics (and
statistical learning in particular), and data mining are so similar that they’re
nearly identical.
So what’s going on here? They’re nearly identical?! Why are there three
The best answer is that even though they use the same methods, they
Let me give you an analogy: “machine learning” and “statistics” are like
analogy). They are quite nearly identical. And in fact, at a deep level, they
sharpsightlabs.com 88
However, even though identical twins are identical in some sense, they
might still dress differently and hang out with different people.
To relate this to the topic of ML, statistics, and data mining, what I mean by
“dressed differently” is that they use different words and terms, and have
different notation. Much like human twins that have different groups of
friends, people in ML, stats, and data mining also have different and
emphasis. Although they use almost the exact same methods, they tend to
sharpsightlabs.com 89
emphasize different things. A different way of saying this, is that although
they use almost exactly the same methods, tools, and techniques, these
sharpsightlabs.com 90
Machine learning is focused on software and systems
Actually, I’ll go a step further and state that machine learning isn’t just
This greater emphasis on systems (i.e., computer programs that learn from
though, I’ll repeat my caveat that these are very broad generalizations.
sharpsightlabs.com 91
The purpose of data mining is finding patterns in databases
As I already noted, the tools that data miners use are almost exactly the
same as the tools used by statisticians and machine learning experts. The
major difference is how and why these tools are used. So how do data
mining operation – for example a gold mining operation – large piles of dirt
and material are extracted from the mine and then the miners sift through
the dirt to find nuggets of gold. Here you can think of a data warehouse as
the mine: it contains mostly useless data. This useless data is like the “dirt
and rubble” in a mine. Then after being extracted from the database, this
sharpsightlabs.com 92
Sometimes, finding these insights requires simple exploratory data
learning.
So again, we find that the tools are almost identical, but the purpose is
To summarize: that although ML, stats, and data mining use the same
methods, they have slightly different philosophies about how, when, and
Not only do they have slightly different emphases and purposes, they also
sharpsightlabs.com 93
Professor Rob Tibshirani – one of the authors of the excellent book An
instructive. Most specifically, the chart drives home the point I’ve been
making in this post: machine learning and statistics are quite nearly
identical. They are so similar that they discuss almost all of the same
sharpsightlabs.com 94
topics, methods, and concepts, but just have different names for many of
them.
Having said that, I want to add that frequently these terms are used
mining).
For example, if you examine Andrew Ng’s machine learning course, you’ll
learning culture”, but he readily uses the terms attributed to the “statistical
learning culture.” So, it appears that even though there are slight cultural
sharpsightlabs.com 95
ML, stats, and data mining tend to favor different tools
Next, let’s talk about tools. There are some very rough generalizations we
can make about tool choices between statisticians, data miners, and
machine learning practitioners (but as I’ve pointed out several times, these
In the machine learning camp, you’re more likely to see Python and
machine learning, many of them use one of these two languages. For
On the statistics side, you’re much more likely to see R. You’re very likely
sharpsightlabs.com 96
statistics classes at colleges and universities. Also, several of the best
I’ll repeat though that these are very hasty generalizations. You’ll see
you’re seeing people learn several tools. Many of the best people in stats
And of course, the list certainly isn’t limited to Python, R, and Matlab. In
both academia and industry, you’ll find statisticians, ML experts, and data
miners using a broad array of other technologies like SAS, SPSS, c++,
sharpsightlabs.com 97
So while I’ll suggest that Python and Matlab are more popular among the
Finally, let’s talk briefly about the size and scale of the problems these
sharpsightlabs.com 98
Andrew Gelman – who is also a very well respected professor of statistics
And finally, data mining also emphasizes large scale data. While I won’t
offer any quotes here, I can say from personal experience that people who
quite common to subset these large datasets down, to take samples, etc).
sharpsightlabs.com 99
Again: there are far more similarities than differences
It should seem clear by now that machine learning, statistics, and data
mining are all fundamentally similar. And to the extent that they have
converge. As Wasserman noted in his blog article, it’s clear that members
the lines between these fields are becoming increasingly blurred. So even
What this means for you, if you’re getting started with data science, is that
you can safely treat machine learning, statistics, and data mining as “the
same.” Learn the “surface level” differences so you can communicate with
people from the “different cultures,” but ultimately, treat machine learning,
sharpsightlabs.com 100
statistics, and data mining each as subjects that you can learn from as you
sharpsightlabs.com 101
TO MASTER DATA SCIENCE , YOU
NEED TO PRACTICE
There you have it. In this book, you’ve seen almost everything that you
Having said that though, you can’t learn data science in a day.
sharpsightlabs.com 102
That’s what you need. In order to get a data science job, you need to
master the basic charts and graphs. You need to master data
manipulation.
You need to be able to write the code for these things “with your eyes
closed.”
So, review the material in this book. Run the code. Continue reading the
But when you’re ready to get to the next level, you’ll need a system for
practicing data science. You’ll need a system to develop intuition for the
Sharp Sight has just such a system. Several times per year, we open the
doors to a course that will show you how to practice. A course that will
sharpsightlabs.com 103
Interested? Watch your inbox …
sharpsightlabs.com 104
P OSTSCRIPT : S END
US YOUR
QUESTIONS AND PROBLEMS
What courses have you taken, and why aren’t they working for you?
sharpsightlabs.com 105