Machine Learning With Apps in R
Machine Learning With Apps in R
Machine Learning
Contents
Preface
Some Terminology
Logistic Regression
10
10
Continuous Outcomes
Squared Error
10
Absolute Error
10
10
Negative Log-likelihood
R Example
11
11
Categorical Outcomes
11
Misclassification
Binomial log-likelihood
12
Exponential
Hinge Loss
12
12
Regularization
R Example
13
11
Applications in R
14
Bias-Variance Tradeoff
14
The Tradeoff
High Variance
16
High Bias
16
Cross-Validation
17
17
Leave-one-out Cross-Validation
Bootstrap
16
17
18
Other Stuff
18
18
20
Process Overview
20
Data Preparation
Feature Scaling
21
Feature Engineering
21
Discretization
22
Model Selection
Model Assessment
22
22
23
R Implementation
24
20
18
Machine Learning
24
k-nearest Neighbors
27
28
Neural Nets
30
30
33
33
35
35
Other
Unsupervised Learning
35
35
Clustering
36
Graphical Structure
36
Imputation
Ensembles
36
Bagging
37
Boosting
37
Stacking
38
39
39
Bayesian Approaches
More Stuff
Summary
40
40
Cautionary Notes
Some Guidelines
Conclusion
38
40
40
41
42
Applications in R
Preface
The purpose of this document is to provide a conceptual introduction to statistical or machine learning (ML) techniques for those that
might not normally be exposed to such approaches during their typical required statistical training1 . Machine learning2 can be described
as a form of a statistics, often even utilizing well-known and familiar
techniques, that has bit of a different focus than traditional analytical
practice in the social sciences and other disciplines. The key notion is
that flexible, automatic approaches are used to detect patterns within
the data, with a primary focus on making predictions on future data.
If one surveys the number of techniques available in ML without
context, it will surely be overwhelming in terms of the sheer number
of those approaches, as well as the various tweaks and variations of
them. However, the specifics of the techniques are not as important
as more general concepts that would be applicable in most every ML
setting, and indeed, many traditional ones as well. While there will be
examples using the R statistical environment and descriptions of a few
specific approaches, the focus here is more on ideas than application3
and kept at the conceptual level as much as possible. However, some
applied examples of more common techniques will be provided in
detail.
As for prerequisite knowledge, I will assume a basic familiarity with
regression analyses typically presented to those in applied disciplines,
particularly those of the social sciences. Regarding programming, one
should be at least somewhat familiar with using R and Rstudio, and
either of my introductions here and here will be plenty. Note that I
wont do as much explaining of the R code as in those introductions,
and in some cases I will be more concerned with getting to a result
than clearly detailing the path to it. Armed with such introductory
knowledge as can be found in those documents, if there are parts of
R code that are unclear one would have the tools to investigate and
discover for themselves the details, which results in more learning
anyway.
Machine Learning
Applications in R
Respective departments of computer science and statistics now overlap more than ever as more relaxed views seem to prevail today, but
there are potential drawbacks to placing too much emphasis on either
approach historically associated with them. Models that just work
have the potential to be dangerous if they are little understood. Situations for which much time is spent sorting out details for an ill-fitting
model suffers the converse problem- some (though often perhaps very
little actually) understanding with little pragmatism. While this paper
will focus on more algorithmic approaches, guidance will be provided
with an eye toward their use in situations where the typical data modeling approach would be applied, thereby hopefully shedding some
light on a path toward obtaining the best of both worlds.
Some Terminology
For those used to statistical concepts such as dependent variables,
clustering, and predictors, etc. you will have to get used to some differences in terminology4 such as targets, unsupervised learning, and
inputs etc. This doesnt take too much, even if it is somewhat annoying
when one is first starting out. I wont be too beholden to either in this
paper, and it should be clear from the context whats being referred to.
Initially I will start off mostly with non-ML terms and note in brackets
its ML version to help the orientation along.
Machine Learning
y = N (, 2 )
= X
Where y is a normally distributed vector of responses [target] with
mean and constant variance 2 . X is a typical model matrix, i.e. a
matrix of predictor variables and in which the first column is a vector of 1s for the intercept [bias5 ], and is the vector of coefficients
[weights] corresponding to the intercept and predictors in the model.
What might be given less focus in applied courses however is how
often it wont be the best tool for the job or even applicable in the form
it is presented. Because of this many applied researchers are still hammering screws with it, even as the explosion of statistical techniques
of the past quarter century has rendered obsolete many current introductory statistical texts that are written for disciplines. Even so, the
concepts one gains in learning the standard linear model are generalizable, and even a few modifications of it, while still maintaining the
basic design, can render it still very effective in situations where it is
appropriate.
Typically in fitting [learning] a model we tend to talk about Rsquared and statistical significance of the coefficients for a small
number of predictors. For our purposes, let the focus instead be on
the residual sum of squares6 with an eye towards its reduction and
model comparison. We will not have a situation in which we are only
considering one model fit, and so must find one that reduces the sum
of the squared errors but without unnecessary complexity and overfitting, concepts well return to later. Furthermore, we will be much more
concerned with the model fit on new data [generalization].
Logistic Regression
Logistic regression is often used where the response is categorical in
nature, usually with binary outcome in which some event occurs or
does not occur [label]. One could still use the standard linear model
here, but you could end up with nonsensical predictions that fall outside the 0-1 range regarding the probability of the event occurring, to
go along with other shortcomings. Furthermore, it is no more effort
nor is any understanding lost in using a logistic regression over the
linear probability model. It is also good to keep logistic regression in
mind as we discuss other classification approaches later on.
Logistic regression is also typically covered in an introduction to
statistics for applied disciplines because of the pervasiveness of binary
responses, or responses that have been made as such7 . Like the standard linear model, just a few modifications can enable one to use it to
provide better performance, particularly with new data. The gist is,
Applications in R
it is not the case that we have to abandon familiar tools in the move
toward a machine learning perspective.
Machine Learning
However really this just means that we have a little more work to get
the desired level of understanding. GAMs can be seen as a segue toward more black box/algorithmic techniques. Compared to some of
those techniques in machine learning, GAMs are notably more interpretable, though perhaps less so than GLMs. Also, part of the estimation process includes regularization and validation in determining the
nature of the smooth function, topics of which we will return later.
Continuous Outcomes
Squared Error
The classic loss function for linear models with continuous response is
the squared error loss function, or the residual sum of squares.
L(Y, f ( X )) = (y f ( X ))2
Absolute Error
For an approach more robust to extreme observations, we might
choose absolute rather than squared error as follows. In this case,
predictions are a conditional median rather than a conditional mean.
L(Y, f ( X )) = |(y f ( X ))|
Negative Log-likelihood
We can also think of our usual likelihood methods learned in a standard applied statistics course as incorporating a loss function that is
the negative log-likelihood pertaining to the model of interest. If we
assume a normal distribution for the response we can note the loss
function as:
L(Y, f ( X )) = n ln + 21 2 (y f ( X ))2
10
11
Applications in R
R Example
The following provides code that one could use with the optim function in R to find estimates of regression coefficients (beta) that minimize the squared error. X is a design matrix of our predictor variables
with the first column a vector of 1s in order to estimate the intercept. y
is the continuous variable to be modeled9 .
sqerrloss = function(beta, X, y) {
mu = X %*% beta
sum((y - mu)^2)
}
set.seed(123)
X = cbind(1, rnorm(100), rnorm(100))
y = rowSums(X[, -1] + rnorm(100))
out1 = optim(par = c(0, 0, 0), fn = sqerrloss, X = X, y = y)
out2 = lm(y ~ X[, 2] + X[, 3]) # check with lm
rbind(c(out1$par, out1$value), c(coef(out2), sum(resid(out2)^2)))
##
(Intercept) X[, 2] X[, 3]
## [1,]
0.2702 0.7336 1.048 351.1
## [2,]
0.2701 0.7337 1.048 351.1
Categorical Outcomes
Here well also look at some loss functions useful in classification problems. Note that there is not necessary exclusion in loss functions for
continuous vs. categorical outcomes10 .
Misclassification
Probably the most straightforward is misclassification, or 0-1 loss. If
we note f as the prediction, and for convenience we assume a [-1,1]
response instead of a [0,1] response:
L(Y, f ( X )) = I (y 6= sign( f ))
In the above, I is the indicator function and so we are summing
misclassifications.
Binomial log-likelihood
L(Y, f ( X )) = log(1 + e2y f )
The above is in deviance form, but is equivalent to binomial log
likelihood if y is on the 0-1 scale.
10
Machine Learning
12
Exponential
Exponential loss is yet another loss function at our disposal.
L(Y, f ( X )) = ey f
Hinge Loss
A final loss function to consider, typically used with support vector
machines, is the hinge loss function.
L(Y, f ( X )) = max(1 y f , 0)
2.5
2.0
1.5
Loss
1.0
0.5
0.0
Which of these might work best may be specific to the situation, but
the gist is that they penalize negative values (misclassifications) more
heavily and increasingly so the worse the misclassification (except for
misclassification error, which penalizes all misclassifications equally),
with their primary difference in how heavy that penalty is. At right is
a depiction of the loss as a functions above, taken from Hastie et al.
(2009).
Misclassification
Exponential
Binomial Deviance
Squared Error
Support Vector
3.0
yf
Regularization
I T I S I M P O RTA N T T O N O T E that a model fit to a single data set might
do very well with the data at hand, but then suffer when predicting independent data 11 . Also, oftentimes we are interested in a best subset
of predictors among a great many, and in this scenario the estimated
coefficients are overly optimistic. This general issue can be improved
by shrinking estimates toward zero, such that some of the performance
in the initial fit is sacrificed for improvement with regard to prediction.
Penalized estimation will provide estimates with some shrinkage,
and we can use it with little additional effort with our common procedures. Concretely, lets apply this to the standard linear model, where
we are finding estimates of that minimize the squared error loss.
= arg min (y X)2
In words, were finding the coefficients that minimize the sum of the
squared residuals. With the approach to regression here we just add a
penalty component to the procedure as follows.
11
13
Applications in R
p
= arg min (y X)2 + j
j =1
In the above equation, is our penalty term12 for which larger values will result in more shrinkage. Its applied to the L1 or Manhattan
norm of the coefficients, 1 , 2 ... p , i.e. not including the intercept 0 ,
and is the sum of their absolute values (commonly referred to as the
lasso13 ). For generalized linear and additive models, we can conceptually express a penalized likelihood as follows:
p
l p ( ) = l ( ) j
12
j =1
14
R Example
In the following example, we take a look at the lasso approach for a
standard linear model. We add the regularization component, with a
fixed penalty for demonstration purposes15 . However you should
insert your own values for in the optim line to see how the results
are affected.
sqerrloss_reg = function(beta, X, y, lambda=.1){
mu = X%*%beta
sum((y-mu)^2) + lambda*sum(abs(beta[-1]))
}
out3 = optim(par=c(0,0,0), fn=sqerrloss_reg, X=X, y=y)
rbind(c(out1$par, out1$value),
c(coef(out2),sum(resid(out2)^2)),
c(out3$par, out3$value) )
##
(Intercept) X[, 2] X[, 3]
## [1,]
0.2702 0.7336 1.048 351.1
## [2,]
0.2701 0.7337 1.048 351.1
## [3,]
0.2704 0.7328 1.047 351.3
From the above, we can see in this case that the predictor coefficients have indeed shrunk toward zero slightly while the residual sum
of squares has increased just a tad.
15
Machine Learning
14
Bias-Variance Tradeoff
I N M O S T O F S C I E N C E we are concerned with reducing uncertainty in
our knowledge of some phenomenon. The more we know about the
factors involved or related to some outcome of interest, the better we
can predict that outcome upon the influx of new information. The initial step is to take the data at hand, and determine how well a model
or set of models fit the data in various fashions. In many applications
however, this part is also more or less the end of the game as well16 .
Unfortunately, such an approach in which we only fit models to one
data set does not give a very good sense of generalization performance,
i.e. the performance we would see with new data. While typically not
reported, most researchers, if they are spending appropriate time with
the data, are actually testing a great many models, for which the best
is then provided in detail in the end report. Without some generalization performance check however, such performance is overstated when
it comes to new data.
In the following consider a standard linear model scenario, e.g.
with squared-error loss function and perhaps some regularization,
and a data set in which we split the data in some random fashion into
a training set, for initial model fit, and a test set, which is a separate
and independent data set, to measure generalization performance17 .
We note training error as the (average) loss over the training set, and
test error as the (average) prediction error obtained when a model
resulting from the training data is fit to the test data. So in addition to
the previously noted goal of finding the best model (model selection),
we are interested further in estimating the prediction error with new
data (model performance).
16
17
15
Applications in R
Low Bias
Low Variance
High Variance
0.6
0.4
0.2
Prediction Error
0.8
1.0
High Bias
0.0
1.2
The Tradeoff
10
15
20
25
30
35
Machine Learning
(classification with 0-1 loss vs. continuous response with squared error
loss19 ) and technique (a standard linear model vs. regularized fit) will
exhibit different bias-variance relationships.
19
See Friedman (1996) On Bias, Variance, 0/1 Loss and the Curse of Dimensionality for the unusal situations that
can arise in dealing with classification
error with regard to bias and variance.
Lets assume a regularized linear model with a standard data split into
training and test sets. We will describe different scenarios with possible
solutions.
Low Variance
High Variance
X
XX
XXX
X
High Bias
X
X
XX
XXX
X
Low Bias
X
X
X
1.0
https://libraries.mit.edu/
archives/exhibits/purse/
0.0
0.0
0.5
0.5
1.0
1.0
1.5
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1.0
0.0
Cross-Validation
As noted in the previous section, in machine learning approaches we
are particularly concerned with prediction error on new data. The simplest validation approach would be to split the data available into a
0.0
0.5
0.5
1.0
1.0
With bias issues our training error is high and test error is not too
different from training error (underfitting problem). Adding new predictors/features, interaction terms, polynomials etc. can help here.
Additionally reducing the penalty parameter would also work with
even less effort, though generally it should be estimated rather than
explicitly set.
1.5
High Bias
0.0
0.5
0.5
1.0
When variance is a problem, our training error is low while test error is
relatively high (overfitting problem). Implementing more shrinkage or
other penalization to model complexity may help with the issue. In this
case more data may help as well.
1.5
High Variance
0.5
0.5
1.0
Starting with the worst case scenario, poor models may exhibit high
bias and high variance. One thing that will not help this situation (perhaps contrary to intuition) is adding more data, i.e. increasing N. You
cant make a silk purse out of a sows ear (usually20 ), and adding more
data just gives you a more accurate picture of how awful your model
is. One might need to rework the model, e.g. adding new predictors
or creating them via interaction terms, polynomials, or other smooth
functions as in additive models, or simply collecting better and/or
more relevant data.
X X
X X
X
1.5
16
17
Applications in R
One technique that might be utilized for larger data sets, is to split
the data into training, validation and final test sets. For example, one
might take the original data and create something like a 60-20-20%
split to create the needed data sets. The purpose of the initial validation set is to select the optimal model and determine the values of
tuning parameters. These are parameters which generally deal with
how complex a model one will allow, but for which one would have
little inkling as to what they should be set at before hand (e.g. our
shrinkage parameter). We select models/tuning parameters that minimize the validation set error, and once the model is chosen examine
test set error performance. In this way performance assessment is still
independent of the model development process.
Test Error
0.6
0.4
0.2
0.0
25
50
75
100
21
K-fold Cross-Validation
In many cases we dont have enough data for such a split, and the
split percentages are arbitrary anyway and results would be specific
to the specific split chosen. Instead we can take a typical data set and
randomly split it into = 10 equal-sized (or close to it) parts. Take
the first nine partitions and use them as the training set. With chosen
model, make predictions on the test set. Now do the same but this time
use the 9th partition as the holdout set. Repeat the process until each
of the initial 10 partitions of data have been used as the test set. Average the error across all procedures for our estimate of prediction error.
With enough data, this (and the following methods) could be used as
the validation procedure before eventual performance assessment on
an independent test set with the final chosen model.
Leave-one-out Cross-Validation
Leave-one-out (LOO) cross-validation is pretty much the same thing
but where = N. In other words, we train a model for all observations
except the th one, assessing fit on the observation that was left out.
Partition 1
Partition 2
Partition 3
Iteration 1
Train
Train
Test
Iteration 2
Train
Test
Train
Iteration 3
Test
Train
Train
Machine Learning
We then cycle through until all observations have been left out once to
obtain an average accuracy.
Of the two, K-fold may have relatively higher bias but less variance, while LOO would have the converse problem, as well as possible
computational issues22 . K-folds additional bias would be diminished
would with increasing sample sizes, and generally 5 or 10-fold crossvalidation is recommended.
18
22
Bootstrap
With a bootstrap approach, we draw B random samples with replacement from our original data set, creating B bootstrapped data sets of
the same size as the original data. We use the B data sets as training
sets and, using the original data as the test set, average the prediction
error across the models.
Other Stuff
Along with the above there are variations such as repeated cross validation, the .632 bootstrap and so forth. One would want to do a bit of
investigating, but -fold and bootstrap approaches generally perform
well. If variable selection is part of the goal, one should be selecting
subsets of predictors as part of the cross-validation process, not at
some initial data step.
23
19
Applications in R
the prediction of the occurrence of a rare disease. Guessing a nonevent every time might result in 99.9% accuracy, but that isnt how we
would prefer to go about assessing some classifiers performance. To
demonstrate other sources of classification information, we will use the
following 2x2 table that shows values of some binary outcome (0 =
non-event, 1 = event occurs) to the predictions made by some model
for that response (arbitrary model). Both a table of actual values, often
called a confusion matrix24 , and an abstract version are provided.
Predicted
1
0
1
41
16
Actual
0
21
13
Predicted
1
0
1
A
C
24
Actual
0
B
D
True Positive, False Positive, True Negative, False Negative Above, these are A, B,
D, and C respectively.
Accuracy Number of correct classifications out of all predictions ((A+D)/Total).
In the above example this would be (41+13)/91, about 59%.
Error Rate 1 - Accuracy.
Sensitivity is the proportion of correctly predicted positives to all true positive
events: A/(A+C). In the above example this would be 41/57, about 72%.
High sensitivity would suggest a low type II error rate (see below), or high
statistical power. Also known as true positive rate.
Specificity is the proportion of correctly predicted negatives to all true negative
events: D/(B+D). In the above example this would be 13/34, about 38%.
High specificity would suggest a low type I error rate (see below). Also
known as true negative rate.
Postive Predictive Value (PPV) proportion of true positives of those that are
predicted positives: A/A+B. In the above example this would be 41/62,
about 66%.
Negative Predictive Value (NPV) proportion of true negatives of those that are
predicted negative: D/C+D. In the above example this would be 13/29,
about 45%.
Precision See PPV.
Recall See sensitivity.
Lift Ratio of positive predictions given actual positives to the proportion of
positive predictions out of the total: (A/(A+C))/((A+B)/Total). In the
above example this would be (41/(41+16))/((41+21)/(91)), or 1.05.
F Score (F1 score) Harmonic mean of precision and recall: 2*(Precision*Recall)/(Precision+Recall).
In the above example this would be 2*(.66*.72)/(.66+.72), about .69.
Type I Error Rate (false positive rate) proportion of true negatives that are
incorrectly predicted positive: B/B+D. In the above example this would be
21/34, about 62%. Also known as alpha.
Type II Error Rate (false negative rate) proportion of true positives that are
incorrectly predicted negative: C/C+A. In the above example this would be
16/57, about 28%. Also known as beta.
Machine Learning
False Discovery Rate proportion of false positives among all positive predictions: B/A+B. In the above example this would be 21/62, about 34%.
Often used in multiple comparison testing in the context of ANOVA.
Phi coefficient A measure of association: (A*D - B*C)/(sqrt((A+C)*(D+B)*(A+B)*(D+C))).
In the above example this would be .11.
Predicted
1
0
Actual
1
T+ /N+ = TPR = sensitivity = recall
F /N+ = FNR = Type II
0
F+ /N = FPR = Type I
T /N = TNR = specificity
There are many other measures such as area under a Receiver Operating Curve (ROC), odds ratio, and even more names for some of
the above. The gist is that given any particular situation you might be
interested in one or several of them, and it would generally be a good
idea to look at a few.
Process Overview
D E S P I T E T H E F A C A D E O F A P O L I S H E D P R O D U C T one finds in published research, most of the approach with the statistical analysis
of data is full of data preparation, starts and stops, debugging, reanalysis, tweaking and fine-tuning etc. Statistical learning is no different in this sense. Before we begin with explicit examples, it might be
best to give a general overview of the path well take.
Data Preparation
As with any typical statistical project, probably most of the time will be
spent preparing the data for analysis. Data is never ready to analyze
right away, and careful checks must be made in order to ensure the
integrity of the information. This would include correcting errors of
entry, noting extreme values, possibly imputing missing data and so
forth. In addition to these typical activities, we will discuss a couple
more things to think about during this initial data examination when
engaged in machine learning.
25
20
21
Applications in R
partition of the data such that we could safely conclude that the data
in the test set comes from the same population as the training set. The
training set is used to fit the initial models at various tuning parameter
settings, with a best model being that which satisfies some criterion
on the validation set (or via a general validation process). With final
model and parameters chosen, generalization error will be assessed
with the the performance of the final model on the test data.
Feature Scaling
Even with standard regression modeling, centering continuous variables (subtracting the mean) is a good idea so that intercepts and zero
points in general are meaningful. Standardizing variables so that they
have similar variances or ranges will help some procedures find their
minimums faster. Another common transformation is min-max normalization26 , which will transfer a scale to a new one of some chosen
minimum and maximum. Note that whatever approach is done, it must
be done after any explicit separation of data. So if you have separate
training and test sets, they should be scaled separately.
26
score
min
Feature Engineering
If were lucky well have ideas on potential combinations or other
transformations of the predictors we have available. For example, in
typical social science research there are two-way interactions one is
often predisposed to try, or perhaps one can sum multiple items to a
single scale score that may be more relevant. Another common technique is to use a dimension reduction scheme such as principal components, but this can (and probably should) actually be an implemented
algorithm in the ML process27 .
One can implement a variety of such approaches in ML as well to
create additional potentially relevant features, even automatically,
but as a reminder, a key concern is overfitting, and doing broad construction of this sort with no contextual guidance would potentially be
prone to such a pitfall. In other cases it may simply be not worth the
time expense.
27
Discretization
While there may be some contextual exceptions to the rule, it is
generally a pretty bad idea in standard statistical modeling to discretize/categorize continuous variables28 . However some ML procedures will work better (or just faster) if dealing with discrete valued
predictors rather than continuous. Others even require them; for example, logic regression needs binary input. While one could pick arbitrary
28
See Harrell (2001) for a good summary of reasons why not to.
Machine Learning
22
Model Selection
With data prepared and ready to analyze, one can use a validation
process to come up with a viable model. Use an optimization procedure or a simple grid search over a set of specific values to examine
models at different tuning parameters. Perhaps make a finer search
once an initial range of good performing values is found, though one
should not split hairs over arbitrarily close performance. Select a best
model given some criterion such as overall accuracy, or if concerned
about over fitting, select the simplest model within one standard error
of the accuracy of the best, or perhaps the simplest within X% of the
best model. For highly skewed classes, one might need to use a different measure of performance besides accuracy. If one has a great many
predictor variables, one may use the model selection process to select
features that are most important.
Model Assessment
With tuning parameters/features chosen, we then examine performance on the independent test set (or via some validation procedure).
For classification problems, consider other statistics besides accuracy
as measures of performance, especially if classes are unbalanced. Consider other analytical techniques that are applicable and compare
performance among the different approaches. One can even combine
disparate models predictions to possibly create an even better classifier29 .
29
23
Applications in R
The Dataset
We will use the wine data set from the UCI Machine Learning data
repository. The goal is to predict wine quality, of which there are 7 values (integers 3-9). We will turn this into a binary classification task to
predict whether a wine is good or not, which is arbitrarily chosen as
6 or higher. After getting the hang of things one might redo the analysis as a multiclass problem or even toy with regression approaches,
just note there are very few 3s or 9s so you really only have 5 values
to work with. The original data along with detailed description can
be found here, but aside from quality it contains predictors such as
residual sugar, alcohol content, acidity and other characteristics of the
wine30 .
The original data is separated into white and red data sets. I have
combined them and created additional variables: color and its numeric version white indicating white or red, and good, indicating scores
greater than or equal to 6 (denoted as Good). The following will show
some basic numeric information about the data.
wine = read.csv("http://www.nd.edu/~mclark19/learn/data/goodwine.csv")
summary(wine)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
fixed.acidity
Min.
: 3.80
1st Qu.: 6.40
Median : 7.00
Mean
: 7.21
3rd Qu.: 7.70
Max.
:15.90
chlorides
Min.
:0.009
1st Qu.:0.038
Median :0.047
Mean
:0.056
3rd Qu.:0.065
Max.
:0.611
pH
Min.
:2.72
1st Qu.:3.11
Median :3.21
Mean
:3.22
3rd Qu.:3.32
Max.
:4.01
white
Min.
:0.000
1st Qu.:1.000
Median :1.000
Mean
:0.754
3rd Qu.:1.000
Max.
:1.000
volatile.acidity citric.acid
residual.sugar
Min.
:0.08
Min.
:0.000
Min.
: 0.60
1st Qu.:0.23
1st Qu.:0.250
1st Qu.: 1.80
Median :0.29
Median :0.310
Median : 3.00
Mean
:0.34
Mean
:0.319
Mean
: 5.44
3rd Qu.:0.40
3rd Qu.:0.390
3rd Qu.: 8.10
Max.
:1.58
Max.
:1.660
Max.
:65.80
free.sulfur.dioxide total.sulfur.dioxide
density
Min.
: 1.0
Min.
: 6
Min.
:0.987
1st Qu.: 17.0
1st Qu.: 77
1st Qu.:0.992
Median : 29.0
Median :118
Median :0.995
Mean
: 30.5
Mean
:116
Mean
:0.995
3rd Qu.: 41.0
3rd Qu.:156
3rd Qu.:0.997
Max.
:289.0
Max.
:440
Max.
:1.039
sulphates
alcohol
quality
color
Min.
:0.220
Min.
: 8.0
Min.
:3.00
red :1599
1st Qu.:0.430
1st Qu.: 9.5
1st Qu.:5.00
white:4898
Median :0.510
Median :10.3
Median :6.00
Mean
:0.531
Mean
:10.5
Mean
:5.82
3rd Qu.:0.600
3rd Qu.:11.3
3rd Qu.:6.00
Max.
:2.000
Max.
:14.9
Max.
:9.00
good
Bad :2384
Good:4113
30
Machine Learning
24
R Implementation
I will use the caret package in R. Caret makes implementation of validation, data partitioning, performance assessment, and prediction
and other procedures about as easy as it can be in this environment.
However, caret is mostly using other R packages31 that have more
information about the specific functions underlying the process, and
those should be investigated for additional information. Check out
the caret home page for more detail. The methods selected here were
chosen for breadth of approach, to give a good sense of the variety of
techniques available.
In addition to caret, its a good idea to use your computers resources as much as possible, or some of these procedures may take
a notably long time, and more so with the more data you have. Caret
will do this behind the scenes, but you first need to set things up. Say,
for example, you have a quad core processor, meaning your processor
has four cores essentially acting as independent CPUs. You can set up R
for parallel processing with the following code, which will allow caret
to allot tasks to three cores simultaneously32 .
31
32
library(doSNOW)
registerDoSNOW(makeCluster(3, type = "SOCK"))
fixed.acidity
volatile.acidity
citric.acid
residual.sugar
library(corrplot)
corrplot(cor(wine[, -c(13, 15)]), method = "number", tl.cex = 0.5)
1
0.22
white
quality
alcohol
sulphates
pH
density
total.sulfur.dioxide
free.sulfur.dioxide
chlorides
residual.sugar
citric.acid
volatile.acidity
fixed.acidity
0.22 0.32 0.11 0.3 0.280.33 0.46 0.25 0.3 0.1 0.080.49
1 0.38 0.2 0.38 0.350.41 0.27 0.26 0.23 0.040.270.65
0.32 0.38 1
0.11 0.2 0.14
0.8
0.6
chlorides
free.sulfur.dioxide
total.sulfur.dioxide
0.330.41 0.2
density
pH
0.2
0.2
sulphates
0.04 0.49
alcohol
0.44 0.03
0.6
0.12
0.8
quality
white
1
1
25
Applications in R
of which I will put 80% into the training set. The function createDataPartition will produce indices to use as the training set. In addition to
this, we will normalize the continuous variables to the [0,1] range. For
the training data set, this will be done as part of the training process,
so that any subsets under consideration are scaled separately, but for
the test set we will go ahead and do it now.
library(caret)
set.seed(1234) #so that the indices will be the same when re-run
trainIndices = createDataPartition(wine$good, p = 0.8, list = F)
wanted = !colnames(wine) %in% c("free.sulfur.dioxide", "density", "quality",
"color", "white")
wine_train = wine[trainIndices, wanted] #remove quality and color, as well as density and others
wine_test = wine[-trainIndices, wanted]
residual.sugar
sulphates
total.sulfur.dioxide
volatile.acidity
1.0
Lets take an initial peek at how the predictors separate on the target. In the following Im predicting the pre-possessed data so as to
get the transformed data. Again, well leave the preprocessing to the
training part, but here it will put them on the same scale for visual
display.
0.8
0.6
0.4
0.2
0.0
alcohol
chlorides
citric.acid
fixed.acidity
pH
1.0
0.8
0.6
0.4
0.2
0.0
No
Yes
No
Yes
No
Yes
No
Yes
No
Feature
For the training set, it looks like alcohol content, volatile acidity and
chlorides separate most with regard to good classification. While this
might give us some food for thought, note that the figure does not give
insight into interaction effects, which methods such as trees will get at.
k-nearest Neighbors
Consider the typical distance matrix33 that is often used for cluster
analysis of observations34 . If we choose something like Euclidean distance as a metric, each point in the matrix gives the value of how far an
observation is from some other, given their respective values on a set of
variables.
K-nearest neighbors approaches exploit this information for predictive purposes. Let us take a classification example, and k = 5
neighbors. For a given observation xi , find the 5 closest neighbors in
terms of Euclidean distance based on the predictor variables. The class
that is predicted is whatever class the majority of the neighbors are labeled as35 . For continuous outcomes we might take the mean of those
neighbors as the prediction.
So how many neighbors would work best? This is an example of
a tuning parameter, i.e. k, for which we have no knowledge about its
value without doing some initial digging. As such we will select the
tuning parameter as part of the validation process.
33
R.
34
35
See the knn.ani function in the animation package for a visual demonstration
Yes
Machine Learning
26
36
5199 samples
9 predictors
2 classes: 'Bad', 'Good'
Pre-processing: re-scaling to [0, 1]
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 4679, 4679, 4680, 4679, 4679, 4679, ...
Resampling results across tuning parameters:
k
3
5
7
9
10
20
50
100
Accuracy
0.8
0.7
0.7
0.7
0.7
0.7
0.7
0.7
Kappa
0.5
0.4
0.4
0.4
0.4
0.4
0.4
0.4
Accuracy SD
0.02
0.009
0.02
0.02
0.02
0.02
0.02
0.02
Kappa SD
0.04
0.02
0.04
0.04
0.04
0.04
0.04
0.04
27
##
##
##
##
##
##
##
##
##
##
##
##
##
Applications in R
Kappa : 0.407
Mcnemar's Test P-Value : 0.136
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Detection Prevalence
:
:
:
:
:
:
:
0.803
0.599
0.776
0.638
0.633
0.508
0.656
dotPlot(varImp(results_knn))
volatile.acidity
chlorides
fixed.acidity
citric.acid
total.sulfur.dioxide
pH
sulphates
Strengths37
residual.sugar
20
40
60
80
Importance
Intuitive approach.
Robust to outliers on the predictors.
37
100
Machine Learning
28
Weaknesses
Susceptible to irrelevant features.
Susceptible to correlated inputs.
Ability to handle data of mixed types.
Big data. Though approaches are available that help in this regard.
Neural Nets
Neural nets have been around for a long while as a general concept
in artificial intelligence and even as a machine learning algorithm,
and often work quite well. In some sense they can be thought of as
nonlinear regression. Visually however, we can see them as layers of
inputs and outputs. Weighted combinations of the inputs are created
and put through some function (e.g. the sigmoid function) to produce
the next layer of inputs. This next layer goes through the same process
to produce either another layer or to predict the output, which is the
final layer38 . All the layers between the input and output are usually
referred to as hidden layers. If there were no hidden layers then it
becomes the standard regression problem.
One of the issues with neural nets is determining how many hidden
layers and how many hidden units in a layer. Overly complex neural
nets will suffer from high variance will thus be less generalizable, particularly if there is less relevant information in the training data. Along
with the complexity is the notion of weight decay, however this is the
same as the regularization function we discussed in a previous section,
where a penalty term would be applied to a norm of the weights.
A comment about the following: if you are not set up for utilizing
multiple processors the following might be relatively slow. You can
replace the method with nnet and shorten the tuneLength to 3 which
will be faster without much loss of accuracy. Also, the function were
using has only one hidden layer, but the other neural net methods
accessible via the caret package may allow for more, though the gains
in prediction with additional layers are likely to be modest relative
to complexity and computational cost. In addition, if the underlying
function39 has additional arguments, you may pass those on in the
train function itself. Here I am increasing the maxit, or maximum
iterations, argument.
results_nnet = train(good~., data=wine_train, method="avNNet",
trControl=cv_opts, preProcess="range",
tuneLength=5, trace=F, maxit=1000)
results_nnet
Input Layer
Hidden Layer
Output
38
39
29
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Applications in R
5199 samples
9 predictors
2 classes: 'Bad', 'Good'
Pre-processing: re-scaling to [0, 1]
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 4679, 4679, 4680, 4679, 4679, 4679, ...
Resampling results across tuning parameters:
size
1
1
1
1
1
3
3
3
3
3
5
5
5
5
5
7
7
7
7
7
9
9
9
9
9
decay
0
1e-04
0.001
0.01
0.1
0
1e-04
0.001
0.01
0.1
0
1e-04
0.001
0.01
0.1
0
1e-04
0.001
0.01
0.1
0
1e-04
0.001
0.01
0.1
Accuracy
0.7
0.7
0.7
0.7
0.7
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
Kappa
0.4
0.4
0.4
0.4
0.4
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
Accuracy SD
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.01
0.02
0.01
0.02
0.02
0.02
0.02
0.01
0.02
0.02
0.02
0.02
0.01
0.02
0.01
0.02
0.01
0.01
Kappa SD
0.04
0.04
0.04
0.04
0.04
0.04
0.03
0.03
0.04
0.03
0.04
0.04
0.05
0.03
0.03
0.05
0.04
0.03
0.04
0.03
0.05
0.02
0.04
0.03
0.03
We see that the best model has 9 hidden layer nodes and a decay
parameter of 104 . Typically you might think of how many hidden
units you want to examine in terms of the amount of data you have
(i.e. estimated parameters to N ratio), and here we have a decent
amount. In this situation you might start with very broad values for
the number of inputs (e.g. a sequence by 10s) and then narrow your
focus (e.g. between 20 and 30), but with at least some weight decay
you should be able to avoid overfitting. I was able to get an increase in
test accuracy of about 1.5% using up to 50 hidden units40 .
preds_nnet = predict(results_nnet, wine_test[,-10])
confusionMatrix(preds_nnet, wine_test[,10], positive='Good')
## Confusion Matrix and Statistics
40
Machine Learning
##
##
Reference
## Prediction Bad Good
##
Bad 295 113
##
Good 181 709
##
##
Accuracy : 0.773
##
95% CI : (0.75, 0.796)
##
No Information Rate : 0.633
##
P-Value [Acc > NIR] : < 2e-16
##
##
Kappa : 0.497
## Mcnemar's Test P-Value : 9.32e-05
##
##
Sensitivity : 0.863
##
Specificity : 0.620
##
Pos Pred Value : 0.797
##
Neg Pred Value : 0.723
##
Prevalence : 0.633
##
Detection Rate : 0.546
##
Detection Prevalence : 0.686
##
##
'Positive' Class : Good
##
We note improved prediction with the neural net model relative to the k-nearest neighbors approach, with increases in accuracy
(77.35%), sensitivity, specificity etc.
30
31
Applications in R
X1 >=5.75
Negative
X2 < 3
Negative
Positive
alcohol|< 10.625
Good
41
Good
Good
Machine Learning
set.seed(1234)
rf_opts = data.frame(.mtry=c(2:6))
results_rf = train(good~., data=wine_train, method="rf",
preProcess='range',trControl=cv_opts, tuneGrid=rf_opts,
n.tree=1000)
results_rf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
5199 samples
9 predictors
2 classes: 'Bad', 'Good'
Pre-processing: re-scaling to [0, 1]
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 4679, 4679, 4680, 4679, 4679, 4679, ...
Resampling results across tuning parameters:
mtry
2
3
4
5
6
Accuracy
0.8
0.8
0.8
0.8
0.8
Kappa
0.6
0.6
0.6
0.6
0.6
Accuracy SD
0.02
0.02
0.02
0.02
0.02
Kappa SD
0.04
0.04
0.04
0.04
0.04
The initial results look promising with mtry = 3 producing the best
initial result. Now for application to the test set.
preds_rf = predict(results_rf, wine_test[,-10])
confusionMatrix(preds_rf, wine_test[,10], positive='Good')
## Confusion Matrix and Statistics
##
##
Reference
## Prediction Bad Good
##
Bad 333
98
##
Good 143 724
##
##
Accuracy : 0.814
##
95% CI : (0.792, 0.835)
##
No Information Rate : 0.633
##
P-Value [Acc > NIR] : < 2e-16
##
##
Kappa : 0.592
## Mcnemar's Test P-Value : 0.00459
##
##
Sensitivity : 0.881
##
Specificity : 0.700
##
Pos Pred Value : 0.835
##
Neg Pred Value : 0.773
##
Prevalence : 0.633
##
Detection Rate : 0.558
##
Detection Prevalence : 0.668
##
##
'Positive' Class : Good
##
32
33
Applications in R
This is our best result so far with 81.43% accuracy, with a lower
bound well beyond the 63% wed have guessing. Random forests do
not suffer from some of the data specific issues that might be influencing the other approaches, such as irrelevant and correlated predictors,
and furthermore benefit from the combined information of many models. Such performance increases are not a given, but random forests
are generally a good method to consider given their flexibility.
Incidentally, the underlying randomForest function here allows one
to assess variable importance in a different manner42 , and there are
other functions used by caret that can produce their own metrics also.
In this case, randomForest can provide importance based on a version
of the decrease in inaccuracy approach we talked before (as well as
another index known as gini impurity). The same two predictors are
found to be most important and notably more than others- alcohol and
volatile.acidity.
42
lab
class1
class2
Support Vector Machines (SVM) will be our last example, and is perhaps the least intuitive. SVMs seek to map the input space to a higher
dimension via a kernel function, and in that transformed feature space,
find a hyperplane that will result in maximal separation of the data.
To better understand this process, consider the example to the right
of two inputs, x and y. Cursory inspection shows no easy separation
between classes. However if we can map the data to a higher dimension43 , shown in the following graph, we might find a more clear separation. Note that there are a number of choices in regard to the kernel
function that does the mapping, but in that higher dimension, the
43
Machine Learning
34
class1
class2
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
5199 samples
9 predictors
2 classes: 'Bad', 'Good'
Pre-processing: re-scaling to [0, 1]
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 4679, 4679, 4680, 4679, 4679, 4679, ...
Resampling results across tuning parameters:
C
0.2
0.5
1
2
4
Accuracy
0.7
0.7
0.7
0.7
0.7
Kappa
0.4
0.4
0.4
0.4
0.4
Accuracy SD
0.02
0.02
0.02
0.02
0.02
class1
Kappa SD
0.05
0.05
0.05
0.04
0.05
class2
35
Applications in R
Results for the initial support vector machine do not match the
random forest for this data set, with accuracy of 74.5%. However, you
might choose a different kernel than the linear one used here, as well
as tinker with other options.
Other
I N T H I S S E C T I O N I N O T E S O M E O T H E R T E C H N I Q U E S one may come
across and others that will provide additional insight into machine
learning applications.
Unsupervised Learning
Unsupervised learning generally speaking involves techniques in which
we are utilizing unlabeled data. In this case we have our typical set of
features we are interested in, but no particular response to map them
to. In this situation we are more interested in the discovery of structure
within the data.
Clustering
Many of the techniques used in unsupervised are commonly taught
in various applied disciplines as various forms of "cluster" analysis.
The gist is we are seeking an unknown class structure rather than
seeing how various inputs relate to a known class structure. Common
techniques include k-means, hierarchical clustering, and model based
approaches (e.g. mixture models).
Machine Learning
36
LV1
LV2
Graphical Structure
Other techniques are available to understand structure among observations or features. Among the many approaches is the popular network
analysis, where we can obtain similarities among observations and examine visually the structure of those data points, where observations
are placed closer together that are more similar in nature. In still other
situations, we arent so interested in the structure as we are in modeling the relationships and making predictions from the correlations of
inputs.
Coburn
DeMint
Inhofe
Vitter
Enzi Allard
Bunning
Burr Ensign DoleGregg
Kyl
Chambliss
Cornyn
Hagel Sessions
Sununu
Shelby
Hatch McConnell
Corker Lott
Craig Isakson
Martinez
Graham
Crapo
Hutchison
Cochran
Roberts
Bond
Domenici
Grassley
McCain
Brownback
Bennett Thune
Alexander
Murkowski
Warner
Stevens
Voinovich Lugar
Coleman
Smith
Specter
Imputation
Collins
We can also use these techniques when we are missing data as a means
to impute the missing values44 . While many are familiar with this
problem and standard techniques for dealing with it, it may not be
obvious that ML techniques may also be used. For example, both knearest neighbors and random forest techniques have been applied to
imputation.
Beyond this we can infer values that are otherwise unavailable in a
different sense. Consider Netflix, Amazon and other sites that suggest
various products based on what you already like or are interested in.
In this case the suggested products have missing values for the user
which are imputed or inferred based on their available data and other
consumers similar to them who have rated the product in question.
Such recommender systems are widely used these days.
Ensembles
In many situations we can combine the information of multiple models
to enhance prediction. This can take place within a specific technique,
Snowe
Nelson (NE)
Pryor Salazar
Baucus
Lincoln Carper
Kennedy
Kerry
Tester
Cantwell Lieberman
Klobuchar Feinstein
Mikulski
Reid
Landrieu Leahy Akaka
Wyden
Sanders Cardin
Nelson (FL)
Murray
Byrd
CaseyHarkinKohl
Conrad
Menendez
Schumer
Lautenberg
Reed
DorganBingaman
Whitehouse
Webb
DurbinDodd
BrownBiden
Clinton
Bayh
Rockefeller
Feingold
Boxer LevinObama
Inouye
Stabenow
McCaskill
37
Applications in R
e.g. random forests, or between models that utilize different techniques. I will discuss some standard techniques, but there are a great
variety of forms in which model combination might take place.
Bagging
Bagging, or bootstrap aggregation, uses bootstrap sampling to create
many data sets on which a procedure is then performed. The final
prediction is based on an average of all the predictions made for each
observation. In general, bagging helps reduce the variance while leaving bias unaffected. A conceptual outline of the procedure is provided.
Model Generation
For B number of iterations:
1. Sample N observations with replacement B times to create B data
sets of size N.
2. Apply the learning technique to each of B data sets to create t models.
3. Store the t results.
Classification
For each of t number of models:
1. Predict the class of N observations of the original data set.
2. Return the class predicted most often.
Boosting
With boosting we take a different approach to refitting models. Consider a classification task in which we start with a basic learner and
apply it to the data of interest. Next the learner is refit, but with more
weight (importance) given to misclassified observations. This process
is repeated until some stopping rule is reached. An example of the
AdaBoost algorithm is provided (in the following I is the indicator
function).
Set initial weights wi to 1/N.
for m = 1 : M
(m)
I( y i 6 = f i
i =1
(m)
wi
i =1
Machine Learning
(m)
)]
}
M
Stacking
Stacking is a method that can generalize beyond a single fitting technique, though it can be applied in a fashion similar to boosting for a
single technique. While the term can refer to a specific technique, here
we will use it broadly to mean any method to combine models of different forms. Consider the four approaches we demonstrated earlier:
k-nearest neighbors, neural net, random forest, and the support vector
machine. We saw that they do not have the same predictive accuracy,
though they werent bad in general. Perhaps by combining their respective efforts, we could get even better prediction than using any
particular one.
The issue then how we might combine them. We really dont have
to get too fancy with it, and can even use a simple voting scheme as in
bagging. For each observation, note the predicted class on new data
across models. The final prediction is the class that receives the most
votes. Another approach would be to use a weighted vote, where the
votes are weighted by their respective accuracies.
Another approach would use the predictions on the test set to create a data set of just the predicted probabilities from each learning
scheme. We can then use this data to train a meta-learner using the
test labels as the response. With the final meta-learner chosen, we then
retrain the original models on the entire data set (i.e. including the
test data). In this manner the initial models and the meta-learner are
trained separately and you get to eventually use the entire data set to
train the original models. Now when new data becomes available, you
feed them to the base level learners, get the predictions, and then feed
the predictions to the meta-learner for the final prediction.
38
39
Applications in R
But once we obtain the initial data set however, we may still want to
trim the models under consideration.
In standard approaches we might have in the past used forward
or other selection procedure, or perhaps some more explicit model
comparison approach. Concerning the content here, take for instance
the lasso regularization procedure we spoke of earlier. Less important
variables may be shrunk entirely to zero, and thus feature selection is
an inherent part of the process, and is useful in the face of many, many
predictors, sometimes outnumbering our sample points. As another
example, consider any particular approach where the importance
metric might be something like the drop in accuracy when the variable
is excluded.
Variable importance was given almost full weight in the discussion
of typical applied research in the past, based on statistical significance
results from a one-shot analysis, and virtually ignorant of prediction
on new data. We still have the ability to focus on feature performance
with ML techniques, while shifting more of the focus toward prediction
at the same time. For the uninitiated, it might require new ways of
thinking about how one measures importance though.
Textual Analysis
In some situations the data of interest is not in a typical matrix form
but in the form of textual content, i.e. a corpus of documents (loosely
defined). In this case, much of the work (like in most analyses but perhaps even more so) will be in the data preparation, as text is rarely if
ever in a ready-to-analyze state. The eventual goals may include using
the word usage in the prediction of an outcome, perhaps modeling the
usage of select terms, or examining the structure of the term usage
graphically as in a network model. In addition, machine learning processes might be applied to sounds (acoustic data) to discern the speech
characteristics and other information.
Bayesian Approaches
It should be noted that the approaches outlined in this document
are couched in the frequentist tradition. But one should be aware
that many of the concepts and techniques would carry over into the
Bayesian perspective, and even some machine learning techniques
might only be feasible or make more sense within the Bayesian framework (e.g. online learning).
Machine Learning
40
More Stuff
Aside from what has already been noted, there still exists a great many
applications for ML such as data set shift 45 , deep learning46 , semisupervised learning47 , online learning48 , and many more.
Summary
Cautionary Notes
A standard mantra in machine learning and statistics generally is that
there is no free lunch. All methods have certain assumptions, and if
those dont hold the results will be problematic at best. Also, even if
in truth learner A is better than B, B can often outperform A in the
finite situations we actually deal with in practice. Furthermore, being
more complicated doesnt mean a technique is better. As previously
noted, incorporating regularization and cross-validation goes a long
way toward to improving standard techniques, and they may perform
quite well in some situations.
45
Some Guidelines
Here are some thoughts to keep in mind, though these may be applicable to applied statistical practice generally.
More data beats a cleverer algorithm, but a lot of data is not enough by
itself49 .
49
Domingos (2012)
Avoid overfitting.
Let the data speak for itself.
"Nothing is more practical than a good theory."50
While getting used to ML, it might be best to start from simpler approaches and then work towards more black box ones that require
more tuning. For example, naive Bayes logistic regression knn
svm.
Drawing up a visual path of your process is a good way to keep your
analysis on the path to your goal. Some programs can even make this
explicit (e.g. RapidMiner, Weka).
Keep the tuning parameter/feature selection process separate from the
final test process for assessing error.
Learn multiple models, selecting the best or possibly combining them.
50
41
Applications in R
Conclusion
It is hoped that this document sheds some light on some areas that
might otherwise be unfamiliar to some applied researchers. The field
of statistics has rapidly evolved over the past two decades. The tools
available are myriad, and expanding all the time. Rather than being
overwhelmed, one should embrace the choice available, and have some
fun with your data.
Machine Learning
42
43
Applications in R
References
Breiman, L. (2001). Statistical modeling: The two cultures (with
comments and a rejoinder by the author). Statistical Science,
16(3):199231. Mathematical Reviews number (MathSciNet):
MR1874152.
Domingos, P. (2012). A few useful things to know about machine
learning. Commun. ACM, 55(10).
Harrell, F. E. (2001). Regression Modeling Strategies: With Applications
to Linear Models, Logistic Regression, and Survival Analysis. Springer.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of
Statistical Learning: Data Mining, Inference, and Prediction, Second
Edition. Springer, 2nd ed. 2009. corr. 10th printing. edition.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective.
The MIT Press.
Wood, S. N. (2006). Generalized additive models: an introduction with
R, volume 66. CRC Press.
I had a lot of sources in putting this together, but I note these in particular as I feel they can either provide the appropriate context to begin,
help with the transition from more standard approaches, or serve as a
definitive reference for various methods.