ML_Introduction
ML_Introduction
Learning
• Text book and resources
Course Info • James, G., Witten, D., Hastie, T., & Tibshirani,
R. (2013). An introduction to statistical
learning (Vol. 112, pp. 3-7). New York:
springer.
• http://faculty.marshall.usc.edu/gare
Teaching Scheme th-james/ISL/ Evaluation Scheme
4
Supervised Learning: Data and corresponding labels
are given
5
Data: labeled instances <xi, y>, e.g. emails marked
spam/not spam
Supervised Training Set
Held-out Set
Learning Test Set
: Important Features: attribute-value pairs which characterize each x
Concepts Experimentation cycle
Learn parameters (e.g. model probabilities) on training set
(Tune hyper-parameters on held-out set)
Compute accuracy of test set
Very important: never “peek” at the test set!
Evaluation
Accuracy: fraction of instances predicted correctly
Overfitting and generalization
Want a classifier which does well on test data
Overfitting: fitting the training data very closely, but not
generalizing well
6
Difference
data
mining control theory
evolutionary
neuroscience
models
Time Future
Data Mining
ML Model
Process
Classification problems are supervised
Learning problems where target/response
variables take only discrete
(finite/countable) values.
Classification Example: Employability prediction
and
Regression
Regression problems are supervised
learning problems where target / response is
a continuous variable (or equivalently can
take any real number).
Example: Predicting price of a used car
Basic Maths
• If (x1,y1) = (1,2) and (x2,y2) = (2,4)
• What is the value of y3, if x3 = 3?
• y3= 6, if x3 = 3. How?
• The two points can be connected by a straight line
• Slope of line = m = y/ x = 2/1 = 2
• Equation connecting the points of the form y = f(x)
• If (x1,y1) = (1,2) and (x2,y2) = (2,5)
Basic Maths
• If (x1,y1) = (1,2), (x2,y2) = (2,5), (x3,y3) = (3,6)
• What is the value of y3, if x4 = 4?
• Two of the points can be connected by a straight line
• What about the third point?
• Slope of line = m = y/ x = 3/1 = 2
• OR Slope of line = m = y/ x = 1/1 = 3
• OR Slope of line = m = y/ x = 4/2 = 2
• Equation connecting the points of the form y = f(x)
• Predicting values based on data available
• To predict values a relation is needed
Prediction of • x is the input variable or predictor
• y is the output or predicted value
values
• y has a relation with x (y = some function of x)
• Previous discussion was 2 dimensional
• What if we have 3 dimensions? x, y, z
• Assume we re-label the axes as x1, x2, y
• Now, y is some function of x1, also x2
• y = f1 (x1) and y = f2(x2)
• OR combined y = f (x1,x2) or y = f (X)
Unsupervised - learn relationships and structure
from such data; no
supervising output (y values or categories or
Unsupervised groups are not known)
• E.g. grouping documents based on their theme (no prior
ML information about groups) –
data does not have blue or orange, etc. existing labels
• Only input variables; no output variable
• E.g customer grouping
• E.g. Input demographic data of
customers (age, qualification,
nationality, ethnicity, socioeconomic
status, etc.)
Clustering • Identify which customers are similar to each
other by grouping individuals according to
their observed characteristics – clustering
problem
• E.g. gene grouping
• Cluster cells based on genes which are
expressing (active); similar cancers tend
to cluster together
• Statistical learning
Introduction – • vast set of tools for understanding data
Regression • Tools are
(Supervised) • supervised (existing output values are available)
- statistical model for predicting, or estimating,
an output based on one or more inputs
• E.g. predicting wage based on age, education, year (given
some past data)
• E.g. the line prediction examples (but the function could be
non-linear)
• Example: Prediction of wage
• Using age?
• Yes, to an extent
• Variability; non-linear (something else
also influencing)
• Year? Education level?
• Each parameter (predictor / variable)
not sufficient alone
• Regression – used for quantitative
(continuous, numerical value)
output
• a measure of the relation between the
mean value of one variable (e.g.
wage)
• and corresponding values of other
variables (e.g. age, education level)
• Predict non-numerical value
• Classification of output / prediction
into two classes (up and down)
• In the example, input is last 5 years
day-wise data
• Only previous dayE.g. predict
whether stock market (S&P 500
index) will increase or decrease
• , two previous days, or even 3
previous days is not seemingly
adequate for predicting next days
movement
There is not much variation in the % change up or
down considering (-)1 day, (-)2 days, (-)3 days
data. So in this case there is no good predictor
Statistical Learning
• Sales (Y=name of the output), in
thousands of units, as a function of
TV (X1), radio (X2), and
newspaper (X3) budgets,
• in thousands of dollars, for 200
different markets
• simple least squares fit of sales to
each variable
• each blue line represents a simple
model [Sales=f1(TV),
Sales=f2(Radio),
Sales=f3(Newspaper)] Least squares fit means (selecting) the line from which the
• can be used to predict sales using TV, sum of the squared distance of each of the points, is the
radio, and newspaper least.
• In general
• quantitative response Y and p different predictors,
Statistical X1,X2, . . .,Xp.
Learning • assume there is some relationship between Y and X
= (X1,X2, . . .,Xp), which can be written
𝑌=𝑓 𝑋 + 𝜖
• f is some fixed but unknown function of X1,X2,
. . .,Xp
• 𝜖 is a random error term
Model • Metrics for Performance Evaluation
Evaluation • How to evaluate the performance of a model?
78 ??
Relation between variables where changes in some variables
may “explain” or possibly “cause” changes in other variables.
Explanatory variables are termed the independent variables
and the variables to be explained are termed the dependent
variables.
Regression Regression model estimates the nature of the relationship
model between the independent and dependent variables.
– Change in dependent variables that results from changes in
independent variables, ie. size of the relationship.
– Strength of the relationship.
– Statistical significance of the relationship.
Strong relationships Weak relationships
Y Y
Types of
Relationships
X X
Y Y
X X
Regression analysis is used to:
Predict the value of a dependent variable based on
the value of at least one independent variable
Introduction to Explain the impact of changes in an independent
variable on the dependent variable
Regression
Analysis Dependent variable: the variable we wish to
predict or explain
Independent variable: the variable used to
predict or explain the
dependent variable
Simple Only one independent variable, X
Linear
Regression Relationship between X and Y is described by a
linear function
Model
Changes in Y are assumed to be related to changes
in X
X = independent (explanatory) variable
Y = dependent (response) variable
Linear Use instead of correlation
Regression when distribution of X is fixed by researcher (i.e., set
number at each level of X)
studying functional dependency between X and Y
Define problem or question
Specify model
Regression Collect data
Do descriptive data analysis
Modeling Estimate unknown parameters
Steps Evaluate model
Multiple
Regression
Models
Non-
Linear
Linear
Dummy Inter-
Linear action
Variable
Poly- Square
Log Reciprocal Exponential
Nomial Root
Multiple Regression Equations
This is too
complicated! You’ve got to be
kiddin’!
Multiple Regression Models
Multiple
Regression
Models
Non-
Linear
Linear
Dummy Inter-
Linear action
Variable
Poly- Square
Log Reciprocal Exponential
Nomial Root
Relationship between one dependent & two or more
independent variables is a linear function
Linear
Population Population Random
Model Y-intercept slopes error
Y 0 1 X 1 2 X 2 P X P
Dependent Independent
(explanatory)
(response) variables
variable
The straight line that best fits the data.
Method
of Least Determine the straight line for which the differences
between the actual values (Y) and the values that
Squares would be predicted from the fitted line of regression
(Y-hat) are as small as possible.
Explained variation (sum of
Measures squares due to regression)
of Unexplained variation (error sum
of squares)
Variation
Total sum of squares
Solved Example
60
50
X = percent receiving reduce or
free meal (RFM) 40
(HELM) 20
linear relation) 0
0 20 40 60 80 100
formulas 60
determine 50
best line 40
30
20
10
0
0 20 40 60 80 100
55
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label
The set of tuples used for model construction is training
set
The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Classification— Test set is independent of training set, otherwise
A Two-Step over-fitting will occur
Process If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known
56
56
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Classification 10
10 No Small 90K Yes
Model
Training Set
Task Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set 57
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Issues: Data Relevance analysis (feature selection)
Preparation Remove the irrelevant or redundant attributes
Data transformation
Generalize data to (higher concepts, discretization)
Normalize attribute values
58
Decision Tree based Methods
Rule-based Methods
Classification Naïve Bayes and Bayesian Belief Networks
Techniques Neural Networks
Support Vector Machines
and more...
59
Example Problem: decide whether to wait for a table at a
restaurant, based on the following attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
Learning 5. Patrons: number of people in the restaurant (None, Some,
decision trees Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60,
>60)
60
Examples described by feature(attribute) values
(Boolean, discrete, continuous)
E.g., situations where I will/won't wait for a table:
Feature(Attribute
)-based
representations
61
One possible representation for hypotheses
E.g., here is the “true” tree for deciding whether to wait:
Decision trees
Solved
Example
62
Decision trees can express any function of the input attributes.
E.g., for Boolean functions, truth table row → path to leaf:
Expressive
ness
Trivially, there is a consistent decision tree for any training set with one path to
leaf for each example (unless f nondeterministic in x) but it probably won't
generalize to new examples
63
Aim: find a small tree consistent with the training examples
Idea: (recursively) choose "most significant" attribute as root of
(sub)tree
Decision tree
learning
64
Principle
Basic algorithm (adopted by ID3, C4.5 and CART): a greedy algorithm
Tree is constructed in a top-down recursive divide-and-conquer manner
Iterations
At start, all the training tuples are at the root
Decision Tree Tuples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
Construction measure (e.g, information gain)
Algorithm Stopping conditions
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning –– majority
voting is employed for classifying the leaf
There are no samples left
65
Decision Tree Induction: Training Dataset
66
66
Example
67
Example
68
Example
69
Example
70
Example
71
Example
72
Example
73
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Tree Induction Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
74
Idea: a good attribute splits the examples into subsets
that are (ideally) "all positive" or "all negative"
Choosing an
attribute
75
Information Gain
Gini Index
Measures of
Node Impurity Misclassification error
76