0% found this document useful (0 votes)
2 views

ML_Introduction

The document provides an introduction to machine learning, covering key concepts such as supervised, unsupervised, and reinforcement learning, along with their applications in predictive analytics and data mining. It discusses the importance of model evaluation, regression analysis, and the relationship between independent and dependent variables in statistical learning. Additionally, it outlines various learning tasks, including classification and regression, and emphasizes the significance of understanding data through statistical models.

Uploaded by

pm.xvi.xii.mmiii
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ML_Introduction

The document provides an introduction to machine learning, covering key concepts such as supervised, unsupervised, and reinforcement learning, along with their applications in predictive analytics and data mining. It discusses the importance of model evaluation, regression analysis, and the relationship between independent and dependent variables in statistical learning. Additionally, it outlines various learning tasks, including classification and regression, and emphasizes the significance of understanding data through statistical models.

Uploaded by

pm.xvi.xii.mmiii
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Introduction to Machine

Learning
• Text book and resources
Course Info • James, G., Witten, D., Hastie, T., & Tibshirani,
R. (2013). An introduction to statistical
learning (Vol. 112, pp. 3-7). New York:
springer.

• http://faculty.marshall.usc.edu/gare
Teaching Scheme th-james/ISL/ Evaluation Scheme

Lecture Practical Internal Continuous Assessment Term End Examinations (TEE)


Tutorial
(Hours per (Hours per Credit (ICA) (Marks- 100 in Question
(Hours per
week) week) (Marks - 50 ) Paper)
week)

2 2 - 3 Marks Scaled to 50 Marks Scaled to 50


Machine Learning is study of algorithms that
Machine  improve their performance P , at some task T
Learning is .. with experience E

Well-defined learning task: <P,T,E>


 Machine learning: how to acquire a model on the basis
Machine of data / experience
Learning  Learning parameters (e.g. probabilities)
 Learning structure (e.g. graphs)
 Learning hidden concepts (e.g. clustering)

4
 Supervised Learning: Data and corresponding labels
are given

 Unsupervised Learning: Only data is given, no labels


provided
Machine
 Semi-supervised Learning: Some (if not all) labels are
Learning Areas present

 Reinforcement Learning: An agent interacting with the


world makes observations, takes actions, and is
rewarded or punished; it should learn to choose actions
in such a way as to obtain a lot of reward

5
 Data: labeled instances <xi, y>, e.g. emails marked
spam/not spam
Supervised  Training Set
 Held-out Set
Learning  Test Set
: Important  Features: attribute-value pairs which characterize each x
Concepts  Experimentation cycle
 Learn parameters (e.g. model probabilities) on training set
 (Tune hyper-parameters on held-out set)
 Compute accuracy of test set
 Very important: never “peek” at the test set!
 Evaluation
 Accuracy: fraction of instances predicted correctly
 Overfitting and generalization
 Want a classifier which does well on test data
 Overfitting: fitting the training data very closely, but not
generalizing well
6
Difference
data
mining control theory

statistics decision theory

information theory machine


learning cognitive science
Related
databases
Fields psychological models

evolutionary
neuroscience
models

Machine learning is primarily concerned with the accuracy and


effectiveness of the computer system.
Predictive Analytics
and Data Mining
What if…?
What’s the optimal
Explanatory scenario for business?
What will happen?
Why is it happening?

Related to Analytical approach DATA


SCIENCE
Data Science Std. and ad hoc reporting,
Dashboards, Alerts, Queries
on demand
What happened last quarter?
BUSINESS How many units sold?
INTELLIGENCE Where is the problem?

Time Future
Data Mining
ML Model
Process
Classification problems are supervised
Learning problems where target/response
variables take only discrete
(finite/countable) values.
Classification Example: Employability prediction
and
Regression
Regression problems are supervised
learning problems where target / response is
a continuous variable (or equivalently can
take any real number).
Example: Predicting price of a used car
Basic Maths
• If (x1,y1) = (1,2) and (x2,y2) = (2,4)
• What is the value of y3, if x3 = 3?
• y3= 6, if x3 = 3. How?
• The two points can be connected by a straight line
• Slope of line = m = y/ x = 2/1 = 2
• Equation connecting the points of the form y = f(x)
• If (x1,y1) = (1,2) and (x2,y2) = (2,5)
Basic Maths
• If (x1,y1) = (1,2), (x2,y2) = (2,5), (x3,y3) = (3,6)
• What is the value of y3, if x4 = 4?
• Two of the points can be connected by a straight line
• What about the third point?
• Slope of line = m = y/ x = 3/1 = 2
• OR Slope of line = m = y/ x = 1/1 = 3
• OR Slope of line = m = y/ x = 4/2 = 2
• Equation connecting the points of the form y = f(x)
• Predicting values based on data available
• To predict values a relation is needed
Prediction of • x is the input variable or predictor
• y is the output or predicted value
values
• y has a relation with x (y = some function of x)
• Previous discussion was 2 dimensional
• What if we have 3 dimensions? x, y, z
• Assume we re-label the axes as x1, x2, y
• Now, y is some function of x1, also x2
• y = f1 (x1) and y = f2(x2)
• OR combined y = f (x1,x2) or y = f (X)
Unsupervised - learn relationships and structure
from such data; no
 supervising output (y values or categories or
Unsupervised groups are not known)
• E.g. grouping documents based on their theme (no prior
ML information about groups) –
 data does not have blue or orange, etc. existing labels
• Only input variables; no output variable
• E.g customer grouping
• E.g. Input demographic data of
customers (age, qualification,
nationality, ethnicity, socioeconomic
status, etc.)
Clustering • Identify which customers are similar to each
other by grouping individuals according to
their observed characteristics – clustering
problem
• E.g. gene grouping
• Cluster cells based on genes which are
expressing (active); similar cancers tend
to cluster together
• Statistical learning
Introduction – • vast set of tools for understanding data
Regression • Tools are
(Supervised) • supervised (existing output values are available)
- statistical model for predicting, or estimating,
an output based on one or more inputs
• E.g. predicting wage based on age, education, year (given
some past data)
• E.g. the line prediction examples (but the function could be
non-linear)
• Example: Prediction of wage
• Using age?
• Yes, to an extent
• Variability; non-linear (something else
also influencing)
• Year? Education level?
• Each parameter (predictor / variable)
not sufficient alone
• Regression – used for quantitative
(continuous, numerical value)
output
• a measure of the relation between the
mean value of one variable (e.g.
wage)
• and corresponding values of other
variables (e.g. age, education level)
• Predict non-numerical value
• Classification of output / prediction
into two classes (up and down)
• In the example, input is last 5 years
day-wise data
• Only previous dayE.g. predict
whether stock market (S&P 500
index) will increase or decrease
• , two previous days, or even 3
previous days is not seemingly
adequate for predicting next days
movement
There is not much variation in the % change up or
down considering (-)1 day, (-)2 days, (-)3 days
data. So in this case there is no good predictor
Statistical Learning
• Sales (Y=name of the output), in
thousands of units, as a function of
TV (X1), radio (X2), and
newspaper (X3) budgets,
• in thousands of dollars, for 200
different markets
• simple least squares fit of sales to
each variable
• each blue line represents a simple
model [Sales=f1(TV),
Sales=f2(Radio),
Sales=f3(Newspaper)] Least squares fit means (selecting) the line from which the
• can be used to predict sales using TV, sum of the squared distance of each of the points, is the
radio, and newspaper least.
• In general
• quantitative response Y and p different predictors,
Statistical X1,X2, . . .,Xp.
Learning • assume there is some relationship between Y and X
= (X1,X2, . . .,Xp), which can be written
𝑌=𝑓 𝑋 + 𝜖
• f is some fixed but unknown function of X1,X2,
. . .,Xp
• 𝜖 is a random error term
Model • Metrics for Performance Evaluation
Evaluation • How to evaluate the performance of a model?

• Methods for Performance Evaluation


• How to obtain reliable estimates?

• Methods for Model Comparison


• How to compare the relative performance
among competing models?
• Metrics for Performance Evaluation
Model • How to evaluate the performance of a model?
Evaluation
• Methods for Performance Evaluation
• How to obtain reliable estimates?

• Methods for Model Comparison


• How to compare the relative performance
among competing models?
 In Regression we assign input vector x to one or more
continuous target variables t – Linear regression has
Classification simple analytical and computational properties

and  In Classification we assign input vector x to one of K


discrete classes Ck , k = 1, . . . ,K – Common
Regression classification scenario: classes considered disjoint –
Each input assigned to only one class – Input space is
thereby divided into decision regions
Feature tuple: (Zip Code, Family Income, # of visits in a
month, Average Money spent in a month)
Response / Target: None
Unsupervised Unsupervised Learning: To discover groups of similar
Learning examples within the data set

S.N Zip Family # of visits in a Average Money Spent in a


o. Code Income month month
1 500078 11,50,000 4 8,000
Classification and Regression - Examples
Classification
 Predicting whether a patient has a particular
disease or not.

 Hand written digit recognition

 Email spam detection


Regression
 Predicting house/property price

 Predicting stock market price

 Predicting sales of a product


Linear Regression
Predicting sales of an item
Advertising Sales
(in lakhs of rupees) (in lakhs of rupees)
10 520
20 625
35 700
50 780

78 ??
 Relation between variables where changes in some variables
may “explain” or possibly “cause” changes in other variables.
Explanatory variables are termed the independent variables
and the variables to be explained are termed the dependent
variables.
Regression  Regression model estimates the nature of the relationship
model between the independent and dependent variables.
– Change in dependent variables that results from changes in
independent variables, ie. size of the relationship.
– Strength of the relationship.
– Statistical significance of the relationship.
Strong relationships Weak relationships

Y Y

Types of
Relationships
X X

Y Y

X X
 Regression analysis is used to:
 Predict the value of a dependent variable based on
the value of at least one independent variable
Introduction to  Explain the impact of changes in an independent
variable on the dependent variable
Regression
Analysis  Dependent variable: the variable we wish to
predict or explain
 Independent variable: the variable used to
predict or explain the
dependent variable
Simple  Only one independent variable, X
Linear
Regression  Relationship between X and Y is described by a
linear function
Model
 Changes in Y are assumed to be related to changes
in X
 X = independent (explanatory) variable
 Y = dependent (response) variable
Linear  Use instead of correlation
Regression when distribution of X is fixed by researcher (i.e., set
number at each level of X)
studying functional dependency between X and Y
 Define problem or question
 Specify model
Regression  Collect data
 Do descriptive data analysis
Modeling  Estimate unknown parameters
Steps  Evaluate model

Use model for prediction


 represents the uniti represents the unit
change in Y per unit change in Y per unit
change in X . change in Xi.
Simple vs.
Multiple Does not take into Takes into account the
account any other effect of other
variable besides i s.
single independent “Net regression
variable. coefficient.”
Linearity - the Y variable is linearly related to
the value of the X variable.
Independence of Error - the error (residual) is
Assumpti independent for each value of X.
ons Homoscedasticity - the variation around the line
of regression be constant for all values of X.
Normality - the values of Y be normally
distributed at each value of X.
Develop a statistical model that
can predict the values of a
Regression dependent (response) variable
Goal based upon the values of the
independent (explanatory)
variables.
Simple
Regression A statistical model that utilizes one quantitative
independent variable “X” to predict the
quantitative dependent variable “Y.”
Multiple A statistical model that
utilizes two or more
Regression quantitative and qualitative
explanatory variables (x1,...,
xp) to predict a quantitative
dependent variable Y.
Have at least two or more quantitative
explanatory variables (rule of thumb)
 H0: 1 = 2 = 3 = ... = P = 0
Hypotheses
 H1: At least one regression
coefficient is not equal to
zero
H0: i = 0
Hypotheses
(alternate
format)
H1: i  0
Positive linear relationship
Negative linear relationship
Types of No relationship between X and Y
Models Positive curvilinear relationship
U-shaped curvilinear
Negative curvilinear relationship
Multiple Regression Models

Multiple
Regression
Models
Non-
Linear
Linear

Dummy Inter-
Linear action
Variable

Poly- Square
Log Reciprocal Exponential
Nomial Root
Multiple Regression Equations
This is too
complicated! You’ve got to be
kiddin’!
Multiple Regression Models

Multiple
Regression
Models
Non-
Linear
Linear

Dummy Inter-
Linear action
Variable

Poly- Square
Log Reciprocal Exponential
Nomial Root
Relationship between one dependent & two or more
independent variables is a linear function
Linear
Population Population Random
Model Y-intercept slopes error

Y   0  1 X 1   2 X 2     P X P  

Dependent Independent
(explanatory)
(response) variables
variable
 The straight line that best fits the data.
Method
of Least  Determine the straight line for which the differences
between the actual values (Y) and the values that
Squares would be predicted from the fitted line of regression
(Y-hat) are as small as possible.
Explained variation (sum of
Measures squares due to regression)
of Unexplained variation (error sum
of squares)
Variation
Total sum of squares
Solved Example
60

50
 X = percent receiving reduce or
free meal (RFM) 40

 Y = percent using helmets 30

(HELM) 20

 n = 12 (outlier removed to study 10

linear relation) 0
0 20 40 60 80 100

X - % Receiving Reduced Fee School Lunch


yˆ  a  bX
“y hat”
Regression
Model where
(Equation) yˆ represents predicted average of Y at a given X
a represents the line' s intercept
b represents the line' s slope
 Distance of points from line = residuals
(dotted)
 Minimizes sum of square residuals
How  Least squares regression line

formulas 60

determine 50

best line 40

30

20

10

0
0 20 40 60 80 100

X - % Receiving Reduced Fee School Lunch


 ŷ = a + bx
Predicted Y = intercept + (slope)(x)
HELM = 47.49 + (–0.54)(RFM)

Predicting  What is predicted HELM when RFM = 50?


ŷ = 47.49 + (–0.54)(50) = 20.5
Average Y Average HELM predicted to be 20.5 in neighborhood
where 50% of children receive reduced or free meal
 What is average Y when x = 20?
ŷ = 47.49 +(–0.54)(20) = 36.7
 Learning a discrete function:
Classification
Supervised  Boolean classification:
 Each example is classified as
Learning true(positive) or
false(negative).
 Learning a continuous function:
Regression

55
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label
The set of tuples used for model construction is training
set
The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Classification— Test set is independent of training set, otherwise
A Two-Step over-fitting will occur
Process If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known

56
56
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


Model
Illustrating 8
9
No
No
Small
Medium
85K
75K
Yes
No

Classification 10
10 No Small 90K Yes
Model
Training Set
Task Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set 57
 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
Issues: Data  Relevance analysis (feature selection)
Preparation  Remove the irrelevant or redundant attributes

 Data transformation
 Generalize data to (higher concepts, discretization)
 Normalize attribute values

58
 Decision Tree based Methods
 Rule-based Methods
Classification  Naïve Bayes and Bayesian Belief Networks
Techniques  Neural Networks
 Support Vector Machines
 and more...

59
Example Problem: decide whether to wait for a table at a
restaurant, based on the following attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
Learning 5. Patrons: number of people in the restaurant (None, Some,
decision trees Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60,
>60)

60
 Examples described by feature(attribute) values
 (Boolean, discrete, continuous)
 E.g., situations where I will/won't wait for a table:

Feature(Attribute
)-based
representations

 Classification of examples is positive (T) or negative (F)


61
 One possible representation for hypotheses
 E.g., here is the “true” tree for deciding whether to wait:

Decision trees
Solved
Example

62
 Decision trees can express any function of the input attributes.
 E.g., for Boolean functions, truth table row → path to leaf:

Expressive
ness

 Trivially, there is a consistent decision tree for any training set with one path to
leaf for each example (unless f nondeterministic in x) but it probably won't
generalize to new examples

 Prefer to find more compact decision trees

63
 Aim: find a small tree consistent with the training examples
 Idea: (recursively) choose "most significant" attribute as root of
(sub)tree

Decision tree
learning

64
 Principle
 Basic algorithm (adopted by ID3, C4.5 and CART): a greedy algorithm
 Tree is constructed in a top-down recursive divide-and-conquer manner
 Iterations
 At start, all the training tuples are at the root
Decision Tree  Tuples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical
Construction measure (e.g, information gain)
Algorithm  Stopping conditions
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –– majority
voting is employed for classifying the leaf
 There are no samples left

65
Decision Tree Induction: Training Dataset

age income student credit_rating buys_computer


<=30 high no fair no
<=30 high no excellent no
This 31…40 high no fair yes
follows an >40 medium no fair yes
>40 low yes fair yes
example >40 low yes excellent no
of 31…40 low yes excellent yes
supervised <=30
<=30
medium
low
no fair
yes fair
no
yes
ML ID3 >40 medium yes fair yes
(Playing <=30 medium yes excellent yes
31…40 medium no excellent yes
Tennis) 31…40 high yes fair yes
>40 medium no excellent no

66
66
Example

67
Example

68
Example

69
Example

70
Example

71
Example

72
Example

73
 Greedy strategy.
 Split the records based on an attribute test that
optimizes certain criterion.

 Issues
Tree Induction  Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split?
 Determine when to stop splitting

74
 Idea: a good attribute splits the examples into subsets
that are (ideally) "all positive" or "all negative"

Choosing an
attribute

75
 Information Gain

 Gini Index
Measures of
Node Impurity  Misclassification error

Choose attributes to split to achieve minimum impurity

76

You might also like