0% found this document useful (0 votes)
7 views101 pages

Unit1 2

The document discusses machine learning, emphasizing its significance and applications across various fields such as web search, finance, and robotics. It outlines different types of learning, including supervised, unsupervised, and reinforcement learning, and explains the importance of algorithms and performance factors in machine learning systems. Additionally, it highlights evaluation methods for classification accuracy and the relevance of precision and recall in assessing model performance.

Uploaded by

d36078067
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views101 pages

Unit1 2

The document discusses machine learning, emphasizing its significance and applications across various fields such as web search, finance, and robotics. It outlines different types of learning, including supervised, unsupervised, and reinforcement learning, and explains the importance of algorithms and performance factors in machine learning systems. Additionally, it highlights evaluation methods for classification accuracy and the relevance of precision and recall in assessing model performance.

Uploaded by

d36078067
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 101

A Few Quotes

• “A breakthrough in machine learning would be worth


ten Microsofts” (Bill Gates, Chairman, Microsoft)
• “Machine learning is the next Internet”
(Tony Tether, Director, DARPA)
• Machine learning is the hot new thing”
(John Hennessy, President, Stanford)
• “Web rankings today are mostly a matter of machine
learning” (Prabhakar Raghavan, Dir. Research, Yahoo)
• “Machine learning is going to result in a real revolution”
(Greg Papadopoulos, CTO, Sun)
• “Machine learning is today’s discontinuity”
So What Is Machine Learning?
• Automating automation
• Getting computers to program themselves
• Writing software is the bottleneck
• Let the data do the work instead!
Traditional Programming
Data
Computer Outpu
Progra t
m
Machine Learning
Data
Computer Progra
Outpu m
t
Magic?
No, more like gardening

• Seeds = Algorithms
• Nutrients = Data
• Gardener = You
• Plants = Programs
Sample Applications
• Web search
• Computational biology
• Finance
• E-commerce
• Space exploration
• Robotics
• Information extraction
• Social networks
• Debugging
Types of Learning
• Supervised (inductive) learning
– Training data includes desired outputs
• Unsupervised learning
– Training data does not include desired outputs
• Semi-supervised learning
– Training data includes a few desired outputs
• Reinforcement learning
Inductive Learning
• Given examples of a function (X, F(X))
• Predict function F(X) for new examples X
– Discrete F(X): Classification
– Continuous F(X): Regression
– F(X) = Probability(X): Probability estimation
Learning system model

Input Testing Learning


Samples Method

System

Training
Training and testing

Data Practical
acquisition usage
Universal set
(unobserved)

Training set Testing set


(observed) (unobserved)
Training and testing
• Training is the process of making the system able to
learn.

• No free lunch rule:


– Training set and testing set come from the same distribution
– Need to make some assumptions or bias
• Some Applications:
• Some Applications:
Applications of ML?
• Some Applications:
Applications of ML
• Some Applications:
Applications of ML
• Some Applications:
Applications of ML
• Some Applications:
Performance
• There are several factors affecting the performance:
– Types of training provided
– The form and extent of any initial background knowledge
– The type of feedback provided
– The learning algorithms used

• Two important factors:


– Modeling
– Optimization
Algorithms
• The success of machine learning system also depends
on the algorithms.

• The algorithms control the search to find and build the


knowledge structures.

• The learning algorithms should extract useful information


from training examples.
Algorithms
• Supervised learning ( )
– Prediction
– Classification (discrete labels), Regression (real values)
• Unsupervised learning ( )
– Clustering
– Probability distribution estimation
– Finding association (in features)
– Dimension reduction
• Semi-supervised learning
• Reinforcement learning
Algorithms

Supervised Unsupervised
learning learning

Semi-supervised learning 20
Machine learning structure
• Supervised learning
Machine learning structure
• Unsupervised learning
Learning techniques
• Supervised learning categories and techniques
– Linear classifier (numerical functions)
– Parametric (Probabilistic functions)
• Naïve Bayes, Gaussian discriminant analysis (GDA), Hidden
Markov models (HMM), Probabilistic graphical models
– Non-parametric (Instance-based functions)
• K-nearest neighbors, Kernel regression, Kernel density
estimation, Local regression
– Non-metric (Symbolic functions)
• Classification and regression tree (CART), decision tree
– Aggregation
Learning techniques
• Linear classifier

, where w is an d-dim vector


(learned)

•Techniques:
– Perceptron
– Logistic regression
– Support vector machine (SVM)
– Ada-line
Learning techniques
Using perceptron learning algorithm(PLA)

Training Testing
Error rate: Error rate:
0.10 0.156
Learning techniques
Using logistic regression

Training Testing
Error rate: Error rate:
0.11 0.145
Learning techniques
• Non-linear case

• Support vector machine (SVM):


– Linear to nonlinear: Feature transform and kernel function
Learning techniques
• Unsupervised learning categories and techniques
– Clustering
• K-means clustering
• Spectral clustering
– Density Estimation
• Gaussian mixture model (GMM)
• Graphical models
– Dimensionality reduction
• Principal component analysis (PCA)
• Factor analysis
Why “Learn”?
• Machine learning is programming computers to
optimize a performance criterion using example
data or past experience.
• There is no need to “learn” to calculate payroll
• Learning is used when:
– Human expertise does not exist (navigating
on Mars),
– Humans are unable to explain their expertise
(speech recognition)
– Solution changes 29 in time (routing on a
What We Talk About When We
Talk About“Learning”
• Learning general models from a data of particular
examples
• Data is cheap and abundant (data warehouses, data
marts); knowledge is expensive and scarce.
• Example in retail: Customer transactions to consumer
behavior:
People who bought “Da Vinci Code” also bought “The
Five People You Meet in Heaven”
(www.amazon.com)
30
• Build a model that is a good and useful approximation to
Growth of Machine Learning
• Machine learning is preferred approach to
– Speech recognition, Natural language processing
– Computer vision
– Medical outcomes analysis
– Robot control
– Computational biology
• This trend is accelerating
– Improved machine learning algorithms
– Improved data capture, networking, faster computers
– Software too complex to write by hand
– New sensors / IO devices 31
Applications
• Association Analysis
• Supervised Learning
– Classification
– Regression/Prediction
• Unsupervised Learning
• Reinforcement Learning
32
Learning Associations
• Basket analysis:
P (Y | X ) probability that somebody who buys X
also buys Y where X and Y are
products/services.

Example: P ( chips | beer ) = 0.7


Market-Basket transactions
Classification
• Example: Credit
scoring
• Differentiating
between low-risk
and high-risk
customers from their
income and savings

Discriminant: IF income > θ1 AND savings > θ2


THEN low-risk ELSE high-risk
Model 34
Classification: Applications
• Aka Pattern recognition
• Face recognition: Pose, lighting, occlusion (glasses, beard), make-
up, hair style
• Character recognition: Different handwriting styles.
• Speech recognition: Temporal dependency.
– Use of a dictionary or the syntax of the language.
– Sensor fusion: Combine multiple modalities; eg, visual (lip
image) and acoustic for speech
• Medical diagnosis: From symptoms to illnesses
• Web Advertizing: Predict if a user clicks on an ad on the Internet.

35
Face Recognition
Training examples of a person

Test images

AT&T Laboratories, Cambridge UK


36 http://www.uk.research.att.com/facedatabase.html
Prediction: Regression
• Example: Price of
a used car y = wx+w0
• x : car attributes
y : price
y = g (x | θ )
g ( ) model,
37
Regression Applications
• Navigating a car: Angle of the steering
wheel (CMU NavLab)
• Kinematics of a robot arm
(x,y) α1= g1(x,y)
α2 α2= g2(x,y)

α1
38
Supervised Learning: Uses
Example: decision trees tools that create rules
• Prediction of future cases: Use the rule to
predict the output for future inputs
• Knowledge extraction: The rule is easy to
understand
• Compression: The rule is simpler than the data it
explains
• Outlier detection: Exceptions that are not
39
covered by the rule, e.g., fraud
Unsupervised Learning
• Learning “what normally happens”
• No output
• Clustering: Grouping similar instances
• Other applications: Summarization, Association
Analysis
• Example applications
– Customer segmentation in 40
CRM
Reinforcement Learning
• Topics:
– Policies: what actions should an agent take in a
particular situation
– Utility estimation: how good is a state (🡪used by
policy)
• No supervised output but delayed reward
• Credit assignment problem (what was responsible for the
outcome)
• Applications:
– Game playing 41
Learning
An example application
• An emergency room in a hospital
measures 17 variables (e.g., blood
pressure, age, etc) of newly admitted
patients.
• A decision is needed: whether to put a
new patient in an intensive-care unit.
• Due to the high cost of ICU, those
patients who may survive less than a
month are42given higher priority.
Learning
An example application
• An emergency room in a hospital
measures 17 variables (e.g., blood
pressure, age, etc) of newly admitted
patients.
• A decision is needed: whether to put a
new patient in an intensive-care unit.
• Due to the high cost of ICU, those
patients who may survive less than a
month are43given higher priority.
Another application
• A credit card company receives thousands
of applications for new cards. Each
application contains information about an
applicant,
– age
– Marital status
– annual salary
– outstanding debts
– credit rating
– etc.
• Problem: to 44decide whether an application
Machine learning and our focus
• Like human learning from past experiences.
• A computer does not have “experiences”.
• A computer system learns from data, which
represent some “past experiences” of an
application domain.
• Our focus: learn a target function that can be
used to predict the values of a discrete class
attribute, e.g., approve or not-approved, and
high-risk or low risk.
45
The data and the goal
• Data: A set of data records (also called
examples, instances or cases) described
by
– k attributes: A1, A2, … Ak.
– a class: Each example is labelled with a pre-
defined class.
• Goal: To learn a classification model from
the data that can be used to predict the
46
An example: data (loan application)
Approved or not

47
An example: the learning task
• Learn a classification model from the data
• Use the model to classify future loan
applications into
– Yes (approved) and
– No (not approved)
• What is the class for following
case/instance?
48
Supervised learning process:
two steps
Learning (training): Learn a model using the training data
Testing: Test the model using unseen test data to assess the model accuracy

49
What do we mean by learning?
• Given
– a data set D,
– a task T, and
– a performance measure M,
a computer system is said to learn from D
to perform the task T if after learning the
system’s performance on T improves as
measured by M.
• In other words,
50 the learned model helps
An example
• Data: Loan application data
• Task: Predict whether a loan should be
approved or not.
• Performance measure: accuracy.

No learning: classify all future applications


(test data) to the majority class (i.e., Yes):
51
Fundamental assumption of
learning
Assumption: The distribution of training examples
is identical to the distribution of test examples
(including future unseen examples).

• In practice, this assumption is often violated to


certain degree.
• Strong violations will clearly result in poor
classification accuracy.
• To achieve good accuracy on the test data,
training examples
52 must be sufficiently
Evaluating classification methods
• Predictive accuracy

• Efficiency
– time to construct the model
– time to use the model
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability:
– understandable and insight provided by the model
• Compactness of the model:53 size of the tree, or the
Evaluation methods
• Holdout set: The available data set D is divided into two disjoint
subsets,
– the training set Dtrain (for learning a model)
– the test set Dtest (for testing the model)
• Important: training set should not be used in testing and the test set
should not be used in learning.
– Unseen test set provides a unbiased estimate of accuracy.
• The test set is also called the holdout set. (the examples in the
original data set D are all labeled with classes.)
• This method is mainly used when the data set D is large.

54
Evaluation methods (cont…)
• n-fold cross-validation: The available data is partitioned
into n equal-size disjoint subsets.
• Use each subset as the test set and combine the rest n-1
subsets as the training set to learn a classifier.
• The procedure is run n times, which give n accuracies.
• The final estimated accuracy of learning is the average of
the n accuracies.
• 10-fold and 5-fold cross-validations are commonly used.
• This method is used when the available data is not large.
55
Evaluation methods (cont…)

• Leave-one-out cross-validation: This method


is used when the data set is very small.
• It is a special case of cross-validation
• Each fold of the cross validation has only a
single test example and all the rest of the data is
used in training.
• If the original data has m examples, this is m-
fold cross-validation
56
Evaluation methods (cont…)
• Validation set: the available data is divided into three
subsets,
– a training set,
– a validation set and
– a test set.
• A validation set is used frequently for estimating
parameters in learning algorithms.
• In such cases, the values that give the best accuracy on
the validation set are used as the final parameter values.
• Cross-validation can be used for parameter estimating
as well. 57
Classification measures
• Accuracy is only one measure (error = 1-accuracy).
• Accuracy is not suitable in some applications.
• In text mining, we may only be interested in the
documents of a particular topic, which are only a small
portion of a big document collection.
• In classification involving skewed or highly imbalanced
data, e.g., network intrusion and financial fraud
detections, we are interested only in the minority class.
– High accuracy does not mean any intrusion is
detected.
– E.g., 1% intrusion.
58 Achieve 99% accuracy by doing
Precision and recall measures
• Used in information retrieval and text
classification.
• We use a confusion matrix to introduce
them.

59
Precision and recall measures
(cont…)

Precision p is the number of correctly classified positive


examples divided by the total number of examples
that are classified as positive.
Recall r is the number of correctly classified positive
examples divided
60 by the total number of actual
An example

• This confusion matrix gives


– precision p = 100% and
– recall r = 1%
because we only classified one positive example correctly
and no negative examples wrongly.
• Note: precision and recall only measure
61
F1-value (also called F1-score)
• It is hard to compare two classifiers using two measures. F 1
score combines precision and recall into one measure

• The harmonic mean of two numbers tends to be closer to the


62
smaller of the two.
Resources: Datasets
• UCI Repository:
http://www.ics.uci.edu/~mlearn/MLRepository.html

• UCI KDD Archive:


http://kdd.ics.uci.edu/summary.data.application.html

• Statlib: http://lib.stat.cmu.edu/
• Delve: http://www.cs.utoronto.ca/~delve/

63
Resources: Journals
• Journal of Machine Learning Research
www.jmlr.org
• Machine Learning
• IEEE Transactions on Neural Networks
• IEEE Transactions on Pattern Analysis
and Machine Intelligence
• Annals of Statistics
• Journal of the American
64 Statistical
Resources: Conferences
• International Conference on Machine Learning (ICML)
• European Conference on Machine Learning (ECML)
• Neural Information Processing Systems (NIPS)
• Computational Learning
• International Joint Conference on Artificial Intelligence (IJCAI)
• ACM SIGKDD Conference on Knowledge Discovery and Data Mining
(KDD)
• IEEE Int. Conf. on Data Mining (ICDM)

65
References and
acknowledgemnent
• Srihari, S.N., Covindaraju, Pattern recognition, Chapman &Hall, London,
1034-1041, 1993,
• Sergios Theodoridis, Konstantinos Koutroumbas , pattern recognition ,
Pattern Recognition ,Elsevier(USA)) ,1982
• R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, New York:
John Wiley, 2001
• W.L.Chao, J.J.Ding, “Integrated Machine Learning Algorithms for Human
Age Estimation”, NTU, 2011.
• Semi-supervised Learning, Avrim Blum.
An Example
• Suppose that:
– A fish packing plant
wants to automate the
process of sorting
incoming fish on a
conveyor belt
according to species,
– There are two
species:
An Example
• How to distinguish one specie from
the other ? (length, width, weight,
number and shape of fins, tail
shape,etc.)
An Example

• Suppose somebody at the fish plant


say us that:
– Sea bass is generally longer than a
salmon
• Then our models for the fish:
– Sea bass have some typical length, and
this is greater than that for salmon.
An Example

• Then length becomes a feature,


• We might attempt to classify the fish
by seeing whether or not the length
of a fish exceeds some critical value
(threshold value) l*.
An Example

• How to decide on the critical value


(threshold value) ?
An Example

• How to decide on the critical value


(threshold value) ?
– We could obtain some training samples
of different types of fish,
– make length measurements,
– Inspect the results.
An Example
• Measurement results on the training
sample related to two species.
An Example
• Can we reliably seperate sea bass
from salmon by using length as a
feature ?
Remember our
model:
–Sea bass
have some
typical
length, and
this is
greater than
that for
An Example
• From histogram we can see that
single criteria is quite poor.
An Example
• It is obvious that length is not a good
feature.
• What we can do to seperate sea
bass from salmon?
An Example
• What we can do to seperate sea
bass from salmon?
• Try another feature:
– average lightness of the fish scales.
An Example
• Can we reliably seperate sea bass
from salmon by using lightness as a
feature ?
An Example
• Lighness is better than length as a
feature but again there are some
problems.
An Example

• Suppose we also know that:


– Sea bass are typically wider than
salmon.
• We can use more than one feature
for our decision:
– Lightness (x1) and width (x2)
An Example

• Each fish is now a point in two


dimension.
– Lightness (x1) and width (x2)
An Example
• Each fish is now a point in two
dimension.
– Lightness (x1) and width (x2)
An Example
• Each fish is now a point in two
dimension.
– Lightness (x1) and width (x2)
Cost of error

• Cost of different errors must be


considered when making decisions,
• We try to make a decision rule so as
to minimize such a cost,
• This is the central task of decision
theory.
Cost of error

• For example, if the fish packing


company knows that:
– Customers who buy salmon will object if
they see sea bass in their cans.
– Customers who buy sea bass will not
be unhappy if they occasionally see
some expensive salmon in their cans.
Decision boundaries
• We can perform better if we use more
complex decision boundaries.
Decision boundaries

• There is a trade of between


complexity of the decision rules and
their performances to unknown
samples.
• Generalization: The ability of the
classifier to produce correct results
The design cycle
The design cycle
• Collect data:
– Collect train and test data
• Choose features:
– Domain dependence and prior information,
– Computational cost and feasibility,
– Discriminative features,
– Invariant features with respect to translation, rotation and
scale,
– Robust features with respect to occlusion, distortion,
deformation, and variations in environment.
• Feature Selection & Extraction
• Selection v extraction
• How many and which subset of features to
use in constructing the decision boundary?
• Some features may be redundant
• Curse of dimensionality—Error rate may in
fact increase with too many features in the
Unsupervised Learning

• Definition of Unsupervised Learning:


Learning useful structure without labeled classes,
optimization criterion, feedback signal, or any
other information beyond the raw data
Unsupervised Learning
• Examples:
– Find natural groupings of Xs (X=human languages, stocks,
gene sequences, animal species,…)🡪
Prelude to discovery of underlying properties
– Summarize the news for the past month🡪
Cluster first, then report centroids.
– Sequence extrapolation: E.g. Predict cancer incidence next
decade; predict rise in antibiotic-resistant bacteria
• Methods
– Clustering (n-link, k-means, GAC,…)
– Taxonomy creation (hierarchical clustering)
Similarity Measures in Data Analysis
• General Assumptions
– Each data item is a tuple (vector)
– Values of tuple are nominal, ordinal or
numerical
– Similarity = (Distance)-1

• Pure Numerical Tuples


– Sim(di,dj) = Σdi,kdj,k
Similarity Measures in Data Analysis
• For Ordinal Values
– E.g. "small," "medium," "large," "X-large"
– Convert to numerical assuming constant Δ…on
a normalized [0,1] scale, where: max(v)=1,
min(v)=0, others interpolate
– E.g. "small"=0, "medium"=0.33, etc.
– Then, use numerical similarity measures
Similarity Measures (cont.)
• For Nominal Values
– E.g. "Boston", "LA", "Pittsburgh", or "male",
"female", or "diffuse", "globular", "spiral",
"pinwheel"
– Binary rule: If di,k=dj,k, then sim=1, else 0
– Use underlying sematic property: E.g.
Sim(Boston, LA)=αdist(Boston, LA)-1, or
Sim(Boston, LA)=α(|size(Boston) – size(LA)| ) -
Similarity Matrix
tiny little small medium large huge
tiny 1.0 0.8 0.7 0.5 0.2 0.0
little 1.0 0.9 0.7 0.3 0.1
small 1.0 0.7 0.3 0.2
medium 1.0 0.5 0.3
large 1.0 0.8
huge 1.0

– Diagonal must be 1.0


– Monotonicity property must hold
References
• Srihari, S.N., Covindaraju, Pattern recognition, Chapman &Hall, London, 1034-1041, 1993,
• Sergios Theodoridis, Konstantinos Koutroumbas , pattern recognition , Pattern
Recognition ,Elsevier(USA)) ,1982
• R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, New York: John Wiley, 2001,
Supervised Learning
• Training Samples (X, F(X))
• Discrete F(X): Classification
• Continuous F(X): Regression

• (1, 2) , (2, 6), (3, 12), (4, 20) ……. (10, 110)

Y = F(x)
Training and testing

Data acquisition Practical usage

Universal set
(unobserved)

Training set Testing set


(observed) (unobserved)
Machine learning structure
• Supervised learning
Machine learning structure
• Unsupervised learning

You might also like