0% found this document useful (0 votes)

86 views60 pages

Machine Learning: Foundations: Prof. Nathan Intrator

This document provides an overview of a machine learning course. It outlines that there will be 4 homework assignments involving both theoretical and programming problems in Matlab. Students will take turns submitting class notes each week for 30% of the grade. The course will cover foundational machine learning concepts like supervised vs unsupervised learning, storage and retrieval approaches, rule-based learning, and flexible model estimation. Example applications are also discussed in areas like credit risk analysis, text clustering, robot control, and medicine.

Uploaded by

Aditya Kaushal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views60 pages

Machine Learning: Foundations: Prof. Nathan Intrator

Uploaded by

Aditya Kaushal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 60

Machine Learning:

Foundations
Course Number 0368403401

Prof. Nathan Intrator

Teaching Assistants: Daniel Gill, Guy Amit

Course structure

• There will be 4 homework exercises

• They will be theoretical as well as programming
• All programming will be done in Matlab
• Course info accessed from
www.cs.tau.ac.il/~nin
• Final has not been decided yet
• Office hours Wednesday 4-5 (Contact via email)
Class Notes

• Groups of 2-3 students will be responsible

for a scribing class notes
• Submission of class notes by next Monday
(1 week) and then corrections and additions
from Thursday to the following Monday
• 30% contribution to the grade
Class Notes (cont’d)

• Notes will be done in LaTeX to be compiled into

PDF via miktex.
(Download from School site)
• Style file to be found on course web site
• Figures in GIF
Basic Machine Learning idea
• Receive a collection of observations associated with
some action label
• Perform some kind of “Machine Learning”
to be able to:
– Receive a new observation
– “Process” it and generate an action label that is
based on previous observations

• Main Requirement: Good generalization

Learning Approaches

• Store observations in memory and retrieve

– Simple, little generalization (Distance measure?)

• Learn a set of rules and apply to new data

– Sometimes difficult to find a good model
– Good generalization

• Estimate a “flexible model” from the data

– Generalization issues, data size issues
Storage & Retrieval

• Simple, computationally intensive

– little generalization

• How can retrieval be performed?

– Requires a “distance measure” between stored
observations and new observation

• Distance measure can be given or “learned”

(Clustering)
Learning Set of Rules

• How to create “reliable” set of rules from the

observed data
– Tree structures
– Graphical models

• Complexity of the set of rules vs. generalization

Estimation of a flexible model

• What is a “flexible” model

– Universal approximator
– Reliability and generalization, Data size issues
Applications
• Control
– Robot arm
– Driving and navigating a car
– Medical applications:
• Diagnosis, monitoring, drug release, gene analysis

• Web retrieval based on user profile

– Customized ads: Amazon
– Document retrieval: Google
Related Disciplines

decision game
theory theory
AI control
theory
information
biological theory
evolution
Machine
probability Learning
& philosophy
statistics
optimization
Data Mining statistical psychology
mechanics

computational
complexity
theory neurophysiology
Example 1: Credit Risk Analysis
• Typical customer: bank.
• Database:
– Current clients data, including:
– basic profile (income, house ownership,
delinquent account, etc.)
– Basic classification.

• Goal: predict/decide whether to grant

credit.
Example 1: Credit Risk Analysis
• Rules learned from data:
IF Other-Delinquent-Accounts > 2 and
Number-Delinquent-Billing-Cycles >1
THEN DENY CREDIT
IF Other-Delinquent-Accounts = 0 and
Income > $30k
THEN GRANT CREDIT
Example 2: Clustering news

• Data: Reuters news / Web data

• Goal: Basic category classification:
– Business, sports, politics, etc.
– classify to subcategories (unspecified)

• Methodology:
– consider “typical words” for each category.
– Classify using a “distance “ measure.
Example 3: Robot control
• Goal: Control a robot in an unknown
environment.
• Needs both
– to explore (new places and action)
– to use acquired knowledge to gain
benefits.
• Learning task “control” what is
observes!
Example 4: Medical Application
• Goal: Monitor multiple physiological
parameters.
– Control a robot in an unknown
environment.
• Needs both
– to explore (new places and action)
– to use acquired knowledge to gain
benefits.
• Learning task “control” what is
History of Machine Learning
• 1960’s and 70’s: Models of human learning
– High-level symbolic descriptions of knowledge, e.g., logical expressions
or graphs/networks, e.g., (Karpinski & Michalski, 1966) (Simon & Lea,
1974).
– Winston’s (1975) structural learning system learned logic-based
structural descriptions from examples.

• Minsky Papert, 1969

• 1970’s: Genetic algorithms
– Developed by Holland (1975)
• 1970’s - present: Knowledge-intensive learning
– A tabula rasa approach typically fares poorly. “To acquire new
knowledge a system must already possess a great deal of initial
knowledge.” Lenat’s CYC project is a good example.
History of Machine Learning (cont’d)
• 1970’s - present: Alternative modes of learning (besides
examples)
– Learning from instruction, e.g., (Mostow, 1983) (Gordon &
Subramanian, 1993)
– Learning by analogy, e.g., (Veloso, 1990)
– Learning from cases, e.g., (Aha, 1991)
– Discovery (Lenat, 1977)
– 1991: The first of a series of workshops on Multistrategy Learning
(Michalski)
• 1970’s – present: Meta-learning
– Heuristics for focusing attention, e.g., (Gordon & Subramanian, 1996)
– Active selection of examples for learning, e.g., (Angluin, 1987),
(Gasarch & Smith, 1988), (Gordon, 1991)
– Learning how to learn, e.g., (Schmidhuber, 1996)
History of Machine Learning (cont’d)

• 1980 – The First Machine Learning Workshop was held at Carnegie-Mellon

University in Pittsburgh.
• 1980 – Three consecutive issues of the International Journal of Policy
Analysis and Information Systems were specially devoted to machine
learning.
• 1981 - Hinton, Jordan, Sejnowski, Rumelhart, McLeland at UCSD
– Back Propagation alg. PDP Book
• 1986 – The establishment of the Machine Learning journal.
• 1987 – The beginning of annual international conferences on machine
learning (ICML). Snowbird ML conference
• 1988 – The beginning of regular workshops on computational learning
theory (COLT).
• 1990’s – Explosive growth in the field of data mining, which involves the
application of machine learning techniques.
Bottom line from History

• 1960 – The Perceptron (Minsky Papert)

• 1960 – “Bellman Curse of Dimensionality”

• 1980 – Bounds on statistical estimators (C. Stone)

• 1990 – Beginning of high dimensional data (Hundreds

variables)

• 2000 – High dimensional data (Thousands variables)

A Glimpse in to the future

• Today status:
– First-generation algorithms:
– Neural nets, decision trees, etc.

• Future:
– Smart remote controls, phones, cars
– Data and communication networks,
software
Type of models
• Supervised learning
– Given access to classified data
• Unsupervised learning
– Given access to data, but no classification
– Important for data reduction
• Control learning
– Selects actions and observes
consequences.
– Maximizes long-term cumulative return.
Learning: Complete Information

• Probability D1 over
and probability D2 for
• Equally likely.
• Computing the
(x,y)
probability of “smiley”
given a point (x,y).
• Use Bayes formula.
• Let p be the
probability.
Task: generate class label to a
point at location (x,y)

P(( x, y ) | S ) P( S )
P( S | ( x, y )) 
P(( x, y ))
P(( x, y ) | S ) P( S )

P(( x, y ) | S ) P( S )  P(( x, y) | H ) P( H )
• Determine between S or H by
comparing the probability of P(S|(x,y))
to P(H|(x,y)).
• Clearly, one needs to know all these
probabilities
Predictions and Loss Model
• How do we determine the optimality of
the prediction
• We define a loss for every prediction
• Try to minimize the loss
– Predict a Boolean value.
– each error we lose 1 (no error no loss.)
– Compare the probability p to 1/2.
– Predict deterministically with the higher
value.
– Optimal prediction (for zero-one loss)
• Can not recover probabilities!
Bayes Estimator

• A Bayes estimator associated with a prior

distribution p and a loss function L is an
estimator d which minimizes L(p,d). For every
x, it is given by d(x), argument of min on
estimators d of p(p,d|x). The value r(p) =
r(p,dap) is then called the Bayes risk.
Other Loss Models
• Quadratic loss
– Predict a “real number” q for outcome 1.
– Loss (q-p)2 for outcome 1
– Loss ([1-q]-[1-p])2 for outcome 0
– Expected loss: (p-q)2
– Minimized for p=q (Optimal prediction)
• Recovers the probabilities
• Needs to know p to compute loss!
The basic PAC Model
•A batch learning model, i.e., the algorithm is
trained over some fixed data set
•Assumption: Fixed (Unknown distribution D of x in a
domain X)
•The error of a hypothesis h w.r.t. a target concept f is
e(h)= PrD[h(x)≠f(x)]
•Goal: Given a collection of hypotheses H, find h in H
that minimizes e(h).
The basic PAC Model

•As the distribution D is unknown, we are provided

with a training data set of m samples S on which
we can estimate the error:
e’(h)= 1/m |{ x ε S | h(x)  f(x) }|

• Basic question: How close is e(h) to e’(h)

Bayesian Theory

Prior distribution over H

Given a sample S compute a posterior distribution:

Pr[ S | h] Pr[ h]
Pr[ h | S ] 
Pr[ S ]

Maximum Likelihood (ML) Pr[S|h]

Maximum A Posteriori (MAP) Pr[h|S]
Bayesian Predictor S h(x) Pr[h|S].
Some Issues in Machine Learning

• What algorithms can approximate functions

well, and when?
• How does number of training examples
influence accuracy?
• How does complexity of hypothesis
representation impact it?
• How does noisy data influence accuracy?
More Issues in Machine Learning

What are the theoretical limits of learnability?

• How can prior knowledge of learner help?
• What clues can we get from biological learning
systems?

• How can systems alter their own

representations?
Complexity vs. Generalization

• Hypothesis complexity versus observed error.

• More complex hypothesis have lower observed
error on the training set,
• Might have higher true error (on test set).
Criteria for Model Selection
Minimum Description Length (MDL)

e’(h) + |code length of h|

Structural Risk Minimization:

e’(h) + { log |H| / m }½ m # of training samples

• Differ in assumptions about a priori Likelihood of h

• AIC and BIC are two other theory-based
model selection methods
Weak Learning

Small class of predicates H

Weak Learning:
Assume that for any distribution D, there is some
predicate heH that predicts better than 1/2+e.

Multiple Weak Learning Strong Learning

Boosting Algorithms

Functions: Weighted majority of the predicates.

Methodology:
Change the distribution to target “hard” examples.

Weight of an example is exponential in the number of

incorrect classifications.

Good experimental results and efficient algorithms.

Computational Methods

•How to find a hypothesis h from a collection H

with low observed error.

•Most cases computational tasks are provably hard.

• Some methods are only for a binary h and others

for both.
Nearest Neighbor Methods

Classify using near examples.

Assume a “structured space” and a “metric”

- + +
? -
+
- + -
Separating Hyperplane

sign
Perceptron: sign( S xiwi )
Find w1 .... wn
S
Limited representation
w1 wn

x1 xn
Neural Networks

Sigmoidal gates:
a= S xiwi and
output = 1/(1+ e-a)

x1 xn

Learning by “Back Propagation” of errors

Decision Trees

x1 > 5

+1
x6 > 2

+1 -1
Decision Trees

Top Down construction:

Construct the tree greedy,
using a local index function.
Ginni Index : G(x) = x(1-x), Entropy H(x) ...

Bottom up model selection:

Prune the decision Tree
while maintaining low observed error.
Decision Trees

• Limited Representation

• Highly interpretable

• Efficient training and retrieval algorithm

• Smart cost/complexity pruning

• Aim: Find a small decision tree with

a low observed error.

Support Vector Machine

n dimensions m dimensions
Support Vector Machine

Project data to a high dimensional space.

Use a hyperplane in the LARGE space.

Choose a hyperplane with a large MARGIN.

+ -

+ + -
+ -
Reinforcement Learning

• Main idea: Learning with a Delayed Reward

• Uses dynamic programming and supervised learning

• Addresses problems that can not be addressed by

regular supervised methods
• E.g., Useful for Control Problems.

• Dynamic programming searches for optimal policies.

Genetic Programming

A search Method. Example: decision trees

Local mutation operations Change a node in a tree

Cross-over operations Replace a subtree by another tree

Keeps the “best” candidates Keep trees with low observed error
Unsupervised learning: Clustering
Unsupervised learning: Clustering
Basic Concepts in Probability
• For a single
hypothesis h:
– Given an observed
error
– Bound the true error
• Markov Inequality
E[ x]
Pr[ x   ] 

Basic Concepts in Probability

• Chebyshev Inequality

Var ( x)
Pr[| x  E[ x] |  ] 
2
Basic Concepts in Probability

• Chernoff Inequality

xi {0,1} i.i.d, Pr( xi  1)  p

xi
Pr[|  i 1  p |  ]  2e 2 2 n
n

Convergence rate of empirical mean to the true mean

Basic Concepts in Probability

• Switching from h1 to h2:

– Given the observed errors
– Predict if h2 is better.

• Total error rate

• Cases where h1(x)  h2(x)
– More refine
Course structure

• Store observations in memory and retrieve

– Simple, little generalization (Distance measure?)

• Learn a set of rules and apply to new data

– Sometimes difficult to find a good model
– Good generalization

• Estimate a “flexible model” from the data

– Generalization issues, data size issues
– Some Issues in Machine Learning
Fourier Transform

f(x) = S z cz(x) cz(x) = (-1)<x,z>

Many Simple classes are well approximated using

large coefficients.

Efficient algorithms for finding large coefficients.

General PAC Methodology

Minimize the observed error.

Search for a small size classifier

Hand-tailored search method for specific classes.

Other Models

Membership Queries

x f(x)

Outside Processing in Oracle OPM
100% (1)
Outside Processing in Oracle OPM
3 pages
V For Violence
No ratings yet
V For Violence
104 pages
9001.2015 Quality Manual Ver. 3
100% (1)
9001.2015 Quality Manual Ver. 3
3 pages
The Foos Full
No ratings yet
The Foos Full
147 pages
Cinematic Derealization
No ratings yet
Cinematic Derealization
10 pages
Junior Design - Automatic Page Turner PDF
No ratings yet
Junior Design - Automatic Page Turner PDF
2 pages
Mukularanyam English School: A Project Report ON Travel Agency Management System
No ratings yet
Mukularanyam English School: A Project Report ON Travel Agency Management System
71 pages
Instruction For Seylan Upgrades +PM V1.1
No ratings yet
Instruction For Seylan Upgrades +PM V1.1
2 pages
FI Config Basic
No ratings yet
FI Config Basic
7 pages
BMC® Remedy® IT Service Management 7.0 Archetecture
100% (1)
BMC® Remedy® IT Service Management 7.0 Archetecture
177 pages
Simulink PLC Coder 1
No ratings yet
Simulink PLC Coder 1
103 pages
A Boy and His Cows 2
64% (11)
A Boy and His Cows 2
108 pages
Sap Im Transaction Codes With Explanations
No ratings yet
Sap Im Transaction Codes With Explanations
11 pages
GGSN Initiated PDP Deactivation
No ratings yet
GGSN Initiated PDP Deactivation
29 pages
M Sipko PDF
No ratings yet
M Sipko PDF
64 pages
Sem 4 PDF
No ratings yet
Sem 4 PDF
18 pages
IBM Security Systems Solutions For Banking
No ratings yet
IBM Security Systems Solutions For Banking
37 pages
Web Science: An Interdisciplinary Approach To Understanding The Web
No ratings yet
Web Science: An Interdisciplinary Approach To Understanding The Web
10 pages
Tingkat Kepuasan Wisatawan Terhadap Kualitas Sarana Dan Prasarana Di Kawasan Kota Lama Semarang
No ratings yet
Tingkat Kepuasan Wisatawan Terhadap Kualitas Sarana Dan Prasarana Di Kawasan Kota Lama Semarang
9 pages
P1 2018
No ratings yet
P1 2018
5 pages
A Preliminary Idea On Machine Learning
No ratings yet
A Preliminary Idea On Machine Learning
40 pages
Ai Unit5 Learning
No ratings yet
Ai Unit5 Learning
62 pages
Emuser SC 05022013
No ratings yet
Emuser SC 05022013
4 pages
Final Report Data Mining
No ratings yet
Final Report Data Mining
60 pages
Machine Learning Notes
100% (3)
Machine Learning Notes
134 pages
Implementing A Stack On A Xilinx Spartan 3e CW558 - Nov 2013
No ratings yet
Implementing A Stack On A Xilinx Spartan 3e CW558 - Nov 2013
10 pages
ML Full Slides Final
No ratings yet
ML Full Slides Final
458 pages
ML 1 2 3
No ratings yet
ML 1 2 3
54 pages
Data Structure Assignment
No ratings yet
Data Structure Assignment
5 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
54 pages
ML Merge
No ratings yet
ML Merge
145 pages
Lecture 13 Intro Machine Learning
No ratings yet
Lecture 13 Intro Machine Learning
56 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
56 pages
Question and Ans
No ratings yet
Question and Ans
5 pages
Machine Learning - Unit - 1
100% (1)
Machine Learning - Unit - 1
58 pages
MAI Lecture 01 Introduction
No ratings yet
MAI Lecture 01 Introduction
52 pages
Introduction To ML
100% (1)
Introduction To ML
39 pages
1 - Introduction
No ratings yet
1 - Introduction
82 pages
Machine Learning - ch1
No ratings yet
Machine Learning - ch1
46 pages
Lecture 1
No ratings yet
Lecture 1
47 pages
ML - Full Slides Srikanth Allamshatty
No ratings yet
ML - Full Slides Srikanth Allamshatty
369 pages
Voluntary Disclosures
No ratings yet
Voluntary Disclosures
1 page
01 - Introduction
No ratings yet
01 - Introduction
35 pages
Ai - Foundations of Machine Learning I
No ratings yet
Ai - Foundations of Machine Learning I
39 pages
Notes
No ratings yet
Notes
125 pages
Seven Pragmatic Practices To Improve Software Quality
No ratings yet
Seven Pragmatic Practices To Improve Software Quality
10 pages
Introduction To Machine Learning For Beginners
No ratings yet
Introduction To Machine Learning For Beginners
5 pages
ML Chapter 1
No ratings yet
ML Chapter 1
37 pages
Unit 3
No ratings yet
Unit 3
62 pages
WEEK 01 Merged
No ratings yet
WEEK 01 Merged
606 pages
Presentation On ML
No ratings yet
Presentation On ML
469 pages
ML ch1
No ratings yet
ML ch1
20 pages
ASAL IT Decisiontree Final
No ratings yet
ASAL IT Decisiontree Final
1 page
Machine Learning
No ratings yet
Machine Learning
33 pages
A Presentation By: Ramin Shamshiri: PH.D Candidate, Ag & Bio Engineering University of Florida
No ratings yet
A Presentation By: Ramin Shamshiri: PH.D Candidate, Ag & Bio Engineering University of Florida
41 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
32 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Sec 1630
No ratings yet
Sec 1630
145 pages
UNIT I-Machine Learning
No ratings yet
UNIT I-Machine Learning
68 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
2 Main 2nd Lecture
No ratings yet
2 Main 2nd Lecture
23 pages
Introduction To Machine Learning - Midterm Quiz 1
No ratings yet
Introduction To Machine Learning - Midterm Quiz 1
10 pages
Escreen Sheet
No ratings yet
Escreen Sheet
2 pages
Day 2 Part 1
No ratings yet
Day 2 Part 1
52 pages
Unit 1
100% (1)
Unit 1
13 pages
M2 AI Chap1 Neural-Network
No ratings yet
M2 AI Chap1 Neural-Network
60 pages
ML - 1 - Sovan - Introduction To ML
No ratings yet
ML - 1 - Sovan - Introduction To ML
83 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
39 pages
Machine Learning Lecture
No ratings yet
Machine Learning Lecture
23 pages
Module 1
No ratings yet
Module 1
22 pages
Unit-1 Introduction To Machine Learning
No ratings yet
Unit-1 Introduction To Machine Learning
24 pages
Last Time: - Web As A Graph - What Is Link Analysis
No ratings yet
Last Time: - Web As A Graph - What Is Link Analysis
78 pages
Machine Learning Techniques - Types of Machine Learning - Applications Mathematical Foundations of Machine Learning
No ratings yet
Machine Learning Techniques - Types of Machine Learning - Applications Mathematical Foundations of Machine Learning
15 pages
DSA5102X Lecture1
No ratings yet
DSA5102X Lecture1
51 pages
Module 1
No ratings yet
Module 1
50 pages
1 Introduction
No ratings yet
1 Introduction
24 pages
Presentation 33360 Content Document 20250319044717PM
No ratings yet
Presentation 33360 Content Document 20250319044717PM
126 pages
Module 1
No ratings yet
Module 1
38 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
48 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
Lecture 1.2 Introduction To Machine Learning
No ratings yet
Lecture 1.2 Introduction To Machine Learning
31 pages
SEng5305-chap-1-Introduction To ML
No ratings yet
SEng5305-chap-1-Introduction To ML
85 pages
Lecture Slide 01 - ML
No ratings yet
Lecture Slide 01 - ML
40 pages
Unit I
No ratings yet
Unit I
112 pages
Machine Learning - v1
No ratings yet
Machine Learning - v1
30 pages
Unit 3
No ratings yet
Unit 3
16 pages
ML Module 1 Final
No ratings yet
ML Module 1 Final
134 pages
ML-UNIT - I - Part A
No ratings yet
ML-UNIT - I - Part A
88 pages
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet
Machine Learning Unraveled: Exploring the World of Data Science and AI
From Everand
Machine Learning Unraveled: Exploring the World of Data Science and AI
Alex Murphy
No ratings yet