0% found this document useful (0 votes)
114 views30 pages

Outline: - Learning Agents - Inductive Learning - Decision Tree Learning

The document discusses machine learning and different types of learning algorithms. It covers supervised learning algorithms like decision trees which use training data to construct a classification model. It also discusses unsupervised learning which finds patterns in unlabeled data and reinforcement learning which uses rewards/punishments to learn. Decision tree induction is explained as a method that recursively partitions data based on attribute tests to classify examples.

Uploaded by

Ali Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views30 pages

Outline: - Learning Agents - Inductive Learning - Decision Tree Learning

The document discusses machine learning and different types of learning algorithms. It covers supervised learning algorithms like decision trees which use training data to construct a classification model. It also discusses unsupervised learning which finds patterns in unlabeled data and reinforcement learning which uses rewards/punishments to learn. Decision tree induction is explained as a method that recursively partitions data based on attribute tests to classify examples.

Uploaded by

Ali Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 30

Outline

• Learning agents
• Inductive learning
• Decision tree learning
Learning
• An agent is learning if it improves its performance on
future tasks after making observations about the word.
• Why would we want an agent to learn ?If the design of the
agent can be improved
• Why wouldn't the designers just program in the
improvement to begin with ?
– First the designers cannot anticipate all possible situations that
the agent might find itself in
– Second , the designers cannot anticipate all changes over time
– Third , sometimes human programmers have no idea how to
program a solution themselves.

• Learning is essential for unknown environments,
– i.e., when designer lacks omniscience

• Learning is useful as a system construction


method,
– i.e., expose the agent to reality rather than trying to
write it down

• Learning modifies the agent's decision


mechanisms to improve performance
Learning agents
Learning element
• Design of a learning element is affected by
– Which components of the performance element are to
be learned
– What prior knowledge the agent already has.
– What feedback is available to learn these components
– What representation is used for the components

• Type of feedback:
– Supervised learning: correct answers for each example
– Semi-supervised learning: correct answers for some
examples
– Unsupervised learning: correct answers not given
– Reinforcement learning: occasional rewards
Machine Learning Algorithms
• Supervised algorithms - the agent observes some example
input–output pairs and learn
1. Use training data which has correct answers (class label).
2. Create a classification model (performance element) by
running the algorithm on the training data.
3. Test the model. If accuracy is low, regenerate model after
changing features, training samples, etc.
4. Use model to predict class label for new incoming data.
• Unsupervised algorithms – find hidden relationships in data
– Do not use training data.
– Classes may not be known in advance.
– In unsupervised learning the agent learns patterns in the input
even though no explicit UNSUPERVISED LEARNING
feedback is supplied.
– The most common unsupervised learning task is clustering:
detecting CLUSTERING

• Reinforcement learning –
– How an agent ought to take actions in an environment so as to
maximize some notion of long-term reward.
– In reinforcement learning the agent learns from a series of
reinforcements—rewards or punishments. For example, the lack of a
tip at the end of the journey gives the taxi agent an indication that it
did something wrong
• In semi-supervised learning we are
given a few labeled examples and
must make what we can of a large
collection of unlabeled examples.
Supervised Learning
Prediction Problems: Classification
vs. Numeric Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts
unknown or missing values
• Typical applications
– Credit/loan approval:
– Medical diagnosis: if a tumor is cancerous or benign
– Fraud detection: if a transaction is fraudulent
10
– Web page categorization: which category it is
Process (1): Model
Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
11 THEN tenured = ‘yes’
Process (2): Using the Model in
Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph
12 Assistant Prof 7 yes
Inductive learning
Simplest form: learn a function from examples
• Type of reasoning that involves moving from a set of
specific facts to a general conclusion.

• f is the target function

• An example is a pair (x, f(x))

• Problem: find a hypothesis h


– such that h ≈ f
– given a training set of examples

• This is a highly simplified model of real learning:


– Ignores prior knowledge
– Assumes examples are given
Inductive learning method
• Construct/adjust h to agree with f on training set
• (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:
Inductive learning method
• Construct/adjust h to agree with f on training set
• (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:
Inductive learning method
• Construct/adjust h to agree with f on training set
• (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:
Inductive learning method
• Construct/adjust h to agree with f on training set
• (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:
Inductive learning method
• Construct/adjust h to agree with f on training set
• (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:
Inductive learning method
• Construct/adjust h to agree with f on training set
• (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:

• Ockham’s razor: prefer the simplest hypothesis consistent with the


data
• how do we choose from among multiple consistent hypotheses?
• One answer is to prefer the simplest hypothesis consistent with the data. This principle is
called Ockham’srazor,
Decision Trees (DT)
• Widely used, practical method for inductive inference.

• Method for approximating discrete valued functions.

• Exhaustively search hypothesis space.

• Can be used for representing if-then-else rules.


DT Representation
• Classify instances by sorting them down a tree from the
root to some leaf node.

• Leaf node provides classification instance.

• Each node in tree specifies a test of some attribute


(feature) of the instance.

• Each branch descending from that node represents one


of the possible values for this attribute.
Decision Tree Induction: An
Example age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
22
Algorithm for Decision Tree
Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-
conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are
discretized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
23
– There are no samples left
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured

24
Bayes’ Theorem: Basics
M
• Total probability Theorem: P(B)   P(B | A )P( A )
i i
i 1

• Bayes’ Theorem:
P( H | X)  P(X | H ) P( H )  P(X | H ) P(H ) / P(X)
P(X)
– Let X be a data sample (“evidence”): class label is unknown
– Let H be a hypothesis that X belongs to class C
– Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
– P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
– P(X): probability that sample data is observed
– P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
25
Prediction Based on Bayes’
Theorem
• Given training data X, posteriori probability of a hypothesis
H, P(H|X), follows the Bayes’ theorem

P(H | X)  P(X | H )P( H )  P(X | H ) P( H ) / P(X)


P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
26
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
• Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i i i

needs to be maximized
27
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i )   P( x | C i )  P( x | C i)  P( x | C i)  ... P( x | C i)
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts
the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ
2
( x )
1 
g ( x,  ,  ) 
2
2
e
2 
and P(xk|Ci) is
28
P ( X | C i )  g ( xk ,  C i ,  C i )
Naïve Bayes Classifier: Training
Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
29 >40 medium no excellent no
Naïve Bayes Classifier: An
Example age
<=30
<=30
31…40
income studentcredit_rating
high
high
high
no fair
buys_comp

no excellent
no fair
no
no
yes

• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40


>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40
<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes
• Compute P(X|Ci) for each class >40
<=30
medium yes fair
medium yes excellent
yes
yes

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222


31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6


P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
30
Therefore, X belongs to class (“buys_computer = yes”)

You might also like