Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
• Learning agents
• Inductive learning
• Decision tree learning
Learning
• An agent is learning if it improves its performance on
future tasks after making observations about the word.
• Why would we want an agent to learn ?If the design of the
agent can be improved
• Why wouldn't the designers just program in the
improvement to begin with ?
– First the designers cannot anticipate all possible situations that
the agent might find itself in
– Second , the designers cannot anticipate all changes over time
– Third , sometimes human programmers have no idea how to
program a solution themselves.
–
• Learning is essential for unknown environments,
– i.e., when designer lacks omniscience
• Type of feedback:
– Supervised learning: correct answers for each example
– Semi-supervised learning: correct answers for some
examples
– Unsupervised learning: correct answers not given
– Reinforcement learning: occasional rewards
Machine Learning Algorithms
• Supervised algorithms - the agent observes some example
input–output pairs and learn
1. Use training data which has correct answers (class label).
2. Create a classification model (performance element) by
running the algorithm on the training data.
3. Test the model. If accuracy is low, regenerate model after
changing features, training samples, etc.
4. Use model to predict class label for new incoming data.
• Unsupervised algorithms – find hidden relationships in data
– Do not use training data.
– Classes may not be known in advance.
– In unsupervised learning the agent learns patterns in the input
even though no explicit UNSUPERVISED LEARNING
feedback is supplied.
– The most common unsupervised learning task is clustering:
detecting CLUSTERING
• Reinforcement learning –
– How an agent ought to take actions in an environment so as to
maximize some notion of long-term reward.
– In reinforcement learning the agent learns from a series of
reinforcements—rewards or punishments. For example, the lack of a
tip at the end of the journey gives the taxi agent an indication that it
did something wrong
• In semi-supervised learning we are
given a few labeled examples and
must make what we can of a large
collection of unlabeled examples.
Supervised Learning
Prediction Problems: Classification
vs. Numeric Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts
unknown or missing values
• Typical applications
– Credit/loan approval:
– Medical diagnosis: if a tumor is cancerous or benign
– Fraud detection: if a transaction is fraudulent
10
– Web page categorization: which category it is
Process (1): Model
Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph
12 Assistant Prof 7 yes
Inductive learning
Simplest form: learn a function from examples
• Type of reasoning that involves moving from a set of
specific facts to a general conclusion.
no yes no yes
22
Algorithm for Decision Tree
Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-
conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are
discretized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
23
– There are no samples left
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
24
Bayes’ Theorem: Basics
M
• Total probability Theorem: P(B) P(B | A )P( A )
i i
i 1
• Bayes’ Theorem:
P( H | X) P(X | H ) P( H ) P(X | H ) P(H ) / P(X)
P(X)
– Let X be a data sample (“evidence”): class label is unknown
– Let H be a hypothesis that X belongs to class C
– Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
– P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
– P(X): probability that sample data is observed
– P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
25
Prediction Based on Bayes’
Theorem
• Given training data X, posteriori probability of a hypothesis
H, P(H|X), follows the Bayes’ theorem
needs to be maximized
27
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i ) P( x | C i ) P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts
the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ
2
( x )
1
g ( x, , )
2
2
e
2
and P(xk|Ci) is
28
P ( X | C i ) g ( xk , C i , C i )
Naïve Bayes Classifier: Training
Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
29 >40 medium no excellent no
Naïve Bayes Classifier: An
Example age
<=30
<=30
31…40
income studentcredit_rating
high
high
high
no fair
buys_comp
no excellent
no fair
no
no
yes