Symbolic Machine Learning: M.S.Kaysar, M.Engg Cse, Iub
Symbolic Machine Learning: M.S.Kaysar, M.Engg Cse, Iub
M.S.Kaysar,M.Engg
CSE, IUB
Outlines
• Man vs. Machine
• Machine Learning
• Application domains
• Algorithms
Man vs. Machine
Man vs. Machine
Man vs. Machine
Human Computer
Have common sense and bigger Speed: Fast.
knowledge base, thus can percept his Reliable.
environment better than computer Endurance: Not tired.
given appropriate means (especially in Unbiased.
visual form). Consistent.
Can think (synthesize) new rules `out Can try much more
Strengths of the box'. combinations than what
Psychologically, human decision is human is capable of.
more trusted than computer expert
system decision.
Can detect trends, patterns, or
anomalies, in visualization data.
Good in learning.
Man vs. Machine
Human Computer
Easily tired and bored, thus Difficult to
can only be utilized for a short synthesize new rules
period of time, perhaps as (cannot think `out of
`oracle' only. the box').
Cannot do micro manage. Limited knowledge
Biased and inconsistent. base.
Weaknesses Can make error. No common sense.
Not a perfect decision maker.
Actually cannot see anything
if the data is presented in
awkward
manner.
Man vs. Machine
Key Difference….
• Intelligence
• How to build up intelligence into machine or
intelligent agents?
• Solution: Artificial Intelligence (AI)
What is Intelligence?
• Intelligence (also called intellect) is an umbrella term
used to describe a property of the mind that
encompasses many related abilities, such as the
capacities to reason, to plan, to solve problems, to
think abstractly, to comprehend ideas, to use
language, & to learn [Wikipedia]
Where is mind?
What is intelligence?
“I could feel –
I could smell –
a new kind of
intelligence
across the
table”
-Gary
Kasparov
Speech Recognition
Automated call
centers
Navigation Systems
Museum Tour-Guide Robots
Minerva, 1998
Rhino, 1997
Mars Rovers (2003-now)
Europa Mission ~ 2018?
Humanoid Robots
Brain-Computer Interfaces
Singing, Dancing, Bride, …..
How it would be Possible?
Cons:
need a lot of labeled data
error prone — usually impossible to get perfect accuracy
Machine Learning Applications
Countless………..
• Machine perception •Sequence mining
• Computer vision, •Speech and handwriting recognition
including object recognition •Game playing
• Natural language processing •Software engineering
• Syntactic pattern recognition •Adaptive websites
• Search engines •Robot locomotion
• Medical diagnosis •Computational advertising
• Bioinformatics •Computational finance
• Brain-machine interfaces •Structural health monitoring
•Sentiment analysis (or opinion mining)
• Cheminformatics
•Affective computing
• Detecting credit card fraud
•Information retrieval
• Stock market analysis
•Recommender systems
• Classifying DNA sequences •Optimization and Metaheuristic
Few Examples
Learning to Predict Emergency C-Sections
Data
Given:
9714 patient records, each describing a pregnancy & birth
Each patient record contains 215 features
Learn to predict:
Classes of future patients at high risk for Emergency Cesarean Section
Learning to Predict Emergency C-Sections
One of 18
learned rules
Credit Risk Analysis
• Data
Rules learned
from synthesized
data
Learning to detect objects in images
• Supervised learning
• Bayesian networks
• Hidden Markov models
• Unsupervised clustering
• Reinforcement learning
• ....
Related Disciplines
Computer
science Animal
Economics learning
& (Cognitive
Organizational science,
Behavior Psychology,
Machine Neuroscience)
learning
Evolution
Adaptive
Control theory
Statistics
ML niche is growing (Why)?
Testing
Image Learned
Prediction
Features model
Test Image
How does classification work?
• Two-Step Process :
• learning step (where a classification model is
constructed)
• describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• classification step (where the model is used to
predict class labels for given data).
• for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
– If the accuracy is acceptable, use the model to classify new data
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
61
Classification Problems
• classify examples into given set of categories
New example
Labeled Classification
training ML algorithm
rule
examples
Predicted
classification
Decision Trees
Decision Tree Learning
• Decision tree induction is the learning of
decision trees from class-labeled training tuples.
• A decision tree is a flowchart-like tree structure,
where each internal node (nonleaf node)
denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf
node (or terminal node) holds a class label.
• The topmost node in a tree is the root node.
A decision tree for the concept buys_computer, indicating whether an
AllElectronics customer is likely to purchase a computer. Each internal
(nonleaf) node represents a test on an attribute. Each leaf node represents a
class (either buys_computer = yes or buys_computer = no).
Example: Good versus Evil
• problem: identify people as good or bad from
their appearance
A Decision Tree Classifier
How to Build Decision Trees
Information gain
Gini Index
Gain Ratio
Information Gain
Expected information (entropy) needed to classify a tuple in D
m
Info( D) pi log 2 ( pi )
i 1
j 1 | D |
Quinlan [Qui86].
How to choose best splitting criterion?
• The class label attribute, buys computer, has two
distinct values (namely, {yes, no});
• two distinct classes (i.e., m = 2).
• class C1 = yes
• class C2 = no
• 09 tuples of class = yes
• 05 tuples of class = no
• A (root) node N is created for the tuples in D.
• To find the splitting criterion for these tuples,
compute the information gain of each attribute.
• Expected information needed to classify a
tuple in D: m
Info( D) pi log 2 ( pi )
i 1
• Next, we need to compute the expected
information requirement for each attribute.
• Start with attribute: age
• age category “youth,”: yes = 02 tuples & no = 03
tuples.
• category “middle aged,”: yes = 04 tuples & no = 0
tuples.
• category “senior,”: yes = 03 tuples & no = 02
tuples.
• The expected information needed to classify a
tuple in D if the tuples are partitioned
according to age:
v | Dj |
InfoA ( D) Info( D j )
j 1 | D|
• Hence, the gain in information from such a
partitioning would be
M
P(B) P(B | A )P( A )
i i
i 1
Bayes’ Theorem: Basics
• Bayes’ Theorem: P(H | X) P(X | H )P(H ) P(X | H ) P(H ) / P(X)
P(X)
89
Naïve Bayes Classifier
• When to use
Moderate or large training set available
Attributes that describes instances are
conditionally independent given classifier
• Applications
Diagnosis
Classifying text documents
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i) P( x | C i) P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts the
class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ & standard deviation σ
( x )2
1
g ( x, , ) e 2 2
& P(xk|Ci) is 2
P(X | C i) g ( xk , Ci , Ci )
91
Predicting a class label using na¨ıve
Bayesian classification
• The data tuples are
described by the
attributes:
- age, income, student, &
credit_rating
Class:
C1: buys_computer = ‘yes’
C2: buys_computer = ‘no’
Data to be classified:
X = (age = youth, income = medium, student = yes, credit_rating = fair)
We need to maximize P(X|Ci)P(Ci), for i = 1, 2.
P(buys_computer =
“yes”) = 9/14 = 0.643
P(buys_computer = “no”)
= 5/14= 0.357
Compute P(X|Ci) for each class
P(age = youth | buys_computer = yes) = 2/9 = 0.222
P(age = youth | buys_computer = no) = 3/5 = 0.6
P(income = medium | buys_computer = yes) = 4/9 = 0.444
P(income = medium | buys_computer = no) = 2/5 = 0.4
P(student = yes | buys_computer = yes) = 6/9 = 0.667
P(student = yes | buys_computer = no) = 1/5 = 0.2
P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667
P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4
X = (age = youth , income = medium, student = yes, credit_rating = fair)
P(X|Ci) :
P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
2 class problem:
Database D
(X1, y1), (X2, y2)---
Xi: traning tuples
yi: class level [+1/-1]
Which
one is better?
SVM—Linearly Inseparable
Now what?
A Puzzle………