Unit 1
Unit 1
MITCHELL, - MGH
REF BOOK: MACHINE LEARNING: AN ALGORITHMIC PERSPECTIVE BY STEPHE MARSHLAND, TAYLOR
& TRANCIS
===================================================================================
===================
UNIT - 1
INTRODUCTION - WELL POSED LEARNING PROBLEMS, DESIGNING A LEARNING SYSTEM,
PERSPECTIVES AND ISSUES IN
MACHINE LEARNING.
CONCEPT LEARNING AND THE GENERAL TO SPECIFIC ORDERING - INTRODUCTION, A CONCEPT
LEARNING TASK, CONCEPT
LEARNING AS SEARCH, FIND - S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS, VERSION
SPACES AND THE CANDIDATE
ELIMINATION ALGORITHM, REMARKS ON VERSION SPACES AND CANDIDATE ELIMINATION,
INDUCTIVE BIAS.
The field of machine learning is concerned with the question of how to construct
computer programs that automatically improve with experience.
AI - ML - DL
DL is the subset of ML
ML is the subset of AI
The goal of this subject is to present the key algorithms and theory that
form the core of machine learning.
Online data sets and implementations of several algorithms are available via the
at http://www.cs.cmu.edu/-tom1mlbook.html.
These include neural network code and data for face recognition,
decision tree learning, code and data for financial loan analysis, and Bayes
classifier code
and data for analyzing text documents.
====================================================
WELL POSED LEARNING PROBLEM IN MACHINE LEARING:
Well Posed Learning Problem – A computer program(or Agent) is said to learn from
experience E in
context to some task T and some performance measure P, if its performance on T, as
was measured by P,
upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits
–
Task
Performance Measure
Experience
Certain examples that efficiently defines the well-posed learning problem are –
===================================================================================
========
Features:
x1 - No.of black pieces on a board
x2 - No.of red pieces on a board
x3 - No.of black kings on a board
x4 - No.of red kings on a board
x5 - NO.of black pieces threatened by red
( blacks which can be beaten by red)
x6 - No.of red pieces threatened by black
The performance system: The performance system solves the given performance task.
Critic: The critic takes the history of the game and generates training examples.
Generalizer: It outputs the hypothesis that is its estimate of the target function.
Experiment Generator: It creates a new problem after taking in the hypothesis for
the
performance system to explore.
===================================================================================
===========
Based on the type of data, we choose a particular algorithm which gives us the best
solution. My point of view(perspective) a particular algorithm suits me but in your
case is a different
algorith suits you to solve the problem.
Issues in ML:
1. What algorithms should be used?
2. Which algorithms perform best for which types of problems?
3. How much training data is sufficient and testing data?
4. What kind of methods should be used?
5. What methods should be used to reduce learning overheads?
6. Which methods should be used for which type of data.
===================================================================================
====================
Concept Learning - Introduction, Concept Learning Task:
Different hypothesis have different concepts
Universal set: Gadgets , tablets, smart phones, Ear-phones, pc, desktop etc
circle: Entire space of concepts but we want only some part of it( target
conept/hypotheis space)
Example:
Reality Vs Hypothesis:
In the year 2000 the features of the mobile phone: x1, x2, x3
All these features are assumtions/imaginations/hypotheis.People are expecting the
small size,
dual sim, wireless charging.
more facilities
In the year 2020: Some features are implemented which are consistent.
===================================================================================
====
Concept learning as search:
Main goal of this search is to find the hypothesis that best fits the training
examples.
Algorithm:
Step 1: Initializes with the most specific hypothesis - Null phi
h0 = < null, null, null, null, null > 5 Attributes
Example:
-----------------------------------------------------------------------
Origin Manufacturer Colour Type Year Class
---------------------------------------------------------------------
JP HO BLUE 1980 ECO +VE (YES)
JP TO GREEN 1970 SPO -VE (NO)
JP TO BLUE 1990 ECO +VE
USA AU RED 1980 ECO -VE
JP HO WHITE 1980 ECO +VE
JP TO GREEN 1980 ECO +VE
JP HO RED 1980 ECO -VE
--------------------------------------------------------------------------
Drawbacks/Disadvantages:
1. Consider only +ve samples
2. h6 may not be sole hypothesis that fits the complete data.
===================================================================================
===================
Vesrion Spaces: Algorithm to find Version Space with example
H is the Hypotheis
D is the training examples
Check
h is consistent with the trainig example h(x) = c(x)
======================================
CANDIDATE ELIMINATION ALGORITHM:
Uses the concept of version space
Consider both +ve and -ve values (yes and no)
Consider both specific and general hypothesis.
For positive samples, move from specific to general
For negative samples, move from general to specific
Algorithm:
Example: EnjoySport
Sky Temperature Humidity Wind Water Forecast Enjoy
-------------------------------------------------------------------------
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
-------------------------------------------------------------------------
S0 = {null,null,null,null,null,null} and G0 = {?,?,?,?,?,?}
Specific to general
S2 = {'sunny', 'warm', ? , 'strong', 'warm', 'same'}
G2 = { ?,?,?,?,?,?} same until you encounter -ve sample
==========================================================================
Inductive Bias
Candidate-Elemination Algorithm
Remarks of CEA
The Candidate-Elimination Algorithm will converge toward the true target concept
provided
i) It is given accurate training example without errors and
ii) Its initial hypothesis space contains the target concept.
Can we avoid this difficulty by using a hypothesis sapce that includes every
possible hypothesis?
Hypothesis space contains the unknown target concept
To get the solution: Hypothesis space to include every possible hypothesis.
-----------------------------------------------------------------------------------
-----
To illustrate, consider the EnjoySport example in which we
restricted the hypothesis space to include only conjunctions of attribute values.
Sky
< Sunny, High, Normal, Strong, Cool, Same>
Here we consider the conjuctions of the attributes, definetly we get hypothesis
space
somecases we miss the target concept in the Hypothesis space is called a Biased
Hypothesis Space.
G2 = <
G1 = < ?,?,?,?,?,?>
G0 = < ?,?,?,?,?,?>
< "Sky = Sunny or Sky = Cloudy", Warm, Normal,Strong, Cool , Change>
It is called Unbiased(Extended) Hypothesis Space. We consider simple disjunctive
target concepts
such as "sky = Sunny or Sky = Cloudy."
Unbiased Learner: In general the set of all subsets of a set X is called the
powerset of X.
Example: EnjoySport
Total no.of instances for the 6 attributes = 96
|H'| = 2 power |X| or 2 pow 96 (vs |H| = 973 , strong bias)
===================================================================================
===
Decision tree learning is generally best suited to problems with the following
characteristics:
iv) The training data may contain missing attribute values. Decision tree methods
can be used
even when some training examples have unknown values.
Such problems, in which the task is to classify examples into one of a discrete set
of possible
categories, are often referred to as classification problems.
===================================================================================
============
HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING
ID3 can be characterized as searching a space of hypotheses for one that fits the
training examples.
The hypothesis space searched by ID3 is the set of possible decision trees.
ID3 performs a simple-to complex, hill-climbing search through this hypothesis
space,
beginning with the empty tree, then considering progressively more elaborate
hypotheses in search of
a decision tree that correctly classifies the training data.
The evaluation function that guides this hill-climbing search is the information
gain measure.
This search is depicted in Figure.
By viewing ID3 in terms of its search space and search strategy, we can get
some insight into its capabilities and limitations
-----------------------------------------------------------s
ii) ID3 maintains only a single current hypothesis as it searches through the
space of decision trees. ID3 loses the capabilities that follow from explicitly
representing
all consistent hypotheses.(Limitation)
iv) ID3 uses all training examples at each step in the search to make statistically
based decisions regarding how to refine its current hypothesis. One advantage of
using statistical properties of all the examples (e.g., information gain) is that
the resulting search is much less sensitive to errors in individual training
examples. ID3 can be easily extended to handle noisy training data by modifying its
termination
criterion to accept hypotheses that imperfectly fit the training data.
===================================================================================
=================
Instance Classification a1 a2
(a) What is the entropy of this collection of training examples with respect to the
target function classification?
(b) What is the information gain of a2 relative to these training examples?
===================================================================================
==============
INDUCTIVE BIAS IN DECISIOIN TREE LEARNING:
Given a collection of training examples, there are typically many decision trees
consistent with
these examples.
if you have only one tree there is no issue and we classify these new(test)
examples but
you have more than one tree which tree should be used to classify the new examples.
???
The solution for this is the Inductive Bias
Describing the inductive bias of ID3 therefore consists of describing the basis by
which chooses
one of these consistent hypotheses over the others.
It chooses the first acceptable tree it encounters in its simple to complex, hill
climbing search
through the space of possible trees.
Why should one prefer shorter hypothesis than the longer one?
Arguments in favor
a) There are fewer short hypotheses than long ones
b) if a short hypothesis fits data unlikely to be a coincidence
Arguments against:
a) Not every short hypothesis is a reasonable one
What is the performance of the short tree on testing examples? performance is
less then we go for
longer one.
- The term razor refers to the act of shaving away unnecessary assumptions to get
to the simplest
explanation.
===================================================================================
==========================
1. Overfitting the data: if you depend on the training data too much then it works
on the training but not on the testing examples. This problem can be addressed with
the two techniques
1) Stop growing the tree earlier, before it reaches the point where it perfectly
classifies the training
data.
2. Allow the tree to overfit the data, and then post-prune the tree
Split the training in two parts( training and validation) and use validation to
assess the utility
of post-pruning.