Classifiers (Support Vector Machines, Decision Trees, Nearest Neighbor Classification)
Classifiers (Support Vector Machines, Decision Trees, Nearest Neighbor Classification)
Classifiers (Support Vector Machines, Decision Trees, Nearest Neighbor Classification)
Lecture 3
Classifiers
(Support Vector Machines, Decision Trees, Nearest
Neighbor Classification)
3
Supervised learning
Like human learning from past experiences.
A computer does not have “experiences”.
A computer system learns from data, which
represent some “past experiences” of an
application domain.
Our focus: learn a target function that can be used
to predict the values of a discrete class attribute
The task is commonly called: Supervised learning,
classification, or inductive learning.
4
The data and the goal
Data: A set of data records (also called
examples, instances or cases) described by
k attributes: A1, A2, … Ak.
a class: Each example is labelled with a pre-
defined class.
Goal: To learn a classification model from the
data that can be used to predict the classes
of new (future, or test) cases/instances.
5
An example: data (loan application)
Approved or not
6
An example: the learning task
Learn a classification model from the data
Use the model to classify future loan applications
into
Yes (approved) and
No (not approved)
What is the class for following case/instance?
7
Supervised vs. unsupervised
Learning
Supervised learning: classification is seen as
supervised learning from examples.
Supervision: The data (observations,
measurements, etc.) are labeled with pre-defined
classes. It is like that a “teacher” gives the classes
(supervision).
Test data are classified into these classes too.
Unsupervised learning (clustering)
Class labels of the data are unknown
Given a set of data, the task is to establish the
existence of classes or clusters in the data
8
Supervised learning process: two
steps
Learning (training): Learn a model using the
training data
Testing: Test the model using unseen test
data to assess the model accuracy
9
Fundamental assumption of learning
Assumption: The distribution of training
examples is identical to the distribution of test
examples (including future unseen examples).
11
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
12
Examples of Classification Task
Predicting tumor cells as benign or malignant
13
Issues: Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
14
Resources: Datasets
UCI Repository:
http://www.ics.uci.edu/~mlearn/MLRepository.html
UCI KDD Archive:
http://kdd.ics.uci.edu/summary.data.application.html
Statlib: http://lib.stat.cmu.edu/
Delve: http://www.cs.utoronto.ca/~delve/
15
Classification Techniques
16