0% found this document useful (0 votes)

31 views64 pages

Chapter 7

The document discusses techniques for extending instance-based learning and linear models, including pruning exemplars, weighted attributes, generalized exemplars, support vector machines, kernel methods, and locally weighted linear regression.

Uploaded by

Md. Emon Hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views64 pages

Chapter 7

Uploaded by

Md. Emon Hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 64

Data Mining

Practical Machine Learning Tools and Techniques

Slides for Chapter 7, Extending instance-based and linear models

of Data Mining by I. H. Witten, E. Frank,

M. A. Hall and C. J. Pal
Extending instance-based learning and linear
models
• Instance-based learning
• Pruning and reducing the number of exemplars
• Weighted attributes
• Generalized exemplars and distance functions
• Extending linear models
• Support vector machines, kernel ridge regression, kernel perceptrons
• Multilayer perceptrons and radial basis function networks
• Gradient descent
• Numeric prediction with local linear models
• Model Trees
• Learning rule sets with model trees
• Locally weighted linear regression

2
Instance Based Learning

3
Instance-based learning

Practical problems of 1-nearest-neighbour scheme:

• Slow (but: fast tree-based approaches exist)
- Remedy: remove irrelevant data
• Noise (but: k -NN copes quite well with noise)
- Remedy: remove noisy instances
• All attributes deemed equally important
- Remedy: weight attributes (or simply select)
• Doesn’t perform explicit generalization
- Remedy: rule-based NN approach

4
Learning prototypes

• Only those instances involved in a decision need to be

stored
• Noisy instances should be filtered out
• Idea: only use prototypical examples

5
Speed up classification, combat noise
• David Aha’s IB2: save memory, speed up classification
• Work incrementally
• Only incorporate misclassified instances
• Problem: noisy data gets incorporated
• David Aha’s IB3: deal with noise
• Discard instances that do not perform well
• Compute confidence intervals for
1. Each instance’s success rate
2. Default accuracy of the instance’s class
• Accept/reject instances according to performance
1. Accept if lower limit of 1 exceeds upper limit of 2
2. Reject if upper limit of 1 is below lower limit of 2

6
Weight attributes

• David Aha’s IB4: weight each attribute

(weights can be class-specific)
• Weighted Euclidean distance:

• Update weights based on nearest neighbor

• Class correct: increase weight
• Class incorrect: decrease weight
• Amount of change for i th attribute depends on
|xi- yi|

7
Generalized exemplars

• Generalize instances into hyperrectangles

• Online: incrementally modify rectangles
• Offline version: seek small set of rectangles that cover the
instances
• Important design decisions:
• Allow overlapping rectangles?
• Requires conflict resolution
• Allow nested rectangles?
• Dealing with uncovered instances?

8
Rectangular generalizations

• Nearest-neighbor rule is used outside rectangles

• Rectangles are rules! (But they can be more conservative than
“normal” rules.)
• Nested rectangles are rules with exceptions

9
Separating generalized exemplars

Class 1

Class 2

Separation
line
10
Generalized distance functions

• Problem with Euclidean distance, etc.: only natural for purely

numeric datasets
• Transformation-based approach to designing distance functions can
be applied more generally
• Given: some transformation operations on attributes
• K* similarity = probability of transforming
instance A into B by chance
• Average over all transformation paths
• Weight paths according their probability
(need way of measuring this)
• Uniform way of dealing with different attribute types
• Easily generalized to give distance between sets of instances

11
Discussion and Bibliographic Notes

• Nearest-neighbor methods gained popularity in machine

learning through the work of Aha (1992)
• Salzberg (1991) suggested that generalization with nested
exemplars can achieve high classification accuracy
• Wettschereck and Dietterich (1994) argued that these results
were fortuitous and did not hold in other domains
• Martin (1995) explored the idea that overgeneralization that
occurs when hyperrectangles nest or overlap is problematic
• The generalized distance function based on transformations is
described by Cleary and Trigg (1995)

12
Extending Linear Models

13
Support vector machines

• Support vector machines are algorithms for learning linear

classifiers
• Resilient to overfitting because they learn a particular
linear decision boundary:
• The maximum margin hyperplane
• Fast in the nonlinear case
• Use a mathematical trick to avoid creating “pseudo-attributes”
• The nonlinear space is created implicitly

14
The maximum margin hyperplane

• The instances closest to the maximum margin hyperplane

are called support vectors

15
Support vectors
• The support vectors define the maximum margin hyperplane
• All other instances can be deleted without changing its position
and orientation

• The hyperplane

can be written as

16
Finding support vectors

• Support vector: training instance for which i > 0

• Determining i and b ?—
A constrained quadratic optimization problem
• Off-the-shelf tools for solving these problems
• However, special-purpose algorithms are faster
• Example: Platt’s sequential minimal optimization (SMO) algorithm
• Note: the method discussed so far assumes separable data!

17
Nonlinear SVMs
• We can create a nonlinear classifier by creating new “pseudo”
attributes from the original attributes in the data
• “Pseudo” attributes represent attribute combinations
• E.g.: all polynomials of degree 2 that can be formed from the original
attributes
• We can learn a linear SVM from this extended data
• The linear SVM in the extended space is a non-linear classifier
in the original attribute space
• Overfitting often not a significant problem with this approach
because the maximum margin hyperplane is stable
• There are often comparatively few support vectors relative to the size
of the training set
• Computation time still an issue
• Each time the dot product is computed, all the “pseudo attributes”
must be included
18
A mathematical trick

• Avoid computing the “pseudo attributes”

• Compute the dot product before doing the nonlinear
mapping
• Example:

• Corresponds to a map into the instance space spanned by

all products of n attributes

19
Other kernel functions
• Mapping is called a “kernel function”
• Polynomial kernel

• We can use others:

K() can be written as a

• Only requirement: dot product in a feature
space create by the
• Examples: implicit feature
mapping Φ()

*
20
Noise

• Have assumed that the data is separable (in original or

transformed space)
• Can apply SVMs to noisy data by introducing a “noise”
parameter C
• Also known as regularization parameter
• C bounds the influence of any one training instance on the
decision boundary
• Based on the following constraint: 0  i  C
• A “soft” margin is maximized based on this constraint
• Still a quadratic optimization problem
• Have to determine C by experimentation

21
Sparse data

• SVM algorithms speed up dramatically if the data is sparse

(i.e., many values are 0)
• Why? Because they compute lots and lots of dot products
• Sparse data  can compute dot products very efficiently
• Iterate only over non-zero values
• SVMs can process sparse datasets with 10,000s of attributes

22
Support vector regression
• Maximum margin hyperplane only applies to classification
• However, idea of support vectors and kernel functions can
be used for regression
• Basic method is the same as in linear regression: want to
minimize error
• Difference A: ignore errors smaller than e and use absolute
error instead of squared error
• Difference B: simultaneously aim to maximize flatness of
function
• User-specified parameter e defines “tube”

23
More on SVM regression
• If there are tubes that enclose all the training points, the flattest
of them is used
• E.g.: mean is used if 2e > range of target values
• Model can be written as:

• Support vectors: points on or outside tube

• Dot product can be replaced by kernel function
• In contrast to the classification case, the coefficients ai may be
negative (in the classification case, we have the class values)
• No tube that encloses all training points?
• Requires trade-off between error and flatness
• Controlled by upper limit C on absolute value of coefficients ai

24
Examples
e=2

e=1
e = 0.5

25
Kernel Ridge Regression

• For classic linear regression using squared loss, only simple

matrix operations are needed to find the parameters
• This is not the case for support vector regression because a
different loss function is used
• Requires use of numeric optimization technique such as
sequential minimal optimization
• Can we combine the power of the kernel trick with the
simplicity of standard least-squares regression?
• Yes! This yields kernel ridge regression

26
Comments on kernel ridge regression
• Like in an SVM, the predicted class value for a test
instance is expressed as a weighted sum of dot products
• But: all training instances are involved in this sum:

• Unlike in an SVM, all training instances participate – not

just support vectors
• No sparseness in the solution
(no support vectors)
• Also, loss in ridge regression does not ignore errors
smaller than e
• Moreover, squared error is used instead of absolute error
so regression model is more sensitive to outliers
27
Performing kernel ridge
regression
• The penalized loss function that is optimized by kernel ridge
regression is

• The user-specified parameter λ determines closeness of fit

to the training data
• The coefficients can be found using matrix operations
• Standard regression – invert an m  m matrix (O(m3)),
m = #attributes
• Kernel ridge regression – invert an n  n matrix (O(n3)),
n = #instances
• Has an advantage if
• a non-linear fit is desired or
• there are more attributes than training instances
28
The kernel perceptron
• We can use the “kernel trick” to make a non-linear classifier
using the perceptron learning rule
• Observation: in perceptron learning rule, weight vector is
modified by adding or subtracting training instances
• Hence, we can represent the learned weight vector using
all instances that have been misclassified:
• This means we can use

instead of
( where y is either -1 or +1)
• Now swap summation signs:
• Can be expressed as:
• Can replace dot product by kernel:

29
Comments on kernel perceptron

• Finds separating hyperplane in space created by kernel

function (if it exists)
• But: doesn't find maximum-margin hyperplane
• Easy to implement, supports incremental learning
• Perceptron can be made more stable by using all weight
vectors encountered during learning, not just the last one
(yields the voted perceptron)
• Weight vectors vote on prediction (vote based on number of
successful classifications since inception)

30
Multilayer perceptrons

• Using kernels is only one way to build nonlinear classifier

based on perceptrons
• Can create network of perceptrons to approximate
arbitrary target concepts
• A multilayer perceptron is an example of an artificial
neural network build from perceptrons
• Consists of: input layer, hidden layer(s), and output layer
• Structure of MLP is usually found by experimentation
• Parameters can be found using backpropagation

31
Examples

32
Backpropagation
• How to learn the weights given a network structure?
• Cannot simply use perceptron learning rule because we have
hidden layer(s)
• Function we are trying to minimize: error
• Can use a general function minimization technique called
gradient descent
• Activation function needs to provide gradient information:
can use sigmoid function instead of threshold function

• Loss function also needs to provide gradient information:

cannot use zero-one loss, but can use squared error

33
Threshold vs. sigmoid activation function

34
Gradient descent example
• Function: x2+1
• Derivative: 2x
• Learning rate: 0.1
• Start value: 4

• Can only find a local minimum!

35
Minimizing the error I
• Need to find partial derivative of error function with respect to
each parameter (i.e., weight):

36
Minimizing the error II
• What about the weights for the connections from the input to
the hidden layer? More application of the chain rule…

37
Remarks
• The same process works for multiple hidden layers and
multiple output units (e.g., for multiple classes)
• Can update weights after all training instances have been
processed or incrementally:
• batch learning vs. stochastic backpropagation
• Weights are initialized to small random values
• How to avoid overfitting?
• Early stopping: use validation set to check when to stop
• Weight decay: add penalty term to error function
• How to speed up learning?
• Momentum: re-use proportion of old weight change
• Use optimization method that employs 2nd derivative

38
Radial basis function networks

• RBF network: another type of feedforward network, with two

layers (plus the input layer)
• Hidden units represent points in instance space and
activation depends on distance to these points
• To this end, distance is converted into a similarity score using a
Gaussian activation function
• Width of Gaussian may be different for each hidden unit
• Points of equal activation of units in hidden layer form hypersphere
(or hyperellipsoid) as opposed to hyperplane
• Output layer is the same as in a multi-layer perceptron

39
Learning RBF networks
• Parameters to be learned: centers and widths of the RBFs +
weights in output layer
• Can learn the two sets of parameters independently and still get
fairly accurate models
• E.g.: clusters from k-means can be used to form basis functions
• Linear model for output layer can be based on fixed RBFs found using
clustering, which makes learning very efficient
• However, for best accuracy it is best to train the entire network in
a fully supervised manner
• Can use the same methods that are used for training multilayer perceptrons
• Disadvantage of standard RBF networks: no built-in attribute
weighting based on relevance
• But: can introduce attribute weights into the distance function
• RBF networks are related to RBF SVMs, which have a basis
function centered on each support vector
40
Stochastic gradient descent
• We have have seen gradient descent + stochastic gradient
descent for learning weights in a neural network
• Gradient descent is a general-purpose optimization
technique
• Can be applied whenever the objective function is differentiable
• Actually, can be used even when the objective function is not
completely differentiable!
• This based on the concept of subgradients, which we will not get
into here
• One application: learning linear models – e.g. linear SVMs
or logistic regression
• Very fast, simple method for learning from large datasets

41
Stochastic gradient descent cont.

• Learning linear models using gradient descent is easier than

optimizing non-linear neural networks
• Objective function has a single global minimum rather than
several local minima
• Stochastic gradient descent is fast, uses little memory and is
suitable for incremental online learning
• Let us look at how to apply stochastic gradient descent to
learn a linear support vector machine

42
Loss functions
• For SVMs, the error function (to be minimized) is called
the hinge loss

43
Optimizing the hinge loss
• In the linearly separable case, the hinge loss is 0 for a
function that successfully separates the data
• The maximum margin hyperplane is given by the smallest weight
vector that achieves 0 hinge loss
• Corresponding optimization problem that needs to be solved:
user-specified
regularization
parameter z
• But: hinge loss is not differentiable at z = 1; cannot
compute gradient for all values of z
• Can use subgradient – something that resembles a gradient
• Can use 0 at z = 1
• In fact, loss is 0 for z  1, so we can focus on z  1 and proceed as
usual with stochastic gradient descent
• Also yields a solution if the data is not separable
44
Discussion and Bibliographic Notes
• SVMs stem from statistical learning theory (Vapnik 1999)
• A good starting point for exploration is a tutorial by Burges (1998)
• Soft-margin SVMs were discussed by Cortes and Vapnik (1995)
• Tutorial on support vector regression: Smola and Schölkopf (2004)
• Schölkopf et al. (1999) present support vector regression with just
one parameter instead of two (C and ε)
• Fletcher (1987) covers constrained quadratic optimization
• The SMO algorithm for training SVMs is due to Platt (1998)
• Ridge regression was introduced by Hoerl and Kennard (1970)
• Hastie et al. (2009) give a good description of kernel ridge regression
• Kernel ridge regression is equivalent to Gaussian process regression,
a Bayesian approach that also provides estimates of uncertainty
45
Discussion and Bibliographic Notes
• The kernel perceptron is due to Freund and Schapire (1999)
• Cristianini and Shawe-Taylor (2000) provide an introduction
to support vector machines and kernel-based methods
• Shawe-Taylor and Cristianini (2004) and Schölkopf and Smola
(2002) cover kernel-based learning in detail
• Bishop (1995) provides an excellent introduction to both
multilayer perceptrons and RBF networks
• Kivinen et al. (2002), Zhang (2004) and Shalev-Shwartz et al.
(2007) explore gradient methods for SVMs
• Kivinen et al. and Shalev-Shwartz et al. provide heuristics for
setting the learning rate for gradient descent

46
Numeric Prediction with
Local Linear Models

47
Numeric prediction (aka regression)

• Counterparts exist for all classification schemes previously

discussed
• Decision trees, rule learners, SVMs, etc.
• (Almost) all classification schemes can be applied to regression
problems using discretization:
• Discretize the class into intervals
• Predict weighted average of interval representatives (e.g.,
midpoints)
• Weight according to class probabilities
• We will cover a couple of approaches to regression that are
based on building local linear models
• Model trees (+ a rule learning algorithm based on them) and locally
weighted linear regression

48
Regression trees

• Like decision trees, but:

• Splitting criterion: minimize intra-subset variation
• Termination criterion: std. dev. becomes small
• Pruning criterion: based on numeric error measure
• Prediction: Leaf predicts average class value of instances
• Yields piecewise constant functions
• Easy to interpret
• More sophisticated version: model trees

49
Model trees
• Build a regression tree
• Each leaf  linear regression function
• Smoothing: factor in ancestor’s predictions
• Smoothing formula:

• Same effect can be achieved by incorporating ancestor models

into the leaves
• Need linear regression function at each node
• At each node, use only a subset of attributes to build
linear regression model
• Those occurring in subtree
• (+maybe those occurring in path to the root)
• Fast: tree usually uses only a small subset of the attributes
50
Building the tree
• Splitting: standard deviation reduction

• Termination of splitting process:

• Standard deviation < 5% of its value on full training set
• Too few instances remain (e.g., < 4)
Pruning:
• Heuristic estimate of absolute error of linear regression models:

• Greedily remove terms from LR models to minimize estimated error

• Proceed bottom up: compare error of LR model at internal node to
error of subtree (this happens before smoothing is applied)
• Heavy pruning: single model may replace whole subtree

51
Nominal attributes
• Convert nominal attributes to binary ones
• Sort attribute values by their average class values
• If attribute has k values,
generate k – 1 binary attributes
• i th attribute is 0 if original nominal value is part of the first i
nominal values in the sorted list, and 1 otherwise
• Treat binary attributes as numeric in linear regression models
and when selecting splits
• Can prove: best SDR split on one of the new binary attributes
is the best (binary) SDR split on original nominal attribute
• In practice this process is not applied at every node of the
tree but globally at the root node of the tree
• Splits are no longer optimal but runtime and potential for overfitting
are reduced this way

52
Missing values

• Modify splitting criterion:

• To determine which subset an instance goes into, use

surrogate splitting
• Split on the attribute whose correlation with attribute whose value is
missing is greatest
• Problem: complex and time-consuming
• Simple solution: always use the class as surrogate attribute
• Class can only be used at training time
• Test set: replace missing value with average

53
Surrogate splitting based on class
• Choose split point based on instances with known values
• Split point divides instances into 2 subsets
• L (smaller class average)
• R (larger)
• m is the average of the two averages
• For an instance with a missing value:
• Choose L if class value < m
• Otherwise R
• Once full tree is built, replace missing values with averages of
corresponding leaf nodes
• Linear regression models can then be built on the completed
(“imputed”) dataset

54
Pseudo-code for M5'
• Let us consider the pseudo code for the model tree inducer M5’
• Four methods:
• Main method: MakeModelTree
• Method for splitting: split
• Method for pruning: prune
• Method that computes error: subtreeError
• We will briefly look at each method in turn
• We will assume that the linear regression method performs
attribute subset selection based on error (discussed previously)
• Nominal attributes are replaced globally at the root node

55
MakeModelTree

MakeModelTree (instances)
{
SD = sd(instances)
for each k-valued nominal attribute
convert into k-1 synthetic binary attributes
root = newNode
root.instances = instances
split(root)
prune(root)
printTree(root)
}

56
split

split(node)
{
if sizeof(node.instances) < 4 or
sd(node.instances) < 0.05*SD
node.type = LEAF
else
node.type = INTERIOR
for each attribute
for all possible split positions of attribute
calculate the attribute's SDR
node.attribute = attribute with maximum SDR
split(node.left)
split(node.right)
}

57
prune

prune(node)
{
if node = INTERIOR then
prune(node.leftChild)
prune(node.rightChild)
node.model = linearRegression(node)
if subtreeError(node) > error(node) then
node.type = LEAF
}

58
subtreeError

subtreeError(node)
{
l = node.left; r = node.right
if node = INTERIOR then
return (sizeof(l.instances)*subtreeError(l)
+ sizeof(r.instances)*subtreeError(r))
/sizeof(node.instances)
else return error(node)
}

59
Model tree for servo data

60
Rules from model trees
• PART algorithm generates classification rules by building
partial decision trees
• Can use the same method to build rule sets for regression
• Use model trees instead of decision trees
• Use variance instead of entropy to choose node to expand when
building a partial tree
• Rules that are generated will have linear models on right-
hand side
• Caveat: using smoothed trees may not be appropriate due
to the separate-and-conquer strategy used in rule learning
• Empirical evidence shows that smoothing does not help
• Full trees can be used instead of partial trees at the expense
of runtime

61
Locally weighted regression
• Locally weighted regression is a numeric prediction method
that combines
• instance-based learning
• linear regression
• It is a “lazy” learning method:
• Computes new regression function for each test instance at
prediction time
• Works incrementally
• Weights training instances
• according to distance to test instance
• builds linear regression model from weighted data
• requires weighted version of linear regression (straightforward)
• Advantage: nonlinear approximation
• Slow if implemented using brute-force search; however, fast
data structures can be used for the nearest-neighbor search
62
Design decisions
• Weighting functions:
• Inverse Euclidean distance
• Gaussian kernel applied to Euclidean distance
• Triangular kernel used the same way
• etc.
• Empirically, performance does not appear to depend much on
the weighting method that is used
• Ideally, weighting function has bounded support so that most
training instances receive weight 0 and can be ignored
• Smoothing parameter is used to scale the distance function for
computation of the weights
• Multiply distance by inverse of this parameter
• Possible choice: distance to the kth nearest training instance (renders
choice of smoothing parameter data dependent)

63
Discussion and Bibliographic Notes
• Regression trees were introduced in the “classification and
regression trees”, or CART system (Breiman et al., 1984)
• The method of handling nominal attributes and the surrogate
device for dealing with missing values were included in CART
• M5 model trees were first described by Quinlan (1992)
• The M5’ version is given by Wang and Witten (1997)
• Using model trees (although not partial trees) for generating
rule sets has been explored by Hall et al. (1999)
• There are many variations of locally weighted learning.
• Statisticians have considered using locally quadratic models
• They have applied locally weighted logistic regression to classification
• Frank et al. (2003) evaluated the use of locally weighted learning in
conjunction with Naïve Bayes
• Atkeson et al. (1997) provide a survey on locally weighted learning

AAM Sample Paper
100% (2)
AAM Sample Paper
4 pages
Deep Learning Lab Practicals
No ratings yet
Deep Learning Lab Practicals
24 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
Artificial Intelligence - Wikipedia
No ratings yet
Artificial Intelligence - Wikipedia
316 pages
Machine Learning Paradigms
No ratings yet
Machine Learning Paradigms
27 pages
Understanding Machine Learning Algorithms - in Depth
No ratings yet
Understanding Machine Learning Algorithms - in Depth
167 pages
Deep Learning Nanodegree Syllabus 8-15
No ratings yet
Deep Learning Nanodegree Syllabus 8-15
15 pages
Image Classification-AIML Project Presentation
No ratings yet
Image Classification-AIML Project Presentation
18 pages
Exam2 s15 Sol
100% (1)
Exam2 s15 Sol
10 pages
2017 IEVC Yanagisawa
No ratings yet
2017 IEVC Yanagisawa
5 pages
Support Vector Machines For Classification and Regression: Steve R. Gunn
No ratings yet
Support Vector Machines For Classification and Regression: Steve R. Gunn
66 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
80 pages
Lung Cancer Project Report
No ratings yet
Lung Cancer Project Report
34 pages
ML Training by Custom Yolo v5
No ratings yet
ML Training by Custom Yolo v5
56 pages
Lecture Slides-Week12
100% (1)
Lecture Slides-Week12
41 pages
08 Classification
No ratings yet
08 Classification
46 pages
CH 7
No ratings yet
CH 7
33 pages
Lect 07 Distance Based Algorithms
No ratings yet
Lect 07 Distance Based Algorithms
34 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
S V M (SVM) : Upport Ector Achine
No ratings yet
S V M (SVM) : Upport Ector Achine
67 pages
Support Vector Machine: Abinas Panda
No ratings yet
Support Vector Machine: Abinas Panda
52 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
SVM Presentation
No ratings yet
SVM Presentation
27 pages
Introduction To Support Vector Machines
No ratings yet
Introduction To Support Vector Machines
46 pages
Handout 03 Classic Classifiers
No ratings yet
Handout 03 Classic Classifiers
39 pages
22-Kernel Tricks Shit
No ratings yet
22-Kernel Tricks Shit
43 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
Script
No ratings yet
Script
10 pages
Lecture09 SVM Intro, Kernel Trick (Updated)
No ratings yet
Lecture09 SVM Intro, Kernel Trick (Updated)
36 pages
Support Vector Machines: (Vapnik, 1979)
No ratings yet
Support Vector Machines: (Vapnik, 1979)
34 pages
Support Vector Machine: Prof. Subodh Kumar Mohanty
No ratings yet
Support Vector Machine: Prof. Subodh Kumar Mohanty
52 pages
2CEIT606 - AI - Lab Experminet List For Manual
No ratings yet
2CEIT606 - AI - Lab Experminet List For Manual
12 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
L5-Support Vector Machine
No ratings yet
L5-Support Vector Machine
61 pages
Slide - SVM
No ratings yet
Slide - SVM
12 pages
Session 5
No ratings yet
Session 5
36 pages
Support Vector Machines For Classification and Regression: Steve R. Gunn
No ratings yet
Support Vector Machines For Classification and Regression: Steve R. Gunn
66 pages
MACHINE LEARNING Notes
No ratings yet
MACHINE LEARNING Notes
8 pages
Emotion Detection From Frontal Facial Image: Thesis Report
No ratings yet
Emotion Detection From Frontal Facial Image: Thesis Report
28 pages
Overview of SVM: A Support Vector Machine (SVM) Performs by Finding The That The Margin Between The
No ratings yet
Overview of SVM: A Support Vector Machine (SVM) Performs by Finding The That The Margin Between The
20 pages
Support Vector Machines
No ratings yet
Support Vector Machines
33 pages
Some Methods of Constructing Kernel
No ratings yet
Some Methods of Constructing Kernel
23 pages
SVM
No ratings yet
SVM
43 pages
Ann 1
No ratings yet
Ann 1
102 pages
Support Vector Machines: More Generally Kernel Methods
No ratings yet
Support Vector Machines: More Generally Kernel Methods
58 pages
Unit 1,2,3
No ratings yet
Unit 1,2,3
17 pages
Ain3001 - 04 - Support - Vector.machines
No ratings yet
Ain3001 - 04 - Support - Vector.machines
50 pages
Introduction To Support Vector Machines: Andrew Moore CMU
No ratings yet
Introduction To Support Vector Machines: Andrew Moore CMU
40 pages
LLM-Select: Feature Selection With Large Language Models: Daniel P. Jeong
No ratings yet
LLM-Select: Feature Selection With Large Language Models: Daniel P. Jeong
74 pages
Support Vector Machine Classifiers
No ratings yet
Support Vector Machine Classifiers
44 pages
Machine Learning in A Nutshell
No ratings yet
Machine Learning in A Nutshell
36 pages
Unit - 2
No ratings yet
Unit - 2
15 pages
Kernel Machines
No ratings yet
Kernel Machines
33 pages
Machine Learning
No ratings yet
Machine Learning
45 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
SVM Intro
No ratings yet
SVM Intro
23 pages
Full Download (Ebook) Generative Artificial Intelligence: Exploring The Power and Potential of Generative AI by Shivam R Solanki, Drupad K Khublani ISBN 9798868804021, 8868804026 PDF
100% (1)
Full Download (Ebook) Generative Artificial Intelligence: Exploring The Power and Potential of Generative AI by Shivam R Solanki, Drupad K Khublani ISBN 9798868804021, 8868804026 PDF
66 pages
SVM Class
No ratings yet
SVM Class
33 pages
Introduction To Artificial Neural Networks: Andrew L. Nelson
No ratings yet
Introduction To Artificial Neural Networks: Andrew L. Nelson
29 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
A Introduction To SVM PDF
No ratings yet
A Introduction To SVM PDF
48 pages
Volve Oil Field S-Wave Log Data Prediction Using G
No ratings yet
Volve Oil Field S-Wave Log Data Prediction Using G
11 pages
(Ebooks PDF) Download Artificial Intelligence and Soft Computing 14th International Conference ICAISC 2015 Zakopane Poland June 14 18 2015 Proceedings Part I 1st Edition Leszek Rutkowski Full Chapters
100% (1)
(Ebooks PDF) Download Artificial Intelligence and Soft Computing 14th International Conference ICAISC 2015 Zakopane Poland June 14 18 2015 Proceedings Part I 1st Edition Leszek Rutkowski Full Chapters
55 pages
Support Vector Machines: Jeff Wu
No ratings yet
Support Vector Machines: Jeff Wu
35 pages
Matlab Expo 2021 Intro To Ai Workshop Edt
No ratings yet
Matlab Expo 2021 Intro To Ai Workshop Edt
26 pages
This Is
No ratings yet
This Is
7 pages
CV - 2024 10 26 120718
No ratings yet
CV - 2024 10 26 120718
2 pages
Discriminative and Generative Methods For Bags of Features: Zebra Non-Zebra
No ratings yet
Discriminative and Generative Methods For Bags of Features: Zebra Non-Zebra
40 pages
Support Vector Machines: Xiaojin Zhu
No ratings yet
Support Vector Machines: Xiaojin Zhu
41 pages
Fundamental Knowledge of Machine Learning: Abstract This Chapter Introduces The Basic Concepts and Methods of Machine
No ratings yet
Fundamental Knowledge of Machine Learning: Abstract This Chapter Introduces The Basic Concepts and Methods of Machine
14 pages
Lecture 1
No ratings yet
Lecture 1
25 pages
Supervised Learning - A Systematic Literature Review
No ratings yet
Supervised Learning - A Systematic Literature Review
22 pages
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
No ratings yet
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
16 pages
Another Introduction SVM
No ratings yet
Another Introduction SVM
4 pages
Ai Model Paper
No ratings yet
Ai Model Paper
1 page
EPQ Presentation
No ratings yet
EPQ Presentation
10 pages
Mini Project On Butterfly Effect - Project PPT Presentation - 18BCS011,18BCS023,18BCS041
No ratings yet
Mini Project On Butterfly Effect - Project PPT Presentation - 18BCS011,18BCS023,18BCS041
20 pages
CSE485 Sheet3 Optimization
No ratings yet
CSE485 Sheet3 Optimization
3 pages
Acoustic Detection of Drone:: Introduction: in Recent Years
No ratings yet
Acoustic Detection of Drone:: Introduction: in Recent Years
6 pages
Pattern Recognition & Learning II: © UW CSE Vision Faculty
No ratings yet
Pattern Recognition & Learning II: © UW CSE Vision Faculty
47 pages
Support Vector Machine in R Paper
No ratings yet
Support Vector Machine in R Paper
28 pages
Machine Learning Handout
No ratings yet
Machine Learning Handout
2 pages
Supervised Learning - Support Vector Machines and Feature Reduction
No ratings yet
Supervised Learning - Support Vector Machines and Feature Reduction
11 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
Atc Lecture Tyliu
No ratings yet
Atc Lecture Tyliu
48 pages
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
From Everand
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kubernetes Unleashed: Herding Containers Like a DevOps Cowboy
From Everand
Kubernetes Unleashed: Herding Containers Like a DevOps Cowboy
Scott Markham
No ratings yet
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet

Chapter 7

Uploaded by

Chapter 7

Uploaded by

Data Mining

Practical Machine Learning Tools and Techniques

of Data Mining by I. H. Witten, E. Frank,

Practical problems of 1-nearest-neighbour scheme:

• Only those instances involved in a decision need to be

• David Aha’s IB4: weight each attribute

• Update weights based on nearest neighbor

• Generalize instances into hyperrectangles

• Nearest-neighbor rule is used outside rectangles

• Problem with Euclidean distance, etc.: only natural for purely

• Nearest-neighbor methods gained popularity in machine

• Support vector machines are algorithms for learning linear

• The instances closest to the maximum margin hyperplane

• Support vector: training instance for which i > 0

• Avoid computing the “pseudo attributes”

• Corresponds to a map into the instance space spanned by

• We can use others:

K() can be written as a

• Have assumed that the data is separable (in original or

• SVM algorithms speed up dramatically if the data is sparse

• Support vectors: points on or outside tube

• For classic linear regression using squared loss, only simple

• Unlike in an SVM, all training instances participate – not

• The user-specified parameter λ determines closeness of fit

• Finds separating hyperplane in space created by kernel

• Using kernels is only one way to build nonlinear classifier

• Loss function also needs to provide gradient information:

• Can only find a local minimum!

• RBF network: another type of feedforward network, with two

• Learning linear models using gradient descent is easier than

• Counterparts exist for all classification schemes previously

• Like decision trees, but:

• Same effect can be achieved by incorporating ancestor models

• Termination of splitting process:

• Greedily remove terms from LR models to minimize estimated error

• Modify splitting criterion:

• To determine which subset an instance goes into, use

You might also like