ML Merged
ML Merged
Machine learning “ Field of Study that gives the computers the ability to
learn without being explicitly programmed”
• The science (and art) of programming computers so they can learn from data
• Engineering-oriented definition
• Algorithms that improve their performance P at some task T with experience E
Data Data
Computer Output Computer Program
Program Output
Defining the Learning Tasks
Improve on task T, with respect to performance metric P, based on experience E
• Example 1
• T: Playing checkers
• P: Percentage of games won against an arbitrary opponent
• E: Playing practice games against itself
• Example 2
• T: Recognizing hand-written words
• P: Percentage of words correctly classified
• E: Database of human-labeled images of handwritten words
• Example 3
• T: Driving on four-lane highways using vision sensors
• P: Average distance traveled before a human-judged error
• E: A sequence of images and steering commands recorded while observing a human driver.
• Example 4
• T: Categorize email messages as spam or legitimate.
• P: Percentage of email messages correctly classified.
• E: Database of emails, some with human-given labels
Traditional Approach to Spam Filtering
Spam typically uses words or phrases such as “4U,” “credit card,” “free,” and “amazing”
• Solution
• Write a detection algorithm for frequently appearing
patterns in spams
• Test and update the detection rules until it is good
enough.
• Challenge
• Detection algorithm likely to be a long list of complex
rules
• hard to maintain.
Machine Learning Approach
Automatically learns phrases that are good predictors of spam by detecting unusually
frequent patterns of words in spams compared to “ham”s
• The program is much shorter, easier to maintain, and most likely more accurate.
A Classic example of ML Task
It is very hard to say what makes a “2”
Actually a “3”
not a “2”
All “2”s
A “3”
A “7”
More ML Usage Scenarios
Tasks that are best solved by using a learning algorithm
• Recognizing Patterns in images, text
• Facial identities or facial expressions
• Handwritten or spoken words
• Medical images
• Types of Applications
• Application Domains
• State of the Art Applications
Two Classes of Application
• Internet
• Computational biology
• Finance
• E-commerce
• Space exploration
• Robotics
• Information extraction
• Social networks
• Software engineering
• System management
• Creative Arts
Example: Classification
Assign object/event to one of a given finite set of categories
• Medical Diagnosis
• Credit card applications or transactions
• Fraud detection in e-commerce
• Worm detection in network packets
• Spam filtering in email
• Recommended articles in a newspaper
• Recommended books, movies, music, or jokes
• Financial investments
• DNA sequences
• Spoken words
• Handwritten letters
• Astronomical images
Example: Planning, Control, Problem Solving
Performing actions in an environment in order to achieve a goal
• Playing checkers, chess, or backgammon
• Balancing a pole
• Driving a car or a jeep
• Flying a plane, helicopter, or rocket
• Controlling an elevator
• Controlling a character in a video game
• Controlling a mobile robot
Breakthrough in Automatic Speech Recognition
Path
Planning
Laser Terrain
Mapping
Sebastian
Stanle
y
Contemporary ML Based Solutions
There are so many different types of Machine Learning systems that it is useful to classify them in
broad categories based on:
1. Based on Training :Whether or not they are trained with human supervision (supervised,
unsupervised, semisupervised, and Reinforcement Learning)
2. Based on stream of incoming data: Whether or not they can learn incrementally on the fly
(online versus batch learning)
3. How they generalize: Whether they work by simply comparing new data points to known
data points, or instead detect patterns in the training data and build a predictive model,
much like scientists do (instance-based versus model-based learning)
ML Terminologies
1.Labels: A label is the thing we're predicting—the y variable in simple linear regression. The
label could be the future price of gold, the kind of animal shown in a picture, the meaning of an
audio clip.
2. Features/attribute: A feature is an input variable—the x variable in simple linear
regression. A simple machine learning project might use a single feature, while a more
sophisticated machine learning project could use millions of features, specified as:
x1, x2,……..xn
In the spam detector example, the features could include the following:
• words in the email text
• sender's address
• time of day the email was sent
Another example: To predict the price (label) of used car , using the features,
1. mileage
2. age
3.brand
ML Terminologies
3.Examples/Instances
An example is a particular instance of data, x. (We put x in boldface to indicate that it is a
vector.) We break examples into two categories:
• labeled examples
• unlabeled examples
A labeled example includes both feature(s) and the label. That is:
labeled examples: {features, label}: (x, y)
An unlabeled example contains features but not the label. That is:
unlabeled examples: {features, ?}: (x, ?)
ML Terminologies
4. Models
A model defines the relationship between features and label. For example, a spam detection
model might associate certain features strongly with "spam". Let's highlight two phases of a
model's life:
1. Training means creating or learning the model. That is, you show the
model labeled examples and enable the model to gradually learn the
relationships between features and label.
2. Inference means applying the trained model to unlabeled examples.
That is, you use the trained model to make useful predictions (y'). For
example, during inference, you can predict HouseValue for new
unlabeled examples.
Types of Learning
Based on level of supervision
• Supervised (inductive) learning
• Given: training data, desired outputs (labels)
• Unsupervised learning
• Given: training data only (without desired outputs)
• Semi-supervised learning
• Given: training data and a few desired outputs
• Reinforcement learning
• Given: rewards from sequence of actions
Types of Learning : Supervised Learning
Based on level of supervision
• Regression vs. classification
• A regression model predicts continuous values. For example, regression models make
predictions that answer questions like the following:
1. What is the value of a house in Bangalore?
2. What is the probability that a user will click on this ad?
• A classification model predicts discrete values. For example, classification models make
predictions that answer questions like the following:
1. Is a given email message spam or not spam?
2. Is this an image of a dog, a cat, or a hamster?
Supervised Learning: Regression
• Given (x1, y1), (x2, y2), ..., (xn, yn)
• Learn a function f (x) to predict y given x
– y is real-valued
September Arctic Sea Ice Extent
9
8
7
(1,000,000 sq km)
6
5
4
3
2
1
0
1970 1980 1990 2000 2010 2020
Year Number
Supervised Learning: Classification
• Given (x1, y1), (x2, y2), ..., (xn, yn)
• Learn a function f (x) to predict y given x
– y is categorical
y=1 (malignant)
y=0 (benign)
1 (malignant)
0 (benign)
Tumor Size Learnt classifer
If x>T, malignant else benign
Predict benign Predict malignant
x=T
Increasing Feature Dimension
• x can be multi-dimensional
– Each dimension corresponds to an attribute
- Clump Thickness
Age
- Uniformity of Cell Size
- Uniformity of Cell Shape
…
Tumor Size
Example: Supervised Learning Techniques
• Linear Regression
• Logistic Regression
• Naïve Bayes Classifiers
• Support Vector Machines (SVMs)
• Decision Trees and Random Forests
• Neural networks
Unsupervised Learning
• Clustering
• k-Means
• Hierarchical Cluster Analysis
• Expectation Maximization
• Visualization and dimensionality reduction
• Principal Component Analysis (PCA)
• Kernel PCA
• Locally-Linear Embedding (LLE)
• t-distributed Stochastic Neighbor Embedding (t-SNE)
• Association rule learning
• Apriori
• Eclat
Data Visualization
Visualize 2/3D representation of complex unlabelled training data
• Preserve as much structure as possible
• e.g., trying to keep separate clusters in the input space from overlapping in the visualization
• Understand how the data is organized
• Identify unsuspected patterns.
frog
cat
bird dog
truck
automobile
deer
horse
ship airplane
Association Rule Mining
Applications: Unsupervised Learning
Genomics application: Group individuals by genetic similarity
Genes
Individuals
Applications: Unsupervised Learning
Organize computing clusters Social network analysis
• For example, deep belief networks (DBNs) are based on unsupervised components called
restricted Boltzmann machines (RBMs) stacked on top of one another. RBMs are trained
sequentially in an unsupervised manner, and then the whole system is fine-tuned using
supervised learning techniques.
Reinforcement Learning
A learning agent
• observes the state of the environment,
select and perform actions
• gets +ve or -ve rewards in return
• learns the best strategy, aka a policy, to get
the most reward over time.
• policy is a mapping from states → actions
• Examples:
– Game playing, e.g., AlphaGo
– Robot in a maze
– Balance a pole on your hand
Types of Learning
Based on how training data is used
• Batch learning
• Uses all available data at a time during training
• Mini Batch learning
• Uses a subset of available at a time during training
• Online (incremental) learning
• Uses single training data instance at a time during training
Types of Learning
Based on how training data is used
• Instance Based Learning
• compare new data points to known data points
• Model Based learning
• detect patterns in the training data and build a predictive model
• Training Data
• Insufficient
• Non representative
• Poor Quality
• Irrelevant attributes
• Model Selection
• Overfitting
• Underfitting
• Testing and Validation
• Hyperparameters
Insufficient Training Data
Consider trade-off Between Algorithm development & training data capture
Non-representative Training Data
Training Data be representative of the new cases we want to generalize
• Small sample size leads to sampling noise
• Missing data over emphasizes the role of wealth on happiness
• If sampling process is flawed, even large sample size can lead to sampling bias
missing data
available data
Data Quality
Cleaning often needed for improving data quality
• Some instances have missing features
• e.g., 5% of customers did not specify their age
• Ignore the instances all together or the feature,
• fill in the missing values
• Train multiple ML models
• Some instances can be erroneous, noisy or outliers
• Human or machine generated
• Identify, discard or fix manually, as appropriate
Irrelevant Features
Feature engineering needed for coming up with a good set of features
• Feature selection
• more useful features to train on among existing features.
• Feature extraction
• combine existing features to produce a more useful one.
• Create new features by gathering new data
Model Selection
Overfitting or Underfitting
• Overfitting leads to high performance in training set but performs poorly on new data
• e.g., a high-degree polynomial life satisfaction model that strongly overfits the training data
• Small training set or sampling noise can lead to model following the noise than the underlying pattern
in the dataset
• Solution: Regularization Large slope
Smaller slope
High-order
polynomial
• Underfitting when the model is too simple to learn the underlying structure in the data
• Select a more powerful model, with more parameters
• Feed better features to the learning algorithm
• Reduce regularization
Model Selection
Overfitting or Underfitting
Testing and Validation
Performance of ML algorithms is statistical / predictive
• Good ML algorithms need to work well on test data
• But test data is often not accessible to the provider of the algorithm
• Common assumption is training data is representative of test data
• Randomly chosen subset of the training data is held out as validation set
• aka dev set
• Once ML model is trained, its performance is evaluated on validation data
• Expectation is ML model working well on validation set will work well on unknown test data
• Typically 20-30% of the data is randomly held out as validation data
Cross Validation
K-fold validation is often performed
• To reduce the bias of validation set selection process
• Often K is chosen as 10
• aka 10 fold cross validation
• 10 fold cross validation involves
• randomly selecting the validation set 10 times
• model generation with 10 resulting training set
• Evaluate the performance of each on that validation set
• averaging the performance over the validation sets
Choice of Hyperparameters
Modern ML models often use a lot of model parameters
• Known as hyperparameters
• Model performance depends on choice of parameters
• Each parameter can assume a number of values
• Real numbers or categories
• Exponential number of hyperparameter combinations possible
• Best model correspond to best cross validation performance over the set of hyperparameter
combinations
• Expensive to perform
• Some empirical frameworks available for hyperparameter optimization
Thank You!
In our next session: End to End ML
BITS Pilani
Pilani Campus
• Outliers are data objects with characteristics that are considerably different than most of the
other data objects in the data set
• Case 1: Outliers are noise that interferes
with data analysis
• Data set may include data objects that are duplicates, or almost duplicates of one another
• Major issue when merging data from heterogeneous sources
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
Data Preprocessing
• Aggregation
• Sampling
• Discretization and Binarization
• Attribute Transformation
• Dimensionality Reduction
• Feature subset selection
• Feature creation
Aggregation
Combining two or more attributes (or objects) into a single attribute (or object)
• Purpose
• Data reduction
• Change of scale
• More “stable” data
• Stratified sampling
• Split the data into several partitions; then draw random samples from each partition
Discretization
Process of converting a continuous attribute into an ordinal attribute
• A potentially infinite number of values are mapped into a small number of categories
• Discretization is used in both unsupervised and supervised settings
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
Unsupervised Discretization
• Many classification algorithms work best if both the independent and dependent variables have
only a few values
• Example: Iris Plant data set.
• http://www.ics.uci.edu/~mlearn/MLRepository.html
• Three flower types (classes):
• Setosa
• Versicolour
• Virginica
• Four (non-class) attributes
• Sepal width and length
• Petal width and length
Supervised Discretization Example …
How can we tell what the best discretization is?
• Supervised discretization: Use class labels to find breaks
50
40
30
Counts
20
10
0
0 2 4 6 8
Petal Length
x1
Dimensionality Reduction: PCA
Increasing # of components improve quality of reconstruction
Feature Subset Selection
Another way to reduce dimensionality of data
• Redundant features
• Duplicate much or all of the information contained in one or more other attributes
• Example: purchase price of a product and the amount of sales tax paid
• Irrelevant features
• Contain no information that is useful for the ML task at hand
• Example: students' ID is often irrelevant to the task of predicting students' GPA
• Many techniques developed, especially for classification
Feature Creation
Create new attributes that can capture the important information in a data set
• More efficiently than the original attributes
• Three general methodologies
• Feature extraction
• Example: extracting edges from images
• Feature construction
• Example: dividing mass by volume to get density
• Mapping data to new space
• Example: Fourier and wavelet analysis
Mapping Data to a New Space
Fourier and wavelet transform - Two Sine Waves + Noise
Frequency
BITS Pilani
Pilani Campus
• The model to predict the median housing price in any district, given all the other metrics.
• Goodness of the model is determined by how close the model output is w.r.t. actual price for
unseen district data
Framing the Problem
Understand the Business Objective and Context
• Goal is to predict a real valued price based on multiple variables line population, income etc.
c
• regression
v
• Output is based on input data at rest, not rapidly changing data rapidly.
• h is the model, X is the training dataset, m is number of instances, x(i) is i-th instance, y(i) is the
actual price for the i-th instance.
• MAE is preferred for a large number of outliers. aka L1 norm or Manhattan distance/norm.
General Form
Stepping Back
Check the Assumptions
• Verify that the downstream module actually uses real-valued prices rather than post processing
them into say, categories, e.g “cheap,” “medium,” or “expensive”)
• If not, the problem should have been framed as a classification task, not a regression task.
• Such potential hazards need to be checked early in the design rather than finding out from
deployed systems
BITS Pilani
Pilani Campus
Objects
• Same attribute can be mapped to different attribute values 5 No Divorced 95K Yes
• Example: height can be measured in feet or meters
6 No Married 60K No
• Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
7 Yes Divorced 220K No
• But properties of attribute can be different than the properties of 8 No Single 85K Yes
the values used to represent the attribute 9 No Married 75K No
10 No Single 90K Yes
10
Discrete and Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented using a finite number of digits.
• Continuous attributes are typically represented as floating-point variables.
Types and properties of Attributes
Types
• Nominal
• ID numbers, eye color, zip codes
• Ordinal
• rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall, medium, short}
• Interval
• calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• temperature in Kelvin, length, counts, elapsed time (e.g., time to run a race)
Properties
• Distinctness: =
• Order: < >
• Differences are + -
meaningful :
• Ratios are * /
meaningful :
Difference Between Ratio and Interval
• Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• World Wide Web
• Molecular Structures
• Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
Types of data sets
• Record
• Data Matrix
• Document Data
• Transaction Data
Types of data sets
• Graph
• World Wide Web
• Molecular Structures
Types of data sets
• Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
Record Data
• Data that consists of a collection of records, each of which consists of a fixed set of attributes
• If data objects have the same fixed set of numeric attributes, then the data objects can be
thought of as points in a multi-dimensional space, where each dimension represents a distinct
attribute
• Such a data set can be represented by an m by n matrix, where there are m rows, one for each
object, and n columns, one for each attribute
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
2
5 1
2
5
Items/Events
An element of
the sequence
Ordered Data
Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
SpatioTemporal Data
Average Monthly Temperature of land and ocean
BITS Pilani
Pilani Campus
• The mean is the most common measure of the location of a set of points.
• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also commonly used.
• Because of outliers, other measures are often used. Average Absolute Distance is given by
Similarity and Dissimilarity Measures
• Similarity measure
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity measure
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity between two objects, x and
y, with respect to a single, simple attribute.
Euclidean Distance
• Euclidean Distance
where n is the number of dimensions (attributes) and xk and yk are, respectively, the kth attributes
(components) or data objects x and y.
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
• r = 2. Euclidean distance
• Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
Mahalanobis Distance
1 𝑥𝑖 − 𝑚𝑥 2 𝑥𝑖 − 𝑚𝑥 𝑦𝑖 − 𝑚𝑦
𝑖 𝑖
𝑁 2
𝑥𝑖 − 𝑚𝑥 𝑦𝑖 − 𝑚𝑦 𝑦𝑖 − 𝑚𝑦
𝑖
𝑖
1
𝑚𝑥 = 𝑖 𝑥𝑖
𝑁
1
𝑚𝑦 = 𝑖 𝑦𝑖
𝑁
0.3 0.2
0.2 0.3
C
A: (0.5, 0.5)
B B: (0, 1)
C: (1.5, 1.5)
A
Mahal(A,B) = 5
Mahal(A,C) = 4
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well known properties.
1. d(x, y) 0 for all x and y and d(x, y) = 0 if and only if x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z) d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points (data objects), x and y.
Scatter plots
showing the
similarity from –1 to
1.
Correlation vs Cosine vs Euclidean Distance
• Compare the three proximity measures according to their behavior under variable transformation
• scaling: multiplication by a value
• translation: adding a constant
• Domain of application
• Similarity measures tend to be specific to the type of attribute and data
• Record data, images, graphs, sequences, 3D-protein structure, etc. tend to have different measures
• However, one can talk about various properties that you would like a proximity measure to have
• Symmetry is a common one
• Tolerance to noise and outliers is another
• Ability to find more types of patterns?
• Many others possible
• The measure must be applicable to the data and produce results that agree with domain
knowledge
BITS Pilani
Pilani Campus
1. Histograms
2. Two-Dimensional Histograms
3. Box Plot
4. Scatter Plots
5. Contour Plots
6. Matrix Plots
7. Correlation Matrix
8. Star Plots
9. Chernoff
Visualization Techniques: Dataset Used
Visualization Techniques: Histograms
• Histogram
• Usually shows the distribution of values of a single variable
• Divide the values into bins and show a bar plot of the number of objects in each
bin.
• The height of each bar indicates the number of objects
• Shape of histogram depends on the number of bins
• Example: Petal Width (10 and 20 bins, respectively)
Two-Dimensional Histograms
90th percentile
75th percentile
50th percentile
25th percentile
10th percentile
Visualization Techniques: Box Plots
Display distribution of data
Visualization Techniques: Scatter Plots
Attributes values determine the position
• Scatter plots can compactly summarize the relationships of several pairs of attributes
• Example: Iris Plant dataset http://www.ics.uci.edu/~mlearn/MLRepository.html
• Three flower types (classes): Setosa, Versicolour, Virginica
• Four (non-class) attributes: Sepal width and length, Petal width and length
Visualization Techniques: Contour Plots
Continuous attribute is measured on a spatial grid
• Useful when a They partition the
plane into regions of similar values
• The contour lines that form the
boundaries of these regions connect
points with equal values
• Common example: contour maps of
elevation, temperature, rainfall, air
pressure, etc.
Celsius
Surface Sea Temperature
Visualization Techniques: Matrix Plots
Plot data matrix
• Often useful when objects are sorted according to class
• Plots of similarity or distance matrices can also be useful for visualizing the relationships
standard
deviation
Visualization of the Iris Correlation Matrix
Star Plots for Iris Data
Setosa
Versicolour
Virginica
Chernoff Faces for Iris Data
Setosa
Versicolour
Virginica
BITS Pilani
Pilani Campus
age latitude
# of households
• rooms_per_household is more correlated with price than total number of bedrooms or rooms.
Data Cleaning
• Some of the instances don’t have total_bedroom values
• Since median applies only to numerical data, create an copy of the data without ocean-proximity
• The imputer stores the median of each attribute the result in its statistics_ instance variable.
• Use this “trained” imputer to replace any missing values by the learned medians
Handling Textual and Categorical Data
Convert the text attributes to numbers
• Scikit-learn provides a transformer
• ML algorithms assume closer values are more similar. But ‘<1H OCEAN’ and ‘NEAR OCEAN’ are far
away in values!
• Solution! Use 1-hot encoding
Test Set Creation
Ensures reproducibility of
• Test set size is 20% of training set results
• Uses uniform sampling of data.
• Not appropriate for heavy tailed distribution
• income very important to predict house prices.
• Important that test set is representative of the various
categories of incomes in the dataset.
• Most income values are around $20–$50K, but some >> $60K.
Module 3:
Big Picture: End-to-end Machine Learning
3.1 Model Selection and Training
3.1.1. Prediction Problem
3.1.2. Classification Problem
3.2 Evaluation
3.2.1. Prediction Problem
3.2.2. Classification Problem
3.3 Machine Learning Pipeline
In This Session
• Multi-class classification
• Multi-output classification
In this segment
Model Selection and Training
• Select based on training data
• If prediction label/output is available, use regression or classification model
• Regression if real valued output
• Classification if output is discrete (binary/integer)
• else, unsupervised model is used.
• For the house price prediction problem, use regression model since median house prices are
available along with training data (predictors)
• Example: Linear Regression
• Train a classification model for detecting ‘5’. Target output for training data instance
corresponding an image of ‘5’ is +1, else target output is ‘0’
• Perform cross validation like the regression problem and try out multiple classification model for
achieving acceptable performance.
Multiclass Classification
• Multiclass classifiers (aka multinomial classifiers) can distinguish between more than two classes.
• Some algorithms (such as Random Forest classifiers or naive Bayes classifiers) are capable of
handling multiple classes directly.
• Many (such as Support Vector Machine classifiers or Linear classifiers) are strictly binary
• One-versus-all (OvA) or One-versus-rest strategy using multiple binary classifiers.
• e.g., for MNIST classification, train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a
2-detector, and so on).
• get the decision score from each classifier for that image and select the class whose classifier outputs
the highest score.
• One-versus-one (OvO) strategy
• train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and
2s, another for 1s and 2s, and so on.
• If there are N classes, you need to train N × (N – 1) / 2 classifiers.
• Run an image through all 45 classifiers and see which class wins the most duels.
• Main advantage of OvO is each classifier only needs to be trained on the part of the training set
for the two classes that it must distinguish
Multiclass Classification
One-versus-all (OvA)
Multiclass Classification
One-versus-one (OvO) strategy
Multi Label / Output Classification
• Confusion Matrix
PREDICTED CLASS
a: TP (true positive)
Class=Yes Class=No b: FN (false negative)
c: FP (false positive)
Class=Yes a b
ACTUAL d: TN (true negative)
CLASS Class=No c d
Metrics for Performance Evaluation
Focus on the predictive capability of a model
• Confusion Matrix
Accuracy, Precision, Recall, F1Score
Ability to detect
when disease is
not present
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)
• Most widely-used metric:
ad TP TN
Accuracy
a b c d TP TN FP FN
Cost Matrix
•True positives: data points labeled as positive that are actually positive
•False positives: data points labeled as positive that are actually negative
•True negatives: data points labeled as negative that are actually negative
•False negatives: data points labeled as negative that are actually positive
Limitation of Accuracy
a
Precision (p)
ac
a
Recall (r)
ab
2rp 2a
F - measure (F)
r p 2a b c
wa w d
Weighted Accuracy 1 4
wa wb wc w d
1 2 3 4
Accuracy, Precision, Recall, F1Score - Consolidated
Methods for Performance Evaluation
How to obtain a reliable estimate of performance?
• Performance of a model may depend on other factors besides the learning algorithm:
• Class distribution
• Cost of misclassification
• Size of training and test sets
Learning Curve
• Holdout
• Reserve 2/3 for training and 1/3 for testing
• Random subsampling
• Repeated holdout
• Cross validation
• Partition data into k disjoint subsets
• k-fold: train on k-1 partitions, test on the remaining one
• Leave-one-out: k=n
• Stratified sampling
• oversampling vs undersampling
• Bootstrap
• Sampling with replacement
Stratified Sampling: In stratified sampling, researchers divide subjects into subgroups called strata based on
characteristics that they share (e.g., race, gender, educational attainment). Once divided, each subgroup is
randomly sampled using another probability sampling method
ROC (Receiver Operating Characteristic)
At threshold t:
TP=0.5, FN=0.5, FP=0.12, TN=0.88
ROC Curve
(TP,FP):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal
• Diagonal line:
• Random guessing
• Below diagonal line:
• prediction is opposite of the true class
Using ROC for Model Comparison
ROC Curve:
Class + - + - - - + - + +
P
Threshold 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
>= TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
• Gradient Descent
• e.g., Learning rate, how long to run
• Mini-batch
• Batch size
• Regularization constant
• Many Others
• will be discussed in upcoming sessions
Hyperparameter Optimization
• Just fiddle with the parameters until you get the results you want
• As the number of parameters increases, the cost of grid search increases exponentially!
• Need some way to choose the grid properly
• Something this can be as hard as the original hyperparameter optimization
• Can’t take advantage of any insight you have about the system!
Making Grid Search Fast
• This is just grid search, but with randomly chosen points instead of points on a grid.
• RandomSearchCV
• Problem: with random search, not necessarily going to get anywhere near the optimal parameters in
a finite sample.
An Alternative: Bayesian Optimization
• Idea: learn a statistical model of the function from hyperparameter values to the loss function
• Then choose parameters to minimize the loss
• Main benefit: choose the hyperparameters to test not at random, but in a way that gives the most
information about the model
• This lets it learn faster than grid search
Effect of Bayesian Optimization
• Upside: empirically it has been demonstrated to get better results in fewer experiments
• Compared with grid search and random search
• Partition part of the available data to create an validation dataset that we don’t use for training.
• What is MLOps
• DevOps vs MLOps
• Level 0 MLOps
• Continuous Training
• Level 1 MLOps
• Continuous Integration, Delivery
• Frameworks
What is MLOps?
Apply DevOps principles to ML systems
• An engineering culture and practice that aims at unifying ML system development (Dev) and ML
system operation (Ops).
• Automation and monitoring at all steps of ML system construction, including integration, testing,
releasing, deployment and infrastructure management.
• Data scientists can implement and train an ML model with predictive performance on an offline
validation (holdout) dataset, given relevant training data for their use case.
• However, the real challenge is building an integrated ML system and to continuously operate it in
production.
Ecosystem of ML System Components
A small fraction of a real-world ML system is composed of the ML code
DevOps Vs. MLOps
• DevOps for developing and operating large-scale software systems provides benefits such as
• shortening the development cycles
• increasing deployment velocity, and
• dependable releases.
• Two key concepts
• Continuous Integration (CI)
• Continuous Delivery (CD)
• An ML system is a software system, so similar practices apply to reliably build and operate at
scale.
• However, ML systems differ from other software systems
• Team skills: focus on exploratory data analysis, model development, and experimentation.
• Development: ML is experimental in nature.
• The challenge is tracking what worked and what did not, maintaining reproducibility, and maximizing code
reusability.
• Testing: Additional testing needed for data validation, trained model quality evaluation, and model
validation.
DevOps Vs. MLOps
• Perform continuous
training (CT) by
automating the ML pipeline
• Achieves continuous
delivery of model prediction
service.
•
• Automated data and model
validation steps to the
pipeline
• Data validation: Required prior to model training to decide whether to retrain the model or stop
the execution of the pipeline based on following
• Data values skews: significant changes in the statistical properties of data, triggering retraining
• Data schema skews: downstream pipeline steps, including data processing and model training,
receives data that doesn't comply with the expected schema.
• stop the pipeline to release a fix or an update to the pipeline to handle these changes in the schema.
• Schema skews include receiving unexpected features or with unexpected values, not receiving all the expected
features
• Model validation: Required after retraining the model with the new data. Evaluate and validate
the model before promoting to production. This offline model validation step consists of
• Producing evaluation metric using the trained model on test data to assess the model quality.
• Comparing the evaluation metrics of production model, baseline model, or other business-requirement
models.
• Ensuring the consistency of model performance on various data segments
• Test model for deployment, including infrastructure compatibility and API consistency
• Undergo online model validation—in a canary deployment or an A/B testing setup
Level 2: CI/CD and automated pipeline automation
Stages of CI/CD Automation Pipeline
1) Development and experimentation: iteratively try new ML algorithms and modeling. The
output is the source code of the ML pipeline steps that are then pushed to a source
repository.
2) Pipeline continuous integration: build source code and run various tests. The outputs of
this stage are pipeline components (packages, executables, and artifacts).
3) Pipeline continuous delivery: deploy artifacts produced by the CI stage to the target
environment.
4) Automated training: automatically executed in production based on a schedule or trigger.
The output is a trained model pushed to the model registry.
5) Model continuous delivery: serve the trained model as a prediction service for the
predictions.
6) Monitoring: collect statistics on the model performance based on live data. The output is a
trigger to execute the pipeline or to execute a new experiment cycle.
Stages of the CI/CD automated ML pipeline
Continuous Integration
• Pipeline and its components are built, tested, and packaged when
• new code is committed or
• pushed to the source code repository.
• Besides building packages, container images, and executables, CI process can include
• Unit testing feature engineering logic.
• Unit testing the different methods implemented in your model.
• For example, you have a function that accepts a categorical data column and you encode the function as a one-
hot feature.
• Testing for training convergence
• Testing for NaN values due to dividing by zero or manipulating small or large values.
• Testing that each component in the pipeline produces the expected artifacts.
• Testing integration between pipeline components.
Continuous Delivery
• For applications, where output will be a real value eg: Predicting housing price,
predicting price of a stock market.
• Examples include
• predict weight from gender, height, age, …
• Predict house price from locality, area, income, …
• predict Google stock price today from Google, Yahoo, MSFT prices yesterday
• predict each pixel intensity in robot’s current camera image, from previous image and
previous action
Regression- Examples
Visually Evaluating Correlation
Scatter plots
showing the
similarity from –1 to
1.
Correlation measures the linear relationship between
objects
Simple Linear Regression
Two Approaches
Two very different ways to train it:
• Using a direct “closed-form” equation that directly computes the model
parameters that best fit the model to the training set (i.e., the model
parameters that minimize the cost function over the training set).
N- no of samples
D: dimension
i- ith sample
Least Squares Approach
Multiple Linear Regression
• Minimize MSE
• Errors are called Residuals
Least Squares Linear Regression
Least Squares Linear Regression
Ɵ0 =0
Intuition Behind Cost Function
Intuition Behind Cost Function
Intuition Behind Cost Function
Intuition Behind Cost Function- Gradient Descent
Intuition Behind Cost Function
Intuition Behind Cost Function
Intuition Behind Cost Function
Intuition Behind Cost Function
Gradient Descent
Gradient Descent is a very generic optimization algorithm capable of finding optimal
solutions to a wide range of problems.
The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize
a cost function.
Concretely, you start by filling θ with random values (this
is called random initialization), and then you improve it
gradually, taking one baby step at a time, each step
attempting to decrease the cost function (e.g., the MSE),
until the algorithm converges to a minimum
Gradient Descent- Hyper parameter
On the other hand, if the learning rate is too high, you might
If the learning rate is too small, then the algorithm jump across the valley and end up on the other side, possibly
will have to go through many iterations to converge, even higher up than you were before. This might make the
which will take a long time algorithm diverge, with larger and larger values, failing to find a
good solution
Gradient Descent- Pitfalls
• Finally, not all cost functions look like nice
regular bowls.
• There may be holes, ridges,plateaus, and
all sorts of irregular terrains, making
convergence to the minimum very difficult.
• Figure shows the two main challenges with
Gradient Descent:
1. if the random initialization starts the
algorithm on the left, then it will
converge to a local minimum,which is
not as good as the global minimum.
2. If it starts on the right, then it will
take a very long time to cross the
plateau, and if you stop too early you
will never reach the global minimum.
Intuition Behind Cost Function
Intuition Behind Cost Function
Intuition Behind Cost Function
(900,-0.1)
h(x)=900-0.1x
Intuition Behind Cost Function
Intuition Behind Cost Function
Intuition Behind Cost Function
Basic Search Procedure
Basic Search Procedure
Basic Search Procedure
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent for Linear Regression
Gradient Descent-Numerical Example
Gradient Descent-Numerical Example
Gradient Descent-Numerical Example
Gradient Descent-Numerical Example
Gradient Descent-Numerical Example
Gradient Descent-Numerical Example
Gradient Descent
constant J
contours
(900,-0.1)
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Choosing Step Size
Impact of Learning Rate
O(d3)
Extending to More Complex Model
Fitting a Polynomial Curve
Thank You!
In our next session: Regularization
Linear Regression
Regularization
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
Session Content
• 𝑥1 = size of house
• 𝑥2 = no. of bedrooms
Price ($)
• 𝑥3 = no. of floors in 1000’s
• 𝑥4 = age of house
• 𝑥5 = average income in neighborhood
• 𝑥6 = kitchen size
• ⋮
• 𝑥100
Size in feet^2
Addressing overfitting
• Regularization.
• Keep all the features, but reduce magnitude/values of parameters 𝜃𝑗 .
• Works well when we have a lot of features, each/many of which contributes a bit to predicting 𝑦.
Effect of Training Size on Overfitting
Size of training dataset needs to be large to prevent overfitting
• when higher order model is used.
Overfitting
Understanding Regularization
Underfitting
Ridge regression
Regularized Linear Regression
Ridge Regression
Further simplified
Lasso Regularization
Ridge Vs Lasso Regularization
λ=0 λ=0
λ = 10 λ = 10-5
λ = 100 λ=1
Lasso
λ=0 λ=0
λ = 0.1 λ = 10-7
λλ = 1 λ=1
Ridge
Early Stopping
Do Not Over train to prevent overfitting
• Stop training once error on the validation set starts showing an upward trend, even if the error
on the training set keeps decreasing
Thank You!
In our next session: Bias Variance Decomposition
Linear Regression
Bias Variance Decomposition
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
Session Content
• What is Classification?
• Linear Classifier
• Generative and Discriminatory Classifier
Classification
Definition
• Given a collection of records (training set )
• Each record is by characterized by a tuple (x,y), where x is the attribute (feature) set and y is the
class label
• x aka attribute, predictor, independent variable, input
• Y aka class, response, dependent variable, output
• Task
• Learn a model or function that maps each attribute set x into one of the predefined class labels y
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Types of Classifiers
Linear Classifier
• Classes are separated by a linear decision surface (e.g., straight line
in 2-dimensional feature/attribute space)
y=1
• If for a given record, linear combination of features xi is >= 0, i.e.,
𝑤0 + 𝑤𝑖 𝑥𝑖 ≥ 0
𝑖 x2
it belongs to one class (say, y = 1), else it belongs to the other class (say,
y=0 or -1) Decision
• wi s are learned during the training (induction) phase of the classifier. y=0 Boundary
• Learnt wi s are applied to a test record during the deduction / inferencing
phase. x1
Discriminative
x1
Thank You!
In our next session: Naïve Bayes Classifier
Classification Model I
Naïve Bayes Classifier
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
BITS Pilani
Pilani Campus
446
Bayes Classifier
A generative framework for solving classification problems
• Conditional Probability:
P( X , Y )
P (Y | X )
P( X )
P( X , Y )
P( X | Y )
P (Y )
• Bayes theorem:
P( X | Y ) P(Y )
P(Y | X )
P( X )
Using Bayes Theorem for Classification
Consider each attribute and class label as random variables
t t n a
ca ca co cl
• Given a record with attributes (X1, X2,…, Xd) Tid Refund Marital Taxable
Status Income Evade
• Goal is to predict class Y (“Evade”)
given (X1,X2,X3)=(<Refund>, <Marital Status>, <Taxable Income>) 1 Yes Single 125K No
• Specifically, we want to find the value of Y that maximizes P(Y| X1, X2,…, Xd ) 2 No Married 100K No
3 No Single 70K No
• Can we estimate P(Y| X1, X2,…, Xd ) directly from data? 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Using Bayes Theorem for Classification
Approach
• Compute posterior probability P(Y | X1, X2, …, Xd) using the Bayes theorem
P ( X 1 X 2 X d | Y ) P (Y )
P (Y | X 1 X 2 X n )
P( X 1 X 2 X d )
• Now we can estimate P(Xi| Yj) for all Xi and Yj combinations from the training data
• Discretization
• Partition the range into bins
• Replace continuous value with bin value
• Attribute changed from continuous to ordinal
• Probability density estimation
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, use it to estimate the conditional probability
P(Xi|Y)
Estimate Probabilities from Data
ca ca co cl
Normal Distribution Tid Refund Marital Taxable
Status Income Evade
• One for each (Xi,Yi) pair
( X i ij ) 2 1 Yes Single 125K No
1 2 ij2
P( X i | Y j ) e 2 No Married 100K No
1
( 120110) 2
𝑛𝑐 + 𝑚𝑝 p: initial estimate of
m − estimate: 𝑃 𝑋𝑖 = 𝑐 𝑦) = (P(Xi = c|y) known apriori
𝑛+𝑚
m: hyper-parameter for our confidence
in p
Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat yes yes no yes mammals P ( A | M ) 0.06
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
7
P ( A | M ) P ( M ) 0.06 0.021
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N ) 0.004 0.0027
eagle no yes no yes non-mammals 20
467
Baseline: Bag of Words Approach
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas
...
1
oil
…
1
Zaire
0
Text Classification: A Simple Example
• “close” doesn’t appear in sentences of sports tag, So P(close | sports) = 0, which makes
product 0
Laplace smoothing
• Laplace smoothing: we add 1 or in general constant k to every count so it’s never zero.
• To balance this, we add the number of possible words to the divisor, so the division will never be
greater than 1
• In our case, the 14 possible words are
{a,great,very,over,it,but,game,election,clean,close,the,was,forgettable,match}
474
Apply Laplace Smoothing
475
Experiment with NewsGroups
• Given 1000 training documents from each group Learn to classify new documents according to
which newsgroup it came from
• Given a test image X, calculate Probability P(Y=yk | X) = P(yk) Πi P(Xi | yk) for all k
Example: Character Recognition
Estimating parameters: Y discrete, Xi continuous
483
Logistic regression
• Logistic Regression could help use predict, for example, whether the student passed or
failed. Logistic regression predictions are discrete (only specific values or categories are
allowed). We can also view probability scores underlying the model’s classifications.
• In comparison, Linear Regression could help us predict the student’s test score on a scale
of 0 - 100. Linear regression predictions are continuous (numbers in a range).
• Idea
• Naïve Bayes allows computing P(Y|X) by learning P(Y) and P(X|Y)
• Why not learn P(Y|X) directly?
Sigmoid/Logistic Function
Classification requires discrete output values
• For example, output y = 0 or 1 for a two-category classification problem
• In logistic regression, sigmoid/logistic function hɵ(x) takes a real vector x as input and outputs a
value between 0 and 1
1
ℎ𝜃 𝑥 = −𝜃 ⊤𝑥
𝑔(𝑧)
1+ 𝑒
𝑧
Logistic regression
ℎ𝜃 𝑥 = 𝑔 𝜃 ⊤ 𝑥
1
𝑔 𝑧 = 𝑔(𝑧)
1 + 𝑒 −𝑧
𝑧 = 𝜃⊤𝑥
Suppose predict “y = 1” if ℎ𝜃 𝑥 ≥ 0.5
𝑧 = 𝜃⊤𝑥 ≥ 0
predict “y = 0” if ℎ𝜃 𝑥 < 0.5
𝑧 = 𝜃⊤𝑥 < 0
Decision boundary
At decision boundary output of logistic regressor is 0.5
• ℎ𝜃 𝑥 = 𝑔 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2
• e.g., 𝜃0 = −3, 𝜃1 = 1, 𝜃2 = 1
Decision boundary
Age
Tumor Size
• Predict “𝑦 = 1” if −3 + 𝑥1 + 𝑥2 ≥ 0
Learning Model Parameters
• Training set:
• m examples
• n features
Logistic regression:
“non-convex” “convex”
0 ℎ𝜃 𝑥 1
𝐂𝐨𝐬𝐭 𝒉𝜽 𝒙 , 𝒚 = −𝒚 𝐥𝐨𝐠 𝒉𝜽 𝒙 − (𝟏 − 𝐲) 𝐥𝐨𝐠 𝟏 − 𝒉𝜽 𝒙
if 𝑦 = 0
• J(θ) is convex
• Apply gradient descent on J(θ) w.r.t. θ to find optimal parameters 0 ℎ𝜃 𝑥 1
Gradient descent
𝑚
1
𝐽 𝜃 =− 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚
𝑖=1
• λ is “regularization” constant
• helps reduce overfitting
• keep weights nearer to zero
Logistic regression more generally
For k<R
For k=R
Multi-class classification
𝑥2 𝑥2
𝑥1 𝑥1
One-vs-all (one-vs-rest)
𝑥2
1
ℎ𝜃 𝑥
𝑥1
𝑥2
2 𝑥2
ℎ𝜃 𝑥
𝑥1 𝑥1
Class 1:
Class 2: 3
ℎ𝜃 𝑥 𝑥2
Class 3:
𝑖
ℎ𝜃 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥1 Slide credit: Andrew Ng
One-vs-all
𝑖
• Train a logistic regression classifier ℎ𝜃 𝑥 for each class 𝑖 to predict the probability that 𝑦 = 𝑖
• Credit Card Fraud : Predicting if a given credit card transaction is fraud or not
• Health : Predicting if a given mass of tissue is benign or malignant
• Marketing : Predicting if a given user will buy an insurance product or not
• Banking : Predicting if a customer will default on a loan.
• Introduction
• Support Vectors
• Linear Support Vector Machine
• Maximizing Margin
• Handling non linearly separable data
• Non-linear Classification
• Kernel Functions
502
Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data
Support Vector Machines
One Possible Solution
B1
Support Vector Machines
Another possible solution
B2
Support Vector Machines
Other possible solutions
B2
Support Vector Machines
B2
Support Vector Machines
Find hyperplane maximizes the margin
• => B1 is better than B2
B1
B2
b21
b22
margin
b11
b12
Support Vector Machines
B1
w x b 0
w x b 1 w x b 1
b11
b12
1 if w x b 1 2
f ( x) Margin
1 if w x b 1 || w ||
Linear SVM
Linear Model
1 if w x b 1
f ( x)
1 if w x b 1
• Learning the model
is equivalent to determining the values of w and b
• How to find w and b from training data?
Learning Linear SVM
𝑦𝑖 (w • x𝑖 + 𝑏) ≥ 1, 𝑖 = 1,2, . . . , 𝑁
• Objective is to maximize:
2
Margin
|| w ||
• Which is equivalent to minimizing:
2
|| w ||
L( w)
2
• Subject to the following constraints:
1 if w x i b 1
yi
or
1 if w x i b 1
𝟏
• This is a constrained optimization problem L(w, b, λi)= ||w||2 - Σ λi [yi (wTxi + b) -1]
• Solve it using Lagrange multiplier method 𝟐
• Lagrange multipliers λi are 0 or +ve
Learning Linear SVM
λi
Example of Linear SVM
Support vectors
x1 x2 y l
0.3858 0.4687 1 65.5261
0.4871 0.611 -1 65.5261
0.9218 0.4103 -1 0
0.7382 0.8936 -1 0
0.1763 0.0579 1 0
0.4057 0.3529 1 0
0.9355 0.8132 -1 0
0.2146 0.0099 1 0
Learning Linear SVM
• How to classify using SVM once w and b are found? Given a test record, xi
1 if w x i b 1
f ( xi )
1 if w x i b 1
Support Vector Machines
What if the problem is not linearly separable?
Support Vector Machines
What if the problem is not linearly separable?
• Introduce slack variables
• Need to minimize
2
|| w || N k
L( w) C i
• subject to
2 i 1
1 if w x i b 1 - i
yi
1 if w x i b 1 i
• If k is 1 or 2, this leads to similar objective function as linear SVM but with different constraints
Nonlinear Support Vector Machines
What if decision boundary is not linear?
Nonlinear Support Vector Machines
Transform data into higher dimensional space
Decision boundary:
w ( x ) b 0
Learning Nonlinear SVM
Optimization Problem
• which leads to the same set of equations (but involve (x) instead of x)
Learning NonLinear SVM
Issues
• What type of mapping function should be used?
• How to do the computation in high dimensional space?
• Most computations involve dot product (xi) (xj)
• Curse of dimensionality?
Learning Nonlinear SVM
• Kernel Trick:
• (xi) (xj) = K(xi, xj)
• K(xi, xj) is a kernel function (expressed in terms of the coordinates in the original space)
• Examples:
Example of Nonlinear SVM
• Robust to noise
• Overfitting is handled by maximizing the margin of the decision boundary
• In some sense, the best linear model for classification.
• SVM can handle irrelevant and redundant data better than many other techniques
• The user needs to provide the type of kernel function and cost function
• Difficult to handle missing values
• What about categorical variables?
• Needs to be mapped to some metric space