0% found this document useful (0 votes)
21 views433 pages

ML Merged

Uploaded by

2022mt12068
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views433 pages

ML Merged

Uploaded by

2022mt12068
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 433

What is Machine Learning?

BITS Pilani Dr. Bharathi R


CSE Department
Pilani Campus
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.1- Module1
In this segment

• What is Machine Learning


• Why Machine Learning
• Connection with Other technical areas
What is Machine Learning?
What is Machine Learning?

Machine learning “ Field of Study that gives the computers the ability to
learn without being explicitly programmed”

---- Arthur Samuel (1959)


What is Machine Learning?

Input (X) Output(Y) Application


email Spam(0/1) Spam filtering
English Hindi Machine translation
audio Text transcripts Speech recognition
Medical Images Tumour(0/1) Visual inspection
Image, radar info Position of cars Self-driving car
What is Machine Learning (ML)?
What is Machine Learning

• The science (and art) of programming computers so they can learn from data

• More general definition


• Field of study that gives computers the ability to learn without being explicitly programmed

• Engineering-oriented definition
• Algorithms that improve their performance P at some task T with experience E

• A well-defined learning task is given by <P, T, E>

Traditional programming Machine Learning

Data Data
Computer Output Computer Program
Program Output
Defining the Learning Tasks
Improve on task T, with respect to performance metric P, based on experience E
• Example 1
• T: Playing checkers
• P: Percentage of games won against an arbitrary opponent
• E: Playing practice games against itself
• Example 2
• T: Recognizing hand-written words
• P: Percentage of words correctly classified
• E: Database of human-labeled images of handwritten words
• Example 3
• T: Driving on four-lane highways using vision sensors
• P: Average distance traveled before a human-judged error
• E: A sequence of images and steering commands recorded while observing a human driver.
• Example 4
• T: Categorize email messages as spam or legitimate.
• P: Percentage of email messages correctly classified.
• E: Database of emails, some with human-given labels
Traditional Approach to Spam Filtering
Spam typically uses words or phrases such as “4U,” “credit card,” “free,” and “amazing”

• Solution
• Write a detection algorithm for frequently appearing
patterns in spams
• Test and update the detection rules until it is good
enough.

• Challenge
• Detection algorithm likely to be a long list of complex
rules
• hard to maintain.
Machine Learning Approach
Automatically learns phrases that are good predictors of spam by detecting unusually
frequent patterns of words in spams compared to “ham”s

• The program is much shorter, easier to maintain, and most likely more accurate.
A Classic example of ML Task
It is very hard to say what makes a “2”

Actually a “3”
not a “2”
All “2”s

A “3”

A “7”
More ML Usage Scenarios
Tasks that are best solved by using a learning algorithm
• Recognizing Patterns in images, text
• Facial identities or facial expressions
• Handwritten or spoken words
• Medical images

• Recognizing Anomalies in structured dataset


• Unusual credit card transactions
• Unusual patterns of sensor readings in a nuclear power plant

• Prediction from time series data


• Future stock prices or currency exchange rates

• Generating new Patterns


• Generating images or motion sequences
Machine Learning Helps Human Learning
Sometimes reveals unsuspected correlations / new trends leading to better understanding
• Data Mining
• Applying ML techniques to dig into large amounts of data can help discover patterns
When to Use Machine Learning
ML Usage Scenarios
• Problems for which existing solutions require a lot of hand-tuning or long lists of rules
• Complex problems for which there is no good solution at all using a traditional approach
• Changing environments
• Get insights about complex problems and large amounts of data
Where does ML fit in?
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.2- Module1
In this segment

• Types of Applications
• Application Domains
• State of the Art Applications
Two Classes of Application

• Assign/Predict an object/event to an element of a given set


• If set of integers, then it is a classification problem
• If set of reals, then it is a regression problem
• Predict a sequence of steps to achieve a goal
Application Domains

• Internet
• Computational biology
• Finance
• E-commerce
• Space exploration
• Robotics
• Information extraction
• Social networks
• Software engineering
• System management
• Creative Arts
Example: Classification
Assign object/event to one of a given finite set of categories
• Medical Diagnosis
• Credit card applications or transactions
• Fraud detection in e-commerce
• Worm detection in network packets
• Spam filtering in email
• Recommended articles in a newspaper
• Recommended books, movies, music, or jokes
• Financial investments
• DNA sequences
• Spoken words
• Handwritten letters
• Astronomical images
Example: Planning, Control, Problem Solving
Performing actions in an environment in order to achieve a goal
• Playing checkers, chess, or backgammon
• Balancing a pole
• Driving a car or a jeep
• Flying a plane, helicopter, or rocket
• Controlling an elevator
• Controlling a character in a video game
• Controlling a mobile robot
Breakthrough in Automatic Speech Recognition

ML used to predict of phone states from the sound spectrogram


Deep learning has state-of-the-art results
# Hidden Layers 1 2 4 8 10 12

Word Error Rate % 16.0 12.8 11.4 10.9 11.0 11.1

Baseline (Gaussian Mixture Model) performance = 15.4%


[Reference: Zeiler et al., “On rectified linear units for
speech recognition” ICASSP 2013]
Impact of Deep Learning in Speech Technology
Visual Question Answering
Answering questions about images
Autonomous Car Technology

Path
Planning

Laser Terrain
Mapping

Learning from Human


Adaptive
Drivers
Vision

Sebastian

Stanle
y
Contemporary ML Based Solutions

• Optical Character Recognition


• Product Image Classification on a production line
• Detecting tumors in brain scans
• Information extraction from document images
• Categorization of news articles
• Flagging offensive comments in discussion forums
• Document summarization
• Forecasting company revenue
• Chatbots / Voice activated App
• Detecting Credit Card Fraud
• Customer Segmentation
• Recommendation System for products, movies, news
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.3- Module1
In this segment

• Supervised Vs. Unsupervised Vs. Semi-Supervised Vs. Reinforcement


• Batch Vs. Online Learning
• Instance Based Vs. Model Based Learning
Types of Machine Learning Systems

There are so many different types of Machine Learning systems that it is useful to classify them in
broad categories based on:
1. Based on Training :Whether or not they are trained with human supervision (supervised,
unsupervised, semisupervised, and Reinforcement Learning)
2. Based on stream of incoming data: Whether or not they can learn incrementally on the fly
(online versus batch learning)
3. How they generalize: Whether they work by simply comparing new data points to known
data points, or instead detect patterns in the training data and build a predictive model,
much like scientists do (instance-based versus model-based learning)
ML Terminologies
1.Labels: A label is the thing we're predicting—the y variable in simple linear regression. The
label could be the future price of gold, the kind of animal shown in a picture, the meaning of an
audio clip.
2. Features/attribute: A feature is an input variable—the x variable in simple linear
regression. A simple machine learning project might use a single feature, while a more
sophisticated machine learning project could use millions of features, specified as:
x1, x2,……..xn
In the spam detector example, the features could include the following:
• words in the email text
• sender's address
• time of day the email was sent
Another example: To predict the price (label) of used car , using the features,
1. mileage
2. age
3.brand
ML Terminologies
3.Examples/Instances
An example is a particular instance of data, x. (We put x in boldface to indicate that it is a
vector.) We break examples into two categories:
• labeled examples
• unlabeled examples
A labeled example includes both feature(s) and the label. That is:
labeled examples: {features, label}: (x, y)

An unlabeled example contains features but not the label. That is:
unlabeled examples: {features, ?}: (x, ?)
ML Terminologies
4. Models
A model defines the relationship between features and label. For example, a spam detection
model might associate certain features strongly with "spam". Let's highlight two phases of a
model's life:
1. Training means creating or learning the model. That is, you show the
model labeled examples and enable the model to gradually learn the
relationships between features and label.
2. Inference means applying the trained model to unlabeled examples.
That is, you use the trained model to make useful predictions (y'). For
example, during inference, you can predict HouseValue for new
unlabeled examples.
Types of Learning
Based on level of supervision
• Supervised (inductive) learning
• Given: training data, desired outputs (labels)
• Unsupervised learning
• Given: training data only (without desired outputs)
• Semi-supervised learning
• Given: training data and a few desired outputs
• Reinforcement learning
• Given: rewards from sequence of actions
Types of Learning : Supervised Learning
Based on level of supervision
• Regression vs. classification
• A regression model predicts continuous values. For example, regression models make
predictions that answer questions like the following:
1. What is the value of a house in Bangalore?
2. What is the probability that a user will click on this ad?
• A classification model predicts discrete values. For example, classification models make
predictions that answer questions like the following:
1. Is a given email message spam or not spam?
2. Is this an image of a dog, a cat, or a hamster?
Supervised Learning: Regression
• Given (x1, y1), (x2, y2), ..., (xn, yn)
• Learn a function f (x) to predict y given x
– y is real-valued
September Arctic Sea Ice Extent

9
8
7
(1,000,000 sq km)

6
5
4
3
2
1
0
1970 1980 1990 2000 2010 2020
Year Number
Supervised Learning: Classification
• Given (x1, y1), (x2, y2), ..., (xn, yn)
• Learn a function f (x) to predict y given x
– y is categorical

Cancer (benign / malignant)

y=1 (malignant)

y=0 (benign)

Tumor Size (x)


Supervised Learning: Classification
• Given (x1, y1), (x2, y2), ..., (xn, yn)
• Learn a function f (x) to predict y given x
– y is categorical

Cancer (benign / malignant)

1 (malignant)

0 (benign)
Tumor Size Learnt classifer
If x>T, malignant else benign
Predict benign Predict malignant

x=T
Increasing Feature Dimension

• x can be multi-dimensional
– Each dimension corresponds to an attribute

- Clump Thickness

Age
- Uniformity of Cell Size
- Uniformity of Cell Shape

Tumor Size
Example: Supervised Learning Techniques

• Linear Regression
• Logistic Regression
• Naïve Bayes Classifiers
• Support Vector Machines (SVMs)
• Decision Trees and Random Forests
• Neural networks
Unsupervised Learning

• Given x1, x2, ..., xn (without labels)


• Output hidden structure behind the x’s
• e.g., clustering
Example: Unsupervised Learning Techniques

• Clustering
• k-Means
• Hierarchical Cluster Analysis
• Expectation Maximization
• Visualization and dimensionality reduction
• Principal Component Analysis (PCA)
• Kernel PCA
• Locally-Linear Embedding (LLE)
• t-distributed Stochastic Neighbor Embedding (t-SNE)
• Association rule learning
• Apriori
• Eclat
Data Visualization
Visualize 2/3D representation of complex unlabelled training data
• Preserve as much structure as possible
• e.g., trying to keep separate clusters in the input space from overlapping in the visualization
• Understand how the data is organized
• Identify unsuspected patterns.
frog

cat
bird dog
truck

automobile

deer

horse

ship airplane
Association Rule Mining
Applications: Unsupervised Learning
Genomics application: Group individuals by genetic similarity
Genes

Individuals
Applications: Unsupervised Learning
Organize computing clusters Social network analysis

Market segmentation Astronomical data analysis


Semisupervised Learning
Partially labelled data – few labelled data and a lot of unlabelled data
• Combines unsupervised and supervised learning algorithms
• Photo hosting service, e.g., google photos
Semisupervised Learning Techniques

• Most semisupervised learning algorithms are combinations of unsupervised and


supervised algorithms.

• For example, deep belief networks (DBNs) are based on unsupervised components called
restricted Boltzmann machines (RBMs) stacked on top of one another. RBMs are trained
sequentially in an unsupervised manner, and then the whole system is fine-tuned using
supervised learning techniques.
Reinforcement Learning
A learning agent
• observes the state of the environment,
select and perform actions
• gets +ve or -ve rewards in return
• learns the best strategy, aka a policy, to get
the most reward over time.
• policy is a mapping from states → actions

• Examples:
– Game playing, e.g., AlphaGo
– Robot in a maze
– Balance a pole on your hand
Types of Learning
Based on how training data is used
• Batch learning
• Uses all available data at a time during training
• Mini Batch learning
• Uses a subset of available at a time during training
• Online (incremental) learning
• Uses single training data instance at a time during training
Types of Learning
Based on how training data is used
• Instance Based Learning
• compare new data points to known data points
• Model Based learning
• detect patterns in the training data and build a predictive model

Instance Based Model Based


BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.4- Module1
In this segment

• Training Data
• Insufficient
• Non representative
• Poor Quality
• Irrelevant attributes
• Model Selection
• Overfitting
• Underfitting
• Testing and Validation
• Hyperparameters
Insufficient Training Data
Consider trade-off Between Algorithm development & training data capture
Non-representative Training Data
Training Data be representative of the new cases we want to generalize
• Small sample size leads to sampling noise
• Missing data over emphasizes the role of wealth on happiness
• If sampling process is flawed, even large sample size can lead to sampling bias

missing data
available data
Data Quality
Cleaning often needed for improving data quality
• Some instances have missing features
• e.g., 5% of customers did not specify their age
• Ignore the instances all together or the feature,
• fill in the missing values
• Train multiple ML models
• Some instances can be erroneous, noisy or outliers
• Human or machine generated
• Identify, discard or fix manually, as appropriate
Irrelevant Features
Feature engineering needed for coming up with a good set of features
• Feature selection
• more useful features to train on among existing features.
• Feature extraction
• combine existing features to produce a more useful one.
• Create new features by gathering new data
Model Selection
Overfitting or Underfitting
• Overfitting leads to high performance in training set but performs poorly on new data
• e.g., a high-degree polynomial life satisfaction model that strongly overfits the training data
• Small training set or sampling noise can lead to model following the noise than the underlying pattern
in the dataset
• Solution: Regularization Large slope

Smaller slope
High-order
polynomial

• Underfitting when the model is too simple to learn the underlying structure in the data
• Select a more powerful model, with more parameters
• Feed better features to the learning algorithm
• Reduce regularization
Model Selection
Overfitting or Underfitting
Testing and Validation
Performance of ML algorithms is statistical / predictive
• Good ML algorithms need to work well on test data
• But test data is often not accessible to the provider of the algorithm
• Common assumption is training data is representative of test data
• Randomly chosen subset of the training data is held out as validation set
• aka dev set
• Once ML model is trained, its performance is evaluated on validation data
• Expectation is ML model working well on validation set will work well on unknown test data
• Typically 20-30% of the data is randomly held out as validation data
Cross Validation
K-fold validation is often performed
• To reduce the bias of validation set selection process
• Often K is chosen as 10
• aka 10 fold cross validation
• 10 fold cross validation involves
• randomly selecting the validation set 10 times
• model generation with 10 resulting training set
• Evaluate the performance of each on that validation set
• averaging the performance over the validation sets
Choice of Hyperparameters
Modern ML models often use a lot of model parameters
• Known as hyperparameters
• Model performance depends on choice of parameters
• Each parameter can assume a number of values
• Real numbers or categories
• Exponential number of hyperparameter combinations possible
• Best model correspond to best cross validation performance over the set of hyperparameter
combinations
• Expensive to perform
• Some empirical frameworks available for hyperparameter optimization
Thank You!
In our next session: End to End ML
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.3- Module 2
In this segment
Data Preprocessing
• Data Quality
• Noise, Outlier, Missing attribute, Duplicate record
• Data Transformation
• Aggregation
• Sampling
• Discretization and Binarization
• Attribute Transformation
• Dimensionality Reduction
• Feature subset selection
• Feature creation
Data Quality

• Poor data quality negatively affects many data processing efforts


• For example, a classification model for detecting people who are loan risks is built using poor
data
• Some credit-worthy candidates are denied loans
• More loans are given to individuals that default
Data Quality …

• What kinds of data quality problems?


• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:


• Noise and outliers
• Wrong data
• Fake data
• Missing values
• Duplicate data
Noise

• For objects, noise is an extraneous object


• For attributes, noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen
• The figures below show two sine waves of the same magnitude and different frequencies, the waves
combined, and the two sine waves with random noise
• The magnitude and shape of the original signal is distorted
Outliers

• Outliers are data objects with characteristics that are considerably different than most of the
other data objects in the data set
• Case 1: Outliers are noise that interferes
with data analysis

• Case 2: Outliers are the goal of our analysis


• Credit card fraud
• Intrusion detection
Missing Values

• Reasons for missing values


• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
• Handling missing values
• Eliminate data objects or variables
• Estimate missing values
• Example: time series of temperature
• Example: census results
• Ignore the missing value during analysis
Duplicate Data

• Data set may include data objects that are duplicates, or almost duplicates of one another
• Major issue when merging data from heterogeneous sources

• Examples:
• Same person with multiple email addresses

• Data cleaning
• Process of dealing with duplicate data issues
Data Preprocessing

• Aggregation
• Sampling
• Discretization and Binarization
• Attribute Transformation
• Dimensionality Reduction
• Feature subset selection
• Feature creation
Aggregation
Combining two or more attributes (or objects) into a single attribute (or object)
• Purpose
• Data reduction
• Change of scale
• More “stable” data

Standard Deviation of Standard Deviation of


Average Monthly Average Yearly
Precipitation Precipitation
Sampling
Primary technique used for data reduction
• It is often used for both the preliminary investigation of the data and the final data analysis.
• Statisticians often sample because obtaining the entire set of data of interest is too expensive or
time consuming.
• Sampling is typically used in machine learning because processing the entire set of data of
interest may be too expensive or time consuming.
• Key Principle
• Using a sample will work almost as well as using the entire data set, if the sample is representative
• A sample is representative if it has approximately the same properties as the original dataset

8000 points 2000 Points 500 Points


Sampling …

• The key principle for effective sampling is the following:


• Using a sample will work almost as well as using the entire data set, if the sample is representative
• A sample is representative if it has approximately the same properties (of interest) as the original set of
data
• Simple Random Sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• As each item is selected, it is removed from the population
• Sampling with replacement
• Objects are not removed from the population as they are selected for the sample.
• In sampling with replacement, the same object can be picked up more than once

• Stratified sampling
• Split the data into several partitions; then draw random samples from each partition
Discretization
Process of converting a continuous attribute into an ordinal attribute
• A potentially infinite number of values are mapped into a small number of categories
• Discretization is used in both unsupervised and supervised settings

Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
Unsupervised Discretization

Equal interval width approach Equal frequency approach K-means approach


Discretization in Supervised Settings

• Many classification algorithms work best if both the independent and dependent variables have
only a few values
• Example: Iris Plant data set.
• http://www.ics.uci.edu/~mlearn/MLRepository.html
• Three flower types (classes):
• Setosa
• Versicolour
• Virginica
• Four (non-class) attributes
• Sepal width and length
• Petal width and length
Supervised Discretization Example …
How can we tell what the best discretization is?
• Supervised discretization: Use class labels to find breaks
50

40

30

Counts
20

10

0
0 2 4 6 8
Petal Length

• Petal width low or petal length low implies Setosa.


• Petal width medium or petal length medium implies Versicolour.
• Petal width high or petal length high implies Virginica.
Binarization
Maps a continuous or categorical attribute into one or more binary variables
• Often convert a continuous attribute to a categorical attribute and then convert a categorical
attribute to a set of binary attributes
• Examples: eye color and height measured as {low, medium, high}
Attribute Transformation
Maps the entire set of values of a given attribute to a new set of replacement values
• Each original value can be identified with one of the new values
• Simple functions: xk, log(x), ex, |x|
• Normalization
• Refers to various techniques to adjust to differences among attributes in terms of frequency of occurrence,
mean, variance, range
• Take out unwanted, common signal, e.g., seasonality
• In statistics, standardization refers to subtracting off the means and dividing by the standard deviation
Curse of Dimensionality

• When dimensionality increases, data becomes


increasingly sparse in the space that it occupies
• Definitions of density and distance between points,
which are critical for clustering and outlier detection,
become less meaningful
• Randomly generate 500 points
• Compute difference between max and min
distance between any pair of points
Dimensionality Reduction
Purpose
• Avoid curse of dimensionality x2
• Reduce amount of time and memory required by ML algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise e
• Popular Techniques
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Find a projection that captures the largest amount of variation in data

x1
Dimensionality Reduction: PCA
Increasing # of components improve quality of reconstruction
Feature Subset Selection
Another way to reduce dimensionality of data
• Redundant features
• Duplicate much or all of the information contained in one or more other attributes
• Example: purchase price of a product and the amount of sales tax paid
• Irrelevant features
• Contain no information that is useful for the ML task at hand
• Example: students' ID is often irrelevant to the task of predicting students' GPA
• Many techniques developed, especially for classification
Feature Creation
Create new attributes that can capture the important information in a data set
• More efficiently than the original attributes
• Three general methodologies
• Feature extraction
• Example: extracting edges from images
• Feature construction
• Example: dividing mass by volume to get density
• Mapping data to new space
• Example: Fourier and wavelet analysis
Mapping Data to a New Space
Fourier and wavelet transform - Two Sine Waves + Noise

Frequency
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.1- Module 2
In this Segment
Frame the Machine Learning Problem
• Business Context Understanding
• Model Selection
• Performance Metric Selection
• Check the assumptions
Key Elements of a Machine Learning Project

• Framing a machine learning problem


• Data types and representation
• Data Pre processing
• Data Visualization and Analysis
• Feature Engineering
• Model Building and testing
An Example Problem Statement
Start with the Big Picture

• Build a model of housing prices using the census data.


• Data attributes <population, median income, median housing price, ….> and so on for each district

• Districts are the smallest geographical unit (population ~600 – 3000)

• The model to predict the median housing price in any district, given all the other metrics.

• Goodness of the model is determined by how close the model output is w.r.t. actual price for
unseen district data
Framing the Problem
Understand the Business Objective and Context

• What is the expected usage and benefit?


• impacts the choice of algorithms, goodness measure, and effort in lifecycle management of the model

• What is the baseline method and its performance?


Choice of Model
Choose between Different techniques

• Supervised or unsupervised? Prediction or classification? Online Vs. Batch? Instance-based or


Model-based?

• Analyze the dataset


• Each instance comes with the expected output, i.e., the district’s median housing price.
• supervised

• Goal is to predict a real valued price based on multiple variables line population, income etc.
c
• regression
v
• Output is based on input data at rest, not rapidly changing data rapidly.

• Dataset small enough to fit in memory



c batch
v
• So, it’s a supervised multivariate batch regression problem
Example- California Housing Prices Dataset
housing_me total_bedro
longitude latitude dian_age total_rooms oms population households median_income ocean_proximity median_house_value
-122.23 37.88 41 880 129 322 126 8.3252 NEAR BAY 452600
-122.22 37.86 21 7099 1106 2401 1138 8.3014 NEAR BAY 358500
-122.24 37.85 52 1467 190 496 177 7.2574 NEAR BAY 352100
-122.25 37.85 52 1274 235 558 219 5.6431 NEAR BAY 341300
-122.25 37.85 52 1627 280 565 259 3.8462 NEAR BAY 342200
-122.25 37.85 52 919 213 413 193 4.0368 NEAR BAY 269700
-122.25 37.84 52 2535 489 1094 514 3.6591 NEAR BAY 299200
-122.25 37.84 52 3104 687 1157 647 3.12 NEAR BAY 241400
-122.26 37.84 42 2555 665 1206 595 2.0804 NEAR BAY 226700
Example- page 40
Choice of Performance Metrics
A typical Choice for Regression: Root Mean Square Error (RMSE)

• h is the model, X is the training dataset, m is number of instances, x(i) is i-th instance, y(i) is the
actual price for the i-th instance.

Mean Absolute Error (MAE)

• MAE is preferred for a large number of outliers. aka L1 norm or Manhattan distance/norm.
General Form
Stepping Back
Check the Assumptions

• Verify that the downstream module actually uses real-valued prices rather than post processing
them into say, categories, e.g “cheap,” “medium,” or “expensive”)

• If not, the problem should have been framed as a classification task, not a regression task.

• Such potential hazards need to be checked early in the design rather than finding out from
deployed systems
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.2- Module2
In this segment
Data Types and Representations
• What is Data?
• Types and properties of attributes of data
• Categorization of attributes
• Characteristics of Data
• Types of Data
What is Data?
Collection of data objects and attributes Attributes
• An attribute is a property or characteristic of an object
• Examples: eye color of a person, temperature, etc. Tid Refund Marital Taxable
• aka variable, field, characteristic, dimension, or feature Status Income Cheat
• A collection of attributes describe an object 1 Yes Single 125K No
• aka record, point, case, sample, entity, or instance
2 No Married 100K No
• Attribute values are numbers or symbols assigned to an
attribute for a particular object 3 No Single 70K No

• Distinction between attributes and attribute values 4 Yes Married 120K No

Objects
• Same attribute can be mapped to different attribute values 5 No Divorced 95K Yes
• Example: height can be measured in feet or meters
6 No Married 60K No
• Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
7 Yes Divorced 220K No

• But properties of attribute can be different than the properties of 8 No Single 85K Yes
the values used to represent the attribute 9 No Married 75K No
10 No Single 90K Yes
10
Discrete and Continuous Attributes

• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented using a finite number of digits.
• Continuous attributes are typically represented as floating-point variables.
Types and properties of Attributes
Types
• Nominal
• ID numbers, eye color, zip codes
• Ordinal
• rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall, medium, short}
• Interval
• calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• temperature in Kelvin, length, counts, elapsed time (e.g., time to run a race)
Properties
• Distinctness: = 
• Order: < >
• Differences are + -
meaningful :
• Ratios are * /
meaningful :
Difference Between Ratio and Interval

• Is it physically meaningful to say that a temperature of 10 ° is twice that of 5° on


• the Celsius scale?
• the Fahrenheit scale?
• the Kelvin scale?
• Consider measuring the height above average
• If Bill’s height is three inches above average and Bob’s height is six inches above average, then would
we say that Bob is twice as tall as Bill?
• Is this situation analogous to that of temperature?
Important Characteristics of Data
• Dimensionality (number of attributes)
• High dimensional data brings a number of challenges
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
• Size
• Type of analysis may depend on size of data
Types of data sets

• Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• World Wide Web
• Molecular Structures
• Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
Types of data sets
• Record
• Data Matrix
• Document Data
• Transaction Data
Types of data sets
• Graph
• World Wide Web
• Molecular Structures
Types of data sets
• Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
Record Data

• Data that consists of a collection of records, each of which consists of a fixed set of attributes

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix

• If data objects have the same fixed set of numeric attributes, then the data objects can be
thought of as points in a multi-dimensional space, where each dimension represents a distinct
attribute
• Such a data set can be represented by an m by n matrix, where there are m rows, one for each
object, and n columns, one for each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data

• Each document becomes a ‘term’ vector


• Each term is a component (attribute) of the vector
• The value of each component is the number of times the corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data

• A special type of data, where


• Each transaction involves a set of items.
• For example, consider a grocery store. The set of products purchased by a customer during one
shopping trip constitute a transaction, while the individual products that were purchased are the items.
• Can represent transaction data as record data

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data

2
5 1
2
5

Benzene Molecule: C6H6


Ordered Data
Sequence of transactions

Items/Events

An element of
the sequence
Ordered Data
Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data

SpatioTemporal Data
Average Monthly Temperature of land and ocean
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.4- Module 2
In this segment
Data Analysis
• Summary Statistics
• Categorical
• Numerical
• Distance Measures
• Euclidean
• Minkowski
• Mahalanobis
• Cosine
• Correlation
Summary Statistics of Data
Numbers that summarize properties of the data
• Summarized properties include frequency, location and spread
• Examples: location – mean, spread - standard deviation
• Most summary statistics can be calculated in a single pass through the data
Frequency and Mode
Typically used with categorical data
• Frequency of an attribute value is the percentage of time the value occurs in the data set
• The mode of an attribute is the most frequent attribute value
• The notions of frequency and mode are typically used with categorical data
Percentiles
Typically used for continuous data
• Percentile is more useful.
• Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile
is a value xp of x such that p% of the observed values of x are less than xp.
• For example, the 50th percentile is the value x50% such that 50% of all values of x are less than x50%. .
Measures of Location: Mean and Median

• The mean is the most common measure of the location of a set of points.
• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also commonly used.

• Assuming sorted {xi}


Measures of Spread: Range and Variance

• Range is the difference between the max and min


• The variance or standard deviation sx is the most common measure of the spread of a set of
points.

• Because of outliers, other measures are often used. Average Absolute Distance is given by
Similarity and Dissimilarity Measures

• Similarity measure
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity measure
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes

The following table shows the similarity and dissimilarity between two objects, x and
y, with respect to a single, simple attribute.
Euclidean Distance

• Euclidean Distance

where n is the number of dimensions (attributes) and xk and yk are, respectively, the kth attributes
(components) or data objects x and y.

• Standardization is necessary, if scales differ.


Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance

• Minkowski Distance is a generalization of Euclidean Distance

where r is a parameter, n is the number of dimensions (attributes) and xk and yk are,


respectively, the kth attributes (components) or data objects x and y.
Minkowski Distance: Examples

• r = 1. City block (Manhattan, taxicab, L1 norm) distance.


• A common example of this for binary vectors is the Hamming distance, which is just the
number of bits that are different between two binary vectors

• r = 2. Euclidean distance

• r  . “supremum” (Lmax norm, L norm) distance.


• This is the maximum difference between any component of the vectors

• Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
Mahalanobis Distance

𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 𝐱, 𝐲 = ((𝐱 − 𝐲)𝑇 Ʃ−1 (𝐱 − 𝐲))-0.5


 is the covariance matrix

1 𝑥𝑖 − 𝑚𝑥 2 𝑥𝑖 − 𝑚𝑥 𝑦𝑖 − 𝑚𝑦
𝑖 𝑖
𝑁 2
𝑥𝑖 − 𝑚𝑥 𝑦𝑖 − 𝑚𝑦 𝑦𝑖 − 𝑚𝑦
𝑖
𝑖

1
𝑚𝑥 = 𝑖 𝑥𝑖
𝑁
1
𝑚𝑦 = 𝑖 𝑦𝑖
𝑁

• For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


Mahalanobis Distance
Covariance Matrix:

0.3 0.2
 
 0.2 0.3
C

A: (0.5, 0.5)
B B: (0, 1)
C: (1.5, 1.5)
A

Mahal(A,B) = 5
Mahal(A,C) = 4
Common Properties of a Distance

• Distances, such as the Euclidean distance, have some well known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only if x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)

where d(x, y) is the distance (dissimilarity) between points (data objects), x and y.

• A distance that satisfies these properties is a metric


Cosine Similarity

• If d1 and d2 are two document vectors, then


cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product of vectors, d1 and d2, and || d || is the length
of vector d.
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481


|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Common Properties of a Similarity

• Similarities, also have some well known properties.

1. s(x, y) = 1 (or maximum similarity) only if x = y.


(does not always hold, e.g., cosine)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects), x and y.


Correlation measures the linear relationship between
objects
Visually Evaluating Correlation

Scatter plots
showing the
similarity from –1 to
1.
Correlation vs Cosine vs Euclidean Distance

• Compare the three proximity measures according to their behavior under variable transformation
• scaling: multiplication by a value
• translation: adding a constant

Property Cosine Correlation Euclidean Distance


Invariant to scaling Yes Yes No
(multiplication)
Invariant to translation No Yes No
(addition)

• Consider the example


• x = (1, 2, 4, 3, 0, 0, 0), y = (1, 2, 3, 4, 0, 0, 0)
• ys = y * 2 (scaled version of y), yt = y + 5 (translated version)

Measure (x , y) (x , ys) (x , yt)


Cosine 0.9667 0.9667 0.7940

Correlation 0.9429 0.9429 0.9429

Euclidean Distance 1.4142 5.8310 14.2127


Comparison of Proximity Measures

• Domain of application
• Similarity measures tend to be specific to the type of attribute and data
• Record data, images, graphs, sequences, 3D-protein structure, etc. tend to have different measures
• However, one can talk about various properties that you would like a proximity measure to have
• Symmetry is a common one
• Tolerance to noise and outliers is another
• Ability to find more types of patterns?
• Many others possible
• The measure must be applicable to the data and produce results that agree with domain
knowledge
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.5- Module 2
In this segment
Data Visualization
• Conversion to Visual representation
• Visualization Plots
• Histogram
• Boxplot
• Contour
• Scatter Plot
• Matrix Plots
• Correlation Plots
Visualization
Conversion of data into a visual or tabular format
• Visualization of data is one of the most powerful and
appealing techniques for data exploration.
• Humans have a well developed ability to analyze large
amounts of information that is presented visually
• Can detect general patterns and trends
• Can detect outliers and unusual patterns
• Data objects, their attributes, and the relationships
among data objects are translated into graphical
elements such as points, lines, shapes, and colors.
• Example:
• Objects are often represented as points
• Their attribute values can be represented as the
position of the points or the characteristics of the
points, e.g., color, size, and shape
Sea Surface Temperature (SST) for July 1982
Arrangement
Placement of visual elements within a display
• Is the Can make a large difference in how easy it is to understand the data
• Example:
Selection
Elimination or the de-emphasis of certain objects and attributes

• Selection may involve the choosing a subset of attributes


• Dimensionality reduction is often used to reduce the number of dimensions to two or three
• Alternatively, pairs of attributes can be considered
• Selection may also involve choosing a subset of objects
• A region of the screen can only show so many points
• Can sample, but want to preserve points in sparse areas
Visualization Techniques:

1. Histograms
2. Two-Dimensional Histograms
3. Box Plot
4. Scatter Plots
5. Contour Plots
6. Matrix Plots
7. Correlation Matrix
8. Star Plots
9. Chernoff
Visualization Techniques: Dataset Used
Visualization Techniques: Histograms

• Histogram
• Usually shows the distribution of values of a single variable
• Divide the values into bins and show a bar plot of the number of objects in each
bin.
• The height of each bar indicates the number of objects
• Shape of histogram depends on the number of bins
• Example: Petal Width (10 and 20 bins, respectively)
Two-Dimensional Histograms

• Show the joint distribution of the values of two attributes


• Example: petal width and petal length
Visualization Techniques: Box Plots
Display distribution of data
outlier

90th percentile

75th percentile

50th percentile
25th percentile

10th percentile
Visualization Techniques: Box Plots
Display distribution of data
Visualization Techniques: Scatter Plots
Attributes values determine the position
• Scatter plots can compactly summarize the relationships of several pairs of attributes
• Example: Iris Plant dataset http://www.ics.uci.edu/~mlearn/MLRepository.html
• Three flower types (classes): Setosa, Versicolour, Virginica
• Four (non-class) attributes: Sepal width and length, Petal width and length
Visualization Techniques: Contour Plots
Continuous attribute is measured on a spatial grid
• Useful when a They partition the
plane into regions of similar values
• The contour lines that form the
boundaries of these regions connect
points with equal values
• Common example: contour maps of
elevation, temperature, rainfall, air
pressure, etc.

Celsius
Surface Sea Temperature
Visualization Techniques: Matrix Plots
Plot data matrix
• Often useful when objects are sorted according to class
• Plots of similarity or distance matrices can also be useful for visualizing the relationships

standard
deviation
Visualization of the Iris Correlation Matrix
Star Plots for Iris Data

Setosa

Versicolour

Virginica
Chernoff Faces for Iris Data

Setosa

Versicolour

Virginica
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.6- Module 2
In this segment
Software Tools for ML
• Notebooks and Colab
• Python and important libraries
• A few important functions
• Missing data
• Attribute Visualization
• Data Cleaning
• Attribute Combination
• Handling categorical attributes
• Test dataset Creation
Prerequisites
• Jupyter Notebook
• Open-source web application to create and share live code, equations, visualizations and narrative
text.
• Used for data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more.
• Easy to tinker with code and execute it in steps
• Run in local browser: https://jupyter.org/try
• Local installation: https://jupyter.org/install.html
• Repository of Jupyter Notebooks: https://github.com/ageron/handson-ml
• Colab
• Google’s flavor of Jupyter notebooks tailored for machine learning and data analysis
• Runs entirely in the cloud.
• free access to hardware accelerators like GPUs and TPUs (with some restrictions).
• http://colab.research.google.com
• Popular Datasets
• UC Irvine Machine Learning Repository: http://archive.ics.uci.edu/ml/
• Kaggle datasets: https://www.kaggle.com/datasets
• Amazon’s AWS datasets: http://aws.amazon.com/fr/datasets/
Local Installation
Preferred Mode
• Install a virtual environment via Anaconda
• Free Anaconda Python distribution: https://www.anaconda.com/products/individual
• Download Python 3.* version
Prerequisites
Python and several scientific libraries
• Python
• high-level, dynamically typed multiparadigm programming language.
• almost like pseudocode, very readable
• Platform: Linux/Windows/MacOS
• Documentation: https://www.python.org/doc/
• Interactive Python tutorial: http://learnpython.org/
• Stand alone installation: (preferably 3.7 or higher) https://www.python.org/downloads/
• scikit-learn
• Open source library that supports supervised and unsupervised learning.
• Various model fitting, data preprocessing, model selection and evaluation tools, and other utilities.
• Platform: Linux/Windows/MacOS
• User guide: https://scikit-learn.org/stable/user_guide.html
• Stand alone installation: https://scikit-learn.org/stable/install.html
Prerequisites
Important libraries
• Numpy
• Scientific computing library with Python
• Provides high-performance multidimensional array (e.g., vector, matrix) and basic tools to compute with
and manipulate these arrays, linear algebra, random number generators
• https://numpy.org/doc/stable/reference/
• https://numpy.org/install/
• SciPy
• Builds on numpy, provides functions that operate on numpy arrays
• Useful for different types of scientific and engineering applications, integration, image processing, …
• https://docs.scipy.org/doc/scipy/reference/tutorial/index.html
• Standalone installation: https://scipy.org/install.html
• Matplotlib
• Plotting library
• https://matplotlib.org/
• https://matplotlib.org/tutorials/index.html
• https://matplotlib.org/users/installing.html
Important ML Steps and Functions
https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb

• 20,640 data instances in the dataset – fairly small


• 207 districts are missing total_bedrooms attribute
• except ocean_proximity, rest of the attributes numerical
• type of ocean_proximity is object, so it must be a text attribute - a repeated categorical attribute!
Attribute Visualization

age latitude
# of households

longitude price income


Observations from Histogram

• Median income attribute not expressed in US dollars (USD).


• scaled and capped at 15 for higher median incomes, and at 0.5 for lower median incomes.
• The housing median age and the median house value also capped.
• Machine Learning algorithms may learn that prices never go beyond that limit.
• If precise predictions even beyond $500,000 is needed
• Collect proper labels for the districts whose labels were capped.
• Remove those districts from the training set and also from the test set

• Attributes have very different scales.


• Finally, many histograms are tail heavy.
• Transformation may be needed on to have more bell-shaped distributions
Visualizing Training Data for Insights
Correlations in Training Data

• House prices are strongly correlated with income


• small negative correlation between the latitude and
the median house value
• i.e., prices have a slight tendency to go down when you
go north)
• price cap clearly visible as a horizontal line at $500K.
• Additional horizontal lines around $450k, $350K,
perhaps one around $280K, ...
Experimenting with Attribute Combinations

• Total number of rooms in a district not very useful


• What may be important is the number of rooms per household.
• Similarly, the total number of bedrooms compared to the number of rooms may be more meaningful

• rooms_per_household is more correlated with price than total number of bedrooms or rooms.
Data Cleaning
• Some of the instances don’t have total_bedroom values

• Imputer utility in scikit-learn

• Since median applies only to numerical data, create an copy of the data without ocean-proximity

• The imputer stores the median of each attribute the result in its statistics_ instance variable.

• Use this “trained” imputer to replace any missing values by the learned medians
Handling Textual and Categorical Data
Convert the text attributes to numbers
• Scikit-learn provides a transformer

• ML algorithms assume closer values are more similar. But ‘<1H OCEAN’ and ‘NEAR OCEAN’ are far
away in values!
• Solution! Use 1-hot encoding
Test Set Creation

Ensures reproducibility of
• Test set size is 20% of training set results
• Uses uniform sampling of data.
• Not appropriate for heavy tailed distribution
• income very important to predict house prices.
• Important that test set is representative of the various
categories of incomes in the dataset.
• Most income values are around $20–$50K, but some >> $60K.

Histogram of income categories


BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.3- Module 3
In This Session

Module 3:
Big Picture: End-to-end Machine Learning
3.1 Model Selection and Training
3.1.1. Prediction Problem
3.1.2. Classification Problem
3.2 Evaluation
3.2.1. Prediction Problem
3.2.2. Classification Problem
3.3 Machine Learning Pipeline
In This Session

• Model Selection and Training


• For regression problem
• For classification problem

• Multi-class classification
• Multi-output classification
In this segment
Model Selection and Training
• Select based on training data
• If prediction label/output is available, use regression or classification model
• Regression if real valued output
• Classification if output is discrete (binary/integer)
• else, unsupervised model is used.
• For the house price prediction problem, use regression model since median house prices are
available along with training data (predictors)
• Example: Linear Regression

• Better model is necessary for improving the prediction accuracy


Regression Model Selection and Training
• Decision Tree Based Regression produces low error on training data

• Cross-validation error is not satisfactory


• Better accuracy can be obtained using Random Forest based regressor
Classification Model
1. Binary Classification
2. Multilabel Classification
3. Multiouput Classification
Classification Model
MNIST Dataset Classification
• A set of 70,000 small images of handwritten digits
• Each image 28x28 pixel with intensity 0 (black) – 255 (white)
• Input data represented as 70000 x 784 matrix
• Each image is labeled with the digit it represents.
Classification Model Training
Detect a ‘5’
• Segment the dataset into 60,000 training images and 10,000 test images

• Shuffle the training dataset

• Train a classification model for detecting ‘5’. Target output for training data instance
corresponding an image of ‘5’ is +1, else target output is ‘0’

• Perform cross validation like the regression problem and try out multiple classification model for
achieving acceptable performance.
Multiclass Classification

• Multiclass classifiers (aka multinomial classifiers) can distinguish between more than two classes.
• Some algorithms (such as Random Forest classifiers or naive Bayes classifiers) are capable of
handling multiple classes directly.
• Many (such as Support Vector Machine classifiers or Linear classifiers) are strictly binary
• One-versus-all (OvA) or One-versus-rest strategy using multiple binary classifiers.
• e.g., for MNIST classification, train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a
2-detector, and so on).
• get the decision score from each classifier for that image and select the class whose classifier outputs
the highest score.
• One-versus-one (OvO) strategy
• train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and
2s, another for 1s and 2s, and so on.
• If there are N classes, you need to train N × (N – 1) / 2 classifiers.
• Run an image through all 45 classifiers and see which class wins the most duels.
• Main advantage of OvO is each classifier only needs to be trained on the part of the training set
for the two classes that it must distinguish
Multiclass Classification
One-versus-all (OvA)
Multiclass Classification
One-versus-one (OvO) strategy
Multi Label / Output Classification

• In multilabel classification, multiple classes for each instance should be output.


• e.g., the classifier has been trained to recognize three faces, A, B, and C; then when it is shown a
picture of A and C, it should output [1, 0, 1]

• In multioutput-multiclass or simply multioutput classification each label can be multiclass, i.e., it


can have more than two possible values.
• e.g., in a system to remove noise from images, input is a noisy digit image, output a clean digit image,
represented as an array of pixel intensities
• classifier’s output is multilabel - one label per pixel
• each label can have multiple values (pixel intensity ranges from 0 to 255

• from sklearn.neighbors import KNeighborsClassifier


Thank You!
In our next session: Model Evaluation
Model Evaluation
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
Model Evaluation

• Metrics for Performance Evaluation


• How to evaluate the performance of a model?

• Methods for Performance Evaluation


Metrics for Performance Evaluation
Focus on the predictive capability of a model

• Confusion Matrix

PREDICTED CLASS
a: TP (true positive)
Class=Yes Class=No b: FN (false negative)
c: FP (false positive)
Class=Yes a b
ACTUAL d: TN (true negative)
CLASS Class=No c d
Metrics for Performance Evaluation
Focus on the predictive capability of a model

• Confusion Matrix
Accuracy, Precision, Recall, F1Score

Model performance is not just dependent on accuracy alone

Ability to detect
when disease is
not present
Metrics for Performance Evaluation…

PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)
• Most widely-used metric:

ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Cost Matrix
•True positives: data points labeled as positive that are actually positive
•False positives: data points labeled as positive that are actually negative
•True negatives: data points labeled as negative that are actually negative
•False negatives: data points labeled as negative that are actually positive
Limitation of Accuracy

• Consider a 2-class problem


• Number of Class 0 examples = 9990
• Number of Class 1 examples = 10

• If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %


• Accuracy is misleading because model does not detect any class 1 example
Measures for Imbalanced Classes

a
Precision (p) 
ac
a
Recall (r) 
ab
2rp 2a
F - measure (F)  
r  p 2a  b  c

wa  w d
Weighted Accuracy  1 4

wa  wb wc  w d
1 2 3 4
Accuracy, Precision, Recall, F1Score - Consolidated
Methods for Performance Evaluation
How to obtain a reliable estimate of performance?
• Performance of a model may depend on other factors besides the learning algorithm:
• Class distribution
• Cost of misclassification
• Size of training and test sets
Learning Curve

 Learning curve shows how


accuracy changes with
varying sample size
 Requires a sampling
schedule for creating
learning curve:
 Arithmetic sampling
(Langley, et al)
 Geometric sampling
(Provost et al)

Effect of small sample size:


- Bias in the estimate
- Variance of estimate
Methods of Estimation

• Holdout
• Reserve 2/3 for training and 1/3 for testing
• Random subsampling
• Repeated holdout
• Cross validation
• Partition data into k disjoint subsets
• k-fold: train on k-1 partitions, test on the remaining one
• Leave-one-out: k=n
• Stratified sampling
• oversampling vs undersampling
• Bootstrap
• Sampling with replacement

Stratified Sampling: In stratified sampling, researchers divide subjects into subgroups called strata based on
characteristics that they share (e.g., race, gender, educational attainment). Once divided, each subgroup is
randomly sampled using another probability sampling method
ROC (Receiver Operating Characteristic)

• Developed in 1950s for signal detection theory to analyze noisy signals


• Characterize the trade-off between positive hits and false alarms
• ROC curve plots TP (on the y-axis) against FP (on the x-axis)
• Performance of each classifier represented as a point on the ROC curve
• changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point
ROC Curve

• 1-dimensional data set containing 2 classes (positive and negative)


• any points located at x > t is classified as positive

At threshold t:
TP=0.5, FN=0.5, FP=0.12, TN=0.88
ROC Curve

(TP,FP):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal

• Diagonal line:
• Random guessing
• Below diagonal line:
• prediction is opposite of the true class
Using ROC for Model Comparison

 No model consistently outperform


the other
 M1 is better for small FPR
 M2 is better for large FPR

 Area Under the ROC curve


 Ideal:
 Area = 1
 Random guess:
 Area = 0.5
How to Construct an ROC curve

Instance P(+|A) True Class


• Use classifier that produces posterior
1 0.95 +
probability for each test instance P(+|A)
2 0.93 +
• Sort the instances according to P(+|A)
3 0.87 - in decreasing order
4 0.85 -
• Apply threshold at each unique value of
5 0.85 - P(+|A)
6 0.85 + • Count the number of TP, FP,
7 0.76 - TN, FN at each threshold
8 0.53 + • TP rate, TPR = TP/(TP+FN)
9 0.43 - • FP rate, FPR = FP/(FP + TN)
10 0.25 +
How to construct an ROC curve

ROC Curve:

Class + - + - - - + - + +
P
Threshold 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

>= TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0


Thank You!
In our next session: Hyperparameter Optimization
Hyperparameter optimization
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
Hyperparameters
Machine Learning systems use many parameters internally

• Gradient Descent
• e.g., Learning rate, how long to run
• Mini-batch
• Batch size
• Regularization constant
• Many Others
• will be discussed in upcoming sessions
Hyperparameter Optimization

• Also called metaparameter optimization


• Also called tuning

• How to find best values of hyperparameters?


Tuning By Hand

• Just fiddle with the parameters until you get the results you want

• Probably the most common type of hyperparameter optimization

• Upsides: the results are generally pretty good…

• Downsides: lots of effort, and no theoretical guarantees


Grid Search

• Define some grid of parameters you want to try


• Try all the parameter values in the grid
• By running the whole system for each setting of parameters
• Then choose the setting with the best result
• Essentially a brute force method
Downsides of Grid Search

• As the number of parameters increases, the cost of grid search increases exponentially!
• Need some way to choose the grid properly
• Something this can be as hard as the original hyperparameter optimization
• Can’t take advantage of any insight you have about the system!
Making Grid Search Fast

• Early stopping to the rescue


• Can run all the grid points for one epoch, then discard the half that performed worse, then run for another
epoch, discard half, and continue.

• Can take advantage of parallelism


• Run all the different parameter settings independently on different servers in a cluster.
• An embarrassingly parallel task.
• Downside: doesn’t reduce the energy cost.
One Variant: Random Search

• This is just grid search, but with randomly chosen points instead of points on a grid.
• RandomSearchCV

• This solves the curse of dimensionality


• Don’t need to increase the number of grid points exponentially as the number of dimensions
increases.

• Problem: with random search, not necessarily going to get anywhere near the optimal parameters in
a finite sample.
An Alternative: Bayesian Optimization

• Statistical approach for minimizing noisy black-box functions.

• Idea: learn a statistical model of the function from hyperparameter values to the loss function
• Then choose parameters to minimize the loss

• Main benefit: choose the hyperparameters to test not at random, but in a way that gives the most
information about the model
• This lets it learn faster than grid search
Effect of Bayesian Optimization

• Downside: it’s a pretty heavyweight method


• The updates are not as simple-to-implement as grid search

• Upside: empirically it has been demonstrated to get better results in fewer experiments
• Compared with grid search and random search

• Pretty widely used method


• Lots of research opportunities here.
Cross-Validation

• Partition part of the available data to create an validation dataset that we don’t use for training.

• Then use that set to evaluate the hyperparameters.

• Typically, multiple rounds of cross-validation are performed using different partitions


• Can get a very good sense of how good the hyperparameters are
• But at a significant computational cost!
Thank You!
In our next session: Machine Learning Pipeline
Machine Learning Pipeline
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
In This Session

• What is MLOps
• DevOps vs MLOps

• Level 0 MLOps
• Continuous Training

• Level 1 MLOps
• Continuous Integration, Delivery

• Frameworks
What is MLOps?
Apply DevOps principles to ML systems
• An engineering culture and practice that aims at unifying ML system development (Dev) and ML
system operation (Ops).

• Automation and monitoring at all steps of ML system construction, including integration, testing,
releasing, deployment and infrastructure management.

• Data scientists can implement and train an ML model with predictive performance on an offline
validation (holdout) dataset, given relevant training data for their use case.

• However, the real challenge is building an integrated ML system and to continuously operate it in
production.
Ecosystem of ML System Components
A small fraction of a real-world ML system is composed of the ML code
DevOps Vs. MLOps

• DevOps for developing and operating large-scale software systems provides benefits such as
• shortening the development cycles
• increasing deployment velocity, and
• dependable releases.
• Two key concepts
• Continuous Integration (CI)
• Continuous Delivery (CD)
• An ML system is a software system, so similar practices apply to reliably build and operate at
scale.
• However, ML systems differ from other software systems
• Team skills: focus on exploratory data analysis, model development, and experimentation.
• Development: ML is experimental in nature.
• The challenge is tracking what worked and what did not, maintaining reproducibility, and maximizing code
reusability.
• Testing: Additional testing needed for data validation, trained model quality evaluation, and model
validation.
DevOps Vs. MLOps

• Deployment: a multi-step pipeline to automatically retrain and deploy model.


• adds complexity
• Automation needed before deployment by data scientists to train and validate new models.
• Production: ML models can have reduced performance due to constantly evolving data
profiles.
• Need to track summary statistics of data and
• monitor the online performance of model to send notifications or roll back for suboptimal values
• ML and other software systems are similar in CI of source control, unit / integration testing,
and CD of the software module / package.
• However, in ML,
• CI is also about testing and validating data, data schemas, and models.
• CD is a system (an ML training pipeline) that automatically deploys another service (model prediction
service).
• Continuous training (CT) is a new property, unique to ML systems, that is concerned with
automatically retraining the model in production and serving the models.
Manual ML Steps
• Manual, script-driven, and interactive process.
• Disconnection between ML and operations, possibly leading to training-serving skew
• Infrequent release iterations. No CI, CD, active performance monitoring
• Deploy trained Model as a prediction service
• Deployment process is concerned only with deploying the trained model as a prediction service,
e.g., a microservice with a REST API
MLOps Level 1

• Perform continuous
training (CT) by
automating the ML pipeline

• Achieves continuous
delivery of model prediction
service.

• Automated data and model
validation steps to the
pipeline

• Needs pipeline triggers and


metadata management.
Data and Model Validation

• Data validation: Required prior to model training to decide whether to retrain the model or stop
the execution of the pipeline based on following
• Data values skews: significant changes in the statistical properties of data, triggering retraining
• Data schema skews: downstream pipeline steps, including data processing and model training,
receives data that doesn't comply with the expected schema.
• stop the pipeline to release a fix or an update to the pipeline to handle these changes in the schema.
• Schema skews include receiving unexpected features or with unexpected values, not receiving all the expected
features

• Model validation: Required after retraining the model with the new data. Evaluate and validate
the model before promoting to production. This offline model validation step consists of
• Producing evaluation metric using the trained model on test data to assess the model quality.
• Comparing the evaluation metrics of production model, baseline model, or other business-requirement
models.
• Ensuring the consistency of model performance on various data segments
• Test model for deployment, including infrastructure compatibility and API consistency
• Undergo online model validation—in a canary deployment or an A/B testing setup
Level 2: CI/CD and automated pipeline automation
Stages of CI/CD Automation Pipeline

1) Development and experimentation: iteratively try new ML algorithms and modeling. The
output is the source code of the ML pipeline steps that are then pushed to a source
repository.
2) Pipeline continuous integration: build source code and run various tests. The outputs of
this stage are pipeline components (packages, executables, and artifacts).
3) Pipeline continuous delivery: deploy artifacts produced by the CI stage to the target
environment.
4) Automated training: automatically executed in production based on a schedule or trigger.
The output is a trained model pushed to the model registry.
5) Model continuous delivery: serve the trained model as a prediction service for the
predictions.
6) Monitoring: collect statistics on the model performance based on live data. The output is a
trigger to execute the pipeline or to execute a new experiment cycle.
Stages of the CI/CD automated ML pipeline
Continuous Integration

• Pipeline and its components are built, tested, and packaged when
• new code is committed or
• pushed to the source code repository.

• Besides building packages, container images, and executables, CI process can include
• Unit testing feature engineering logic.
• Unit testing the different methods implemented in your model.
• For example, you have a function that accepts a categorical data column and you encode the function as a one-
hot feature.
• Testing for training convergence
• Testing for NaN values due to dividing by zero or manipulating small or large values.
• Testing that each component in the pipeline produces the expected artifacts.
• Testing integration between pipeline components.
Continuous Delivery

• Continuously delivers new pipeline implementations to the target environment


• prediction services of the newly trained model.
• For rapid and reliable continuous delivery of pipelines and models, consider
• Verifying the compatibility of the model with the target infrastructure
• e.g., required packages are installed in the serving environment
• Availability of memory, compute, and accelerator resources.
• Testing the prediction service by calling the service API for the updated model
• Testing prediction service performance, such as throughput, latency.
• Validating the data either for retraining or batch prediction.
• Verifying that models meet the predictive performance targets prior to deployment.
• Automated deployment to a test environment, triggered by new code to the development branch.
• Semi-automated deployment to a pre-production environment, triggered by code merging
• Manual deployment to a production from pre-production.
Frameworks
Cloud Vendors are providing MLOps framework
• https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-
pipelines-in-machine-learning
• Kubeflow and Cloud Build

• Amazon AWS MLOps

• Microsoft Azure MLOps


BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.4- Module 4
Module-4

• Linear Prediction Models


1. Linear Regression
2. Gradient Descent and Variants
3. Regularization
4. Bias and Variance
Session Content

• What is Linear regression


• Approaches to linear regression
• Least Squares Based Solution
• Direct Solution
Regression
• Wish to learn a function f :: X Y, where predicted output Y is real, given the n
training instances {<x1,y1>…<xn,yn>}.

• For applications, where output will be a real value eg: Predicting housing price,
predicting price of a stock market.

• Examples include
• predict weight from gender, height, age, …
• Predict house price from locality, area, income, …
• predict Google stock price today from Google, Yahoo, MSFT prices yesterday
• predict each pixel intensity in robot’s current camera image, from previous image and
previous action
Regression- Examples
Visually Evaluating Correlation

Scatter plots
showing the
similarity from –1 to
1.
Correlation measures the linear relationship between
objects
Simple Linear Regression
Two Approaches
Two very different ways to train it:
• Using a direct “closed-form” equation that directly computes the model
parameters that best fit the model to the training set (i.e., the model
parameters that minimize the cost function over the training set).

• Using an iterative optimization approach, called Gradient Descent (GD), that


gradually tweaks the model parameters to minimize the cost function over the
training set, eventually converging to the same set of parameters as the first
method.
Least Squares Approach

N- no of samples
D: dimension
i- ith sample
Least Squares Approach
Multiple Linear Regression

• Least Square fit or Linear


Regression

• Minimize MSE
• Errors are called Residuals
Least Squares Linear Regression
Least Squares Linear Regression

How to minimize J()


Least Squares Based Solution
Least Squares Based Solution
Least Squares Based Solution
Least Squares Based Solution
• Solve for optimal θ analytically
Thank You!
In our next session: Gradient Descent
Linear Regression
Gradient Descent
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
Session Content - Gradient Descent

• Intuition Behind Regression Cost Function


• Gradient Descent Based Solution
• Types of Gradient Descent
• Batch
• Stochastic
• Minibatch
Intuition Behind Cost Function
Intuition Behind Cost Function - Gradient Descent
Intuition Behind Cost Function

Ɵ0 =0
Intuition Behind Cost Function
Intuition Behind Cost Function
Intuition Behind Cost Function
Intuition Behind Cost Function- Gradient Descent
Intuition Behind Cost Function
Intuition Behind Cost Function
Intuition Behind Cost Function
Intuition Behind Cost Function
Gradient Descent
Gradient Descent is a very generic optimization algorithm capable of finding optimal
solutions to a wide range of problems.
The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize
a cost function.
Concretely, you start by filling θ with random values (this
is called random initialization), and then you improve it
gradually, taking one baby step at a time, each step
attempting to decrease the cost function (e.g., the MSE),
until the algorithm converges to a minimum
Gradient Descent- Hyper parameter
On the other hand, if the learning rate is too high, you might
If the learning rate is too small, then the algorithm jump across the valley and end up on the other side, possibly
will have to go through many iterations to converge, even higher up than you were before. This might make the
which will take a long time algorithm diverge, with larger and larger values, failing to find a
good solution
Gradient Descent- Pitfalls
• Finally, not all cost functions look like nice
regular bowls.
• There may be holes, ridges,plateaus, and
all sorts of irregular terrains, making
convergence to the minimum very difficult.
• Figure shows the two main challenges with
Gradient Descent:
1. if the random initialization starts the
algorithm on the left, then it will
converge to a local minimum,which is
not as good as the global minimum.
2. If it starts on the right, then it will
take a very long time to cross the
plateau, and if you stop too early you
will never reach the global minimum.
Intuition Behind Cost Function
Intuition Behind Cost Function
Intuition Behind Cost Function

(900,-0.1)
h(x)=900-0.1x
Intuition Behind Cost Function
Intuition Behind Cost Function
Intuition Behind Cost Function
Basic Search Procedure
Basic Search Procedure
Basic Search Procedure
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent for Linear Regression
Gradient Descent-Numerical Example
Gradient Descent-Numerical Example
Gradient Descent-Numerical Example
Gradient Descent-Numerical Example
Gradient Descent-Numerical Example
Gradient Descent-Numerical Example
Gradient Descent

constant J
contours

(900,-0.1)
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Running Gradient Descent
Choosing Step Size
Impact of Learning Rate

α=0.02 α=0.1 α=0.5


Feature Normalization
Gradient Descent can be slow when feature scales are widely different
• Normalization of feature values for same variance needed for faster convergence
Impact Due To Batch Size
Choice of batch size impacts the rate of convergence of gradient descent (GD)
• In Batch GD, entire training set is used to calculate the training error in each iteration/epoch and gradient
is calculated and used for weight updates
• Convergences in least number of iterations, i.e., rate of
convergence is highest [O(1/iterations)]
• Computation requirement per iteration is highest
• Memory requirement is also highest
• In Stochastic gradient descent, a randomly selected
training instance is used
• Convergences in highest number of iterations, i.e., rate of
convergence is slowest [~O(1/√𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠)]
• Computation requirement per iteration is lowest
• Memory requirement is also lowest
• In mini batch GD, a subset of training data of size, say
64,128, 256 is used
• very efficient implementation possible leveraging vector
processing using GPUs
Closed Form Solution Vs. Gradient Descent

O(d3)
Extending to More Complex Model
Fitting a Polynomial Curve
Thank You!
In our next session: Regularization
Linear Regression
Regularization
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
Session Content

• Overfitting of the training data


• Effect of Training data size on overfitting
• Regularization
• Ridge
• Lasso
• Early Stopping
Quality of Fit
Addressing overfitting

• 𝑥1 = size of house
• 𝑥2 = no. of bedrooms
Price ($)
• 𝑥3 = no. of floors in 1000’s
• 𝑥4 = age of house
• 𝑥5 = average income in neighborhood
• 𝑥6 = kitchen size
• ⋮
• 𝑥100

Size in feet^2
Addressing overfitting

• Reduce number of features.


• Manually select which features to keep.
• Model selection algorithm

• Regularization.
• Keep all the features, but reduce magnitude/values of parameters 𝜃𝑗 .
• Works well when we have a lot of features, each/many of which contributes a bit to predicting 𝑦.
Effect of Training Size on Overfitting
Size of training dataset needs to be large to prevent overfitting
• when higher order model is used.

Higher order model

Lower order model


Polynomial Fitting can lead to Overfitting
Underlying target function is quadratic
• Linear model results in underfitting with large bias
• Polynomial of order 300 results in a large variance
Regularization
Ridge Regularization
Understanding Regularization

Overfitting
Understanding Regularization

Underfitting
Ridge regression
Regularized Linear Regression
Ridge Regression

Further simplified
Lasso Regularization
Ridge Vs Lasso Regularization

λ=0 λ=0
λ = 10 λ = 10-5
λ = 100 λ=1

Lasso

λ=0 λ=0
λ = 0.1 λ = 10-7
λλ = 1 λ=1
Ridge
Early Stopping
Do Not Over train to prevent overfitting
• Stop training once error on the validation set starts showing an upward trend, even if the error
on the training set keeps decreasing
Thank You!
In our next session: Bias Variance Decomposition
Linear Regression
Bias Variance Decomposition
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
Session Content

• What is Bias, Variance


• Sources of generalization error
• Bias Variance Tradeoff
• Bias Variance Decomposition
Bias and Variance

• Bias: difference between


what you expect to learn (“average” hypothesis) and truth
• Measures how well you expect to represent true solution
• Decreases with more complex model

• Variance: difference between


what you expect to learn and what you learn from a particular dataset
• Measures how sensitive learner is to specific dataset
• Increases with more complex model
Bias and Variance
Sources of Error
Model’s generalization error can be expressed as the sum of three very different errors
• Bias
• due to wrong assumptions, such as model order is less than actual order
• most likely to underfit the training data.
• Variance
• Due to model order higher than actual
• excessive sensitivity to small variations in the training data.
• overfit the training data.
• Irreducible error
• due to the noisiness of the data itself.
• The only way to reduce this error is to clean up the data

• Bias Variance Tradeoff


• Increasing a model’s complexity will typically increase its variance and reduce its bias.
• Conversely, reducing a model’s complexity increases its bias and reduces its variance.
Bias Variance Tradeoff
Bias/Variance is a Way to Understand Overfitting and Underfitting
Bias and Variance
Bias and Variance
Bias and Variance
Bias and Variance
Bias and Variance
Bias and Variance
References

1.source of the slides: DeepLearning.AI


BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.5.1- Module 5
Session Content

M5 Classification Models I T1: Chapter 4, 5


1. Logistic Regression T2: Chapter 5
2. Support Vector Machines
3. Naïve Bayes
4. Comparative Analysis
Session Content

• What is Classification?
• Linear Classifier
• Generative and Discriminatory Classifier
Classification
Definition
• Given a collection of records (training set )
• Each record is by characterized by a tuple (x,y), where x is the attribute (feature) set and y is the
class label
• x aka attribute, predictor, independent variable, input
• Y aka class, response, dependent variable, output

• Task
• Learn a model or function that maps each attribute set x into one of the predefined class labels y

Task Attribute set, x Class label, y


Categorizing Features extracted from email spam or non-spam
email messages message header and content
Identifying tumor Features extracted from x-rays or malignant or benign cells
cells MRI scans
Cataloging Features extracted from Elliptical, spiral, or
galaxies telescope images irregular-shaped galaxies
General Approach for Building Classification Model

Tid Attrib1 Attrib2 Attrib3 Class Learning


1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Types of Classifiers
Linear Classifier
• Classes are separated by a linear decision surface (e.g., straight line
in 2-dimensional feature/attribute space)
y=1
• If for a given record, linear combination of features xi is >= 0, i.e.,
𝑤0 + 𝑤𝑖 𝑥𝑖 ≥ 0
𝑖 x2
it belongs to one class (say, y = 1), else it belongs to the other class (say,
y=0 or -1) Decision
• wi s are learned during the training (induction) phase of the classifier. y=0 Boundary
• Learnt wi s are applied to a test record during the deduction / inferencing
phase. x1

• In nonlinear classification, classes are separated by a non-linear


surface
Generative Vs. Discriminative Models
Generative Model Generative

• Class-conditional probability distribution of attribute/feature


set and prior probability of classes are learnt during the
training phase
x2
• Given these learnt probabilities, during inferencing phase,
probability of a test record belonging to different classes are
calculated and compared.
• Can result in linear or nonlinear decision surface x1

Discriminative

Discriminative Model y=1


• Given a training set, a function f is learnt that directly
maps an attribute/feature vector x to the output class x2
(y=1 or 0/-1) Decision
• A linear function f results in linear decision surface y=0 Boundary

x1
Thank You!
In our next session: Naïve Bayes Classifier
Classification Model I
Naïve Bayes Classifier
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.5.2- Module 5
Session Content

• What is Bayes Classifier?


• Bayes Theorem and its application in classification
• Conditional Independence and Naïve Bayes Classifier
• Laplace Smoothing
• Applications of Naïve Bayes Classification
• Text classification
• Image classification

446
Bayes Classifier
A generative framework for solving classification problems

• Conditional Probability:
P( X , Y )
P (Y | X ) 
P( X )
P( X , Y )
P( X | Y ) 
P (Y )
• Bayes theorem:

P( X | Y ) P(Y )
P(Y | X ) 
P( X )
Using Bayes Theorem for Classification
Consider each attribute and class label as random variables
t t n a
ca ca co cl
• Given a record with attributes (X1, X2,…, Xd) Tid Refund Marital Taxable
Status Income Evade
• Goal is to predict class Y (“Evade”)
given (X1,X2,X3)=(<Refund>, <Marital Status>, <Taxable Income>) 1 Yes Single 125K No
• Specifically, we want to find the value of Y that maximizes P(Y| X1, X2,…, Xd ) 2 No Married 100K No
3 No Single 70K No
• Can we estimate P(Y| X1, X2,…, Xd ) directly from data? 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Using Bayes Theorem for Classification
Approach
• Compute posterior probability P(Y | X1, X2, …, Xd) using the Bayes theorem

P ( X 1 X 2  X d | Y ) P (Y )
P (Y | X 1 X 2  X n ) 
P( X 1 X 2  X d )

• Maximum a-posteriori: Choose Y that maximizes P(Y | X1, X2, …, Xd)

• Equivalent to choosing value of Y that maximizes P(X1, X2, …, Xd|Y) P(Y)

• How to estimate P(X1, X2, …, Xd | Y )?


Example Data
Given a Test Record
X  (Refund  No, Divorced, Income  120K) t t n a
ca ca c o cl
• Can we estimate P(Evade = Yes | X) and P(Evade = No | X)? Tid Refund Marital Taxable
Status Income Evade

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Example Data
Given a Test Record
X  (Refund  No, Divorced, Income  120K)at t n a
c ca co cl
Tid Refund Marital Taxable
Status Income Evade

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Naïve Bayes Classifier
Assume conditional independence

• Among attributes Xi when class is given:


• P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)

• Now we can estimate P(Xi| Yj) for all Xi and Yj combinations from the training data

• New point is classified to Yj if P(Yj)  P(Xi| Yj) is maximal.


Naïve Bayes on Example Data
Given a Test Record
X  (Refund  No, Divorced, Incomeat  120K)
at o n
cl
a
c c c
Tid Refund Marital Taxable
P(X | Yes) = Status Income Evade
P(Refund = No | Yes) x
1 Yes Single 125K No
P(Divorced | Yes) x
2 No Married 100K No
P(Income = 120K | Yes) 3 No Single 70K No
4 Yes Married 120K No
P(X | No) = 5 No Divorced 95K Yes
P(Refund = No | No) x 6 No Married 60K No
P(Divorced | No) x 7 Yes Divorced 220K No
P(Income = 120K | No) 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Estimate Probabilities from Data
Discrete Attributes t t n a
ca ca c o cl
Tid Refund Marital Taxable
Status Income Evade
• P(y) = fraction of instances of class y
1 Yes Single 125K No
• e.g., P(No) = 7/10, P(Yes) = 3/10
2 No Married 100K No
3 No Single 70K No
• For categorical attributes P(Xi =c | y) = nc / n 4 Yes Married 120K No
• nc is number of instances having attribute value Xi =c 5 No Divorced 95K Yes
and belonging to class y
6 No Married 60K No
• e.g., P(Status=Married|y=No) = 4/7,
7 Yes Divorced 220K No
P(Refund=Yes|y=Yes)=0
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Estimate Probabilities from Data
Continuous Attributes

• Discretization
• Partition the range into bins
• Replace continuous value with bin value
• Attribute changed from continuous to ordinal
• Probability density estimation
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, use it to estimate the conditional probability
P(Xi|Y)
Estimate Probabilities from Data
ca ca co cl
Normal Distribution Tid Refund Marital Taxable
Status Income Evade
• One for each (Xi,Yi) pair
( X i  ij ) 2 1 Yes Single 125K No

1 2 ij2
P( X i | Y j )  e 2 No Married 100K No

2ij2 3 No Single 70K No


4 Yes Married 120K No
• For (Income, Class=No) 5 No Divorced 95K Yes
• If Class=No 6 No Married 60K No
• sample mean = 110 7 Yes Divorced 220K No
• sample variance = 2975 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

1 
( 120110) 2

P( Income  120 | No)  e 2 ( 2975)


 0.0072
2 (54.54)
Example of Naïve Bayes Classifier
Given a Test Record
X  (Refund  No, Divorced, Income  120K)
P(Refund = Yes | No) = 3/7
• P(X | No) = P(Refund=No | No)
P(Refund = No | No) = 4/7
 P(Divorced | No)
P(Refund = Yes | Yes) = 0  P(Income=120K | No)
P(Refund = No | Yes) = 1 = 4/7  1/7  0.0072 = 0.0006
P(Marital Status = Single | No) = 2/7
P(Marital Status = Divorced | No) = 1/7
• P(X | Yes) = P(Refund=No | Yes)
P(Marital Status = Married | No) = 4/7
 P(Divorced | Yes)
P(Marital Status = Single | Yes) = 2/3  P(Income=120K | Yes)
P(Marital Status = Divorced | Yes) = 1/3 = 1  1/3  1.2  10-9 = 4  10-10
P(Marital Status = Married | Yes) = 0
• P(X|No)P(No) > P(X|Yes)P(Yes)
For Taxable Income:
If class = No: sample mean = 110 • Therefore, P(No|X) > P(Yes|X) => Class = No
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25
Issues with Naïve Bayes Classifier
Consider the table with Tid = 7 deleted ca ca co cl
Tid Refund Marital Taxable
Status Income Evade

P(Refund = Yes | No) = 2/6 1 Yes Single 125K No


P(Refund = No | No) = 4/6
P(Refund = Yes | Yes) = 0 2 No Married 100K No
P(Refund = No | Yes) = 1 3 No Single 70K No
P(Marital Status = Single | No) = 2/6
P(Marital Status = Divorced | No) = 0 4 Yes Married 120K No
P(Marital Status = Married | No) = 4/6
5 No Divorced 95K Yes
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3 6 No Married 60K No
P(Marital Status = Married | Yes) = 0/3
For Taxable Income: X 7 Yes Divorced 220K No X
If class = No: sample mean = 91 8 No Single 85K Yes
sample variance = 685
If class = No: sample mean = 90 9 No Married 75K No
sample variance = 25 10 No Single 90K Yes
10

Given X = (Refund = Yes, Divorced, 120K)


Naïve Bayes will not be able to
P(X | No) = 2/6 X 0 X 0.0083 = 0
classify X as Yes or No!
P(X | Yes) = 0 X 1/3 X 1.2 X 10-9 = 0
Issues with Naïve Bayes Classifier
If one of the conditional probabilities is zero, then the entire expression becomes zero
• Need to use other estimates of conditional probabilities than simple fractions
• Probability estimation:
n: number of training instances
𝑛𝑐 belonging to class y
original: 𝑃 𝑋𝑖 = 𝑐 𝑦) =
𝑛 nc: number of instances with Xi = c and
Y=y
𝑛𝑐 + 1
Laplace Estimate: 𝑃 𝑋𝑖 = 𝑐 𝑦) = v: total number of attribute values that
𝑛+𝑣 Xi can take

𝑛𝑐 + 𝑚𝑝 p: initial estimate of
m − estimate: 𝑃 𝑋𝑖 = 𝑐 𝑦) = (P(Xi = c|y) known apriori
𝑛+𝑚
m: hyper-parameter for our confidence
in p
Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat yes yes no yes mammals P ( A | M )      0.06
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N )      0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
7
P ( A | M ) P ( M )  0.06   0.021
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N )  0.004   0.0027
eagle no yes no yes non-mammals 20

P(A|M)P(M) > P(A|N)P(N)


Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ? => Mammals
Thank You!
In our next session: Applications in Text and Image
Classification Model I
Naïve Bayes Applications
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.5.3- Module 5
Naïve Bayes Classifier Applications

467
Baseline: Bag of Words Approach

aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas
...
1
oil

1
Zaire

0
Text Classification: A Simple Example

• Which tag does the sentence A very close game


Text Tag
belong to? i.e. P(sports| A very close game)
“A great game” Sports • Feature Engineering: Bag of words i.e., use
“The election was over” Not sports word frequencies without considering order
“Very clean match” Sports • Using Bayes Theorem:

“A clean but forgettable game” Sports P(sports| A very close game)

“It was a close election”


= P(A very close game| sports) P(sports)
Not sports -------------------------------------------------------------
P(A very close game)

• We assume that every word in a sentence is independent of the other ones

• “close” doesn’t appear in sentences of sports tag, So P(close | sports) = 0, which makes
product 0
Laplace smoothing

• Laplace smoothing: we add 1 or in general constant k to every count so it’s never zero.
• To balance this, we add the number of possible words to the divisor, so the division will never be
greater than 1
• In our case, the 14 possible words are
{a,great,very,over,it,but,game,election,clean,close,the,was,forgettable,match}

474
Apply Laplace Smoothing

Word P(word | Sports) P(word | Not Sports)


a 2+1 / 11+14 1+1 / 9+14
very 1+1 / 11+14 0+1 / 9+14
close 0+1 / 11+14 1+1 / 9+14
game 2+1 / 11+14 0+1 / 9+14

475
Experiment with NewsGroups

• Given 1000 training documents from each group Learn to classify new documents according to
which newsgroup it came from

• Naive Bayes: 89% classification accuracy

comp.graphics misc.forsale alt.atheism sci.space


comp.os.ms-windows.misc rec.autos soc.religion.christian sci.crypt
comp.sys.ibm.pc.hardware rec.motorcycles talk.religion.misc sci.electronics
comp.sys.mac.hardware rec.sport.baseball talk.politics.mideast sci.med
comp.windows.x rec.sport.hockey talk.politics.misc
talk.politics.guns
Learning Curve for Newsgroups
Accuracy vs. Training set size
• 1/3 withheld for test
Image Classification
Example: Character Recognition

Xi is intensity at i-th pixel

• Given a test image X, calculate Probability P(Y=yk | X) = P(yk) Πi P(Xi | yk) for all k
Example: Character Recognition
Estimating parameters: Y discrete, Xi continuous

• δ(k) is Kronecker’s delta – equals 1 if argument is true


Thank You!
In our next session: Logistic Regression
Classification Model I
Logistic Regression
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.5.4- Module 5
Session Content

• What is Logistic Regression?


• Sigmoid function
• Log loss Cost Function
• Optimization Using Gradient Descent
• Regularization
• Multi-class classification

483
Logistic regression

• Logistic Regression could help use predict, for example, whether the student passed or
failed. Logistic regression predictions are discrete (only specific values or categories are
allowed). We can also view probability scores underlying the model’s classifications.

• In comparison, Linear Regression could help us predict the student’s test score on a scale
of 0 - 100. Linear regression predictions are continuous (numbers in a range).

• Idea
• Naïve Bayes allows computing P(Y|X) by learning P(Y) and P(X|Y)
• Why not learn P(Y|X) directly?
Sigmoid/Logistic Function
Classification requires discrete output values
• For example, output y = 0 or 1 for a two-category classification problem
• In logistic regression, sigmoid/logistic function hɵ(x) takes a real vector x as input and outputs a
value between 0 and 1

• ɵ is a vector of logistic regression parameters

1
ℎ𝜃 𝑥 = −𝜃 ⊤𝑥
𝑔(𝑧)

1+ 𝑒
𝑧
Logistic regression

ℎ𝜃 𝑥 = 𝑔 𝜃 ⊤ 𝑥
1
𝑔 𝑧 = 𝑔(𝑧)
1 + 𝑒 −𝑧

𝑧 = 𝜃⊤𝑥
Suppose predict “y = 1” if ℎ𝜃 𝑥 ≥ 0.5
𝑧 = 𝜃⊤𝑥 ≥ 0
predict “y = 0” if ℎ𝜃 𝑥 < 0.5
𝑧 = 𝜃⊤𝑥 < 0
Decision boundary
At decision boundary output of logistic regressor is 0.5

• ℎ𝜃 𝑥 = 𝑔 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2
• e.g., 𝜃0 = −3, 𝜃1 = 1, 𝜃2 = 1

Decision boundary
Age

Tumor Size

• Predict “𝑦 = 1” if −3 + 𝑥1 + 𝑥2 ≥ 0
Learning Model Parameters

• Training set:

• m examples
• n features

• How to choose parameters (feature weights)?


MSE Cost Function
MSE cost function is non convex for logistic regression due to sigmoidal function

Logistic regression:

“non-convex” “convex”

• Gradient descent may not find the optimal global minimum.


• So instead of Mean Squared Error, use a error/ cost function called Cross-Entropy, also known
as Log Loss.
Cost function for Logistic Regression
Cross Entropy or Log Loss Function
if 𝑦 = 1
−log ℎ𝜃 𝑥 if 𝑦 = 1
Cost(ℎ𝜃 𝑥 , 𝑦) =
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0

0 ℎ𝜃 𝑥 1
𝐂𝐨𝐬𝐭 𝒉𝜽 𝒙 , 𝒚 = −𝒚 𝐥𝐨𝐠 𝒉𝜽 𝒙 − (𝟏 − 𝐲) 𝐥𝐨𝐠 𝟏 − 𝒉𝜽 𝒙
if 𝑦 = 0

• J(θ) is convex
• Apply gradient descent on J(θ) w.r.t. θ to find optimal parameters 0 ℎ𝜃 𝑥 1
Gradient descent

𝑚
1
𝐽 𝜃 =− 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚
𝑖=1

Goal: min 𝐽(𝜃)


𝜃 Good news: Convex function!
Repeat { Bad news: No analytical solution
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃
𝜕𝜃𝑗
}
(Simultaneously update all 𝜃𝑗 )
𝑚
𝜕 1 𝑖 (𝑖)
𝐽 𝜃 = (ℎ𝜃 𝑥 − 𝑦 (𝑖) ) 𝑥𝑗 Slide credit: Andrew Ng
𝜕𝜃𝑗 𝑚
𝑖=1
Regularization of Logistic Regression
Maximum a posteriori estimate of W = θ

• λ is “regularization” constant
• helps reduce overfitting
• keep weights nearer to zero
Logistic regression more generally

• Logistic regression when Y not boolean (but


still discrete-valued).
• Now y  {y1 ... yR} : learn R-1 sets of weights

For k<R

For k=R
Multi-class classification

Binary classification Multiclass classification

𝑥2 𝑥2

𝑥1 𝑥1
One-vs-all (one-vs-rest)

𝑥2
1
ℎ𝜃 𝑥
𝑥1
𝑥2
2 𝑥2
ℎ𝜃 𝑥

𝑥1 𝑥1
Class 1:
Class 2: 3
ℎ𝜃 𝑥 𝑥2
Class 3:
𝑖
ℎ𝜃 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥1 Slide credit: Andrew Ng
One-vs-all

𝑖
• Train a logistic regression classifier ℎ𝜃 𝑥 for each class 𝑖 to predict the probability that 𝑦 = 𝑖

• Given a new input 𝑥, pick the class 𝑖 that maximizes


𝑖
max ℎ𝜃 𝑥
i
Logistic Regression Applications

• Credit Card Fraud : Predicting if a given credit card transaction is fraud or not
• Health : Predicting if a given mass of tissue is benign or malignant
• Marketing : Predicting if a given user will buy an insurance product or not
• Banking : Predicting if a customer will default on a loan.

IS ZC464, Machine Learning 498


Thank You!
In our next session: Support Vector Machine
Classification Model I
Support Vector Machines
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.5.5- Module 5
Session Content

• Introduction
• Support Vectors
• Linear Support Vector Machine
• Maximizing Margin
• Handling non linearly separable data
• Non-linear Classification
• Kernel Functions

502
Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data
Support Vector Machines
One Possible Solution

B1
Support Vector Machines
Another possible solution

B2
Support Vector Machines
Other possible solutions

B2
Support Vector Machines

• Which one is better? B1 or B2?


• How do you define better?
B1

B2
Support Vector Machines
Find hyperplane maximizes the margin
• => B1 is better than B2
B1

B2

b21
b22

margin
b11

b12
Support Vector Machines
B1

 
w x  b  0
 
 
w  x  b  1 w  x  b  1

b11

  b12
 1 if w  x  b  1 2
f ( x)     Margin  
 1 if w  x  b  1 || w ||
Linear SVM
Linear Model

 
  1 if w x  b 1
f ( x)    
 1 if w  x  b  1

• Learning the model
 is equivalent to determining the values of w and b
• How to find w and b from training data?
Learning Linear SVM
𝑦𝑖 (w • x𝑖 + 𝑏) ≥ 1, 𝑖 = 1,2, . . . , 𝑁
• Objective is to maximize:
2
Margin  
|| w ||
• Which is equivalent to minimizing:
 2
 || w ||
L( w) 
2
• Subject to the following constraints:
 
1 if w  x i  b  1
yi    
or
 1 if w  x i  b  1
𝟏
• This is a constrained optimization problem L(w, b, λi)= ||w||2 - Σ λi [yi (wTxi + b) -1]
• Solve it using Lagrange multiplier method 𝟐
• Lagrange multipliers λi are 0 or +ve
Learning Linear SVM

λi λi is non-zero except for support vectors

λi
Example of Linear SVM

Support vectors

x1 x2 y l
0.3858 0.4687 1 65.5261
0.4871 0.611 -1 65.5261
0.9218 0.4103 -1 0
0.7382 0.8936 -1 0
0.1763 0.0579 1 0
0.4057 0.3529 1 0
0.9355 0.8132 -1 0
0.2146 0.0099 1 0
Learning Linear SVM

• Decision boundary depends only on support vectors


• If you have data set with same support vectors, decision boundary will not change

• How to classify using SVM once w and b are found? Given a test record, xi

 
 1 if w  x i  b  1
f ( xi )    
 1 if w  x i  b  1
Support Vector Machines
What if the problem is not linearly separable?
Support Vector Machines
What if the problem is not linearly separable?
• Introduce slack variables
• Need to minimize
 2
|| w ||  N k
L( w)   C   i 
• subject to
2  i 1 
 
1 if w  x i  b  1 - i
yi    
 1 if w  x i  b  1  i
• If k is 1 or 2, this leads to similar objective function as linear SVM but with different constraints
Nonlinear Support Vector Machines
What if decision boundary is not linear?
Nonlinear Support Vector Machines
Transform data into higher dimensional space

Decision boundary:
 
w  ( x )  b  0
Learning Nonlinear SVM
Optimization Problem

• which leads to the same set of equations (but involve (x) instead of x)
Learning NonLinear SVM
Issues
• What type of mapping function  should be used?
• How to do the computation in high dimensional space?
• Most computations involve dot product (xi) (xj)
• Curse of dimensionality?
Learning Nonlinear SVM

• Kernel Trick:
• (xi) (xj) = K(xi, xj)
• K(xi, xj) is a kernel function (expressed in terms of the coordinates in the original space)
• Examples:
Example of Nonlinear SVM

SVM with polynomial degree 2 kernel


Learning Nonlinear SVM

• Advantages of using kernel:


• Don’t have to know the mapping function 
• Computing dot product (xi) (xj) in the original space avoids curse of dimensionality

• Not all functions can be kernels


• Must make sure there is a corresponding  in some high-dimensional space
• Mercer’s theorem (see textbook)
Thank You!
In our next session: Comparison and Applicability
Classification Model I
Comparison and Applicability
BITS Pilani Dr. Bharathi R
CSE Department
Pilani Campus
BITS Pilani
Pilani Campus

SE ZG568/SS ZG568 , Applied Machine Learning


Lecture No.5.6- Module 5
Session Content

• Pros and Cons


• Naïve Bayes Classifier
• Logistic Regression
• Support Vector Machines
Generative vs. Discriminative Classifiers

• Training classifiers involves estimating f: X -> Y, or P(Y|X)

• Generative classifiers like Naïve Bayes


• Assume some functional form for probabilities P(X|Y), P(X)
• Estimate parameters of P(X|Y), P(X) directly from training data
• Use Bayes rule to calculate P(Y|X= xi)

• Discriminative classifiers like Logistic regression, support vector machine


• Assume some functional form for f: X -> Y
• Estimate parameters of mapping function f directly from training data
Naïve Bayes – Advantages

• Algorithm is simple to implement and fast


• If conditional independence holds, it will converge quickly than other methods
• Even in cases where conditional independence doesn’t hold, its results are quite acceptable
• Needs less training data (due to conditional independence assumption)
• Highly scalable, scales linearly with the number of predictors and training points
• Can be used for both binary and multi-class classification problems
• Handles continuous and discrete data
• Not sensitive to irrelevant features
• Doesn’t overfit the data due to small model size (compared to other algorithms like Random
Forest)
• Handles missing values well
Naïve Bayes – When to use

• When the training data is small


• When the features are conditionally independent (mostly)
• When we have large number of features with minimal data set
• Ex: Text classification
Naïve Bayes versus Logistic Regression

• Naïve Bayes are Generative Models


• Logistic Regression are Discriminative Models
• When the training size reaches infinity, logistic regression performs better than the
generative model Naive Bayes.
Logistic Regression and
Gaussian Naïve Bayes Classifier

• Interestingly, the parametric form of P(Y|X) used by Logistic Regression is


precisely the form implied by the assumptions of a Gaussian Naive Bayes
classifier.

• Therefore, we can view Logistic Regression as a closely related alternative to


GNB, though the two can produce different results in many cases
Characteristics of SVM

• The learning problem is formulated as a convex optimization problem


• Efficient algorithms are available to find the global minima
• Many of the other methods use greedy approaches and find locally optimal solutions
• High computational complexity for building the model

• Robust to noise
• Overfitting is handled by maximizing the margin of the decision boundary
• In some sense, the best linear model for classification.

• SVM can handle irrelevant and redundant data better than many other techniques
• The user needs to provide the type of kernel function and cost function
• Difficult to handle missing values
• What about categorical variables?
• Needs to be mapped to some metric space

Computationally expensive for large training data with many attributes


Thank You!
In our next session: Applications of ML

You might also like