0% found this document useful (0 votes)
4 views

Machine Learning

Uploaded by

esraaaboelkhair8
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Machine Learning

Uploaded by

esraaaboelkhair8
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

5.

Learning
Created @December 3, 2024 11:49 AM

Tags

Table of Contents
1. Recap

2. Introduction to Machine Learning (ML)

Machine Perception

Pattern Recognition

Key ML Concepts

3. Supervised Learning

Classification

Regression

4. Nearest-Neighbor Classification

5. Perceptron Learning Rule

6. Support Vector Machines (SVM)

7. Evaluation of Hypotheses

Loss Functions

Overfitting

8. Cross-Validation and Regularization

9. Gradient Descent Algorithm

1. Recap
Key Topics Covered in Previous Lectures

5.Learning 1
Search Problems: Strategies for state-space exploration.

Adversarial Games: Focus on decision-making in multi-agent environments.

Bayesian Networks: Probabilistic models using variables and conditional


dependencies.

Logic in AI: Role of reasoning in both low-level (reactive) and high-level


intelligence.

2. Introduction to Machine Learning


Machine Perception
Goal: Design machines capable of recognizing patterns for:

Speech recognition.

Fingerprint identification.

Optical Character Recognition (OCR).

DNA sequence identification.

Reliable pattern recognition aids in automating critical processes.

What is Pattern Recognition?


Definition: Recognizing patterns and making decisions based on them.

Examples:

Visual: Classifying emotions (happy vs. sad).

Speech: Identifying if a speaker said "Yes" or "No".

Physics: Determining whether an observation is an atom or molecule.

Key Concepts in Pattern Recognition


Patterns arise from the laws of physics, brain activity, or behavioral relations.

Challenges:

Variations in lighting, expressions, and occlusions.

5.Learning 2
Machine Learning Core
Definition: Deriving insights from raw data to make decisions or predictions.

Patterns serve as the foundational structure in:

Object and signal recognition.

Decision-making based on input parameters.

3. Supervised Learning
Overview
Goal: Learn a function to predict outcomes from labeled input-output pairs.

5.Learning 3
Process
1. Training Phase:

Input a labeled dataset: .

{(x1,y1),…,(xN,yN)}\{(x_1, y_1), \dots, (x_N, y_N)\}

Minimize prediction error.

2. Testing Phase:

Apply the learned function to unseen examples.

f(x)

Classification

5.Learning 4
Assigns a discrete category (class) to inputs.

Examples:
Spam detection (binary classification).

Handwritten digit recognition (multi-class classification).

Fish Classification Example

This example demonstrates classification principles using features of fish. It


illustrates how input features and classification labels work in supervised learning.

Features:

Length: Length of the fish in cm.

Lightness: The lightness of the fish's color.

Feature Space Representation:

Each fish is represented as a point in a two-dimensional feature space


(length vs. lightness).

Each point is associated with a class, e.g., "Salmon" or "Sea Bass."

1. Histograms of Features

5.Learning 5
The histograms illustrate the distribution of features like lightness and length for
the fish dataset.

Histogram of Lightness:

Two distinct clusters may appear, representing "Salmon" (typically lighter)


and "Sea Bass" (typically darker).

Histogram of Length:

Similarly, clusters or overlaps might indicate class distribution.

Key Insights from Histograms:

Overlapping regions indicate ambiguity, where additional features might be


needed.

The lightness feature is more distinctive because there is a difference


between the 2 species

2. Feature Space and Decision Boundaries


The graphs depict 2D feature spaces, with the following elements:

5.Learning 6
Axes:

X-axis: One feature (e.g.,


length).

Y-axis: Another feature (e.g.,


lightness).

Data Points:

Each fish is represented as a


point, color-coded or labeled
by its class.

Example: Red points for


"Salmon," blue points for "Sea
Bass."

Decision Boundary:
Definition: A line or curve that
separates the feature space into
regions, one for each class.

Linear Decision Boundary:

A straight line dividing


"Salmon" and "Sea Bass."

Example: If lightness > 0.5 and


length < 15, classify as
"Salmon."

Non-Linear Decision Boundary:

A curve that handles more complex distributions.

Example: Incorporates additional feature relationships.

5. Commonly Used Terminologies


1. Pattern:

5.Learning 7
Combination of features that define an object.

Example: A fish’s size and color.

2. Features:

Variables used to characterize a pattern.

Example: Lightness, length.

3. Classes (Labels):

The category or label assigned to a pattern.

Example: "Salmon" or "Sea Bass."

4. Instance (Exemplar):

A single data point representing a pattern.

Classification Approaches:
1. Binary Classifiers:

Separate two groups, e.g.,


fraud detection.

Methods: Decision Trees,


Support Vector Machines
(SVMs), Neural Networks

2. Multi-class Classifiers:

Distinguish more than two


classes.

Approaches: One-vs-All, One-


vs-One.

3. Multi-label Classifiers:

Associate multiple labels with a


single instance, e.g.,

5.Learning 8
recognizing multiple objects in
an image.

Multi-Class Classifiers (Detailed Explanation)


Definition
A multi-class classifier is designed to distinguish between three or more
classes.

Unlike binary classification, where the task is to separate two classes, multi-
class classification must identify one out of several possible classes for a
given input.

Examples of Multi-Class Classification


1. Handwritten Digit Recognition:

Input: Image of a handwritten digit.

Output: A number between 0 and 9.

Challenges:

Variability in handwriting styles.

Noise in images or overlapping digits.

2. Image Classification:

Input: Image (e.g., a photo of an animal).

Output: Label corresponding to the type of animal (e.g., cat, dog, or bird).

3. Object Detection:

Input: An image containing multiple objects.

Output: Labels for objects and their locations.

Techniques for Multi-Class Classification

5.Learning 9
1. One-Versus-All (OvA):

Approach:

Train a separate binary classifier for each class.

For classes, binary classifiers are trained.

Each classifier distinguishes one class from the rest.

Example:

Handwritten digits: Train 10 classifiers, one for each digit (0, 1, ..., 9).

For a given input, each classifier outputs a score, and the class with
the highest score is chosen.

Advantages:

Simple to implement.

Disadvantages:

Classifiers may produce conflicting results.

Requires binary classifiers.

2. One-Versus-One (OvO):

Approach:

Train a binary classifier for every pair of classes.

For classes, classifiers are trained.

( ) = n(n − 1)/2
n
2

Each classifier determines the winner between two specific classes.

Example:

Handwritten digits: Train 45 binary classifiers for all pairs of digits


(e.g., 0 vs. 1, 0 vs. 2, ..., 8 vs. 9).

Use a voting mechanism to determine the final class.

Advantages:

5.Learning 10
Each classifier focuses on distinguishing between two classes, making
it potentially more accurate.

Disadvantages:

Computationally expensive due to the large number of classifiers.

3. Multi-Class Logistic Regression:

Extends binary logistic regression to handle multiple classes by using a


softmax function.

Outputs probabilities for each class; the class with the highest probability
is selected, instead of sharp threshold where the voting is only 1 or 0. This
model is more flexible.

4. Decision Trees and Random Forests:

Decision trees can naturally handle multi-class classification by creating


branches for each class.

Random forests use an ensemble of decision trees for improved accuracy.

5. Neural Networks:

Can handle multi-class classification by using multiple output neurons,


each corresponding to a class.

The softmax layer ensures that the outputs sum to 1, representing


probabilities.

5.Learning 11
Challenges in Multi-Class Classification
1. Class Imbalance:

Some classes may have significantly more samples than others, leading to
biased models.

Solution: Use techniques like oversampling, undersampling, or class-


weighted loss functions.

2. Overlapping Classes:

Classes may not have clearly defined boundaries in feature space.

Solution: Use advanced models like Support Vector Machines (SVMs) with
non-linear kernels or deep learning.

3. Scalability:

As the number of classes increases, the computational complexity can


become prohibitive.

Solution: Dimensionality reduction or hierarchical classification.

Multi-Output vs Multi-Label Classification


1. Multi-Output Classifiers:

Predict multiple outputs simultaneously for a single input.

Example: Predicting temperature and humidity from weather data.

2. Multi-Label Classifiers:

Assign multiple labels to a single instance.

Example: Identifying all objects in an image (e.g., both a cat and a dog).

Summary of Techniques
Technique Advantages Disadvantages

One-Versus-All Simple to implement Conflicts between classifiers

One-Versus-One More focused classifiers Computationally expensive

5.Learning 12
Logistic
Probabilistic interpretation Assumes linear separability
Regression

Handles complex Requires large datasets and high


Neural Networks
relationships well computation

4. Nearest-Neighbor Classification
Rule:
Nearest-Neighbor Classification
involves assigning a class label to
a new data point based on the
label of its closest point
(prototype) in the feature space.

Example:

If a test point lies closest to a


training point labeled "A," the
model predicts "A."

k-Nearest Neighbor:

Algorithm:
1. Determine k:

k is the number of nearest neighbors to consider.

It should be odd to prevent ties.

Example: k=3 considers the 3 nearest points.

2. Compute Distance:

Measure the distance between the test point and all training points using a
distance metric.

3. Find Neighbors:

5.Learning 13
Identify the k closest neighbors based on the computed distances.

4. Majority Voting:

Determine the most common class among the k neighbors and assign it to
the test point.

Choosing k:
Small k:

Sensitive to noise and outliers.

Example: k=1 (Nearest Neighbor Rule).

Large k:

Reduces the effect of noise but may blur class boundaries.

Example: k=7 averages over more neighbors.

Distance Metrics:
1. Euclidean Distance:

Formula

d
d(x1 , x2 ) =
​ ​ ∑(x1i − x2i )2
​ ​ ​ ​

i=1

Suitable for numerical features in a continuous space.

Example: Comparing fish length and lightness.

2. Manhattan Distance:

Formula:

d
d(x1 , x2 ) = ∑ ∣x1i − x2i ∣
​ ​ ​ ​ ​

i=1

Useful when features are on a grid or when differences are categorical.

3. Other Metrics:

5.Learning 14
Minkowski Distance (generalized version of Euclidean/Manhattan).

Cosine Similarity (used in text classification).

Strengths and Weaknesses:

Strengths:
Simple to understand and implement.

Non-parametric: No assumptions about data distribution.

Weaknesses:
Computationally expensive for large datasets.

Requires feature scaling (e.g., normalization) for fair distance computation.

Practical Application:
Recommender Systems:

Classify user preferences based on similar users.

Medical Diagnosis:

Predict diseases based on similar patient symptoms.

5. Perceptron Learning Rule


Overview:
A Perceptron is a type of linear classifier that separates data points using a
hyperplane.

It iteratively adjusts model parameters (weights) to minimize errors in


classification.

Perceptron Learning Rule:

5.Learning 15
The Perceptron rule updates
weights wi as follows:

wi = wi + α(y − y^)xi
​ ​ ​

α: Learning rate (controls step


size).

y: True label.

y^: Predicted label.

Algorithm:
1. Initialize Weights:

Start with random or zero values for w.

2. Iterate Through Training Data:

Compute the predicted label y^ using the current weights:

y^ = {
1 if w ⋅ x + b ≥ 0
0 otherwise.
​ ​ ​

^=
Update weights if y  y

w = w + α(y − y^)x ​

3. Repeat Until Convergence:

Stop when classification errors are minimized or after a fixed number of


iterations.

Features:

5.Learning 16
1. Threshold-Based Classification:

Uses a hard threshold to


separate classes.

Example: If w⋅x+b≥0, classify


as positive.

2. Linearly Separable Data:

Works well when data can be


perfectly separated by a
hyperplane.

Example:

Dataset:
Points: (2,3) labeled 1, (1,1) labeled 0.

Initial weights: w=[0,0],b=0,α=0.1.

Iteration 1:
Predicted y^​for (2,3): 0 (incorrect).

Update:

w = w + α(1 − 0) × [2, 3] = [0.2, 0.3]

Repeat until no errors remain.

Applications:
Optical character recognition (OCR).

Simple image classification.

6. Support Vector Machines (SVM)

5.Learning 17
Overview:
SVMs are supervised learning models that find the optimal hyperplane to
separate classes in feature space.

The maximum margin separator ensures the boundary is robust to small


variations.

Key Concepts:
1. Margin:

Distance between the hyperplane and the nearest data points (support
vectors) from each class.

Goal: Maximize this margin to improve generalization.

2. Hyperplane:

A decision boundary separating classes.

For two features:

w⋅x+b= 0

Generalizes to higher dimensions.

3. Kernel Trick:

Extends SVMs to non-linear problems by mapping data into a higher-


dimensional space.

Common Kernels:

Linear.

Polynomial.

Radial Basis Function (RBF).

SVM Optimization:
1. Objective:

Minimize: (residuals )

5.Learning 18
1
∥w∥2
2

Subject to constraints:

yi (w ⋅ xi + b) ≥ 1
​ ​ ∀i.
2. Slack Variables:

Allow some misclassifications to handle non-separable data.

Add penalty C to control trade-off between margin size and classification


error.

maximum margin separator

Strengths and Weaknesses:

Strengths:
Effective for high-dimensional data.

Works well for both linear and non-linear problems.

Weaknesses:
Sensitive to parameter choices (e.g., C and kernel parameters).

Computationally expensive for large datasets.

5.Learning 19
Applications:
1. Text Classification:

Spam detection in emails.

2. Image Recognition:

Face detection.

7. Regression
Regression is a supervised learning technique used to predict a continuous
output value based on input features

The goal is to find a function f(x) that maps inputs (x) to outputs (y).

Types of Regression:
1. Linear Regression:

Predicts a continuous target using a linear relationship between input


features and output.

Hypothesis Function:

hθ (x) = θ0 + θ1 x
​ ​ ​

Cost Function:

5.Learning 20
m
1 2
J (θ) = ∑ (hθ (x(i) ) − y(i) )
2m
​ ​ ​

i=1

where:

m: Number of training examples.

hθ(x) Predicted value.

y: Actual value.

2. Polynomial Regression:

Extends linear regression by introducing higher-degree terms of input


features.

Example:

hθ (x) = θ0 + θ1 x + θ2 x2
​ ​ ​ ​

3. Logistic Regression:

Used for classification problems, predicts probabilities of class


membership.

Hypothesis Function:

1
hθ (x) =
1 + e−θ T x
​ ​

Example: Predicting House Prices


Input: Size of a house (in square feet).

Output: Price (in dollars).

Steps:

1. Plot data to visualize the relationship.

2. Fit a regression line .

hθ (x) = θ0 + θ1 x
​ ​ ​

5.Learning 21
3. Use the model to predict the price for a house of a given size.

In cases where linear regression is giving you high error, try to increase the
complexity little by little (make it polynomial)

8. Evaluation of Hypotheses
Evaluation metrics quantify how well a model's predictions match actual
outcomes.

Loss Functions:
1. 0-1 Loss:

Used in classification.

Outputs 0 for correct predictions and 1 for incorrect predictions.

2. L1 Loss (Absolute Error):

Measures the absolute difference between actual and predicted values.

5.Learning 22
Formula:

L(y, y^) = ∣y − y^∣


​ ​

3. L2 Loss (Squared Error):

Penalizes larger errors more heavily.

Formula:

L(y, y^) = (y − y^)2


​ ​

Overfitting and Underfitting:


Overfitting:

Model fits the training data too closely, failing to generalize to new data.

Indicators: Low training error but high test error.

5.Learning 23
Underfitting:

Model is too simple to capture the underlying pattern in the data.

Indicators: High training and test errors.

Tools for Evaluation:


1. Training and Testing Datasets:

Split the dataset into training (used for learning) and testing (used for
evaluation).

2. Cross-Validation:

Divide the dataset into k subsets and train/test the model k times.

Each subset serves as the test set once (more on this below).

5.Learning 24
9. Cross-Validation and Regularization
Cross-Validation
1. Holdout Method:

Divide the dataset into two parts: training and testing.

Limitation: Performance can depend on how the data is split.

2. K-Fold Cross-Validation:

Split the data into k subsets.

Train the model k times, each time using a different subset as the test set
and the rest as the training set.

Compute the average performance across all folds.

3. Leave-One-Out Cross-Validation:

Special case of fold where k=m (number of examples).

Each example is used as a test case once.

Regularization
Regularization prevents overfitting by penalizing overly complex models.

Techniques: (EXTRA)
1. L2 Regularization (Ridge Regression):

Adds a penalty term proportional to the square of the coefficients.

5.Learning 25
Modified Cost Function:

m n
1 2
J (θ) = ∑ (hθ (x ) − y ) + λ ∑ θj2
(i) (i)
2m
​ ​ ​ ​ ​

i=1 j=1

2. L1 Regularization (Lasso Regression):

Adds a penalty term proportional to the absolute value of the coefficients.

Encourages sparsity by driving some coefficients to zero.

3. Elastic Net:

Combines L1 and L2 regularization.

Applications:
Helps in feature selection by reducing less important features.

Example: Predicting house prices using many correlated features.

10. Gradient Descent Algorithm


Gradient Descent is used to minimize the cost function by iteratively adjusting
the model parameters (θ) in the direction of the steepest descent.

Key Steps:
1. Initialize Parameters:

Start with random or zero values for θ

2. Compute Gradients:

Calculate the partial derivatives of the cost function with respect to each
parameter:

∂J (θ)
∂θj

3. Update Parameters:

Adjust parameters using the learning rate (α):

5.Learning 26
∂J (θ)
θj = θj − α
∂θj
​ ​ ​

Variants:
1. Batch Gradient Descent:

Uses the entire dataset to compute gradients.

Pros: Converges to the global minimum.

Cons: Computationally expensive for large datasets.

2. Stochastic Gradient Descent (SGD):

Updates parameters for each training example.

Pros: Faster updates.

Cons: May not converge smoothly.

3. Mini-Batch Gradient Descent:

Combines batch and stochastic approaches by using small subsets of


data.

Pros: Efficient and smooth convergence.

Learning Rate (α):


1. Too Small:

Convergence is slow.

2. Too Large:

May overshoot the minimum or fail to converge.

3. Adaptive Learning Rates:

Methods like Adam or RMSprop dynamically adjust the learning rate.

Application Example:
Linear Regression with Gradient Descent:

5.Learning 27
Dataset: House prices.

Hypothesis:

hθ (x) = θ0 + θ1 x
​ ​ ​

Cost Function:

m
1 2
J (θ) = ∑ (hθ (x ) − y )
(i) (i)
2m
​ ​ ​

i=1

Update Rule:

∂J (θ) ∂J (θ)
θ0 − α , θ1 = θ1 − α
∂θ0 ∂θ1
​ ​ ​ ​ ​

​ ​

Practical Challenges:
1. Local Minima:

Not an issue for convex cost functions like in linear regression.

2. Feature Scaling:

Gradient descent performs better when features are scaled (e.g.,


normalization or standardization).

5.Learning 28

You might also like