0% found this document useful (0 votes)
8 views15 pages

Machine Learning (CSEN3203) 1-14

The document provides a comprehensive overview of various machine learning concepts, including the Perceptron Learning Algorithm, linear regression, and the differences between supervised and semi-supervised learning. It explains the workings of the PLA, derives formulas for linear regression in both single and multiple variable contexts, and discusses Hoeffding's inequality in relation to learning feasibility. Additionally, it outlines different learning paradigms such as supervised, unsupervised, semi-supervised, and reinforcement learning, along with their respective examples and components.

Uploaded by

Yash Jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views15 pages

Machine Learning (CSEN3203) 1-14

The document provides a comprehensive overview of various machine learning concepts, including the Perceptron Learning Algorithm, linear regression, and the differences between supervised and semi-supervised learning. It explains the workings of the PLA, derives formulas for linear regression in both single and multiple variable contexts, and discusses Hoeffding's inequality in relation to learning feasibility. Additionally, it outlines different learning paradigms such as supervised, unsupervised, semi-supervised, and reinforcement learning, along with their respective examples and components.

Uploaded by

Yash Jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Machine Learning (CSEN3203) - Complete Answer Set

1. Describe the Perceptron Learning Algorithm (PLA) and briefly explain the
working principle of the algorithm.
The Perceptron Learning Algorithm (PLA) is a supervised learning algorithm for binary classifiers. It works
as follows:

1. Initialization: Start with arbitrary weights w (often zeros or small random values).
2. Iterative Process:
For each training example (x, y), where x is the input feature vector and y is the target output (+1
or -1)
Calculate the predicted output: ŷ = sign(w^T·x)

If misclassified (ŷ ≠ y), update the weights: w = w + η·y·x (where η is the learning rate, typically set
to 1)

If correctly classified, leave weights unchanged

3. Termination: Repeat until no misclassifications occur in an entire pass through the dataset or until
reaching a maximum number of iterations.

The working principle is based on iteratively adjusting the decision boundary (represented by the weight
vector) whenever a misclassification occurs. For linearly separable data, PLA is guaranteed to converge to
a solution in a finite number of updates.

2. Linear Regression Problem with Student Attendance and Marks


Given the data:

Student Attendance (x) Marks (y) Student Attendance (x) Marks (y)

1 28 43 6 28 39

2 27 39 7 26 36

3 23 27 8 21 36

4 27 36 9 22 31

5 24 34 10 28 37
 

To find marks for a student with 20 classes of attendance using linear regression:

Step 1: Calculate necessary values

Number of data points: n = 10


Sum of x: Σx = 254
Sum of y: Σy = 358
Mean of x: x̄ = 25.4

Mean of y: ȳ = 35.8
Sum of x²: Σx² = 6,518

Sum of xy: Σxy = 9,168

Step 2: Calculate the slope (m) m = [n(Σxy) - (Σx)(Σy)] / [n(Σx²) - (Σx)²] m = [10(9,168) - (254)(358)] /
[10(6,518) - (254)²] m = [91,680 - 90,932] / [65,180 - 64,516] m = 748 / 664 m = 1.126

Step 3: Calculate the y-intercept (b) b = ȳ - m(x̄ ) b = 35.8 - 1.126(25.4) b = 35.8 - 28.6 b = 7.2

Step 4: Regression equation y = 1.126x + 7.2

Step 5: Predict marks for 20 classes y = 1.126(20) + 7.2 y = 22.52 + 7.2 y = 29.72 ≈ 30 marks

3. Derive the linear regression formula for single dependent variables.


For linear regression with single dependent variable, we're finding the line y = mx + b that minimizes the
sum of squared errors (SSE).

Given n data points (x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ), the SSE is: SSE = Σ(yᵢ - (mxᵢ + b))²

To minimize SSE, we set partial derivatives with respect to m and b to zero:

∂SSE/∂m = -2Σ(yᵢ - mxᵢ - b)xᵢ = 0 ∂SSE/∂b = -2Σ(yᵢ - mxᵢ - b) = 0

From the second equation: Σyᵢ - mΣxᵢ - nb = 0 b = (Σyᵢ - mΣxᵢ)/n = ȳ - mx̄

Substituting into the first equation: Σ(yᵢ - mxᵢ - (ȳ - mx̄ ))xᵢ = 0 Σ((yᵢ - ȳ) - m(xᵢ - x̄ ))xᵢ = 0 Σ(yᵢ - ȳ)xᵢ - mΣ(xᵢ -
x̄ )xᵢ = 0 Σ(yᵢ - ȳ)xᵢ - mΣ(xᵢxᵢ - x̄ xᵢ) = 0 Σ(yᵢ - ȳ)xᵢ - m(Σxᵢ² - x̄ Σxᵢ) = 0

Therefore: m = Σ(yᵢ - ȳ)xᵢ / Σ(xᵢ - x̄ )² m = Σ(xᵢ - x̄ )(yᵢ - ȳ) / Σ(xᵢ - x̄ )² m = Cov(x,y) / Var(x)

And b = ȳ - mx̄

These can also be written as: m = [n(Σxy) - (Σx)(Σy)] / [n(Σx²) - (Σx)²] b = [Σy - m(Σx)] / n

4. Consider the perceptron in two dimensions: h(x) = sign(wᵀx) where w = [w₀,


w₁, w₂]ᵀ and x = [1, x₁, x₂]ᵀ.
(i) Show that the regions on the plane where h(x) = +1 and h(x) = -1 are separated by a line. If we express
this line by the equation x₂ = ax₁ + b, what are the expressions for a and b in terms of w₀, w₁, w₂?

The perceptron function h(x) = sign(wᵀx) gives: h(x) = sign(w₀·1 + w₁x₁ + w₂x₂)

This means:

h(x) = +1 when w₀ + w₁x₁ + w₂x₂ > 0


h(x) = -1 when w₀ + w₁x₁ + w₂x₂ < 0

The boundary between these regions occurs when w₀ + w₁x₁ + w₂x₂ = 0

Rearranging to solve for x₂: w₂x₂ = -w₀ - w₁x₁ x₂ = (-w₀ - w₁x₁)/w₂ (assuming w₂ ≠ 0) x₂ = -w₀/w₂ - w₁/w₂·x₁

Therefore, the line is x₂ = ax₁ + b where: a = -w₁/w₂ b = -w₀/w₂

(ii) Draw a picture for the cases w = [3, 2, 1]ᵀ and w = -[3, 2, 1]ᵀ.

For w = [3, 2, 1]ᵀ: a = -w₁/w₂ = -2/1 = -2 b = -w₀/w₂ = -3/1 = -3 So the line is x₂ = -2x₁ - 3

For w = -[3, 2, 1]ᵀ = [-3, -2, -1]ᵀ: a = -w₁/w₂ = -(-2)/(-1) = -2 b = -w₀/w₂ = -(-3)/(-1) = -3 So the line is
again x₂ = -2x₁ - 3

The two cases produce the same decision boundary line but with regions for h(x) = +1 and h(x) = -1
flipped between the two cases.

5. Define Hoeffding's inequality in the context of feasibility of learning.


Hoeffding's inequality provides a statistical bound on how much the in-sample error Ein(h) might differ
from the out-of-sample error Eout(h) for a hypothesis h. In the context of feasibility of learning, it states:

P[|Ein(h) - Eout(h)| > ε] ≤ 2e^(-2ε²N)

Where:

Ein(h) is the in-sample error (training error)

Eout(h) is the out-of-sample error (test error)

N is the sample size

ε is the tolerance level

P[...] is the probability that the difference exceeds ε

This inequality demonstrates that with a large enough sample size N, the probability of having a
significant difference between training and test errors becomes very small, making learning feasible.
Specifically:

1. As N increases, the bound becomes tighter

2. For a fixed ε, increasing N makes the probability of a large deviation exponentially smaller

3. For a finite hypothesis set H, we can apply the union bound to maintain that: P[∃h ∈ H such that
|Ein(h) - Eout(h)| > ε] ≤ 2|H|e^(-2ε²N)

Thus, Hoeffding's inequality demonstrates that learning is feasible when we have:

A sufficiently large dataset

A finite hypothesis set, or a hypothesis set with controlled complexity


6. Derive the linear regression formula for multiple dependent variables. Also
explain how the derived linear regression formula can be used for nonlinear
cases.

Multiple Linear Regression Formula Derivation


For multiple regression with p independent variables, we have: y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε

In matrix form with n observations: Y = Xβ + ε

Where:

Y is n×1 vector of dependent variables

X is n×(p+1) matrix of independent variables (with first column of 1s)

β is (p+1)×1 vector of parameters

ε is n×1 vector of errors

The sum of squared errors (SSE) is: SSE = (Y - Xβ)ᵀ(Y - Xβ)

To minimize SSE, we take the derivative with respect to β and set it to zero: ∂SSE/∂β = -2Xᵀ(Y - Xβ) = 0

Solving for β: Xᵀ(Y - Xβ) = 0 XᵀY - XᵀXβ = 0 XᵀXβ = XᵀY β = (XᵀX)⁻¹XᵀY

This is the normal equation for multiple linear regression.

Non-linear Cases
For nonlinear relationships, we can transform the input features to higher-order terms or apply other
transformations, then use the same linear regression formula. Some approaches:

1. Polynomial regression: Include powers of variables (x², x³, etc.) Example: y = β₀ + β₁x + β₂x² + β₃x³
2. Feature interactions: Include products of variables (x₁x₂, etc.) Example: y = β₀ + β₁x₁ + β₂x₂ + β₃x₁x₂

3. Basis functions: Apply transformations like logarithmic, exponential, or trigonometric Example: y = β₀


+ β₁log(x) + β₂sin(x)

4. Kernel methods: Implicitly map data to higher-dimensional spaces Example: Radial basis function
kernel

The process involves:

1. Transform the original features into the desired nonlinear features


2. Apply standard linear regression to the transformed features

3. The resulting model becomes nonlinear in the original feature space

7. Describe the differences between supervised and semi-supervised learning.


Supervised Learning:

Uses fully labeled training data (each example has input features and a target output)

The algorithm learns to map inputs to outputs based on these labeled examples

Goal is to generalize from training data to make predictions on unseen data

Examples: Classification, regression, object recognition

Requires extensive labeled data, which can be expensive and time-consuming to obtain

Semi-supervised Learning:

Uses a combination of labeled and unlabeled data for training

Typically includes a small amount of labeled data and a large amount of unlabeled data

Exploits the structure in the unlabeled data to improve the learning model

Key assumptions:
1. Continuity: Points close to each other are likely to have the same label

2. Cluster: Data points in the same cluster likely belong to the same class
3. Manifold: Data lies on a lower-dimensional manifold within the higher-dimensional space

Techniques include self-training, co-training, transductive SVM, and graph-based methods

Particularly useful when labeled data is scarce or expensive, but unlabeled data is abundant

Examples: Web content classification, image recognition with partially labeled datasets

The main advantage of semi-supervised learning is that it can significantly reduce the amount of labeled
data needed for training while still achieving good performance by leveraging the structure in the
unlabeled data.

8. Explain supervised, unsupervised, semi-supervised, and reinforcement


learning along with suitable examples.
Supervised Learning:

Definition: Learning from labeled training data to predict outputs for new inputs
Process: Given input-output pairs (x, y), learn a function f(x) = y

Examples:
1. Email spam classification (Input: email text; Output: spam/not spam)

2. House price prediction (Input: house features; Output: price)

3. Medical diagnosis (Input: patient symptoms; Output: disease classification)

Unsupervised Learning:

Definition: Learning patterns from unlabeled data without specific output targets
Process: Given inputs x, find interesting structures or patterns
Examples:
1. Customer segmentation (grouping similar customers)

2. Anomaly detection in network traffic

3. Topic modeling in text documents

4. Dimensionality reduction for visualization

Semi-supervised Learning:

Definition: Learning from a combination of labeled and unlabeled data


Process: Use small labeled dataset plus large unlabeled dataset to improve performance

Examples:
1. Web page classification with few labeled examples
2. Speech recognition with partially transcribed audio

3. Protein function prediction with some known functions

Reinforcement Learning:

Definition: Learning optimal actions through trial and error in an environment to maximize rewards
Process: Agent learns policy to maximize cumulative reward through environment interaction

Examples:
1. Game playing (AlphaGo, chess, Atari games)
2. Robotics (learning to walk, grasp objects)

3. Autonomous vehicles (learning driving policies)


4. Dynamic pricing strategies

Each paradigm addresses different problem types and data availability scenarios, making them suitable
for different real-world applications.

9. What are the different components of learning?


The different components of a learning system include:

1. Input Space (X):


The set of all possible inputs to the learning algorithm

Examples: images, text, numerical features

2. Output Space (Y):


The set of all possible outputs or predictions

Examples: class labels, continuous values, structured outputs


3. Training Data (D):
The dataset used to train the model

Contains examples (x, y) from which the algorithm learns

4. Hypothesis Space (H):


The set of all possible models/functions the learning algorithm can select from
Example: all possible linear functions, decision trees of certain depth

5. Learning Algorithm (A):


The procedure that selects a specific hypothesis/model from the hypothesis space
Uses training data to determine which hypothesis best fits the problem

6. Loss Function (L):


Measures how well a model's predictions match the actual outputs

Examples: squared error, cross-entropy, hinge loss

7. Final Hypothesis (g):


The specific model selected by the learning algorithm from the hypothesis space

Used to make predictions on new, unseen data

8. Regularization:
Controls model complexity to prevent overfitting

Examples: L1/L2 regularization, early stopping

9. Feature Extraction/Engineering:
Process of transforming raw data into features suitable for modeling

Examples: normalization, dimensionality reduction, one-hot encoding

10. Validation Process:


Methods to evaluate model performance (cross-validation, holdout sets)

Used to tune hyperparameters and compare models

These components work together to create a system that can learn patterns from data and make
predictions or decisions based on that learning.

10. Discuss with example the in-sample error and out-of-sample error.
In-sample Error (Ein) and Out-of-sample Error (Eout) are fundamental concepts in evaluating machine
learning model performance.

In-sample Error (Ein):

Error measured on the training data the model was built on

Represents how well the model fits the data it has seen
Formula: Ein(h) = (1/N) Σ L(h(xn), yn) for N training examples
Usually optimistically biased (underestimates true error)

Out-of-sample Error (Eout):

Error measured on new, unseen data

Represents the model's generalization ability


Formula: Eout(h) = Ex,y[L(h(x), y)] (expected value over all possible examples)

The true measure of a model's performance in practice

Example: House Price Prediction

Suppose we have data on 1000 houses with features like size, location, number of rooms, etc., and we
want to predict house prices using linear regression.

We use 800 houses for training and develop a model: Price = 50,000 + 100×Size + 5,000×Rooms

On the training data (800 houses), we calculate Mean Squared Error = $10,000 (Ein)

On the testing data (200 houses never seen during training), we calculate Mean Squared Error =
$25,000 (Eout)

The difference between Ein and Eout demonstrates:

1. The model fits training data better than test data (Ein < Eout)

2. Some overfitting has occurred (significant gap between Ein and Eout)

Another example: Polynomial Regression

Consider fitting different degree polynomials to predict a target variable:

Degree 1 (linear): Ein = 10.5, Eout = 10.2


Degree 3: Ein = 5.2, Eout = 6.1

Degree 10: Ein = 0.5, Eout = 15.7

As polynomial degree increases:

Ein decreases (better fit to training data)

Eout initially decreases then increases (model becomes too complex and overfits)

The goal of learning is not to minimize Ein but to find a model complexity that minimizes Eout, which
requires balancing bias and variance.

11. Describe classification and regression in context of supervised learning?


Discuss their differences with suitable examples.
Classification and Regression in Supervised Learning
Both classification and regression are supervised learning tasks where the algorithm learns from labeled
training data, but they differ in the type of output variable they predict.

Classification:

Predicts discrete category labels or classes


Output space Y is a finite set of categories

Examples of classification algorithms:


Logistic Regression
Decision Trees

Support Vector Machines


Neural Networks

k-Nearest Neighbors

Regression:

Predicts continuous numerical values

Output space Y is typically ℝ (real numbers)

Examples of regression algorithms:


Linear Regression

Polynomial Regression

Ridge/Lasso Regression

Decision Tree Regression

Neural Networks

Key Differences:

1. Output Type:
Classification: Discrete categories (e.g., "spam/not spam")

Regression: Continuous values (e.g., house price $327,500)

2. Error Metrics:
Classification: Accuracy, precision, recall, F1-score, AUC-ROC
Regression: Mean Squared Error (MSE), Mean Absolute Error (MAE), R²

3. Decision Boundaries:
Classification: Creates boundaries between classes
Regression: Fits a curve/surface to data points

4. Loss Functions:
Classification: Cross-entropy, hinge loss
Regression: Squared error, absolute error

Examples:

Classification Example: Email Spam Detection

Input features: Word frequencies, sender information, subject line characteristics

Output: Binary classification (spam/not spam)


A decision tree might create rules like: "If email contains 'free money' AND sender is not in contacts,
then classify as spam"

Regression Example: House Price Prediction

Input features: Square footage, number of bedrooms, location, age of house

Output: Continuous price value ($250,000, $375,500, etc.)


A linear regression might create a model like: Price = $50,000 + $100×SquareFeet +
$15,000×Bedrooms - $2,000×Age

Some tasks can be approached as either classification or regression:

Age prediction: Regression (predict exact age) or Classification (predict age group)

Risk assessment: Regression (predict probability) or Classification (high/medium/low risk)

12. What is the R² metric in Linear Regression?


R² (R-squared) Metric in Linear Regression

R² is a statistical measure that represents the proportion of the variance in the dependent variable that is
predictable from the independent variables. It's also known as the coefficient of determination.

Definition: R² = 1 - (SSres / SStot)

Where:

SSres (Sum of Squared Residuals) = Σ(yi - ŷi)² (observed - predicted)

SStot (Total Sum of Squares) = Σ(yi - ȳ)² (observed - mean)

Interpretation:

R² ranges from 0 to 1 (can be negative for some models, indicating very poor fit)

R² = 1: Perfect fit, model explains all variability in the response data

R² = 0: Model doesn't explain any variability, equivalent to predicting the mean

R² = 0.7: Model explains 70% of the variance in the dependent variable

Example: For a house price prediction model:


R² = 0.85 means 85% of the variance in housing prices can be explained by the features in your
model

The remaining 15% is due to other factors not captured by the model or random variation

Properties:

1. R² increases (or stays the same) when more variables are added to the model, even if those variables
are not significant
2. R² doesn't indicate whether the coefficients and predictions are biased

3. R² doesn't indicate whether a regression model is adequate or if the right model was chosen
4. Adding irrelevant variables can artificially increase R²

Adjusted R²: To address the issue of R² increasing with additional variables regardless of their
contribution:

Adjusted R² = 1 - [(1 - R²)(n - 1)/(n - p - 1)]

Where n is number of observations and p is number of predictors


Penalizes the addition of variables that don't improve the model significantly

R² helps quantify how well a linear regression model fits the data, making it a useful metric for model
evaluation and comparison.

13. Discuss the preamble to the theory of learning.


Preamble to the Theory of Learning

The preamble to the theory of learning establishes the foundational framework that makes machine
learning theoretically possible. It addresses several key questions:

1. Is Learning Possible?
The fundamental question: Can a model that performs well on training data also perform well on
unseen data?
This leads to the discussion of generalization from sample to population

2. The Learning Framework:


Input space X: The domain of all possible inputs

Output space Y: The range of possible outputs

Unknown target function f: X → Y (what we're trying to learn)


Training data D = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}, where yᵢ = f(xᵢ)
Hypothesis set H: Set of candidate functions h: X → Y

Learning algorithm A: Process that uses D to select a final hypothesis g ∈ H

3. Probability and Learning:


Assumes data points are drawn from an unknown probability distribution P(x, y)
The learning goal is to find h that minimizes expected error: E[L(h(x), y)]

Training data provides samples from this distribution

4. Inductive Learning:
The core principle that what works on the training set will likely work on unseen data from the
same distribution
This requires assumptions like: a) The training and test data come from the same distribution b)
The training set is sufficiently large and representative c) The hypothesis space has appropriate
complexity

5. The No Free Lunch Theorem:


States that no learning algorithm can outperform others on all possible problems
Learning is possible only with some inductive bias or assumptions about the problem

6. Approximation-Generalization Tradeoff:
Complex models can better approximate target functions but may not generalize well

Simple models may generalize better but might not capture complex patterns
This sets up the bias-variance tradeoff discussion

7. Feasibility of Learning:
Hoeffding's inequality provides statistical bounds on generalization error

Shows that with enough data, the probability of large deviation between training and test error
becomes very small

This preamble establishes that learning is theoretically possible under certain conditions and lays the
groundwork for more advanced learning theory concepts like VC dimension, generalization bounds, and
structural risk minimization.

14. Explain the contexts where linear regression is used. Write the linear
regression algorithm, in detail.
Contexts Where Linear Regression is Used:

1. Predictive Analysis:
Sales forecasting based on advertising spend

Predicting house prices based on features


Estimating crop yields based on rainfall and temperature

2. Relationship Analysis:
Understanding correlation between variables
Quantifying the effect of specific factors on an outcome
Medical research (e.g., relationship between dosage and response)

3. Trend Analysis:
Analyzing time series data

Identifying growth patterns

Economic forecasting

4. Quality Control:
Identifying factors affecting product quality

Process optimization

5. Financial Applications:
Portfolio risk assessment
Asset pricing models
Return forecasting

6. Feature Selection:
Identifying significant predictors

Variable importance analysis

7. As a Baseline Model:
Starting point for more complex models
Benchmark for comparing advanced algorithms

Linear Regression Algorithm in Detail:

Input: Training data D = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}, where xᵢ is a d-dimensional feature vector and yᵢ is the
target value

Output: Weight vector w and bias term b for the model y = w·x + b

Algorithm Steps:

1. Data Preparation:
Normalize/standardize features if needed
Split data into training and validation sets

Add a column of ones to X for the bias term if using matrix form

2. Model Specification:
Simple linear regression: y = b + w₁x₁

Multiple linear regression: y = b + w₁x₁ + w₂x₂ + ... + wₚxₚ


In matrix form: y = Xw (where first column of X is ones)

3. Parameter Estimation (choose one method): a) Analytical Solution (Normal Equation):


w = (X^T X)⁻¹X^T y
Compute directly when n and d are not too large
b) Gradient Descent:
Initialize w randomly or to zeros

Repeat until convergence:


Compute predictions: ŷ = Xw
Compute error: e = ŷ - y

Compute gradient: ∇w = (2/n)X^T e


Update weights: w = w - α∇w (where α is learning rate)

Monitor convergence using cost function: J(w) = (1/n)‖Xw - y‖²

4. Model Evaluation:
Compute R² = 1 - (SSres/SStot)

Compute Mean Squared Error (MSE) = (1/n)Σ(yᵢ - ŷᵢ)²


Analyze residual plots for heteroscedasticity and normality

Check for multicollinearity using VIF (Variance Inflation Factor)

5. Statistical Testing:
Calculate standard errors for coefficients
Perform t-tests for coefficient significance: t = wᵢ/SE(wᵢ)
Calculate p-values and confidence intervals

Perform F-test for overall model significance

6. Prediction:
For new input x*, predict y* = w·x* + b

7. Regularization (Optional):
Ridge Regression (L2): Add λ‖w‖² to cost function

Lasso Regression (L1): Add λ‖w‖₁ to cost function


Elastic Net: Combine L1 and L2 regularization

The algorithm outputs a linear function that minimizes the sum of squared differences between predicted
and actual values in the training data.

15. Illustrate a simple learning model using the concept of input, output,
learning algorithm and hypothesis set.
Simple Learning Model Illustration

Let's illustrate a simple learning model for email spam classification:


1. Input Space (X):

The set of all possible emails

Each email is represented as a feature vector x = [x₁, x₂, ..., xₚ]


Features might include:
x₁: frequency of the word "free"

x₂: presence of excessive capitalization (0/1)


x₃: sender domain reputation score

x₄: number of recipients

2. Output Space (Y):

Binary classification: Y = {0, 1}


y = 0: legitimate email

y = 1: spam email

3. Hypothesis Set (H):

Linear classifiers of the form h(x) = sign(w·x + b)

Each hypothesis h ∈ H corresponds to a different weight vector w and bias b


This set contains all possible linear decision boundaries in the feature space

4. Learning Algorithm (A):

Perceptron Learning Algorithm

Given training data D = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}

Initialize w = 0, b = 0
For each epoch:
For each training example (xᵢ, yᵢ):
Predict ŷᵢ = sign(w·xᵢ + b)
If misclassified (ŷᵢ ≠ yᵢ):
Update w = w + η

You might also like