Machine Learning (CSEN3203) 1-14
Machine Learning (CSEN3203) 1-14
1. Describe the Perceptron Learning Algorithm (PLA) and briefly explain the
working principle of the algorithm.
The Perceptron Learning Algorithm (PLA) is a supervised learning algorithm for binary classifiers. It works
as follows:
1. Initialization: Start with arbitrary weights w (often zeros or small random values).
2. Iterative Process:
For each training example (x, y), where x is the input feature vector and y is the target output (+1
or -1)
Calculate the predicted output: ŷ = sign(w^T·x)
If misclassified (ŷ ≠ y), update the weights: w = w + η·y·x (where η is the learning rate, typically set
to 1)
3. Termination: Repeat until no misclassifications occur in an entire pass through the dataset or until
reaching a maximum number of iterations.
The working principle is based on iteratively adjusting the decision boundary (represented by the weight
vector) whenever a misclassification occurs. For linearly separable data, PLA is guaranteed to converge to
a solution in a finite number of updates.
Student Attendance (x) Marks (y) Student Attendance (x) Marks (y)
1 28 43 6 28 39
2 27 39 7 26 36
3 23 27 8 21 36
4 27 36 9 22 31
5 24 34 10 28 37
To find marks for a student with 20 classes of attendance using linear regression:
Mean of y: ȳ = 35.8
Sum of x²: Σx² = 6,518
Step 2: Calculate the slope (m) m = [n(Σxy) - (Σx)(Σy)] / [n(Σx²) - (Σx)²] m = [10(9,168) - (254)(358)] /
[10(6,518) - (254)²] m = [91,680 - 90,932] / [65,180 - 64,516] m = 748 / 664 m = 1.126
Step 3: Calculate the y-intercept (b) b = ȳ - m(x̄ ) b = 35.8 - 1.126(25.4) b = 35.8 - 28.6 b = 7.2
Step 5: Predict marks for 20 classes y = 1.126(20) + 7.2 y = 22.52 + 7.2 y = 29.72 ≈ 30 marks
Given n data points (x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ), the SSE is: SSE = Σ(yᵢ - (mxᵢ + b))²
Substituting into the first equation: Σ(yᵢ - mxᵢ - (ȳ - mx̄ ))xᵢ = 0 Σ((yᵢ - ȳ) - m(xᵢ - x̄ ))xᵢ = 0 Σ(yᵢ - ȳ)xᵢ - mΣ(xᵢ -
x̄ )xᵢ = 0 Σ(yᵢ - ȳ)xᵢ - mΣ(xᵢxᵢ - x̄ xᵢ) = 0 Σ(yᵢ - ȳ)xᵢ - m(Σxᵢ² - x̄ Σxᵢ) = 0
And b = ȳ - mx̄
These can also be written as: m = [n(Σxy) - (Σx)(Σy)] / [n(Σx²) - (Σx)²] b = [Σy - m(Σx)] / n
The perceptron function h(x) = sign(wᵀx) gives: h(x) = sign(w₀·1 + w₁x₁ + w₂x₂)
This means:
Rearranging to solve for x₂: w₂x₂ = -w₀ - w₁x₁ x₂ = (-w₀ - w₁x₁)/w₂ (assuming w₂ ≠ 0) x₂ = -w₀/w₂ - w₁/w₂·x₁
(ii) Draw a picture for the cases w = [3, 2, 1]ᵀ and w = -[3, 2, 1]ᵀ.
For w = [3, 2, 1]ᵀ: a = -w₁/w₂ = -2/1 = -2 b = -w₀/w₂ = -3/1 = -3 So the line is x₂ = -2x₁ - 3
For w = -[3, 2, 1]ᵀ = [-3, -2, -1]ᵀ: a = -w₁/w₂ = -(-2)/(-1) = -2 b = -w₀/w₂ = -(-3)/(-1) = -3 So the line is
again x₂ = -2x₁ - 3
The two cases produce the same decision boundary line but with regions for h(x) = +1 and h(x) = -1
flipped between the two cases.
Where:
This inequality demonstrates that with a large enough sample size N, the probability of having a
significant difference between training and test errors becomes very small, making learning feasible.
Specifically:
2. For a fixed ε, increasing N makes the probability of a large deviation exponentially smaller
3. For a finite hypothesis set H, we can apply the union bound to maintain that: P[∃h ∈ H such that
|Ein(h) - Eout(h)| > ε] ≤ 2|H|e^(-2ε²N)
Where:
To minimize SSE, we take the derivative with respect to β and set it to zero: ∂SSE/∂β = -2Xᵀ(Y - Xβ) = 0
Non-linear Cases
For nonlinear relationships, we can transform the input features to higher-order terms or apply other
transformations, then use the same linear regression formula. Some approaches:
1. Polynomial regression: Include powers of variables (x², x³, etc.) Example: y = β₀ + β₁x + β₂x² + β₃x³
2. Feature interactions: Include products of variables (x₁x₂, etc.) Example: y = β₀ + β₁x₁ + β₂x₂ + β₃x₁x₂
4. Kernel methods: Implicitly map data to higher-dimensional spaces Example: Radial basis function
kernel
Uses fully labeled training data (each example has input features and a target output)
The algorithm learns to map inputs to outputs based on these labeled examples
Requires extensive labeled data, which can be expensive and time-consuming to obtain
Semi-supervised Learning:
Typically includes a small amount of labeled data and a large amount of unlabeled data
Exploits the structure in the unlabeled data to improve the learning model
Key assumptions:
1. Continuity: Points close to each other are likely to have the same label
2. Cluster: Data points in the same cluster likely belong to the same class
3. Manifold: Data lies on a lower-dimensional manifold within the higher-dimensional space
Particularly useful when labeled data is scarce or expensive, but unlabeled data is abundant
Examples: Web content classification, image recognition with partially labeled datasets
The main advantage of semi-supervised learning is that it can significantly reduce the amount of labeled
data needed for training while still achieving good performance by leveraging the structure in the
unlabeled data.
Definition: Learning from labeled training data to predict outputs for new inputs
Process: Given input-output pairs (x, y), learn a function f(x) = y
Examples:
1. Email spam classification (Input: email text; Output: spam/not spam)
Unsupervised Learning:
Definition: Learning patterns from unlabeled data without specific output targets
Process: Given inputs x, find interesting structures or patterns
Examples:
1. Customer segmentation (grouping similar customers)
Semi-supervised Learning:
Examples:
1. Web page classification with few labeled examples
2. Speech recognition with partially transcribed audio
Reinforcement Learning:
Definition: Learning optimal actions through trial and error in an environment to maximize rewards
Process: Agent learns policy to maximize cumulative reward through environment interaction
Examples:
1. Game playing (AlphaGo, chess, Atari games)
2. Robotics (learning to walk, grasp objects)
Each paradigm addresses different problem types and data availability scenarios, making them suitable
for different real-world applications.
8. Regularization:
Controls model complexity to prevent overfitting
9. Feature Extraction/Engineering:
Process of transforming raw data into features suitable for modeling
These components work together to create a system that can learn patterns from data and make
predictions or decisions based on that learning.
10. Discuss with example the in-sample error and out-of-sample error.
In-sample Error (Ein) and Out-of-sample Error (Eout) are fundamental concepts in evaluating machine
learning model performance.
Represents how well the model fits the data it has seen
Formula: Ein(h) = (1/N) Σ L(h(xn), yn) for N training examples
Usually optimistically biased (underestimates true error)
Suppose we have data on 1000 houses with features like size, location, number of rooms, etc., and we
want to predict house prices using linear regression.
We use 800 houses for training and develop a model: Price = 50,000 + 100×Size + 5,000×Rooms
On the training data (800 houses), we calculate Mean Squared Error = $10,000 (Ein)
On the testing data (200 houses never seen during training), we calculate Mean Squared Error =
$25,000 (Eout)
1. The model fits training data better than test data (Ein < Eout)
2. Some overfitting has occurred (significant gap between Ein and Eout)
Eout initially decreases then increases (model becomes too complex and overfits)
The goal of learning is not to minimize Ein but to find a model complexity that minimizes Eout, which
requires balancing bias and variance.
Classification:
k-Nearest Neighbors
Regression:
Polynomial Regression
Ridge/Lasso Regression
Neural Networks
Key Differences:
1. Output Type:
Classification: Discrete categories (e.g., "spam/not spam")
2. Error Metrics:
Classification: Accuracy, precision, recall, F1-score, AUC-ROC
Regression: Mean Squared Error (MSE), Mean Absolute Error (MAE), R²
3. Decision Boundaries:
Classification: Creates boundaries between classes
Regression: Fits a curve/surface to data points
4. Loss Functions:
Classification: Cross-entropy, hinge loss
Regression: Squared error, absolute error
Examples:
Age prediction: Regression (predict exact age) or Classification (predict age group)
R² is a statistical measure that represents the proportion of the variance in the dependent variable that is
predictable from the independent variables. It's also known as the coefficient of determination.
Where:
Interpretation:
R² ranges from 0 to 1 (can be negative for some models, indicating very poor fit)
The remaining 15% is due to other factors not captured by the model or random variation
Properties:
1. R² increases (or stays the same) when more variables are added to the model, even if those variables
are not significant
2. R² doesn't indicate whether the coefficients and predictions are biased
3. R² doesn't indicate whether a regression model is adequate or if the right model was chosen
4. Adding irrelevant variables can artificially increase R²
Adjusted R²: To address the issue of R² increasing with additional variables regardless of their
contribution:
R² helps quantify how well a linear regression model fits the data, making it a useful metric for model
evaluation and comparison.
The preamble to the theory of learning establishes the foundational framework that makes machine
learning theoretically possible. It addresses several key questions:
1. Is Learning Possible?
The fundamental question: Can a model that performs well on training data also perform well on
unseen data?
This leads to the discussion of generalization from sample to population
4. Inductive Learning:
The core principle that what works on the training set will likely work on unseen data from the
same distribution
This requires assumptions like: a) The training and test data come from the same distribution b)
The training set is sufficiently large and representative c) The hypothesis space has appropriate
complexity
6. Approximation-Generalization Tradeoff:
Complex models can better approximate target functions but may not generalize well
Simple models may generalize better but might not capture complex patterns
This sets up the bias-variance tradeoff discussion
7. Feasibility of Learning:
Hoeffding's inequality provides statistical bounds on generalization error
Shows that with enough data, the probability of large deviation between training and test error
becomes very small
This preamble establishes that learning is theoretically possible under certain conditions and lays the
groundwork for more advanced learning theory concepts like VC dimension, generalization bounds, and
structural risk minimization.
14. Explain the contexts where linear regression is used. Write the linear
regression algorithm, in detail.
Contexts Where Linear Regression is Used:
1. Predictive Analysis:
Sales forecasting based on advertising spend
2. Relationship Analysis:
Understanding correlation between variables
Quantifying the effect of specific factors on an outcome
Medical research (e.g., relationship between dosage and response)
3. Trend Analysis:
Analyzing time series data
Economic forecasting
4. Quality Control:
Identifying factors affecting product quality
Process optimization
5. Financial Applications:
Portfolio risk assessment
Asset pricing models
Return forecasting
6. Feature Selection:
Identifying significant predictors
7. As a Baseline Model:
Starting point for more complex models
Benchmark for comparing advanced algorithms
Input: Training data D = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}, where xᵢ is a d-dimensional feature vector and yᵢ is the
target value
Output: Weight vector w and bias term b for the model y = w·x + b
Algorithm Steps:
1. Data Preparation:
Normalize/standardize features if needed
Split data into training and validation sets
Add a column of ones to X for the bias term if using matrix form
2. Model Specification:
Simple linear regression: y = b + w₁x₁
4. Model Evaluation:
Compute R² = 1 - (SSres/SStot)
5. Statistical Testing:
Calculate standard errors for coefficients
Perform t-tests for coefficient significance: t = wᵢ/SE(wᵢ)
Calculate p-values and confidence intervals
6. Prediction:
For new input x*, predict y* = w·x* + b
7. Regularization (Optional):
Ridge Regression (L2): Add λ‖w‖² to cost function
The algorithm outputs a linear function that minimizes the sum of squared differences between predicted
and actual values in the training data.
15. Illustrate a simple learning model using the concept of input, output,
learning algorithm and hypothesis set.
Simple Learning Model Illustration
y = 1: spam email
Given training data D = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}
Initialize w = 0, b = 0
For each epoch:
For each training example (xᵢ, yᵢ):
Predict ŷᵢ = sign(w·xᵢ + b)
If misclassified (ŷᵢ ≠ yᵢ):
Update w = w + η