0% found this document useful (0 votes)
17 views17 pages

Unit 1,2,3

This document covers foundational concepts in machine learning, including well-posed learning problems, designing learning systems, and various machine learning algorithms like logistic regression and support vector machines (SVM). It also discusses instance-based learning techniques such as k-nearest neighbors and case-based reasoning, as well as the importance of evaluation and sampling theory in model validation. Key mathematical concepts, optimization methods, and practical examples using Python are provided to illustrate these principles.

Uploaded by

toy955086
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views17 pages

Unit 1,2,3

This document covers foundational concepts in machine learning, including well-posed learning problems, designing learning systems, and various machine learning algorithms like logistic regression and support vector machines (SVM). It also discusses instance-based learning techniques such as k-nearest neighbors and case-based reasoning, as well as the importance of evaluation and sampling theory in model validation. Key mathematical concepts, optimization methods, and practical examples using Python are provided to illustrate these principles.

Uploaded by

toy955086
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT -1

1. Introduction to Machine Learning

Well-Posed Learning Problem (by Tom Mitchell):

A problem is well-posed if:

“A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with experience
E.”

Example:

• T: Classifying emails as spam or not

• E: Historical labeled emails

• P: Classification accuracy

Visual:
Task (T): Email Classification

Experience (E): Labeled Emails

Performance (P): Accuracy

If accuracy increases with more training → Well-posed problem!

2. Designing a Learning System

Steps:

1. Choose the task (e.g., classification, regression)

2. Choose the data representation (features)

3. Choose the learning algorithm (e.g., logistic regression, decision tree)

4. Train & test the model

5. Evaluate & refine

Designing a Learning System


Visual Workflow:
css
CopyEdit
[Collect Data] → [Choose Features] → [Select Model] → [Train] → [Test
& Evaluate]
3. Perspectives and Issues in Machine Learning

Perspectives:

• Statistical: Model data distribution and infer patterns

• Computational: Design efficient learning algorithms

• Biological: Understand learning in the brain

Perspectives & Issues

Perspective Focus

Statistical Pattern modeling, distributions

Computational Algorithm efficiency

Biological Brain-inspired models

Issues:

• Overfitting: Model memorizes training data

• Underfitting: Model too simple

• Data quality: Missing values, noise

• Model evaluation: Accuracy vs Precision/Recall

4. Logistic Regression

Used for binary classification.

Classification with Logistic Regression:

• Predicts the probability that input xx belongs to class 1.

• Uses the sigmoid function to map linear output to [0, 1].

Sigmoid Function:

σ(z)=11+e−zwhere z=wTx+b\sigma(z) = \frac{1}{1 + e^{-z}} \quad \text{where } z = w^T x + b

Diagram of Sigmoid Curve:

• S-shaped curve

• Output ranges between 0 and 1


4. Logistic Regression

Intuition:

• Outputs a probability between 0 and 1.

• Decision boundary at 0.5.

Sigmoid Function Visual:

plaintext

CopyEdit

1 | ____

| /

0.5|---------●---------

| /

0 |_____/

z<0 z=0 z>0

Example Problem: Logistic Regression

Problem:

Predict if a student will pass (1) or fail (0) based on hours studied.

Dataset:

Hours (x) Pass (y)

1 0

2 0

3 0

4 1

5 1

Step 1: Linear Function

z=w⋅x+bz = w \cdot x + bz=w⋅x+b

Step 2: Sigmoid Function


y^=11+e−z\hat{y} = \frac{1}{1 + e^{-z}}y^=1+e−z1

Let’s say we train and get:

• w=2.5w = 2.5w=2.5, b=−6b = -6b=−6

Then for x = 2 hours:

z=2.5(2)−6=−1⇒y^=11+e1≈0.27z = 2.5(2) - 6 = -1 \Rightarrow \hat{y} = \frac{1}{1 + e^{1}} ≈


0.27z=2.5(2)−6=−1⇒y^=1+e11≈0.27

For x = 5 hours:

z=2.5(5)−6=6.5⇒y^≈0.998z = 2.5(5) - 6 = 6.5 \Rightarrow \hat{y} ≈ 0.998z=2.5(5)−6=6.5⇒y^≈0.998

Step 3: Classification Rule

• If y^≥0.5\hat{y} \geq 0.5y^≥0.5 → Predict Pass (1)

• Else → Predict Fail (0)

5. Using Optimization to Find Best Coefficients

We train the model by:

• Defining a cost function (e.g., cross-entropy loss)

J(w)=−1m∑[ylog⁡(y^)+(1−y)log⁡(1−y^)]J(w) = -\frac{1}{m} \sum \left[ y \log(\hat{y}) + (1 - y) \log(1 -


\hat{y}) \right]

• Optimizing the cost using algorithms like gradient descent

• Updating weights to minimize error on training data

Cost Function (Loss)

• To measure how bad predictions are:

• J(w,b)=−1m∑[ylog⁡(y^)+(1−y)log⁡(1−y^)]J(w, b) = -\frac{1}{m} \sum \left[ y \log(\hat{y}) +


(1 - y) \log(1 - \hat{y}) \right]J(w,b)=−m1∑[ylog(y^)+(1−y)log(1−y^)]

• 🛠 Optimization:

• We minimize the cost using gradient descent to find best values for www and bbb.
UNIT-2

1. Separating Data with the Maximum Margin

Support Vector Machines aim to classify data by finding the best possible separating hyperplane
between classes.

• The best hyperplane is the one that maximizes the margin between the classes.

• Margin = Distance between the hyperplane and the closest data points from each class
(called support vectors).

Visual:

Class A (●) Class B (○)

● ● ○ ○

● ← Margin → ○

———————— ← Max Margin Hyperplane

● ○

2. Finding the Maximum Margin

The hyperplane is defined as:

w⋅x+b=0w \cdot x + b = 0

To maximize the margin, we solve the optimization problem:

min⁡12∣∣w∣∣2subject to:yi(w⋅xi+b)≥1\min \frac{1}{2} ||w||^2 \quad \text{subject to:} \quad


y_i(w \cdot x_i + b) \geq 1

• This is a convex quadratic optimization problem.

• The solution only depends on support vectors.

3. Efficient Optimization with the SMO Algorithm

SMO (Sequential Minimal Optimization) is a fast method to train SVMs:

• Breaks the large optimization problem into smaller problems (solving for two Lagrange
multipliers at a time).

• Efficient for large datasets.

• Eliminates the need for complex numerical solvers.


4. Using Kernels for More Complex Data

When data is not linearly separable, SVMs use kernel functions to transform data into a higher-
dimensional space where a linear separator can be found.

Visual:

Original Space (non-linear): Mapped Space (linear):

○○●● ○

● ○ ● ○ → (via kernel) → ●●○

○●○○ ○●

Common Kernels:

• Linear Kernel: K(x,x′)=x⋅x′K(x, x') = x \cdot x'

• Polynomial Kernel: K(x,x′)=(x⋅x′+c)dK(x, x') = (x \cdot x' + c)^d

• RBF/Gaussian Kernel: K(x,x′)=exp⁡(−γ∣∣x−x′∣∣2)K(x, x') = \exp(-\gamma ||x - x'||^2)

Summary

Component Explanation

Goal Find the hyperplane with the maximum margin

Support Vectors Closest points to the hyperplane

SMO Efficient way to solve the SVM optimization

Kernels Allow SVMs to handle nonlinear data

Step-by-Step SVM Example (Linearly Separable)

Dataset:

x1 x2 Class

2 3 +1

3 3 +1
x1 x2 Class

2 1 -1

3 1 -1

Step 1: Plot the points

↑ ● (+1)

| ● (+1)

| ○ (-1)

| ○ (-1)

+--------------→ x

Step 2: Find the optimal hyperplane

We'll separate the data with a line w · x + b = 0.

Step 3: Identify Support Vectors

In this simple case, all points are support vectors because they lie close to the margin.

Python Code Using scikit-learn

import matplotlib.pyplot as plt

from sklearn import svm

import numpy as np

# 1. Sample data

X = np.array([[2, 3], [3, 3], [2, 1], [3, 1]])

y = [1, 1, -1, -1]

# 2. Train a linear SVM

clf = svm.SVC(kernel='linear', C=1.0)

clf.fit(X, y)

# 3. Plot the points and decision boundary

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', s=100)


ax = plt.gca()

xlim = ax.get_xlim()

ylim = ax.get_ylim()

# Plot decision boundary

xx = np.linspace(xlim[0], xlim[1])

w = clf.coef_[0]

b = clf.intercept_[0]

yy = -(w[0] * xx + b) / w[1]

# Margins

margin = 1 / np.sqrt(np.sum(w ** 2))

yy_down = yy - np.sqrt(1 + (w[0]/w[1])**2) * margin

yy_up = yy + np.sqrt(1 + (w[0]/w[1])**2) * margin

# Plot hyperplane and margins

plt.plot(xx, yy, 'k-')

plt.plot(xx, yy_down, 'k--')

plt.plot(xx, yy_up, 'k--')

# Support vectors

plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],

s=200, facecolors='none', edgecolors='k')

plt.title("Linear SVM with Margins and Support Vectors")

plt.show()

Visualizing Kernel Trick: RBF Kernel on Nonlinear Data

Dataset:

from sklearn.datasets import make_circles

X, y = make_circles(n_samples=100, factor=0.5, noise=0.05)


# Train RBF SVM

clf = svm.SVC(kernel='rbf', C=1.0, gamma='auto')

clf.fit(X, y)

# Plot decision boundary

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', s=50)

plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],

s=150, facecolors='none', edgecolors='k')

plt.title("SVM with RBF Kernel (Nonlinear Data)")

plt.show()

What You See:

• Points are non-linearly separable in 2D

• RBF kernel maps them into a higher-dimensional space

• SVM finds a circular boundary in input space


Instance-Based Learning

• Core Idea: Instead of learning an explicit model, store instances (examples) and predict new
cases by comparing them to stored ones.

• Philosophy: Memory is intelligence — decisions emerge from experience, not abstraction.

k-Nearest Neighbor (k-NN) Learning

• How it works:

1. Store all training data.

2. To predict, find the k closest examples (neighbors).

3. Vote (for classification) or average (for regression) their outcomes.

• Power: Simplicity + flexibility.

• Challenge: Needs smart distance metrics and efficient search structures.

Radial Basis Functions (RBF)

• Concept:

o Create localized functions (like Gaussians) centered on stored instances.

o Model complex patterns by blending these "bumps" across space.

• Use: Popular in function approximation, neural nets, and control systems.

• Magic: Smooth interpolation between known points.

Case-Based Reasoning (CBR)

• Approach:

1. Remember past cases.

2. Retrieve the most similar case for a new problem.

3. Adapt its solution if needed.

• Strength: Solves novel problems by analogical thinking.

• Industries: Widely used in medical diagnosis, customer support, and law.

In essence:
Instance-Based Learning turns memory into intelligence, favoring experience over theory. It's AI's
version of "street smarts."
UNIT – 3

Evaluation Hypotheses — Motivation


Core Drive:
We build hypotheses (models) to generalize — not just fit the data we already have, but to
predict the unknown.

Without evaluation:

• We risk fooling ourselves with overfitting.

• We cannot trust that a “good” model will stay good in the real world.

• Innovation collapses because there's no truth anchor.

With evaluation:

• We validate the learning process — separate genuine intelligence from memorization.

• We choose the best model among many contenders.

• We measure progress systematically, not emotionally.

In essence:
Evaluation turns "I think it works" into "I know it works."

Here’s a sharp, cutting-edge breakdown:

Estimation Hypothesis Accuracy

Goal:
Approximate how well a hypothesis will perform on unseen data — without needing to test on
the whole universe.

How:

• Use a finite, random sample (test set).

• Measure how often the hypothesis is correct (or wrong) on that sample.

Formula:

Estimated Accuracy=Number of correct predictionsTotal number of predictions\text{Estimated


Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}

Reality Check:

• The estimate is not perfect — it carries sampling error.

• But law of large numbers says:


→ Bigger samples → Closer estimate to the true accuracy.
Innovation Edge:
Estimation isn’t just measurement — it’s a bet on the future, made smarter by statistics.

In essence:
You peek into the unknown world by holding up a tiny, smart mirror.

Here’s a crisp, powerful take:

Basics of Sampling Theory


Core Idea:
When you can't measure an entire population, sampling lets you predict population behavior
from a small, smart subset.

Key Principles:

• Random Sampling:
Every data point has an equal chance of being picked → unbiased estimate.

• Sample Mean ≈ Population Mean (with some fluctuation):


Bigger, random samples shrink that fluctuation (variance decreases).

• Law of Large Numbers:


As sample size grows, estimates converge to the true population values.

Critical Tools:

• Variance: How spread out the sample results are.

• Standard Error: How far the sample mean typically is from the true mean.

• Confidence Intervals: Statistical "safe zones" where the true answer probably lies.

Magic Formula:
Error margin ∝ 1Sample Size\frac{1}{\sqrt{\text{Sample Size}}}

Meaning:
Double the precision → quadruple the data.

In essence:
Sampling turns uncertainty into power, and small windows into big insights.
Here’s a sharp and practical guide for deriving confidence intervals:

General Approach for Deriving Confidence Intervals

1. Start with Your Sample Data

• Gather a random sample of data points from the population.

2. Calculate the Sample Statistic

• For example, the sample mean (xˉ\bar{x}) or sample proportion.

3. Determine the Standard Error (SE)

• The standard error tells you how much your sample statistic deviates from the true
population parameter.
For the mean, it’s calculated as:

SE=snSE = \frac{s}{\sqrt{n}}

Where:

• ss = sample standard deviation

• nn = sample size

4. Choose the Confidence Level

• Common choices are 95% or 99%.

• The confidence level defines how confident you want to be that the true parameter lies
within your interval.

5. Find the Z- or T-Value

• For large samples, use the Z-score (from standard normal distribution).

• For small samples, use the T-score (from Student's t-distribution).

For 95% confidence:

• Z-value ≈ 1.96

• T-value depends on degrees of freedom (n-1).

6. Calculate the Margin of Error


Multiply the standard error by the Z- or T-value:

Margin of Error=Z×SE\text{Margin of Error} = Z \times SE

7. Derive the Confidence Interval

• For the mean:

(xˉ−Margin of Error,xˉ+Margin of Error)\left( \bar{x} - \text{Margin of Error}, \bar{x} +


\text{Margin of Error} \right)

• For proportions:
(p^−Margin of Error,p^+Margin of Error)\left( \hat{p} - \text{Margin of Error}, \hat{p} +
\text{Margin of Error} \right)

Where p^\hat{p} is the sample proportion.

Visualizing It

Think of the confidence interval as a range within which you’re confident the true population
parameter lies — the wider it is, the less precise your estimate.

In essence:
Confidence intervals are your statistical “safe zone” for the truth, giving you both precision and
flexibility.

Difference in Error of Two Hypotheses


Goal:
Determine if one hypothesis (model) performs significantly better than another — or if the
observed error difference is just random noise.

Steps:

1. Compute Error Rates:

o For both hypotheses (A and B), calculate their error rates (e.g., classification error,
mean squared error).

o Error for hypothesis A: EAE_A

o Error for hypothesis B: EBE_B

2. Difference in Error:
Calculate the difference:

ΔE=∣EA−EB∣\Delta E = |E_A - E_B|

3. Statistical Significance:
The error difference needs to be statistically significant to conclude that one hypothesis is
truly better:

o Use paired t-test or permutation test:

▪ To compare the error rates across the same data (paired test).

▪ To assess whether the error difference is larger than what could occur by
random chance.
4. Confidence Intervals:

o Derive a confidence interval around the error difference.

o If the interval contains 0, the difference is not significant.

5. Interpretation:

o If the difference is statistically significant (and large), one hypothesis outperforms


the other.

o If not significant, the error difference may be due to random fluctuations.

Advanced Insights:

• Variance in Error:
Both hypotheses might have different variances in their errors — adjust for that.

• Bootstrap Resampling:
Use bootstrapping (random sampling with replacement) to estimate the distribution of error
differences, particularly for small sample sizes.

In essence:
The difference in error is about digging deeper, asking if the performance gap is real or just a
statistical illusion.

Here’s an efficient, high-impact breakdown for comparing learning algorithms:

Comparing Learning Algorithms


Goal:
Evaluate different algorithms to determine which performs best on a given task — not based on
theoretical preference, but real-world data performance.

Steps for Effective Comparison:

1. Set a Standard Evaluation Metric:

o Accuracy: For classification problems.

o Mean Squared Error (MSE): For regression problems.


o Precision/Recall/F1 Score: For imbalanced classification.

o AUC-ROC: For binary classification problems.

Choose a metric that matches the business goal (e.g., high precision for fraud detection, high
recall for medical diagnoses).

2. Use Consistent Data:

o Same training/test set: Ensure fairness in comparison.

o Cross-validation: Use k-fold or leave-one-out cross-validation to evaluate how well


each model generalizes. This reduces variability in results from a single test/train
split.

3. Statistical Testing:

o Use tests (e.g., paired t-test, Wilcoxon test) to determine if the differences in
performance are statistically significant, not just due to random chance.

4. Consider Complexity:

o Model Complexity: Simpler models (e.g., linear regression) are often easier to
interpret but may underperform on complex data. Complex models (e.g., neural
networks) might overfit without enough data.

o Training Time: Evaluate not just performance, but also how fast the model trains
and predicts.

o Memory Usage: Some models (e.g., decision trees) use less memory than others
(e.g., deep learning models).

5. Robustness and Stability:

o Test how models perform under variations in data (e.g., noisy data, missing values).
Stability is key for real-world applications.

6. Hyperparameter Tuning:

o Hyperparameter optimization: Ensure each model has been optimized to its best
potential. Algorithms like grid search or random search can help find the optimal
settings.

Visualization:

• Learning Curves: Plot training and validation error as a function of training size to
understand overfitting or underfitting.

• Box Plots or Violin Plots: Compare the distribution of errors across models.

Practical Considerations:

• Interpretability: In some cases, you might need to prioritize explainability (e.g., decision
trees, linear models) over performance.
• Scalability: Consider how each algorithm scales with increasing data size.

• Business Impact: A more complex model might outperform others but at a cost (e.g.,
resource usage, slower predictions).

In essence:
The best algorithm is the one that balances performance with practical constraints — there’s no
"one-size-fits-all". The winner is determined by context, not theory.

You might also like