Unit 1,2,3
Unit 1,2,3
“A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with experience
E.”
Example:
• P: Classification accuracy
Visual:
Task (T): Email Classification
Steps:
Perspectives:
Perspective Focus
Issues:
4. Logistic Regression
Sigmoid Function:
• S-shaped curve
Intuition:
plaintext
CopyEdit
1 | ____
| /
0.5|---------●---------
| /
0 |_____/
Problem:
Predict if a student will pass (1) or fail (0) based on hours studied.
Dataset:
1 0
2 0
3 0
4 1
5 1
For x = 5 hours:
• 🛠 Optimization:
• We minimize the cost using gradient descent to find best values for www and bbb.
UNIT-2
Support Vector Machines aim to classify data by finding the best possible separating hyperplane
between classes.
• The best hyperplane is the one that maximizes the margin between the classes.
• Margin = Distance between the hyperplane and the closest data points from each class
(called support vectors).
Visual:
● ● ○ ○
● ← Margin → ○
● ○
w⋅x+b=0w \cdot x + b = 0
• Breaks the large optimization problem into smaller problems (solving for two Lagrange
multipliers at a time).
When data is not linearly separable, SVMs use kernel functions to transform data into a higher-
dimensional space where a linear separator can be found.
Visual:
○○●● ○
○●○○ ○●
Common Kernels:
Summary
Component Explanation
Dataset:
x1 x2 Class
2 3 +1
3 3 +1
x1 x2 Class
2 1 -1
3 1 -1
↑ ● (+1)
| ● (+1)
| ○ (-1)
| ○ (-1)
+--------------→ x
In this simple case, all points are support vectors because they lie close to the margin.
import numpy as np
# 1. Sample data
clf.fit(X, y)
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1])
w = clf.coef_[0]
b = clf.intercept_[0]
yy = -(w[0] * xx + b) / w[1]
# Margins
# Support vectors
plt.show()
Dataset:
clf.fit(X, y)
plt.show()
• Core Idea: Instead of learning an explicit model, store instances (examples) and predict new
cases by comparing them to stored ones.
• How it works:
• Concept:
• Approach:
In essence:
Instance-Based Learning turns memory into intelligence, favoring experience over theory. It's AI's
version of "street smarts."
UNIT – 3
Without evaluation:
• We cannot trust that a “good” model will stay good in the real world.
With evaluation:
In essence:
Evaluation turns "I think it works" into "I know it works."
Goal:
Approximate how well a hypothesis will perform on unseen data — without needing to test on
the whole universe.
How:
• Measure how often the hypothesis is correct (or wrong) on that sample.
Formula:
Reality Check:
In essence:
You peek into the unknown world by holding up a tiny, smart mirror.
Key Principles:
• Random Sampling:
Every data point has an equal chance of being picked → unbiased estimate.
Critical Tools:
• Standard Error: How far the sample mean typically is from the true mean.
• Confidence Intervals: Statistical "safe zones" where the true answer probably lies.
Magic Formula:
Error margin ∝ 1Sample Size\frac{1}{\sqrt{\text{Sample Size}}}
Meaning:
Double the precision → quadruple the data.
In essence:
Sampling turns uncertainty into power, and small windows into big insights.
Here’s a sharp and practical guide for deriving confidence intervals:
• The standard error tells you how much your sample statistic deviates from the true
population parameter.
For the mean, it’s calculated as:
SE=snSE = \frac{s}{\sqrt{n}}
Where:
• nn = sample size
• The confidence level defines how confident you want to be that the true parameter lies
within your interval.
• For large samples, use the Z-score (from standard normal distribution).
• Z-value ≈ 1.96
• For proportions:
(p^−Margin of Error,p^+Margin of Error)\left( \hat{p} - \text{Margin of Error}, \hat{p} +
\text{Margin of Error} \right)
Visualizing It
Think of the confidence interval as a range within which you’re confident the true population
parameter lies — the wider it is, the less precise your estimate.
In essence:
Confidence intervals are your statistical “safe zone” for the truth, giving you both precision and
flexibility.
Steps:
o For both hypotheses (A and B), calculate their error rates (e.g., classification error,
mean squared error).
2. Difference in Error:
Calculate the difference:
3. Statistical Significance:
The error difference needs to be statistically significant to conclude that one hypothesis is
truly better:
▪ To compare the error rates across the same data (paired test).
▪ To assess whether the error difference is larger than what could occur by
random chance.
4. Confidence Intervals:
5. Interpretation:
Advanced Insights:
• Variance in Error:
Both hypotheses might have different variances in their errors — adjust for that.
• Bootstrap Resampling:
Use bootstrapping (random sampling with replacement) to estimate the distribution of error
differences, particularly for small sample sizes.
In essence:
The difference in error is about digging deeper, asking if the performance gap is real or just a
statistical illusion.
Choose a metric that matches the business goal (e.g., high precision for fraud detection, high
recall for medical diagnoses).
3. Statistical Testing:
o Use tests (e.g., paired t-test, Wilcoxon test) to determine if the differences in
performance are statistically significant, not just due to random chance.
4. Consider Complexity:
o Model Complexity: Simpler models (e.g., linear regression) are often easier to
interpret but may underperform on complex data. Complex models (e.g., neural
networks) might overfit without enough data.
o Training Time: Evaluate not just performance, but also how fast the model trains
and predicts.
o Memory Usage: Some models (e.g., decision trees) use less memory than others
(e.g., deep learning models).
o Test how models perform under variations in data (e.g., noisy data, missing values).
Stability is key for real-world applications.
6. Hyperparameter Tuning:
o Hyperparameter optimization: Ensure each model has been optimized to its best
potential. Algorithms like grid search or random search can help find the optimal
settings.
Visualization:
• Learning Curves: Plot training and validation error as a function of training size to
understand overfitting or underfitting.
• Box Plots or Violin Plots: Compare the distribution of errors across models.
Practical Considerations:
• Interpretability: In some cases, you might need to prioritize explainability (e.g., decision
trees, linear models) over performance.
• Scalability: Consider how each algorithm scales with increasing data size.
• Business Impact: A more complex model might outperform others but at a cost (e.g.,
resource usage, slower predictions).
In essence:
The best algorithm is the one that balances performance with practical constraints — there’s no
"one-size-fits-all". The winner is determined by context, not theory.