ML-Notes
ML-Notes
Overview
Working Mechanism
C4.5:
Splitting Criteria
o The attribute with the highest information gain is chosen for the
split.
Gini Impurity
Decision trees are prone to overfitting because they can grow very
deep and model noise in the data.
Random Forests
Overview
Key Features
Important Properties
Overview
Linear SVM
For linearly separable data, SVM looks for the line (in 2D) or
hyperplane (in higher dimensions) that best separates the classes.
Among many possible dividing lines, SVM chooses the one that
maximizes the margin, i.e., the distance between the hyperplane
and the closest data points from each class (called support
vectors).
Key Terminology
Term Description
Support Data points closest to the hyperplane. Crucial for defining the
Vectors margin.
Elbow Method:
Plot the inertia (sum of squared distances) and find the point where
additional clusters provide minimal gain (the "elbow").
Inertia:
Defined as the sum of squared distances between each point and its
assigned cluster's centroid.
1. Initialization:
Randomly select kkk centroids, where kkk is the number of desired
clusters. Each centroid represents the center of a cluster.
2. Expectation Step:
Assign each data point to the nearest centroid based on distance.
3. Maximization Step:
Recalculate the centroid as the mean of all points assigned to that
cluster.
4. Repeat:
Continue the expectation and maximization steps until centroids
stabilize and no further changes occur.
o Bottom-Up (Agglomerative):
Starts with individual points and merges them into clusters.
o Top-Down (Divisive):
Starts with one large cluster and splits it into smaller clusters.
DBSCAN groups together data points that are closely packed in feature
space.
Applications of Clustering
2. Semi-Supervised Learning:
Useful when only a small subset of data is labeled. Clustering helps
detect patterns in the unlabeled portion.
3. Image Segmentation:
Pixels with similar characteristics (color, intensity, texture) are grouped
into clusters. Each cluster forms a segment or region in the image.
Dimensionality Reduction
Goal: Retain only the most informative features to simplify models and
improve their efficiency and accuracy.
Feature Selection
Techniques:
Filter Methods: Use statistical tests (e.g., correlation, chi-square,
ANOVA F-test). Example: remove features with low variance.
Feature Extraction
Linear Methods:
Non-Linear Methods:
Key Points: