0% found this document useful (0 votes)
4 views

ML-Notes

The document provides an overview of various machine learning algorithms, including Decision Trees, Support Vector Machines (SVM), K-Means Clustering, Hierarchical Clustering, and Dimensionality Reduction techniques. It explains the structure and functioning of Decision Trees, the principles behind SVM, and methods for clustering and reducing dimensionality in datasets. Key concepts such as overfitting, splitting criteria, and feature extraction methods like PCA are also discussed.

Uploaded by

Anton Vergara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

ML-Notes

The document provides an overview of various machine learning algorithms, including Decision Trees, Support Vector Machines (SVM), K-Means Clustering, Hierarchical Clustering, and Dimensionality Reduction techniques. It explains the structure and functioning of Decision Trees, the principles behind SVM, and methods for clustering and reducing dimensionality in datasets. Key concepts such as overfitting, splitting criteria, and feature extraction methods like PCA are also discussed.

Uploaded by

Anton Vergara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

DECISION TREES

Overview

 Decision Tree is a supervised learning algorithm used for


classification and regression tasks.

 It has a hierarchical structure:

o Root node: starting point of the tree.

o Internal nodes: represent decisions based on feature tests.

o Branches: represent the outcomes of those tests.

o Leaf nodes: represent final class labels or predictions.

Working Mechanism

 The algorithm starts at the root.

 It tests attributes at each internal node.

 Based on test results, it follows a branch to the next node.

 Once a leaf node is reached, a prediction or class label is assigned.

Popular Decision Tree Algorithms

 ID3 (Iterative Dichotomiser 3):

o Developed by Ross Quinlan.

o Uses Entropy and Information Gain to evaluate splits.

 C4.5:

o Successor of ID3, also by Quinlan.

o Uses Gain Ratio or Information Gain.

 CART (Classification and Regression Trees):

o Developed by Leo Breiman.

o Uses Gini Impurity as the split criterion.

Splitting Criteria

Entropy and Information Gain


 Entropy: Measures impurity or randomness in the dataset.

o Entropy = 0 if all samples belong to one class.

o Maximum entropy (1 for binary classification) occurs when


samples are equally divided among classes.

 Information Gain: Measures the reduction in entropy after a dataset


is split on an attribute.

o The attribute with the highest information gain is chosen for the
split.

Gini Impurity

 Measures the probability of misclassifying a randomly chosen instance.

 Lower Gini values are better.

 Similar to entropy but computationally simpler.

Overfitting and Instability

 Decision trees are prone to overfitting because they can grow very
deep and model noise in the data.

 Overfitting reduces generalization performance on unseen data.

 Instability: Small changes in training data can lead to very different


tree structures.

 Pruning techniques help mitigate overfitting:

o Pre-pruning: Limit tree depth or minimum samples per node


(e.g., max_depth, min_samples_split).

o Post-pruning: Grow the full tree, then cut back using


complexity measures (e.g., ccp_alpha in Scikit-learn).

Random Forests

Overview

 A powerful ensemble learning method for classification and


regression.
 Composed of multiple decision trees.

 Built using a technique called bagging:

o Each tree trains on a random subset of the data (with


replacement).

 The final prediction is made by majority vote (classification) or mean


value (regression).

Key Features

 Reduces overfitting and instability of individual trees.

 Ensures diversity among trees using:

o Bootstrapping: Sampling training data with replacement for


each tree.

o Feature randomness: Randomly selecting a subset of features


for splitting at each node (controlled via max_features).

Important Properties

 Works best when trees are uncorrelated.

 Aggregating predictions from uncorrelated trees improves model


accuracy and robustness
SUPPORT VECTOR MACHINES (SVM)

Overview

 SVM is a supervised learning algorithm mainly used for


classification tasks.

 It seeks to find the optimal hyperplane that separates data


classes with the maximum margin.

 Originally designed for linear classification, but through the kernel


trick, SVM can also handle non-linear problems.

Linear SVM

 For linearly separable data, SVM looks for the line (in 2D) or
hyperplane (in higher dimensions) that best separates the classes.

 Among many possible dividing lines, SVM chooses the one that
maximizes the margin, i.e., the distance between the hyperplane
and the closest data points from each class (called support
vectors).

 This large margin leads to better generalization on unseen data.

Key Terminology

Term Description

Hyperplan A decision boundary that separates data into classes. In linear


e SVM, defined by wx+b=0wx + b = 0wx+b=0.

Support Data points closest to the hyperplane. Crucial for defining the
Vectors margin.

The distance between the hyperplane and the support vectors.


Margin
Larger margins are better.

Hard No misclassifications allowed. Used when data is perfectly


Margin separable.

Soft Allows some misclassification to improve generalization,


Margin especially on noisy data.
Term Description

A regularization parameter that controls trade-off between


C margin width and classification error. High C = less tolerance
for misclassification.

The loss function used to penalize misclassified points or


Hinge Loss
margin violations.

Non-linear SVM and the Kernel Trick

 When data is not linearly separable, SVM uses kernel functions to


map input data to a higher-dimensional space where it can be
linearly separated.

 Common kernel functions:

o Linear Kernel: No transformation (for linearly separable data).

o Polynomial Kernel: Captures polynomial relationships.

o Gaussian (RBF) Kernel: Effective for complex, non-linear


boundaries.

o Sigmoid Kernel: Similar to neural network activation functions.

Robustness and Generalization

 SVM is robust to outliers due to its margin-maximization strategy.

 The soft margin formulation balances:

o Maximizing margin (to generalize well),

o Allowing some misclassification (to handle noise/outliers),

o Controlled via the C parameter.


UNSUPERVISED LEARNING – K-MEANS CLUSTERING
 k-means clustering takes the number of clusters kkk and a dataset of
nnn objects as input.

 The algorithm outputs kkk clusters by minimizing within-cluster


variances.

 Data is split into kkk clusters where:

o Objects in the same cluster have high similarity.

o Objects in different clusters have low similarity.

 Automatically classifies unlabeled data into groups based on feature


similarity.

Choosing the Optimal Number of Clusters kkk

 Elbow Method:
Plot the inertia (sum of squared distances) and find the point where
additional clusters provide minimal gain (the "elbow").

 Silhouette Score Method:


Measures how similar an object is to its own cluster compared to other
clusters. Higher scores indicate better-defined clusters.

 Inertia:
Defined as the sum of squared distances between each point and its
assigned cluster's centroid.

k-Means Clustering Algorithm

1. Initialization:
Randomly select kkk centroids, where kkk is the number of desired
clusters. Each centroid represents the center of a cluster.

2. Expectation Step:
Assign each data point to the nearest centroid based on distance.

3. Maximization Step:
Recalculate the centroid as the mean of all points assigned to that
cluster.
4. Repeat:
Continue the expectation and maximization steps until centroids
stabilize and no further changes occur.

Unsupervised Learning – Hierarchical Clustering

 Divides the dataset across different levels to form a tree-like structure.

 Two main approaches:

o Bottom-Up (Agglomerative):
Starts with individual points and merges them into clusters.

o Top-Down (Divisive):
Starts with one large cluster and splits it into smaller clusters.

 The result is a tree diagram (dendrogram) where:

o The root represents the complete dataset.

o The leaves represent individual data points.

Unsupervised Learning – Density-Based Clustering (DBSCAN)

 DBSCAN groups together data points that are closely packed in feature
space.

 Points in low-density regions are considered noise or outliers.

 It identifies clusters as dense regions that are separated by areas of


lower density.

Applications of Clustering

1. Preprocessing for Supervised Learning:


Clustering can be used to reduce dimensionality and simplify the data
before applying supervised algorithms.

2. Semi-Supervised Learning:
Useful when only a small subset of data is labeled. Clustering helps
detect patterns in the unlabeled portion.
3. Image Segmentation:
Pixels with similar characteristics (color, intensity, texture) are grouped
into clusters. Each cluster forms a segment or region in the image.
Dimensionality Reduction

Why Reduce Dimensionality?

Problems with Too Many Features:

 Curse of Dimensionality: As the number of features increases, data


becomes sparse. Models struggle to generalize well.

 Overfitting: High-dimensional data can lead to the model learning noise


instead of patterns.

 Computational Cost: More features mean slower training and


prediction.

 Redundancy and Noise: Some features may be irrelevant or highly


correlated.

Goal: Retain only the most informative features to simplify models and
improve their efficiency and accuracy.

What is Dimensionality Reduction?

Dimensionality reduction refers to the process of reducing the number of


input variables (features) in a dataset while preserving as much meaningful
information as possible.

This is done either by:

 Removing irrelevant/redundant features

 Combining or transforming features into a lower-dimensional space

Two Main Approaches:

1. Feature Selection – Selecting a subset of the original features.

2. Feature Extraction – Transforming or combining existing features to


create new ones.

Feature Selection

Selects existing features without modifying them.

Techniques:
 Filter Methods: Use statistical tests (e.g., correlation, chi-square,
ANOVA F-test). Example: remove features with low variance.

 Wrapper Methods: Use a machine learning model to evaluate subsets


of features. Example: Recursive Feature Elimination (RFE).

 Embedded Methods: Feature selection is integrated into the model


itself. Example: Lasso Regression (L1 regularization reduces some
coefficients to zero).

Feature Extraction

Transforms or combines original features into a new feature space.

Linear Methods:

 PCA (Principal Component Analysis): Projects data into a lower-


dimensional space that captures the most variance.

 LDA (Linear Discriminant Analysis): Supervised technique that finds the


axes maximizing class separability.

Non-Linear Methods:

 t-SNE (t-distributed Stochastic Neighbor Embedding): Good for


visualizing high-dimensional data in 2D/3D; preserves local structure.

 UMAP (Uniform Manifold Approximation and Projection): Similar to t-


SNE but faster and better at preserving global structure.

 Kernel PCA: Extension of PCA using kernel methods to capture non-


linear patterns.
Principal Component Analysis (PCA)

PCA is a linear feature extraction technique introduced by Karl Pearson


(1901). It reduces dimensionality while retaining as much variability as
possible in the data.

Key Points:

 Transforms high-dimensional data into a lower-dimensional space.

 Maximizes variance in the new axes (principal components).

 Principal components are linear combinations of the original features.

 First principal component captures the most variance; the second


captures the most remaining variance, and so on.

 Components are uncorrelated and not directly interpretable.

Steps to Perform PCA

1. Standardize the Data

o Normalize the range of the continuous variables so each has a


mean of 0 and standard deviation of 1.

2. Compute the Covariance Matrix

o Shows how variables vary with each other.

o Covariance matrix is symmetric and has dimensions p x p, where


p is the number of features.

3. Compute Eigenvectors and Eigenvalues

o Eigenvectors determine the direction of the new feature space


(principal components).

o Eigenvalues measure the variance carried in each eigenvector.

o Rank eigenvalues in descending order to prioritize components.

4. Create the Feature Vector

o Choose top k eigenvectors (with the highest eigenvalues).

o Combine these into a matrix called the feature vector.

5. Recast the Data


o Multiply the original standardized data by the feature vector to
transform it into the new feature space (principal components).

This results in a dataset with reduced dimensions while preserving as much


variance as possible.

You might also like