0% found this document useful (0 votes)
18 views13 pages

ML Module 2,3,4

ML unit 2,3,4

Uploaded by

Viman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

ML Module 2,3,4

ML unit 2,3,4

Uploaded by

Viman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Module 2: Linear Algebra for Machine Learning

- System of Linear equations:


 A system of linear equations consists of multiple equations with multiple
unknown variables.
 It can be represented in matrix form and solved using various techniques
such as Gaussian elimination, matrix inversion, and matrix factorization.
 Linear algebra provides a powerful framework for solving systems of
equations and understanding the properties of linear transformations.

- Norms:
Norms are mathematical measures used to quantify the size or length of a vector
in a vector space. In the context of machine learning and linear algebra, two
commonly used norms are the L1 norm and the L2 norm. Here's an explanation
of each:

1. L1 Norm (Manhattan Norm or Taxicab Norm):


 The L1 norm of a vector {x} is defined as the sum of the
absolute values of its components:

 Geometrically, the L1 norm represents the distance between the


origin and the point defined by the vector when following only
horizontal and vertical paths (like a taxi cab navigating city
blocks).
- Properties of the L1 norm:
- It is less sensitive to outliers compared to the L2 norm.
- It tends to produce sparse solutions in optimization problems,
leading to feature selection.
- It is commonly used in Lasso regularization in linear regression
and feature selection algorithms.
2. L2 Norm (Euclidean Norm):
- The L2 norm of a vector {x} is defined as the square root of the
sum of the squares of its components:

 Geometrically, the L2 norm represents the Euclidean distance


between the origin and the point defined by the vector in a
multidimensional space.
- Properties of the L2 norm:
 It is sensitive to outliers because it squares the values of the
vector components.
 It tends to produce dense solutions in optimization problems,
which may not be desirable in certain scenarios.
 It is commonly used in Ridge regularization in linear regression
and as a distance metric in machine learning algorithms such as
k-nearest neighbors (KNN).

- Inner Product:
 The inner product, also known as the dot product, is a binary operation
that takes two vectors and returns a scalar quantity.
 It measures the similarity or projection of one vector onto another and
plays a fundamental role in defining distances, angles, and orthogonality
in vector spaces.
 The inner product is used in various mathematical and computational
applications, including vector spaces, geometry, signal processing, and
machine learning.

- Diagonalization:
 Diagonalization is a process that transforms a square matrix into a
diagonal matrix by finding a basis of eigenvectors and expressing the
matrix as a product of eigenvectors and eigenvalues.
 Diagonalization simplifies matrix computations, facilitates eigenvalue
analysis, and provides insights into the matrix's properties and behavior.
 It's used in various mathematical and computational applications,
including solving systems of linear equations, computing matrix powers,
and solving differential equations.

- SVD and its application:


 Singular Value Decomposition (SVD) is a matrix factorization technique
that decomposes a matrix into three orthogonal matrices: U, Σ, and V^T.
 U is an orthogonal matrix containing the left singular vectors.
 Σ is a diagonal matrix containing the singular values.
 V^T is the transpose of an orthogonal matrix containing the right singular
vectors.
 It's a powerful tool for dimensionality reduction, data compression, and
matrix approximation.
Its applications include:
 Dimensionality Reduction: Retaining significant singular values and
vectors reduces data dimensions while preserving essential information.
 Matrix Approximation: Truncating Σ yields a lower-rank approximation
of A , useful for denoising and completion tasks.
 Collaborative Filtering: Factorizing user-item rating matrices helps
make personalized recommendations in recommendation systems.
 Principal Component Analysis (PCA): SVD identifies principal
components, aiding in dimensionality reduction while retaining variance
in data.
 Image Processing: SVD enables image denoising, compression, and
restoration by decomposing images into singular components.
 Latent Semantic Analysis (LSA): Analyzing term-document matrices
uncovers latent semantic structures in textual data for tasks like topic
modeling and information retrieval.
 Low-Rank Matrix Completion: SVD aids in recovering missing data
entries by approximating matrices with a low-rank representation, useful
in recommender systems and collaborative filtering.
Module 3: Regression and Support Vector Machine (SVM)

- Least-Squares Regression for classification:


Least-Squares Regression for classification, also known as Least-Squares
Classification (LSC), is a simple method used for binary classification tasks.
Here's a straightforward explanation:
1. Objective:
- Least-Squares Regression for classification aims to find a linear decision
boundary that separates the classes in the feature space.
- It seeks to minimize the squared error between the predicted class labels and
the actual class labels.
2. Model Representation:
- Given a dataset with input features and binary class labels (0 or 1), LSC fits
a linear regression model to the data.
- The model predicts the class labels using a linear equation of the form:

3. Decision Boundary:
- The decision boundary is determined by the threshold value (e.g., 0.5)
applied to the predicted class probabilities.
- If the predicted probability is above the threshold, the instance is classified
as class 1; otherwise, it is classified as class 0.
4. Loss Function:
- LSC minimizes the squared error loss between the predicted class labels and
the actual class labels.
- The loss function penalizes misclassifications by squaring the difference
between the predicted and actual class labels.
5. Applications:
- Least-Squares Regression for classification is a simple and interpretable
method commonly used in situations where linear decision boundaries are
appropriate.
- It can be applied to various binary classification tasks, such as spam
detection, medical diagnosis, and sentiment analysis.

- Multivariate linear regression:


Multivariate linear regression is an extension of simple linear regression to
handle multiple independent variables.
In multivariate linear regression, the goal is to model the relationship between
multiple predictors (independent variables) and a single target variable
(dependent variable) by fitting a linear equation to the observed data.
1. Model Representation:

2. Applications:
- Multivariate linear regression is widely used in various fields such as:
 Economics: Analyzing the impact of multiple factors on
economic outcomes like GDP or employment rates.
 Finance: Predicting stock prices based on multiple financial
indicators such as interest rates, market indices, and company
performance metrics.
 Social Sciences: Investigating the relationships between
demographic factors, social behaviors, and health outcomes.
 Marketing: Predicting sales or market share based on
advertising expenditure, pricing strategies, and consumer
demographics.
 Environmental Science: Modeling the relationships between
environmental variables (temperature, humidity, pollution
levels) and ecological outcomes (species abundance,
biodiversity).

- Regularized regression:
Regularized regression is an extension of linear regression that introduces
penalty terms to the model's cost function, aiming to prevent overfitting and
improve predictive performance.

Model Representation:
 The model equation resembles linear regression:

 Additional penalty terms are included in the cost function to


regulate the coefficients.

Types of Regularization:
The two primary types of regularization commonly used in regularized
regression are:

1. L1 Regularization (Lasso):
 L1 regularization adds the absolute values of the coefficients to
the cost function, represented by the sum of the absolute values
of the coefficients:
 L1 regularization encourages sparsity in the coefficient
estimates, as it tends to shrink less relevant features' coefficients
to exactly zero.
 L1 regularization facilitates feature selection by effectively
removing irrelevant or redundant features from the model.

2. L2 Regularization (Ridge):
 L2 regularization adds the squared values of the coefficients to
the cost function, represented by the sum of the squared
coefficients:

 L2 regularization penalizes large coefficients, effectively


shrinking them towards zero, but typically not all the way to
zero.
 L2 regularization is effective in reducing the impact of
multicollinearity among predictor variables by stabilizing
coefficient estimates.
Applications:
 Regularized regression finds application across various
domains:
 Finance: Predicting stock prices, risk assessment.
 Healthcare: Patient outcome prediction, disease diagnosis.
 Marketing: Customer churn prediction, sales forecasting.
 Environmental Science: Modeling environmental impacts on
ecosystems.

Difference between Lasso and Ridge Regression

- Support Vector Machine (SVM):


Support Vector Machine (SVM) is a supervised machine learning algorithm that
finds the optimal hyperplane in an n-dimensional space to classify data into
different classes. It works by identifying the best separation boundary between
classes, maximizing the margin between the classes.

 Model Representation:
o Given a training dataset with input features and corresponding
class labels, SVM finds the hyperplane that separates the classes
with the largest margin.
o The hyperplane is defined by a set of support vectors, which are the
data points closest to the decision boundary.

 Key Concepts:
o Margin: The distance between the hyperplane and the nearest data
point from each class. SVM aims to maximize this margin, leading
to better generalization.
o Kernel Trick: SVM can handle non-linearly separable data by
mapping input features into a higher-dimensional space using
kernel functions (e.g., polynomial, radial basis function) to find a
linear separation boundary.
o Regularization Parameter (C): Controls the trade-off between
maximizing the margin and minimizing the classification error on
the training data. Higher values of C allow for fewer margin
violations but may lead to overfitting.
o Kernel Parameters: Parameters specific to the chosen kernel
function, such as the degree for polynomial kernels and the gamma
parameter for radial basis function (RBF) kernels.

 Types of SVM:
SVM, or Support Vector Machine, can be categorized based on the type of
decision boundary they form. Here are the main types:

1. Linear SVM:
a. Linear SVMs classify data by finding the optimal hyperplane that
linearly separates the classes in the feature space.
b. The decision boundary is a straight line (in 2D), or a hyperplane (in
higher dimensions) that maximizes the margin between the classes.
c. Linear SVMs are suitable for linearly separable datasets where
classes can be separated by a straight line or plane.

2. Non-linear SVM:
a. Non-linear SVMs are used for datasets that are not linearly
separable in the original feature space.
b. They employ kernel functions to map the input features into a
higher-dimensional space where the classes become separable by a
hyperplane.
c. Common kernel functions include polynomial kernel, radial basis
function (RBF) kernel, sigmoid kernel, and custom kernels tailored
to specific data characteristics.
d. Non-linear SVMs are capable of capturing complex decision
boundaries and can handle more intricate patterns in the data.

 Applications:
SVM is widely used in various fields, including:
o Text classification (e.g., spam detection, sentiment analysis).
o Image recognition (e.g., object detection, facial recognition).
o Bioinformatics (e.g., gene expression classification, protein
structure prediction).
o Finance (e.g., credit scoring, stock market prediction).
o Medical diagnosis (e.g., disease classification, cancer detection).
Module 4: Hebbian Learning and Expectation Maximization

- Hebbian learning rule:


The Hebbian learning rule is a concept in neuroscience and neural network
theory that describes a mechanism for synaptic plasticity, which is the ability of
synapses to strengthen or weaken over time based on their activity. Here's a
simplified explanation:
 Definition:
- The Hebbian learning rule states that "neurons that fire together,
wire together." It suggests that if two neurons are repeatedly
activated at the same time, the strength of the connection
(synaptic weight) between them should increase.
- Proposed by Donald Hebb in 1949, the rule provides a
foundational concept for understanding how learning and
memory formation occur in biological neural networks.
 Mechanism:
- When a presynaptic neuron repeatedly fires and causes the
postsynaptic neuron to fire, the connection between them
strengthens.
- If the presynaptic neuron consistently precedes the firing of the
postsynaptic neuron, the synaptic connection strengthens further.
- Conversely, if the presynaptic neuron consistently fails to cause
the postsynaptic neuron to fire, the connection weakens.
 Key Points:
- The Hebbian learning rule is based on the idea of correlation
between neuronal activity.
- It provides a mechanism for associative learning, where the co-
activation of neurons leads to the formation of associations or
memories.
- While the Hebbian learning rule offers a simple explanation for
synaptic plasticity, it doesn't account for all aspects of learning
and memory, and more complex rules have been proposed in
neuroscience and artificial neural network models.
 Applications:
- The Hebbian learning rule has inspired various computational
models of learning and memory in artificial neural networks.
- It forms the basis for unsupervised learning algorithms such as
Hebbian learning and competitive learning, where networks self-
organize based on input patterns.

- Expectation maximization algorithm for clustering:


The Expectation Maximization (EM) algorithm for clustering is a powerful
method used to fit mixture models, particularly in unsupervised learning tasks
such as clustering. Here's a straightforward explanation:

1. Objective:
a. The EM algorithm for clustering aims to find the parameters of a
mixture model that best describe the underlying data distribution.
b. It iteratively estimates the parameters of the mixture model by
maximizing the likelihood of the observed data.

2. Model Representation:
a. The mixture model represents the data as a combination of multiple
probability distributions (e.g., Gaussian distributions) with
different parameters.
b. Each component of the mixture model represents a cluster in the
data.

3. Algorithm Steps:
a. Expectation (E) Step: In the E-step, the algorithm estimates the
probabilities of data points belonging to each cluster (i.e.,
computes the posterior probabilities or responsibilities).
b. Maximization (M) Step: In the M-step, the algorithm updates the
parameters of the mixture model (e.g., means and covariances of
Gaussian distributions) based on the estimated cluster assignments
obtained from the E-step.
4. Iterative Process:
a. The EM algorithm iterates between the E-step and M-step until
convergence, where the likelihood of the observed data stops
improving or reaches a predefined threshold.
b. Each iteration of the algorithm typically improves the fit of the
mixture model to the data, leading to better cluster assignments and
parameter estimates.

5. Initialization:
a. The performance of the EM algorithm can be sensitive to the initial
parameter values.
b. Common initialization strategies include random initialization, k-
means clustering, or hierarchical clustering.

6. Applications:
a. The EM algorithm for clustering is widely used in various
domains, including image segmentation, document clustering, and
gene expression analysis.
b. It is particularly useful when the data contains hidden or latent
variables and when the underlying data distribution is complex and
cannot be easily modeled by a single probability distribution.

You might also like