MACHINE LEARNING Notes
MACHINE LEARNING Notes
UNIT 1:
Supervised Learning Basics
Definition:
o Supervised learning is a type of machine learning where an algorithm learns from labeled
data. This means the training data includes both the input features and the correct output
labels.
o The goal is for the algorithm to learn a mapping function that can predict the output for new,
unseen input data.
Key Characteristics:
o Labeled training data.
o Prediction of outputs based on inputs.
o Two main types: regression and classification.
Regression
Definition:
o Regression is used when the output variable is continuous. This means the output can take on
any value within a range.
o The goal is to predict a numerical value.
Examples:
o Predicting house prices based on size, location, and other features.
o Forecasting stock prices.
o Estimating temperature.
Common Algorithms:
o Linear Regression
o Polynomial Regression
o Support Vector Regression (SVR)
o Decision Tree Regression
o Random Forest Regression
Classification
Definition:
o Classification is used when the output variable is categorical. This means the output belongs
to a specific class or category.
o The goal is to predict which category an input belongs to.
Examples:
o Email spam detection (spam or not spam).
o Image recognition (identifying objects in images).
o Medical diagnosis (disease or no disease).
Common Algorithms:
o Logistic Regression
o Decision Trees
o Random Forests
o Support Vector Machines (SVMs)
o K-Nearest Neighbors (KNN)
o Naive Bayes.
Evaluation Metrics:
o Accuracy
o Precision
o Recall
o F1-score
o Confusion Matrix
o ROC curves.
Output:
o Regression: Continuous numerical values.
o Classification: Discrete categories.
Goal:
o Regression: Predict a quantity.
o Classification: Predict a class.
In many real-world scenarios, data points from different classes are not linearly separable. This
means you can't draw a straight line (or hyperplane) to perfectly separate them.
SVMs, in their basic form, are linear classifiers. So, they need a way to handle these non-linear
relationships.
The "kernel trick" is a powerful technique that allows SVMs to implicitly map data into a higher-
dimensional space.
In this higher-dimensional space, it's often possible to find a linear hyperplane that separates the
data.
The beauty of the kernel trick is that it performs this mapping without explicitly calculating the
coordinates of the data points in the higher-dimensional space. This saves a lot of computational
effort.
Instead of computing the transformed coordinates, kernel functions compute the inner products
between pairs of data points in the higher-dimensional space.
Kernel Functions
Linear Kernel:
o This is the simplest kernel, and it's used when the data is linearly separable.
o It's essentially the standard dot product.
Polynomial Kernel:
o This kernel allows for curved decision boundaries.
o It raises the features to a certain power, allowing the SVM to capture polynomial
relationships.
Radial Basis Function (RBF) Kernel:
o Also known as the Gaussian kernel, this is one of the most popular kernels.
o It can handle highly non-linear data.
o It measures the similarity between data points based on their distance.
Sigmoid Kernel:
o This kernel is similar to the sigmoid function used in neural networks.
o It can be useful in certain applications, but it's not as commonly used as the RBF kernel.
Why Kernels Matter
Kernels enable SVMs to handle complex datasets that would otherwise be impossible to classify.
They provide a flexible way to define similarity between data points, allowing SVMs to adapt to
various data distributions.
By avoiding explicit transformations, kernels make SVMs computationally efficient.
In essence, kernel methods extend the power of SVMs, allowing them to tackle a wider range of machine
learning problems.
UNIT 2:
Unsupervised learning
Unsupervised learning is a fascinating branch of machine learning where algorithms learn from unlabeled
data. This means the data provided to the algorithm is not tagged or categorized in any way. The algorithm's
task is to find hidden patterns, structures, or relationships within the data itself. Here's a breakdown of key
concepts:
Unsupervised learning in artificial intelligence is a type of machine learning that learns from
data without human supervision. Unlike supervised learning, unsupervised machine learning
models are given unlabeled data and allowed to discover patterns and insights without any
explicit guidance or instruction.
Core Idea:
Unlike supervised learning, which uses labeled data to train a model for prediction, unsupervised
learning focuses on discovering inherent structures within unlabeled data.
The goal is to understand the underlying distribution of the data, identify clusters of similar data
points, or reduce the dimensionality of the data.
Key Tasks:
Clustering:
o Grouping similar data points together based on their features.
o Examples:
Customer segmentation: Grouping customers with similar purchasing behavior.
Document clustering: Organizing documents based on their content.
K-means clustering is a very common algorithm used for this.
Dimensionality Reduction:
o Reducing the number of features in a dataset while preserving its essential information.
o This helps to simplify the data, reduce noise, and improve the efficiency of subsequent
analyses.
o Examples:
Principal Component Analysis (PCA): Transforming data into a lower-dimensional
space while maximizing variance.
Autoencoders.
Association Rule Learning:
o Discovering relationships or dependencies between variables in a dataset.
o Example:
Market basket analysis: Identifying products that are frequently purchased together.
Anomaly Detection:
o Identifying data points that deviate significantly from the norm.
o This is very useful for fraud detection, or for detecting faults in machinery.
Common Algorithms:
K-means Clustering:
o Partitions data into k clusters based on the distance to cluster centroids.
Hierarchical Clustering:
o Builds a hierarchy of clusters by iteratively merging or splitting data points.
Principal Component Analysis (PCA):
o Transforms data into a lower-dimensional space by identifying principal components.
Autoencoders:
o Neural networks that learn compressed representations of data.
Association rule learning algorithms, such as Apriori.
Applications:
Unsupervised learning is crucial for extracting valuable insights from the vast amounts of unlabeled data
that exist in the real world.
Okay, let's dive into Dimensionality Reduction with a focus on Principal Component Analysis (PCA)
and its non-linear extension, Kernel PCA.
Before we discuss the techniques, it's important to understand why dimensionality reduction is often
necessary:
Curse of Dimensionality: As the number of features (dimensions) increases, the amount of data
needed to generalize well grows exponentially. This can lead to sparse data and overfitting.
Computational Cost: Training and running machine learning models on high-dimensional data can
be computationally expensive and time-consuming.
Storage Requirements: High-dimensional data requires more storage space.
Visualization: It's difficult or impossible to visualize data in more than three dimensions, hindering
our understanding of the data.
Redundancy: High-dimensional datasets often contain redundant or irrelevant features that don't
contribute significantly to the underlying patterns.
Core Idea: PCA is a linear dimensionality reduction technique that aims to find the directions
(principal components) of maximum variance in the data. It projects the data onto a lower-
dimensional subspace formed by these principal components, effectively capturing the most
important information.
How it Works:
1. Standardize the Data: Scale the data so that each feature has zero mean and unit variance.
This prevents features with larger ranges from dominating the principal components.
2. Compute the Covariance Matrix: Calculate the covariance matrix of the standardized data.
This matrix shows the relationships and variances between different features.
3. Compute Eigenvectors and Eigenvalues: Find the eigenvectors and eigenvalues of the
covariance matrix. Eigenvectors represent the principal components (directions of maximum
variance), and eigenvalues indicate the amount of variance explained by each principal
component.
4. Select Principal Components: Sort the eigenvectors by their corresponding eigenvalues in
descending order. Choose the top k eigenvectors (where k is the desired lower
dimensionality) that capture a significant portion of the total variance.
5. Project the Data: Project the original data onto the subspace spanned by the selected top k
eigenvectors. This results in a lower-dimensional representation of the data.
Key Characteristics:
o Linear Transformation: PCA performs a linear transformation of the data.
o Unsupervised: It doesn't require labeled data.
o Variance Maximization: It aims to preserve the maximum variance in the lower-
dimensional space.
o Orthogonal Components: The principal components are orthogonal (uncorrelated) to each
other.
Limitations: PCA is effective for linear relationships in the data. If the underlying structure is highly
non-linear, PCA may not be able to capture it effectively, leading to a significant loss of
information.
2. Kernel PCA
Core Idea: Kernel PCA is a non-linear extension of PCA that uses the "kernel trick" (similar to
Support Vector Machines) to perform PCA in a higher-dimensional feature space implicitly. This
allows it to capture non-linear relationships in the original data.
How it Works:
1. Choose a Kernel Function: Select a kernel function (e.g., polynomial, radial basis function
(RBF), sigmoid) that defines a non-linear transformation of the data into a higher-
dimensional space.
2. Construct the Kernel Matrix: Instead of explicitly computing the transformed data, Kernel
PCA computes a kernel matrix. The elements of this matrix represent the inner products
between pairs of data points in the higher-dimensional feature space, as defined by the chosen
kernel function.
3. Center the Kernel Matrix: Center the kernel matrix to ensure the data is centered in the
feature space.
4. Compute Eigenvectors and Eigenvalues of the Kernel Matrix: Find the eigenvectors and
eigenvalues of the centered kernel matrix.
5. Select Eigenvectors and Project: Select the top k eigenvectors corresponding to the largest
eigenvalues. The principal components in the original space are then obtained by projecting
the original data onto these eigenvectors in the kernel-induced feature space. The projection
of a new data point involves computing its kernel values with the training data and then using
the selected eigenvectors.
Key Characteristics:
o Non-linear Transformation: By using kernel functions, Kernel PCA can capture complex
non-linear patterns.
o Implicit Feature Mapping: It avoids explicitly computing the coordinates in the high-
dimensional space, making it computationally feasible.
o Flexibility: Different kernel functions can be chosen to suit the specific non-linear structure
of the data.
Advantages over Linear PCA:
o Handles Non-linearity: Effective in reducing the dimensionality of data with non-linear
relationships.
o Can Unfold Complex Structures: Can reveal underlying structures that are not apparent in
the original linear space.
Disadvantages:
o Computational Cost: Can be more computationally expensive than linear PCA, especially
for large datasets, due to the need to compute the kernel matrix.
o Kernel Selection: Choosing the appropriate kernel function and its parameters can be
challenging and requires domain knowledge or experimentation.
o Interpretability: The principal components obtained by Kernel PCA are in the kernel-
induced feature space, which can be less interpretable than the components from linear PCA
in the original feature space.
In Summary:
PCA is a powerful linear dimensionality reduction technique that finds directions of maximum
variance. It's computationally efficient and easy to interpret but limited to linear relationships.
Kernel PCA extends PCA to handle non-linear data by implicitly mapping the data to a higher-
dimensional space using kernel functions. It can capture complex patterns but comes with higher
computational cost and the challenge of kernel selection.
The choice between PCA and Kernel PCA depends on the nature of the data and the underlying
relationships between the features. If the data has strong linear correlations, PCA might suffice. However, if
non-linear patterns are suspected, Kernel PCA can be a valuable tool for dimensionality reduction.
UNIT 3:
Artificial Neural Network :
Artificial Neural Networks (ANNs)! These are a cornerstone of modern machine learning, particularly in the
realm of deep learning. Inspired by the structure and function of the human brain, ANNs are powerful tools
for learning complex patterns from data. Let's break them down:
Core Idea:
Basic Components:
Neurons (Nodes):
o The fundamental building blocks of an ANN.
o Each neuron receives input from other neurons or external sources.
o It performs a weighted sum of its inputs and applies an activation function to produce an
output.
Weights:
o Numerical values associated with the connections between neurons.
o They determine the strength of the connection and influence the signal passed between
neurons.
o During training, these weights are adjusted to minimize the difference between the network's
predictions and the actual target values.
Biases:
o An additional parameter associated with each neuron.
o Similar to an intercept in a linear equation, the bias allows the neuron to be activated even
when all inputs are zero.
Activation Functions:
o Non-linear functions applied to the weighted sum of inputs in a neuron.
o Non-linearity is crucial for ANNs to learn complex, non-linear relationships in the data.
Without it, the entire network would behave like a single linear layer.
o Common activation functions include:
Sigmoid: Outputs a value between 0 and 1. Historically used but less common in
deep networks now.
Tanh (Hyperbolic Tangent): Outputs a value between -1 and 1.
ReLU (Rectified Linear Unit): Outputs 0 if the input is negative, and the input itself
if it's positive. Very popular due to its simplicity and efficiency.
Leaky ReLU: Similar to ReLU but allows a small, non-zero gradient when the input
is negative, addressing the "dying ReLU" problem.
Softmax: Used in the output layer for multi-class classification, it converts a vector of
raw scores into a probability distribution over the classes.
Network Architecture:
Input Layer: Receives the raw input data. The number of neurons in this layer corresponds to the
number of features in the input data.
Hidden Layers: One or more layers between the input and output layers. These layers perform the
complex feature extraction and transformation. Deep learning networks have multiple hidden layers.
Output Layer: Produces the final output of the network. The number of neurons and the activation
function in this layer depend on the task (e.g., one neuron with a sigmoid for binary classification,
multiple neurons with softmax for multi-class classification, one neuron with no activation for
regression).
Forward Propagation: Input data is passed through the network layer by layer. Each neuron
calculates its output based on the weighted sum of its inputs and its activation function.
Loss Function: A function that measures the difference between the network's predictions and the
actual target values. The goal of training is to minimize this loss.
Backpropagation: An algorithm used to calculate the gradients of the loss function with respect to
the network's weights. These gradients indicate how much each weight contributes to the error.
Optimization: Algorithms like Gradient Descent (and its variants like Adam, RMSprop) use the
calculated gradients to update the weights in a direction that reduces the loss. This iterative process
continues until the network's performance on the training data (and ideally unseen data) is
satisfactory.
There are many different architectures of neural networks designed for specific tasks, including:
Feedforward Neural Networks (FNNs): The basic type where information flows in one direction
from input to output.
Convolutional Neural Networks (CNNs): Particularly effective for image and video processing,
they use convolutional layers to automatically learn spatial hierarchies of features.
Recurrent Neural Networks (RNNs): Designed for sequential data (e.g., text, time series), they
have feedback connections that allow them to maintain a memory of past inputs.
Transformers: A more recent architecture that has shown remarkable success in natural language
processing and is also being applied to other domains. They rely on attention mechanisms to weigh
the importance of different parts of the input sequence.
Applications of ANNs:
Image Recognition and Computer Vision: Object detection, image classification, facial
recognition.
Natural Language Processing (NLP): Machine translation, sentiment analysis, text generation,
chatbots.
Speech Recognition: Converting spoken language into text.
Recommendation Systems: Suggesting products or content to users.
Robotics and Autonomous Systems: Perception, control, and decision-making.
Healthcare: Disease diagnosis, drug discovery.
Finance: Fraud detection, algorithmic trading.