Machine Learning
Machine Learning
Machine learning is the branch of Artificial Intelligence that focuses on developing models and
algorithms that let computers learn from data and improve from previous experience without
being explicitly programmed for every task.In simple words, ML teaches the systems to think
and understand like humans by learning from the data.
Types of Systems of Machine Learning-
There are different types of machine-learning systems.
1.Supervised
2. Unsupervised
3 Semi-supervised
4 Reinforcement Learning
Supervised learning is defined as when a model gets trained on a "Labelled Dataset". Labelled
datasets have both input and output parameters. In SL algorithms learn to map points between
inputs and correct outputs. It has both training and validation datasets labelled.
Example: Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs. If you feed the datasets of dogs and cats labelled images to the
algorithm, the machine will learn to classify between a dog or a cat from these labeled images.
When we input new dog or cat images that it has never seen before, it will use the learned
algorithms and predict whether it is a dog or a cat.
There are two main categories of supervised learning that are mentioned below:
Classification:Logistic Regression, Support Vector Machine, Random Forest, Decision Tree,
K-Nearest Neighbors (KNN),Naive Bayes
Regression: Linear Regression, Ridge Regression, Lasso Regression
Example: Consider that you have a dataset that contains information about the purchases you
made from the shop. Through clustering, the algorithm can group the same purchasing behavior
among you and other customers, which reveals potential customers without predefined labels.
This type of information can help businesses get target customers as well as identify outliers.
There are two main categories of unsupervised learning that are mentioned below:
Clustering: K-Means Clustering algorithm, DBSCAN Algorithm, Principal Component Analysis
Association: Apriori Algorithm, FP-growth Algorithm
Reinforcement ml algorithm is a learning method that interacts with the environment by
producing actions and discovering errors. Trial, error, and delay are the most relevant
characteristics of reinforcement learning. In this technique, the model keeps on increasing its
performance using Reward Feedback to learn the behavior or pattern. These algorithms are
specific to a particular problem
e.g. Google Self Driving car, AlphaGo where a bot competes with humans and even itself to get
better and better performers in Go Game. Each time we feed in data, they learn and add the
data to their knowledge which is training data. So, the more it learns the better it gets trained
and hence experienced.
Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep learning. Deep
Q-learning uses a neural network to represent the Q-function, which allows it to learn complex
relationships between states and actions.
Semi-Supervised learning is a machine learning algorithm that works between the supervised
and unsupervised learning so it uses both labelled and unlabelled data. It's particularly useful
when obtaining labeled data is costly, time-consuming, or resource-intensive. This approach is
useful when the dataset is expensive and time-consuming. Semi-supervised learning is chosen
when labeled data requires skills and relevant resources in order to train or learn from it.
Example: Consider that we are building a language translation model, having labeled
translations for every sentence pair can be resource intensive. It allows the models to learn from
labeled and unlabeled sentence pairs, making them more accurate. This technique has led to
significant improvements in the quality of machine translation services.
1. Inadequate Training Data- Although data plays a vital role in the processing of machine
learning algorithms, many data scientists claim that inadequate data, noisy data, and unclean
data are extremely exhausting the machine learning algorithms. Data quality can be affected by
some factors as follows: a. Noisy Data b. Incorrect data
2. Generalizing output data Poor quality of data plays a significant role in machine learning.
Noisy data,incomplete data, inaccurate data, and unclean data lead to less accuracy in
classification and low-quality results. Hence, data quality can also be considered as a major
common problem while processing machine learning algorithms.
3. Slow Implementation This is one of the common issues faced by machine learning
professionals. The machine learning models are highly efficient in providing accurate results, but
it takes a tremendous amount of time. Slow programs, data overload, and excessive
requirements usually take a lot of time to provide accurate results.
4. Overfitting: Overfitting refers to a machine learning model trained with a massive amount of
data that negatively affects its performance. Unfortunately, this is one of the significant issues
faced by machine learning professionals. This means that the algorithm is trained with noisy and
biased data, which will affect its overall performance. Techniques to reduce overfitting:
Increase training data.
Reduce model complexity.
Ridge Regularization and Lasso Regularization
Use dropout for neural networks to tackle overfitting.
5. Underfitting: This process occurs when data is unable to establish an accurate relationship
between input and output variables. It signifies the data is too simple to establish a precise
relationship. To overcome this issue:
Maximize the training time
Enhance the complexity of the model
Add more features to the data
Reduce regular parameters
Increasing the training time of model
Write down Application of Machine Learning
2. Traffic prediction: If we want to visit a new place, we take help of Google Maps, which
shows us the correct path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested.
3. speech Recognition: Speech Recognition based smart systems like Alexa and Siri have
certainly come across and used to communicate with them. These systems are designed such
that they can convert voice instructions into text.
4. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places,digital images, etc.
5. Social Media Features: Social media platforms use machine learning algorithms and
approaches to create some attractive and excellent features.
Machine learning (ML) offers numerous applications in business, ranging from enhancing
customer experiences to streamlining operations and making informed decisions.
1.Overfitting refers to a machine learning model trained with a massive amount of data that
negatively affects its performance. Unfortunately, this is one of the significant issues faced by
machine learning professionals. This means that the algorithm is trained with noisy and biased
data, which will affect its overall performance.
ex, imagine fitting a very complicated curve to a set of points. The curve will go through every
point, but it won’t represent the actual pattern.
As a result, the model works great on training data but fails when tested on new data.
Reasons: High variance and low bias.The model is too complex. The size of the training data.
2. Underfitting
Underfitting is the opposite of overfitting. It happens when a model is too easy to capture what’s
going on in the data.This process occurs when data is unable to establish an accurate
relationship between input and output variables. It signifies the data is too simple to establish a
precise relationship.
For example, imagine drawing a straight line to fit points that actually follow a curve. The line
misses most of the pattern.In this case, the model doesn’t work well on either the training or
testing data.
Types of LR:- Linear regression can be further divided into two types of the algorithm:
Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.
Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique.
It is used for predicting the categorical dependent variable using a given set of indep variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, True or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Logistic Regression is much similar to Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for
solving the classification problems.The curve from the logistic function indicates the likelihood of
something such as whether the cells are cancerous or not, a mouse is obese or not based on its
weight, etc.
Type of Logistic Regression:
1. Binomial
2. Multinomial
3. Ordinal
Explain In detail Multivariate Linear Regression
Ans: A decision tree is a flowchart-like model that represents decisions and their potential
outcomes. It's a powerful tool used in both decision analysis and machine learning to make
predictions or classify data. Decision trees are a type of supervised learning algorithm, meaning
they learn from labeled data to make future predictions
Structure of a Decision Tree:
Root Node: The starting point of the tree, representing the entire dataset or the initial question.
Internal Nodes: Represent tests or questions on features or attributes.
Branches: Represent the outcomes or results of the tests or questions.
Leaf Nodes: The terminal nodes, representing the final predictions or classifications.
2. How Decision Trees Work:
Splitting Data:
Decision trees work by recursively splitting data into subsets based on different features or
attributes.
Making Decisions:
At each internal node, the algorithm decides which feature to use for splitting based on certain
criteria (like information gain or Gini impurity).
Reaching Leaf Nodes:
The process continues until it reaches leaf nodes, which represent the final predictions or
classifications.
3. Key Concepts:
Supervised Learning: Decision trees learn from labeled data to make predictions on new,
unseen data.
Classification: Used to categorize data into different classes.
Regression: Used to predict continuous values.
Overfitting: When a tree becomes too complex and learns the training data too well, leading to
poor generalization of new data.
Pruning: A technique used to simplify the tree by removing unnecessary branches to improve its
performance and prevent overfitting.
4. Advantages of Decision Trees:
Interpretability: Decision trees are relatively easy to understand and visualize.
Versatility: Can be used for both classification and regression tasks.
Flexibility: Can handle various data types and non-linear relationships.
Low Preprocessing Needs: Require minimal data preparation before training.
5. Applications:
Machine Learning: Building predictive models for classification and regression.
Decision Analysis: Analyzing complex decisions and identifying potential outcomes.
Operations Research: Optimizing resource allocation and making strategic decision
CART( Classification And Regression Trees) is a variation of the decision tree algorithm. It
can handle both classification and regression tasks. Scikit-Learn uses the Classification And
Regression Tree (CART) algorithm to train Decision Trees (also called “growing” trees). CART
is a predictive algorithm used in Machine learning and it explains how the target variable's
values can be predicted based on other matters. It is a decision tree where each fork is split into
a predictor variable and each node has a prediction for the target variable at the end.
The term CART serves as a generic term for the following categories of decision trees:
Classification Trees: The tree is used to determine which "class" the target variable is most likely
to fall into when it is continuous.CART for classification works by recursively splitting the training
data into smaller and smaller subsets based on certain criteria. The goal is to split the data in a
way that minimizes the impurity within each subset. Impurity is a measure of how mixed up the
data is in a particular subset. For classification tasks, CART uses Gini impurity
Gini Impurity- Gini impurity measures the probability of misclassifying a random instance from a
subset labeled according to the majority class. Lower Gini impurity means more purity of the
subset.
Splitting Criteria- The CART algorithm evaluates all potential splits at every node and chooses
the one that best decreases the Gini impurity of the resultant subsets. This process continues
until a stopping criterion is reached, like a maximum tree depth or a minimum number of
instances in a leaf node.
Regression trees: These are used to predict a continuous variable's value.Regression CART
works by splitting the training data recursively into smaller subsets based on specific criteria.
The objective is to split the data in a way that minimizes the residual reduction in each subset.
Residual Reduction- Residual reduction is a measure of how much the average squared
difference between the predicted values and the actual values for the target variable is reduced
by splitting the subset. The lower the residual reduction, the better the model fits the data.
Splitting Criteria- CART evaluates every possible split at each node and selects the one that
results in the greatest reduction of residual error in the resulting subsets. This process is
repeated until a stopping criterion is met, such as reaching the maximum tree depth or having
too few instances in a leaf node.
Advantages of CART
Results are simplistic, Classification and regression trees are Nonparametric and Nonlinear.
Outliers have no meaningful effect on CART, It requires minimal supervision and produces
easy-to-understand models.
Limitations of CART
Overfitting, High Variance, low bias, the tree structure may be unstable.
Applications: For quick Data insights, In Blood Donors Classification,In the financial sectors.
Lasso and Ridge regression are regularization techniques used in machine learning to prevent
overfitting and improve model prediction accuracy.
Lasso Regression (L1 Regularization):
Purpose is To penalize the absolute value of coefficients, leading to some coefficients being set
to zero.
Lasso can perform feature selection by eliminating less important variables, resulting in a
simpler and more interpretable model.
Lasso tends to produce sparse models where a subset of features are selected, making the
model easier to understand and interpret.
Ridge Regression (L2 Regularization): To penalize the square of the coefficients, shrinking them
towards zero without eliminating them entirely.
Ridge regression is effective in situations with multicollinearity, where predictor variables are
highly correlated, by reducing the impact of these correlated variables on the model.
Ridge regression does not set coefficients to zero, retaining all predictors in the model.
Suitable when all variables are expected to contribute to the prediction and multicollinearity is
present.
The Gini index is a measure of impurity or randomness in a dataset, particularly used in
decision tree algorithms. It quantifies the probability of incorrectly classifying a randomly chosen
element if it were labeled according to the class distribution within the dataset. A Gini index of 0
indicates perfect purity (all elements belong to the same class), while a Gini index of 0.5
indicates maximum impurity (elements are evenly distributed across all classes).
Key Points:
A lower Gini index indicates a more pure or homogenous group.
A higher Gini index indicates a more impure or mixed group.
The Gini index is a metric used to evaluate the impurity of a dataset or node in a decision tree.
Decision trees utilize the Gini index to guide the splitting process and create a tree that can
accurately predict outcome
In decision tree algorithms, a node with a lower Gini index is preferred for splitting the data
because it represents a more pure or homogeneous group. The algorithm aims to create splits
that minimize the Gini index of each child node.
Example Calculation:
The Gini index is calculated as 1 minus the sum of squared probabilities of each class in the
dataset.
If a group has 80% "Yes" and 20% "No", the Gini index would be:
1 - (0.8^2 + 0.2^2) = 1 - (0.64 + 0.04) = 1 - 0.68 = 0.32.
Bagging (Bootstrap Aggregating) and Boosting
Process: Bagging involves creating multiple models by training them on different bootstrapped
samples of the original dataset. This means that each model is trained on a subset of the data,
where data points can be selected multiple times with replacement.
Goal: The primary goal of bagging is to reduce variance, which can help prevent overfitting and
improve the stability of the model.
Example: Random Forest is a well-known bagging algorithm that creates multiple decision trees
and combines their predictions to make a final prediction.
Process: Boosting trains models sequentially, with each new model focusing on correcting the
errors made by the previous model. This iterative process allows the ensemble to gradually
learn and improve its predictions.
Goal: aims to reduce bias by creating a strong learner that can make more accurate predictions.
Example: AdaBoost and Gradient Boosting are common boosting algorithms.
There are several ways to combine multiple classifiers in machine learning, broadly categorized
as fusion and selection strategies. Fusion methods, like bagging, boosting, and stacking,
combine the predictions of multiple classifiers, while selection methods choose the best
classifier from a pool of candidates. A common approach is to use an aggregation function,
such as majority voting, to combine the individual predictions.
1. Fusion Methods:
Bagging (Bootstrap Aggregating):Trains multiple models on different subsets of the same data
and then combines their predictions.eg random forest
Boosting: Trains models sequentially, with each new model focusing on the errors of the
previous ones. eg Gradient Boosting
Stacking: Combines the predictions of multiple base classifiers using a meta-learner, which is
trained to predict the output of the ensemble.
Weighted Voting: Assigns weights to each classifier based on its individual performance and
then combines their predictions using a weighted average.
Majority Voting: Simple fusion where the class with the most votes among the classifiers is
selected as the final prediction.
2. Selection Methods:
MultiScheme: Selects the best classifier from a set of candidates using cross-validation.
Meta-classifier Approach: A meta-classifier is trained to choose the best classifier based on a
set of criteria.
Feature Selection: Identifies the most relevant features for classification and uses those to train
a classifier or a set of classifiers.
3. Hybrid Methods:
Some methods combine elements of both fusion and selection, such as using weighted voting
with a selection process to choose the best weighted classifiers.
limitations:
Computational Complexity: Some ensemble methods, like boosting, can be computationally
expensive.
Overfitting: Ensemble methods can be susceptible to overfitting if not properly regularized.
Data Requirements: Some ensemble methods, like stacking, require a larger amount of data
Q5.Discuss in detail K-fold cross-validation
Cross-validation is a technique used to check how well a machine learning model performs on
unseen data. It splits the data into several parts, trains the model on some parts and tests it on
the remaining part repeating this process multiple times. Finally the results from each validation
step are averaged to produce a more accurate estimate of the model's performance.
The main purpose of cross validation is to prevent overfitting. If you want to make sure your
machine learning model is not just memorizing the training data but is capable of adapting to
real-world data cross-validation is a commonly used technique.
clustering with an overview of distance metrics and major clustering approaches
Least Square Regression for classification
Least squares regression aims to minimize the sum of the squared differences between
observed and predicted values. For classification, the idea is to adapt this approach to predict
class labels, typically by encoding classes numerically (e.g., -1 and 1 for binary classification).
➢ Process
• Encoding Class Labels: For binary classification, the class labels (e.g.,positive and negative)
are encoded as -1 and 1. For multi-class classification, labels are often encoded using a
one-vs-all strategy.
• Model Fitting: The regression model fits a linear function to these encoded labels by
minimizing the sum of squared errors.
This involves solving for the weights 𝑤w in the linear equation:
𝑦= 𝑋𝑤+𝜖y
where 𝑋 is the feature matrix, 𝑦 is the vector of encoded labels, and 𝜖 is the error term.
• Prediction: The fitted linear model is used to predict the class label for new data points. For
binary classification, the sign of the predicted
value determines the class (positive if 𝑦>, negative if 𝑦<0).
For multi-class classification, the class with the highest predicted value is chosen.
➢ Advantages
• Simplicity: Least squares regression is straightforward to implement and understand.
• Efficiency: Computationally efficient, especially for small to medium-sized datasets.
• Interpretability: The linear model’s coefficients can be easily interpreted to understand the
relationship between features and the target variable.
➢ Limitations
• Not Optimized for Classification: Least squares regression is designed for continuous
outcomes and may not handle classification boundaries effectively.
• Assumptions: Assumes linear separability of classes, which may not hold in practice.
• Sensitivity to Outliers: Can be sensitive to outliers, which may distort the classification
boundary.
What are activation functions? Explain the Binary, Bipolar, Continuous, and Ramp
activation functions. (eqn on website)
Activation functions are mathematical functions applied to the output of each neuron in a neural
network. They introduce non-linearity into the network, enabling it to learn complex patterns in
the data. Here’s an explanation of the Binary, Bipolar, Continuous, and Ramp activation
functions:
This function is typically used in binary classification tasks where the output should represent
one of two classes.
Bipolar Activation Function:
The bipolar activation function outputs either -1 or 1 based on a threshold. If the input is below
the threshold, it outputs -1; if it’s equal to or above the threshold, it outputs 1.
Mathematically:
Similar to the binary activation function, but it allows for representation of negative values.
Continuous Activation Function:
The continuous activation function produces a smooth, continuous output over the entire range
of inputs. Examples include sigmoid and hyperbolic tangent (tanh) functions.
Sigmoid function:
Tanh function:
These functions are commonly used in hidden layers of neural networks for their smoothness
and suitability for gradient-based optimization algorithms like backpropagation.
Ramp Activation Function:
The ramp activation function linearly increases the output as the input increases up to a certain
threshold, after which it saturates to a constant value.
Mathematically:
This function can be useful in situations where the network needs to gradually activate neurons
based on input strength, such as in certain types of regression tasks.
These activation functions serve different purposes and are chosen based on the requirements
of the neural network architecture and the nature of the problem being solved.
Draw and explain biological and artificial neural networks and compare them with
artificial neural networks. 05
The biological neural network is also composed of several processing pieces known as neurons
that are linked together via synapses. These neurons accept either external input or the results
of other neurons. The generated output from the individual neurons propagates its effect on the
entire network to the last layer, where the results can be displayed to the outside world.Every
synapse has a processing value and weight recognized during network training. The
performance and potency of the network fully depend on the neuron numbers in the network,
how they are connected to each other (i.e., topology), and the weights assigned to every
synapse.
1. Dendrite: Dendrites are branched extensions of a neuron that receive signals from other
neurons or sensory receptors. They act like antennae, collecting incoming electrical and
chemical signals and transmitting them towards the cell body.
2. Axon: An axon is a long, slender projection of a neuron that carries electrical impulses
away from the cell body toward other neurons, muscles, or glands. It’s like a cable
transmitting signals over long distances.
3. Nucleus: The nucleus is the central part of a cell that contains genetic material (DNA)
and controls the cell’s activities. In neurons, the nucleus is located in the cell body.
4. Cell Body (Soma): The cell body, also called the soma, is the main part of a neuron that
contains the nucleus and other organelles essential for the cell’s functioning. It
integrates incoming signals from dendrites and generates outgoing signals to the axon.
5. Synapse: A synapse is a junction between two neurons or between a neuron and a target
cell (such as a muscle or gland). It’s where the electrical signal from one neuron is
transmitted to another cell through chemical or electrical signaling. Synapses are crucial
for communication between neurons in the nervous system
Artificial Neural Networks contain artificial neurons, which are called units. These units are
arranged in a series of layers that together constitute the whole Artificial Neural Network in a
system. A layer can have only a dozen units or millions of units, as this depends on how the
complex neural networks will be required to learn the hidden patterns in the dataset. Commonly,
an Artificial Neural Network has an input layer, an output layer, as well as hidden layers. The
input layer receives data from the outside world, which the neural network needs to analyze or
learn about. Then, this data passes through one or multiple hidden layers that transform the
input into data that is valuable for the output layer. Finally, the output layer provides an output in
the form of a response of the Artificial Neural Networks to the input data provided.
The McCulloch-Pitts neural model, which was the earliest ANN model, has only two types of
inputs — Excitatory and Inhibitory. The excitatory inputs have weights of positive magnitude and
the inhibitory weights have weights of negative magnitude. The inputs of the McCulloch-Pitts
neuron could be either 0 or 1. It has a threshold function as an activation function. So, the
output signal yout is 1 if the input ysum is greater than or equal to a given threshold value, else 0.
The diagrammatic representation of the model is as follows:
Simple McCulloch-Pitts neurons can be used to design logical operations. For that purpose, the
connection weights need to be correctly decided along with the threshold function (rather than
the threshold value of the activation function). Example: John carries an umbrella if it is sunny
or if it is raining. There are four given situations. I need to decide when John will carry the
umbrella. The situations are as follows:
To analyse the situations using the McCulloch-Pitts neural model, I can consider the input
signals as follows: X1: Is it raining? X2 : Is it sunny?
So, the value of both scenarios can be either 0 or 1. We can use the value of both weights X1
and X2 as 1 and a threshold function as 1. So, the neural network model will look like:
So, I can say that,
The truth table built with respect to the problem is depicted above. From the truth table, I can
conclude that in the situations where the value of yout is 1, John needs to carry an umbrella.
Hence, he will need to carry an umbrella in scenarios 2, 3 and 4.
Draw a block diagram of the Error Back Propagation Algorithm and explain with a
flowchart the Error Back Propagation concept .
The Error Back Propagation Algorithm, commonly referred to as Backpropagation, is a
fundamental algorithm used for training artificial neural networks. It is a supervised learning
algorithm that aims to minimize the error between the predicted outputs and the actual outputs
by adjusting the weights of the network.
Steps in Backpropagation
Practical Considerations
● Learning Rate: A critical hyperparameter that needs to be set properly. Too high can lead
to divergence, too low can slow convergence.
● Overfitting: Use techniques like regularization (L1, L2), dropout, or early stopping to
prevent overfitting.
● Gradient Vanishing/Exploding: Deep networks may suffer from these issues. Techniques
like Batch Normalization, appropriate activation functions (ReLU), and gradient clipping
help mitigate them.
List out and explain the applications of SVD.
Singular Value Decomposition (SVD) is a powerful matrix factorization technique with various
applications across different domains. Here are some common applications of SVD:
PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation systems, and optimizing the
power allocation in various communication channels. It is a feature extraction technique, so it
contains the important variables and drops the least important variable.
Perceptron is one of the simplest Artificial neural network architectures. It was introduced by Frank
Rosenblatt in 1957s. It is the simplest type of feedforward neural network, consisting of a single layer of
input nodes that are fully connected to a layer of output nodes. It can learn the linearly separable patterns.
it uses slightly different types of artificial neurons known as threshold logic units (TLU). it was first
introduced by McCulloch and Walter Pitts in the 1940s.
Types of Perceptron
● Single-Layer Perceptron: This type of perceptron is limited to learning linearly separable patterns.
effective for tasks where the data can be divided into distinct categories through a straight line.
● Multilayer Perceptron: Multilayer perceptrons possess enhanced processing capabilities as they
consist of two or more layers, adept at handling more complex patterns and relationships within
the data.
Basic Components of Perceptron A perceptron, the basic unit of a neural network, comprises essential
components that collaborate in information processing.
● Input Features: The perceptron takes multiple input features, each input feature represents a
characteristic or attribute of the input data.
● Weights: Each input feature is associated with a weight, determining the significance of each
input feature in influencing the perceptron’s output. During training, these weights are adjusted to
learn the optimal values.
● Summation Function: The perceptron calculates the weighted sum of its inputs using the
summation function. The summation function combines the inputs with their respective weights
to produce a weighted sum.
● Activation Function: The weighted sum is then passed through an activation function. Perceptron
uses Heaviside step function functions. which take the summed values as input and compare
with the threshold and provide the output as 0 or 1.
● Output: The final output of the perceptron, is determined by the activation function’s result. For
example, in binary classification problems, the output might represent a predicted class (0 or 1).
● Bias: A bias term is often included in the perceptron model. The bias allows the model to make
adjustments that are independent of the input. It is an additional parameter that is learned during
training.
● Learning Algorithm (Weight Update Rule): During training, the perceptron learns by adjusting its
weights and bias based on a learning algorithm. A common approach is the perceptron learning
algorithm, which updates weights based on the difference between the predicted output and the
true output.
Explain Hebbian Learning rule. [05]
Hebbian Learning Rule, also known as the Hebb Learning Rule, was proposed by Donald O Hebb.
It is one of the first and also easiest learning rules in the neural network. It is used for pattern
classification. It is a single layer neural network, i.e. it has one input layer and one output layer.
The input layer can have many units, say n. The output layer only has one unit. Hebbian rule
works by updating the weights between neurons in the neural network for each training sample.
1. Set all weights to zero, wi = 0 for i=1 to n, and bias to zero.
2. For each input vector, S(input vector) : t(target output pair), repeat steps 3-5.
3. Set activations for input units with the input vector Xi = Si for i = 1 to n.
4. Set the corresponding output value to the output neuron, i.e. y = t.
5. Update weight and bias by applying Hebb rule for all i = 1 to n:
Explain Expectation-Maximization Algorithm.
E-step (Expectation Step): Estimates missing or hidden values using current parameter
estimates.
M-step (Maximization Step): Updates model parameters to maximize the likelihood based on
the estimated values from the E-step.
This process repeats until the model reaches a stable solution as it improve accuracy with each
iteration. It is widely used in clustering like Gaussian Mixture Models and handling missing data.
Expectation-Maximization algorithm can be used for the latent variables (variables that are not
directly observable and are
actually inferred from the values of the other observed variables)too in order to predict their
values with the condition that the
a general form of probability distribution governing those latent variables is known to us.
Algorithm:
1.The algorithm starts with initial parameter values and assumes the observed data comes
from a specific model.
2. E-Step (Expectation Step): Find the missing or hidden data based on the current parameters.
Calculate the posterior probability of each latent variable based on the observed data.
Compute the log-likelihood of the observed data using the current parameter estimates.
3. M-Step (Maximization Step): Update the model parameters by maximize the log-likelihood.
The better the model the higher this value.
4. Convergence: Check if the model parameters are stable and converging.
If the changes in log-likelihood or parameters are below a set threshold, stop. If not, repeat the
E-step and M-step until convergence is reached
Uses:
1. It can be used to fill the missing data in a sample.
2. It can be used as the basis of unsupervised learning of clusters.
3. It can be used for discovering the values of latent variables.
Advantages of EM algorithm-
1. It is always guaranteed that likelihood will increase with each iteration.
2. The E-step and M-step are often pretty easy for many problems in terms of implementation.
3. Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm -
1. It has slow convergence.
2. It converges to the local optima only.
3. It requires both the probabilities, forward and backward
Dimensionality reduction techniques aim to simplify datasets by reducing the number of features while
preserving important information. This is crucial for improving model efficiency, accuracy, and
interpretability, especially when dealing with high-dimensional data. These techniques can be broadly
classified into feature selection and feature extraction.
Feature Selection
Feature selection chooses the most relevant features from the dataset without altering them. It helps
remove redundant or irrelevant features, improving model efficiency. There are several methods for
feature selection including filter methods, wrapper methods and embedded methods.
Filter methods rank the features based on their relevance to the target variable.
Wrapper methods use the model performance as the criteria for selecting features.
Embedded methods combine feature selection with the model training process.
Please refer to Feature Selection Techniques for better in depth understanding about the techniques.
Feature Extraction
Feature extraction transforms the original high-dimensional data into a lower-dimensional space by
creating new features that capture the essential information. This approach often results in new features
that are combinations of the original ones, potentially losing some original information but gaining a more
concise representation. Common feature extraction methods include:
Principal Component Analysis (PCA):
PCA is a widely used linear dimensionality reduction technique that identifies the principal components
(directions of greatest variance) in the data and projects the data onto these components, reducing the
dimensionality while preserving as much variance as possible.
Linear Discriminant Analysis (LDA):
LDA is another linear technique, but it's specifically designed for classification tasks. It identifies linear
combinations of features that best separate different classes.
In essence, dimensionality reduction is a powerful tool for simplifying complex datasets while preserving
the most important information, leading to more efficient, accurate, and interpretable models and
analyses
A Random Forest is a collection of decision trees that work together to make predictions.
Random Forest algorithm is a powerful tree learning technique in Machine Learning to make predictions
and then we do voting of all the trees to make predictions. Random Forest is based on ensemble
learning. They are widely used for classification and regression tasks.
It is a type of classifier that uses many decision trees to make predictions.
It takes different random parts of the dataset to train each tree and then it combines the results by
averaging them. This approach helps improve the accuracy of predictions.
Linear Discriminant Analysis (LDA) is a statistical technique used for classification and dimensionality
reduction, especially when dealing with high-dimensional data. It works by finding a linear combination of
features that best separates different classes or groups. When working with high-dimensional datasets it
is important to apply dimensionality reduction techniques to make data exploration and modeling more
efficient. This technique helps in reducing the dimensionality of data while retaining the most significant
features for classification tasks. It works by finding the linear combinations of features that best separate
the classes in the dataset.
LDA has various real-world applications in fields like image recognition, medical diagnosis, and customer
segmentation. Applications of LDA:
Image Recognition: LDA can reduce the dimensionality of image data (e.g., face images) while preserving
the essential features for distinguishing between different individuals or objects. This makes it efficient
and accurate for tasks like face recognition.
Medical Diagnosis: LDA can be used to classify patients into different disease states or severities based
on medical data like blood test results or imaging scans. It helps identify patterns and relationships in
patient data, aiding in diagnosis and treatment decisions.
Customer Segmentation: LDA can help identify customer segments based on demographics, purchasing
behavior, or other relevant data. This allows businesses to tailor marketing strategies and improve
customer experiences.
Quality Control and Manufacturing: LDA can be used to identify defective products or processes by
classifying items as defective or non-defective based on various measurements or attributes.
Document Classification: LDA can categorize documents into different topics or classes based on their
content and keywords.
Email spam detection primarily relies on machine learning, specifically supervised learning techniques.
Algorithms are trained on labeled datasets (spam vs. non-spam emails) to learn patterns and classify
new emails accordingly. Popular algorithms used for spam detection include Naive Bayes, Support Vector
Machines (SVMs), and K-nearest neighbors (KNN).
1. Supervised Learning:
A machine learning approach where the algorithm learns from labeled data (spam/not spam) to make
predictions on new, unlabelled data.
This is particularly well-suited for email spam detection, as it allows the algorithm to learn the
characteristics of spam and non-spam emails.
3. Data Preprocessing:
Tokenization: Breaking down the email text into individual words or tokens.
Stop-word removal: Removing common words like "the," "a," "is" that don't carry much meaning for
classification.
Stemming/Lemmatization: Reducing words to their root form (e.g., "running," "runs" to "run").
Feature extraction: Identifying important features from the email text, such as the presence of certain
keywords, the length of the email, or the sender's reputation.
By combining machine learning algorithms with appropriate data preprocessing and evaluation
techniques, spam filters can effectively identify and filter out unwanted email
Support Vector Machine (SVM) Terminology
Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and
regression tasks. While it can handle regression problems, SVM is particularly well-suited for
classification tasks.
SVM aims to find the optimal hyperplane in an N-dimensional space to separate data points into different
classes. The algorithm maximizes the margin between the closest points of different classes.
Support Vector Machine (SVM) Terminology
Hyperplane: A decision boundary separating different classes in separates data points into different
classes, represented by the equation wx + b = 0 in linear classification.In 2D space, it's a line, and in 3D
space, it's a plane. SVM aims to find the hyperplane that maximizes the margin, the distance between the
hyperplane and the closest data points (support vectors) from each class.
Support Vectors: support vectors are the data points closest to the hyperplane that separates different
classes. They are crucial because they determine the position and orientation of the hyperplane and the
margin between classes, influencing the classification accuracy. Essentially, they "support" the decision
boundary.
Margin: the margin is the distance between the hyperplane (the decision boundary) and the closest data
points from each class (known as support vectors). SVMs aim to find the hyperplane that maximizes this
margin, separating the classes with the largest possible gap. This maximization of the margin helps to
improve the generalization ability of the model and reduce the risk of overfitting. SVM aims to maximize
this margin for better classification performance.
Kernel: the kernel is a function that implicitly maps data into a higher-dimensional space, SVMs aim to
find the optimal hyperplane (a decision boundary) that separates different classes of data points.
However, in many real-world scenarios, data is not linearly separable in the original feature space. This is
where kernels come into play. They map the data into a higher-dimensional space where the data might
be more easily separable. This "kernel trick" is crucial for SVM's ability to handle complex, non-linear
relationships in data.
Hard Margin: Seeks to find the optimal hyperplane that maximizes the margin between the two classes,
ensuring all data points are correctly classified.
This approach is most suitable for linearly separable data, where a clear boundary exists between the
classes.
If data contains outliers or is not linearly separable, hard margin SVMs may fail to find a suitable
hyperplane, leading to overfitting.
Soft Margin SVM: Introduces slack variables to allow for misclassifications, making it more robust to
outliers and non-linearly separable data.
The model penalizes misclassified data points, but allows for some errors to find a hyperplane that
generalizes better to unseen data.
This approach is more flexible and can handle a wider range of datasets, including those with overlapping
classes.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a powerful clustering algorithm
that groups data points based on their density.
It identifies clusters as dense regions separated by less dense areas, effectively handling noise and
outliers without requiring a predefined number of clusters.
DBSCAN uses two main parameters: eps (a distance threshold) and minPts (the minimum number of
points required to form a dense region).
Advantages of DBSCAN:
DBSCAN doesn't require the user to specify the number of clusters in advance.
It effectively identifies and separates outliers, making it suitable for data cleaning.
DBSCAN can identify clusters of different shapes and sizes, unlike some other algorithms that assume
spherical clusters.
DBSCAN can be used to identify unusual or anomalous data points.
Disadvantages of DBSCAN:
Choosing the right eps and minPts parameters can be challenging and require experimentation.
DBSCAN may struggle with datasets where clusters have significantly varying densities.
Applications of DBSCAN:
Anomaly Detection: Identifying unusual patterns in data.
Spatial Data Analysis: Clustering geographical data points.
Image Segmentation: Grouping pixels with similar characteristics.
Customer Segmentation: Identifying customer groups with similar buying patterns.
In essence, DBSCAN is a powerful and versatile clustering algorithm that excels at identifying clusters of
arbitrary shape, handling noise, and finding anomalies in datasets.
Discuss in detail Singular Value Decomposition.SVD