0% found this document useful (0 votes)
14 views34 pages

ML Solutions

The document provides an overview of machine learning, including its definition, applications, and steps for developing machine learning applications. It explains different types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning, along with their pros and cons. Additionally, it discusses concepts like underfitting and overfitting, the differences between biological and artificial neural networks, and introduces techniques like least squares regression and singular value decomposition.

Uploaded by

dipshree1010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views34 pages

ML Solutions

The document provides an overview of machine learning, including its definition, applications, and steps for developing machine learning applications. It explains different types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning, along with their pros and cons. Additionally, it discusses concepts like underfitting and overfitting, the differences between biological and artificial neural networks, and introduces techniques like least squares regression and singular value decomposition.

Uploaded by

dipshree1010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

ML

(Q.1) What is machine learning, machine learning application & model? Steps
in developing a machine learning application.
Solution:
What is machine learning:
- Field of study that gives computers the ability to learn without being
explicitly programmed.
- Uses techniques to give machines the ability to “LEARN FROM DATA”,
without being explicitly programmed.
- Machines to make data-driven decisions rather than being explicitly
programmed for carrying out a certain task.
- “A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E.”

Applications of machine learning:


1. Image Processing
2. Speech Recognition
3. Traffic Prediction
4. Product recommendation
5. Self Driven Cars
6. Email Spam Filter
7. Virtual Personal Assistant
8. Online Fraud Detection
9. Stock Market Prediction

1) Image Processing: Image processing in machine learning involves using


algorithms to analyze and manipulate images for various applications.
This includes tasks like object detection, facial recognition, medical
imaging analysis, and autonomous vehicle navigation.
2) Speech Recognition; Speech recognition in machine learning refers to
the ability of algorithms to understand and interpret spoken language.
This technology enables devices like smartphones, virtual assistants (e.g.,
Siri, Alexa), and automated customer service systems to comprehend and
respond to human speech.
3) Traffic Predictions: Traffic prediction in machine learning involves
using algorithms to forecast traffic conditions, such as congestion levels
and travel times, in specific areas or along particular routes.
4) Product Recommendation: Product recommendation in machine
learning refers to the process of suggesting items or services to users
based on their preferences, behaviours, and past interactions. This
technology is widely used in e-commerce platforms, streaming services,
and social media platforms to help users discover relevant products,
content, or services tailored to their individual interests and needs.
Example: Netflix, Amazon, etc.
5) Self-driven Cars: Self-driving cars utilize machine learning to perceive
and navigate the environment without human intervention. Through
sensors such as cameras, lidar, and radar, the car collects data about its
surroundings, including other vehicles, pedestrians, and road signs.
6) Email Spam Filter: Email spam filters use machine learning to
automatically identify and classify incoming emails as either spam or
legitimate. By analyzing the content, sender information, and other
features of emails, machine learning algorithms learn to recognize
patterns associated with spam messages.
7) Virtual Personal Assistant: Virtual personal assistants leverage machine
learning to understand and respond to user commands and queries,
performing tasks such as setting reminders, scheduling appointments, and
providing information.
8) Online Fraud Detection: Online fraud detection uses machine learning
to identify and prevent fraudulent activities conducted over the internet.
By analyzing patterns and anomalies in user behavior, transaction data,
and other relevant factors, machine learning algorithms can detect
suspicious activities indicative of fraud, such as unauthorized access,
identity theft, or payment fraud.
9) Stock Market Prediction: Stock market prediction utilizes machine
learning algorithms to analyze historical market data, trends, and various
factors influencing stock prices to forecast future price movements.
However, it's important to note that stock market prediction is inherently
uncertain, and predictions may not always be accurate due to the
complexity and unpredictability of financial markets.
Steps in developing a machine learning application:

● Step 1: Gathering Data


Once you know exactly what you want and the equipment is in hand, it takes
you to the first real step of machine learning- Gathering Data.
This step is very crucial as the quality and quantity of data gathered will
determine how good the predictive model will turn out to be.
The data collected is then tabulated and used for training as well as testing.
Data gathering can be done via using API, Web scraping, Public dataset.

● Step 2: Data preparation


Data preparation involves adjusting and manipulating like normalization, error
correction, handling missing values, converting data into usable format needed
for the algorithm.

Once all data is ready, it is loaded into a suitable place and then the order is
randomized as the order of data should not affect what is learned.
Lastly, Data set is divided into training and testing data sets.

● Step 3: Choosing a model


The next step is choosing an appropriate model as per requirement. There are
many models that researchers and data scientists have created over the years.
Some are very well suited for image data, others for sequences (like text, or
music), some for numerical data, others for text-based data.

● Step 4: Train the Algorithm


Training step is the core step often considered as the bulk of machine learning,
where the data is used to incrementally improve the model’s ability to predict.
The training process involves initializing some random values for say A and B
of our model, predict the output with those values, then compare it with the
model's prediction and then adjust the values so that they match the predictions
that were made previously.
This process then repeats and each cycle of updating is called one training step.

● Step 5: Evaluation
In this, testing dataset kept aside is used to evaluate and identify efficiency of
the model.
Evaluation allows the testing of the model against data that has never been seen
and used for training and is meant to be representative of how the model might
perform when in the real world.

● Step 6: Hyper Parameter Tuning


Learning rate that defines how fast algorithms learn in each step, based on the
information from the previous training step. These values will have an impact
on how long the training will take.
For models that are more complex, initial conditions play a significant role in
the determination of the outcome of training. Differences can be seen depending
on whether a model starts off training with values initialized to zeros versus
some distribution of values, which then leads to the question of which
distribution is to be used.

● Step 7: Prediction / Use It


Finally, use the model to predict the outcome. If the outcome is not proper then
revisit all the steps.
(Q.2) What is supervised vs unsupervised? Explain different types of machine
learning algorithms.
Solution:

1. Supervised Learning:
Supervised learning algorithms are trained on labeled datasets, where each data
instance is associated with a corresponding target or output variable. The
algorithm learns a mapping between the input features and the target variable,
enabling it to make predictions or classifications on unseen data
Working:
● Given data in the form of input output pair, it is fed to a learning algorithm
one by one, during training.
● Then the algorithm is allowed to predict the output for each example, and give
it feedback as to whether it predicted the right answer or not.
● Over time, the algorithm will learn to approximate the exact nature of the
relationship between input output pairs.
● When fully-trained, the supervised learning algorithm will be able to observe
a new, never-before-seen example and predict a correct label/output for it.
Pros:
● Clear specific objective
● Easy to measure accuracy – Since actual output is known it's easy to design a
performance metric for the system.
● Controlled training process – which in return gives an outcome of a very
specific behaviour.
Cons:
● Intensive Labor - data requires labelling before the model is trained, which
can take hours of human effort.
● Needs a large amount of data.
● Limited insights - no freedom for the machine to explore other possibilities

2. Unsupervised Learning:
Unsupervised learning algorithms work with unlabelled data, where there are no
predefined target variables. The algorithms explore the patterns and structures in
the data to find inherent relationships, clusters, or patterns
Working:
● The very first step is to load the unlabeled data into the system.
● Once the data is loaded into the system, the algorithm analyzes the data
● As the analysis gets completed, the algorithm will look for patterns depending
upon the behavior or attributes of the dataset.
● Once pattern identification and grouping are done, it gives the output.
Pros:
● Fast Process - since no data labeling is required here i.e. fewer human
resources is required in order to perform tasks.
● Unique insights – unique, disruptive insights for a business to consider as it
interprets data on its own.
Cons:
● Difficult to measure accuracy - it is not easy to measure the accuracy since we
don’t have any expected or desired outcome to compare to.
● Data Dimensionality - When the dimension of data and the number of
variables become more and need to be reduced in order to work on that data,
then human involvement becomes necessary to clean the data.

3. Semi-Supervised Learning:
Semi-supervised learning algorithms deal with partially labelled datasets, where
only a small portion of the data instances have labels. These algorithms leverage
both labelled and unlabelled data to learn patterns and make predictions. They
combine elements of supervised and unsupervised learning. Self-training, co-
training, and generative models (e.g., generative adversarial networks - GANs)
can be used for semi-supervised learning tasks.)
Pros:
● Reduces time required for labelling massive data.
● Avoids human biases which can be introduced due to labelling.
● Using lots of unlabelled data during the training process improves the
accuracy of the final model while reducing the time and cost spent building it.

4. Reinforcement Learning:
Reinforcement learning algorithms involve training an agent to interact with an
environment and learn optimal actions through a trial-and-error process. The
agent receives feedback in the form of rewards or penalties based on its actions
and uses this feedback to learn and improve its decision-making policy.
Reinforcement learning is often used in scenarios where there is no labeled
dataset, and the agent learns by exploration and exploitation of the environment.
Pros:
● Reinforcement learning can be used to solve very complex problems that
cannot be solved by conventional techniques.
● This learning model is very similar to the learning of human beings. Hence, it
is close to achieving perfection.
Cons:
● Computation Heavy and Time Consuming.
● The curse of dimensionality limits reinforcement learning heavily for real
physical systems. The curse of dimensionality refers to various phenomena that
arise when analyzing and organizing data in high-dimensional spaces that do not
occur in low-dimensional settings.

(Q.3) Explain the difference between biological neurons and ANN.


Solution:

Definition of Artificial Neural Network


Artificial neural network is the mathematical model, essentially inspired by the
biological neuron system in the human brain. The neural network is built from
the several numbers of processing elements interlinked by weighted pathways to
form networks. The result of each element is computed by using a non-linear
function of its weighted inputs. When these processing elements are merged into
networks can employ arbitrarily complex non-linear functions such as problems
regarding classification, prediction or optimization.
Similar to human brain these artificial neural networks learn by experiences,
generalise by examples and can retrieve essential data from the noisy one. These
can work parallelly, at a higher speed and are fault tolerant.

Definition of Biological Neural Network


The biological neural network is also made up of multiple processing elements
known as neurons, which are interconnected by synapses. These neurons either
accepts the external input or the outcome of the other neurons. The generated
output from the various neurons propagates their effect on the whole network to
the final layer where the results can be shown to the real world.

Each synapse has a processing value and weight, which is recognized at the
time of the training of the network. The network’s performance and potency
completely depend on the number of neurons in the network, how these are
connected with each other (i.e. topology) and the value of weights assigned to
each synapse.

BASIS FOR ARTIFICIAL BIOLOGICAL NEURAL


COMPARISON NEURAL NETWORK NETWORK
Processing Sequential and Parallel and distributed
centralised
Rate Artificial neural Biological neurons are slow in
networks process processing information.
information in a faster
pace.
Size Small Large
Complexity Incapable to perform The enormous size and complexity of
complex pattern the connections provide brain a
recognition. capability of the performing complex
tasks.
Fault tolerance Intolerant to the failure. Implicitly fault tolerant.
Control mechanism Control unit monitors all All the processing is centrally
computing-related controlled.
activities.
Feedback Not provided Provided
(Q.4) Explain the underfitting and overfitting issues with machine learning
models.
Solution:

Underfitting and overfitting are common issues that can arise when training
machine learning models. These issues relate to the model's ability to generalize
well to new, unseen data. Let's explain underfitting and overfitting in detail:
Underfitting:
Underfitting occurs when a machine learning model is too simple or lacks the
capacity to capture the underlying patterns and relationships in the training data.
The model's performance is poor, both on the training data and new, unseen
data. It fails to learn the complexity of the problem and produces high bias.
Signs of underfitting include:
1. High training error: The model struggles to fit the training data accurately,
resulting in a high training error rate.
2. High testing error: The model's performance on new, unseen data is also poor,
leading to a high testing error rate.
3. Oversimplified predictions: The model makes overly simplistic assumptions
or predictions, disregarding important features or patterns in the data.
Underfitting can occur due to various reasons, including:
1. Model simplicity: Using a model with insufficient complexity or too few
parameters to capture the complexity of the data.
2. Insufficient training: Inadequate training data or insufficient training time,
preventing the model from learning effectively.
3. Feature scarcity: Lack of informative features or not capturing the relevant
aspects of the problem.
To address underfitting, you can consider the following approaches:
1. Increase model complexity: Use a more complex model with higher capacity
to capture the underlying patterns in the data.
2. Feature engineering: Add more relevant features or transform existing
features to improve the model's ability to learn.
3. Gather more data: Increase the size of the training dataset to provide the
model with more examples to learn from.
4. Reduce regularization: If regularization techniques are applied, such as L1 or
L2 regularization, consider reducing their strength to allow the model to fit the
data better.
Overfitting:
Overfitting occurs when a machine learning model becomes overly complex
and starts to memorize the noise or random fluctuations in the training data,
rather than learning the underlying patterns. The model fits the training data
extremely well but fails to generalize to new, unseen data.
Signs of overfitting include:
1. Low training error: The model achieves very low training error, as it is able to
fit the training data closely.
2. High testing error: The model performs poorly on new, unseen data, leading
to a high testing error rate.
3. Overly complex predictions: The model may produce overly complex or
erratic predictions, capturing noise rather than true patterns.
Overfitting can occur due to various reasons, including:
1. Model complexity: Using a model with excessive complexity or too many
parameters that allows it to fit noise in the training data.
2. Limited training data: Insufficient training examples may cause the model to
overgeneralize patterns in the available data, including noise.
3. Feature overfitting: Overfitting can also occur when the model is given too
many irrelevant or noisy features.
To address overfitting, you can consider the following approaches:
1. Reduce model complexity: Use a simpler model or apply regularization
techniques to limit the model's capacity to fit noise.
2. Feature selection: Identify and remove irrelevant or noisy features that may
contribute to overfitting.
3. Increase training data: Gather more training examples to provide a more
representative sample and reduce overfitting.
4. Regularization: Apply techniques like L1 or L2 regularization to introduce
constraints on the model's parameters and prevent overfitting.
(Q.5) Error Back Propagation Algorithm(diagram)(flow chart).

(Q.6) Least Square Regression (Definition, Explanation, Sum).


Solution:

Definition:
Least squares regression lines are a specific type of model that analysts
frequently use to display relationships in their data. Statisticians call it “least
squares” because it minimizes the residual sum of squares.

Example:

Next, we’ll plug those sums into the slope formula.

Now that we have the slope (m), we can find the y-intercept (b) for the line.

Let’s plug the slope and intercept values in the least squares regression line
equation:
y = 11.329 + 1.0616x

This linear equation matches the one that the software displays on the graph.
We can use this equation to make predictions. For example, if we want to
predict the score for studying 5 hours, we simply plug x = 5 into the equation:

y = 11.329 + 1.0616 * 5 = 16.637

Therefore, the model predicts that people studying for 5 hours will have an
average test score of 16.637.

(Q.7) Diagonalization sum.


(Q.8) Explain SVD in detail. What are the different applications of SVD?
Solution:

Singular Value Decomposition (SVD) is a matrix factorization technique that


decomposes a matrix into three separate matrices. It is widely used in various
applications of machine learning, data analysis, and linear algebra. Here's a
detailed explanation of SVD:

SVD is applicable to any m x n matrix, where m represents the number of rows


and n represents the number of columns. Given a matrix A, SVD decomposes it
into three matrices:
A = UΣV^T
where U is an m x m orthogonal matrix, Σ is an m x n diagonal matrix with non-
negative real numbers on the diagonal, and V^T is the transpose of an n x n
orthogonal matrix V.
The diagonal elements of Σ are known as singular values and are arranged in
descending order. The columns of U are called left singular vectors, and the
columns of V are called right singular vectors. The singular vectors represent
the directions or axes along which the matrix A has the most significant
influence.
Applications of SVD include:

1. Dimensionality Reduction:
SVD is used for dimensionality reduction by reducing the number of features in
a dataset while preserving the most important information. It allows us to
identify the most relevant singular values and vectors, which can be used as a
reduced set of features for further analysis or modeling.
2. Image Compression:
SVD is used in image compression techniques, such as the JPEG format. The
matrix representing an image can be decomposed using SVD, and the singular
values can be truncated to retain only the most significant ones. This results in a
compressed representation of the image, reducing storage requirements without
losing significant visual information.
3. Data Denoising:
SVD can be utilized for denoising data by separating the signal from the noise.
By decomposing a matrix into its singular values and vectors, it is possible to
identify the dominant components (signal) and remove the smaller components
(noise). This is particularly useful in signal processing and data cleaning tasks.
(Q.9) Explain EM algorithm along with its application.
Solution:

The Expectation-Maximization (EM) algorithm is an iterative optimization


algorithm used to estimate the parameters of probabilistic models when there
are latent (unobserved) variables. It is commonly applied in unsupervised
machine learning tasks where the goal is to learn the underlying structure or
hidden patterns in the data. The EM algorithm consists of two main steps: the E-
step (Expectation step) and the M-step (Maximization step). It alternates
between these steps until convergence, optimizing the model parameters to
maximize the likelihood of the observed data. Here's a high-level overview of
the EM algorithm:
1. Initialization: Start by initializing the model parameters (e.g., mean,
variance, mixing proportions) either randomly or based on prior knowledge.
2. E-step (Expectation step): Given the current parameter estimates, calculate
the expected values (or posterior probabilities) of the latent variables. This step
involves computing the probability distribution over the latent variables given
the observed data and the current parameter values.
3. M-step (Maximization step): With the expected values of the latent
variables, update the model parameters to maximize the likelihood of the
observed data. This step involves adjusting the parameters to increase the
likelihood of the observed data, based on the computed expected values from
the E-step.
4. Convergence check: Evaluate the change in the model parameters after each
iteration. If the change is below a predefined threshold or the maximum number
of iterations is reached, terminate the algorithm. Otherwise, go back to step 2
and repeat the E-step and M-step.
The EM algorithm is particularly useful when dealing with incomplete or
partially observed data, where there are hidden variables that need to be
estimated. It provides a framework for estimating the parameters of models that
involve missing data, clustering, mixture models, and more.
Applications of EM algorithm:
1. Data Clustering:
The EM algorithm is widely used in machine learning for clustering tasks. It
helps estimate the parameters of clustering models, such as Gaussian Mixture
Models (GMM), which assume that the data points are generated from a
mixture of Gaussian distributions. The EM algorithm is used to iteratively
update the cluster assignments and estimate the parameters of the distributions.
2. Computer Vision and NLP:
The EM algorithm finds applications in computer vision tasks, such as image
segmentation and object recognition. It helps estimate the parameters of models
that capture the underlying structure and patterns in images. In natural language
processing (NLP), the EM algorithm can be used for tasks like part-of-speech
tagging, text categorization, and language modelling.
3. Mixed Models:
The EM algorithm is widely used in statistics, especially in mixed models. It
helps estimate the parameters in models with both fixed and random effects.
For example, in quantitative genetics, the EM algorithm is used to estimate the
genetic parameters in models that capture the genetic and environmental factors
affecting traits.
4. Psychometrics:
The EM algorithm is utilized in psychometrics for estimating item parameters
and latent abilities in item response theory (IRT) models. IRT models are used
to analyse responses to test items and estimate the underlying latent traits or
abilities of individuals.
5. Medical and Healthcare Applications:
The EM algorithm has applications in medical imaging, such as image
reconstruction and segmentation. It helps estimate the parameters of imaging
models to enhance the quality of medical images. In structural engineering, the
EM algorithm can be used for structural health monitoring, estimating the
hidden states or variables related to the structural behaviour.
6. Gaussian Density Estimation:
The EM algorithm can be used to estimate the parameters of a Gaussian
distribution, which is commonly used to model continuous data. It helps
estimate the mean and covariance matrix of the distribution based on the
observed data.

Advantages of EM algorithm:

1. Handling Missing Data:


EM algorithm is particularly useful when dealing with missing data. It provides
a way to estimate the missing values based on the observed data and model
parameters, allowing for more complete and accurate analysis.
2. Unsupervised Learning:
EM algorithm is commonly used in unsupervised learning tasks, where the goal
is to uncover hidden patterns or structures in the data. It can effectively estimate
the parameters of models with latent variables, enabling the discovery of
underlying relationships.
3. Maximum Likelihood Estimation:
EM algorithm is an iterative optimization method that maximizes the likelihood
of the observed data. It provides a principled approach to estimating model
parameters in situations where direct maximum likelihood estimation is
challenging due to the presence of latent variables.
4. Flexibility:
EM algorithm is a versatile tool that can be applied to various probabilistic
models. It can handle different types of distributions and can be adapted to suit
specific modeling requirements.

Disadvantages of EM algorithm:
1. Local Optima:
EM algorithm is sensitive to the initial parameter values, and it can
converge to local optima rather than the global optimum. Multiple runs with
different initializations may be required to mitigate this issue.
2. Convergence:
Although the EM algorithm aims to converge to an optimal solution, it is
not guaranteed to reach the global optimum. Convergence can be slow or may
not occur if the likelihood surface is complex or if the algorithm gets stuck in
suboptimal solutions.
3. Computational Complexity:
The computational complexity of the EM algorithm can be a drawback,
especially for large datasets or complex models. Each iteration involves
computing the expected values and updating the model parameters, which can
be computationally expensive and time-consuming.
4. Assumption of Model Correctness:
The EM algorithm assumes that the model structure and distribution
assumptions are correct. If the model is misspecified or the assumptions do not
hold, the parameter estimates obtained from the EM algorithm may be biased or
inaccurate.
5. Sensitivity to Outliers:
The EM algorithm can be sensitive to outliers or extreme observations in the
data. Outliers can disproportionately influence the parameter estimates, leading
to biased results.
(Q.10) What is the activation function? Explain the common activation
functions used in neural networks.
Solution:

An Activation Function is a mathematical function which decides whether a


neuron should be activated or not. This means that it will decide whether the
neuron’s input to the network is important or not. The primary role of the
Activation Function is to transform the summed weighted input from the node
into an output value to be fed to the next hidden layer or as output. Well, the
purpose of an activation function is to add non-linearity to the neural network.
Activation functions introduce an additional step at each layer during the
forward propagation, but its computation is worth it. Here is why—
Let’s suppose we have a neural network working without the activation
functions.
In that case, every neuron will only be performing a linear transformation on the
inputs using the weights and biases. It’s because it doesn’t matter how many
hidden layers we attach in the neural network; all layers will behave in the same
way because the composition of two linear functions is a linear function itself.
Although the neural network becomes simpler, learning any complex task is
impossible, and our model would be just a linear regression model.
Types of Activation function:

Binary Step Function


Binary step function depends on a threshold value that decides whether a neuron
should be activated or not. The input fed to the activation function is compared
to a certain threshold; if the input is greater than it, then the neuron is activated,
else it is deactivated, meaning that its output is not passed on to the next hidden
layer.
Mathematically it can be represented as:

Here are some of the limitations of binary step function:

● It cannot provide multi-value outputs—for example, it cannot be used for


multi-class classification problems.
● The gradient of the step function is zero, which causes a hindrance in the
backpropagation process.

Linear Activation Function

The linear activation function, also known as "no activation," or "identity


function" (multiplied x1.0), is where the activation is proportional to the input.
The function doesn't do anything to the weighted sum of the input, it simply
spits out the value it was given.

Mathematically it can be represented as:

However, a linear activation function has two major problems :


● It’s not possible to use backpropagation as the derivative of the function is a
constant and has no relation to the input x.
● All layers of the neural network will collapse into one if a linear activation
function is used. No matter the number of layers in the neural network, the last
layer will still be a linear function of the first layer. So, essentially, a linear
activation function turns the neural network into just one layer.
Sigmoid / Logistic Activation Function
This function takes any real value as input and outputs values in the range of 0
to 1.
The larger the input (more positive), the closer the output value will be to 1.0,
whereas the smaller the input (more negative), the closer the output will be to
0.0, as shown below.

Mathematically it can be represented as:

Here’s why sigmoid/logistic activation function is one of the most widely used
functions:
● It is commonly used for models where we have to predict the probability as an
output. Since probability of anything exists only between the range of 0 and 1,
sigmoid is the right choice because of its range.
● The function is differentiable and provides a smooth gradient, i.e., preventing
jumps in output values. This is represented by an S-shape of the sigmoid
activation function.

Tanh Function (Hyperbolic Tangent)


Tanh function is very similar to the sigmoid/logistic activation function, and
even has the same S-shape with the difference in output range of -1 to 1. In
Tanh, the larger the input (more positive), the closer the output value will be to
1.0, whereas the smaller the input (more negative), the closer the output will be
to -1.0.
Mathematically it can be represented as:

Advantages of using this activation function are:


● The output of the tanh activation function is Zero centered; hence we can
easily map the output values as strongly negative, neutral, or strongly positive.
● Usually used in hidden layers of a neural network as its values lie between -1
to; therefore, the mean for the hidden layer comes out to be 0 or very close to it.
It helps in centring the data and makes learning for the next layer much easier.

ReLU Function
ReLU stands for Rectified Linear Unit.
Although it gives an impression of a linear function, ReLU has a derivative
function and allows for backpropagation while simultaneously making it
computationally efficient. The main catch here is that the ReLU function does
not activate all the neurons at the same time. The neurons will only be
deactivated if the output of the linear transformation is less than 0.

Mathematically it can be represented as:

The advantages of using ReLU as an activation function are as follows:


● Since only a certain number of neurons are activated, the ReLU function is far
more computationally efficient when compared to the sigmoid and tanh
functions.
● ReLU accelerates the convergence of gradient descent towards the global
minimum of the loss function due to its linear, non-saturating property.
Leaky ReLU Function
Leaky ReLU is an improved version of ReLU function to solve the Dying
ReLU problem as it has a small positive slope in the negative area.

Mathematically it can be represented as:

The advantages of Leaky ReLU are the same as that of ReLU, in addition to the
fact that it does enable backpropagation, even for negative input values.
By making this minor modification for negative input values, the gradient of the
left side of the graph comes out to be a non-zero value. Therefore, we would no
longer encounter dead neurons in that region.

Exponential Linear Units (ELUs) Function


Exponential Linear Unit, or ELU for short, is also a variant of ReLU that
modifies the slope of the negative part of the function.
ELU uses a log curve to define the negative values unlike the leaky ReLU with
a straight line.

Mathematically it can be represented as:


ELU is a strong alternative for f ReLU because of the following advantages:
● ELU becomes smooth slowly until its output equal to -α whereas RELU
sharply smoothes.
● Avoids dead ReLU problem by introducing log curve for negative values of
input. It helps the network nudge weights and biases in the right direction.

(Q.11 & Q.12) Multivariate Linear Regression & Regularized Regression.


Solution:
(Q.13) What is the curse of dimensionality? Explain PCA dimensionality
reduction technique in detail.
Solution:

What is the curse of dimensionality?


In machine learning, a feature of an object can be an attribute or a characteristic
that defines it. Each feature represents a dimension, and a group of dimensions
creates a data point. This represents a feature vector that defines the data point
to be used by a machine learning algorithm(s). When we say increase in
dimensionality it implies an increase in the number of features used to describe
the data. For example, in the field of breast cancer research, age, number of
cancerous nodes can be used as features to define the prognosis of the breast
cancer patient. These features constitute the dimensions of a feature vector. But
other factors like past surgeries, patient history, type of tumor and other such
features help a doctor to better determine the prognosis. In this case by adding
features, we are theoretically increasing the dimensions of our data.

Curse of Dimensionality describes the explosive nature of increasing data


dimensions and its resulting exponential increase in computational efforts
required for its processing and/or analysis due to exponential increase in
demand of data. An increase in the dimensions can in theory, add more
information to the data thereby improving the quality of data but practically
increases the noise and redundancy during its analysis.

As the dimensionality increases, the number of data points required for good
performance of any machine learning algorithm increases exponentially. When
the number of dimensions in a dataset increases, the volume of the space
represented by the data also increases. As a result, the density of the data points
decreases, and more data is required to represent the underlying surface or
function that best approximates the data.

The curse of dimensionality can cause a number of problems in various fields of


machine learning and data analysis, some common problems are:
1. Overfitting: As the number of dimensions increases, it becomes more
difficult to estimate the underlying distribution of the data, and as a result,
models may overfit the training data.
2. Poor generalization: Because high-dimensional data is often sparse, models
trained on such data may not generalize well to new data, resulting in poor
performance on unseen data.
3. Computational complexity: As the number of dimensions increases, the
computational complexity of many algorithms increases exponentially, making
it difficult to perform tasks such as optimization, inference, and learning on
high-dimensional data.
4. Difficulty in visualizing high-dimensional data: It's hard to visualize high-
dimensional data, which makes it difficult to understand the underlying patterns
and relationships in the data.
5. Difficulty in building robust models: In high-dimensional space, models
can be sensitive to small variations in the data, and it can be difficult to build
models that are robust to noise and outliers.

PCA
PCA (Principal Component Analysis) is a widely used technique for
dimensionality reduction in machine learning and data analysis. It aims to
transform a high-dimensional dataset into a lower-dimensional space while
retaining the most important information or patterns in the data.
By reducing the dimensionality of the data, PCA allows for easier visualization,
analysis, and modeling while preserving the most significant patterns or
structures in the data. The lower-dimensional space is constructed in such a way
that the first principal component explains the maximum variance, followed by
the second component, and so on. The principal components are orthogonal to
each other, meaning they are uncorrelated.

The steps involved in PCA are as follows:


1. Standardization: The first step is to standardize the data by subtracting the
mean and dividing by the standard deviation of each feature. This ensures that
all features have the same scale and prevents any single feature from
dominating the analysis.
2. Covariance Matrix: Next, the covariance matrix is computed from the
standardized data. The covariance matrix provides information about the
relationships between pairs of features and helps identify the directions in which
the data varies the most.
3. Eigendecomposition: The covariance matrix is then eigen decomposed to
obtain its eigenvectors and eigenvalues. The eigenvectors represent the principal
components of the data, which are the directions along which the data varies the
most. The corresponding eigenvalues indicate the amount of variance explained
by each principal component.
4. Selecting Principal Components: The next step involves selecting a subset
of the principal components based on their corresponding eigenvalues. This
selection can be based on a threshold, such as retaining the components that
explain a certain percentage of the total variance (e.g., 95% variance explained).
5. Projection: Finally, the selected principal components are used to project the
original data onto a lower-dimensional space. The projection involves
multiplying the standardized data by the selected principal components, which
results in a new set of transformed features or principal components.

(Q.14) Design a hebb net to implement OR function (consider bipolar inputs


and target).
Solution:

(Q.15) Draw Delta learning rule (LMS-Widrow Hoff) model and explain it with
training process flow chart
Solution:
(Q.16) Ridge Regression v/s Lasso Regression.
Solution:
Aspect Ridge Regression Lasso Regression
L1 regularization (penalty
Regularization L2 regularization (penalty on the on the absolute value of
Type square of coefficients) coefficients)
Objective Minimizes: Minimizes: [ \text{RSS} +
Function RSS+𝜆∑𝑗=1𝑝𝛽𝑗2RSS+λ∑j=1pβj2 \lambda \sum_{j=1}^{p}
Can shrink coefficients all
Shrinks coefficients towards zero the way to zero,
Coefficient but does not eliminate them effectively performing
Shrinkage completely variable selection
Can handle
multicollinearity, but may
Effective in reducing the impact of select only one variable
Impact of multicollinearity by shrinking from a group of highly
Multicollinearity coefficients correlated predictors
Can perform feature
Does not inherently perform feature selection by setting some
Feature Selection selection coefficients to zero
Can be more
computationally intensive,
Computational Generally less computationally especially with a large
Complexity intensive number of predictors
Can lead to sparse models
Coefficients tend to be smaller but with fewer predictors,
Interpretability may not be exactly zero enhancing interpretability
(Q.17) Artificial Neural Networks.
Solution:
Artificial Neural Networks contain artificial neurons which are called units.
These units are arranged in a series of layers that together constitute the whole
Artificial Neural Network in a system. A layer can have only a dozen units or
millions of units as this depends on how the complex neural networks will be
required to learn the hidden patterns in the dataset. Commonly, Artificial
Neural Network has an input layer, an output layer as well as hidden layers.
The input layer receives data from the outside world which the neural network
needs to analyze or learn about. Then this data passes through one or multiple
hidden layers that transform the input into data that is valuable for the output
layer. Finally, the output layer provides an output in the form of a response of
the Artificial Neural Networks to input data provided.
In the majority of neural networks, units are interconnected from one layer to
another. Each of these connections has weights that determine the influence of
one unit on another unit. As the data transfers from one unit to another, the
neural network learns more and more about the data which eventually results
in an output from the output layer.

Neural Networks Architecture

The structures and operations of human neurons serve as the basis for artificial
neural networks. It is also known as neural networks or neural nets. The input
layer of an artificial neural network is the first layer, and it receives input from
external sources and releases it to the hidden layer, which is the second layer.
In the hidden layer, each neuron receives input from the previous layer
neurons, computes the weighted sum, and sends it to the neurons in the next
layer. These connections are weighted means effects of the inputs from the
previous layer are optimized more or less by assigning different-different
weights to each input and it is adjusted during the training process by
optimizing these weights for improved model performance.
(Q.18) Feature Selection method in dimensionality reduction.
Solution:
Feature selection is the process of selecting a subset of relevant features from a
larger set of available features in a dataset. By reducing the number of features,
feature selection can improve the efficiency, interpretability, and generalization
performance of machine learning models. The importance of feature selection
arises from the fact that not all features in a dataset may be equally relevant or
contribute significantly to the target variable. Irrelevant or redundant features
can introduce noise, increase model complexity, and potentially lead to
overfitting. Feature selection helps to mitigate these issues by focusing on the
most informative features, which can lead to improved model performance,
reduced training time, and better understanding of the underlying data.
Types of Feature Selection Methods in ML
Filter Methods

Filter methods select the most important features based on their statistics
properties. These methods are faster and less computationally expensive than
wrapper methods. When dealing with high-dimensional data, it is
computationally cheaper to use filter methods.

Following are some of the statistical properties:


● Chi-square Test
The Chi-square test is used for categorical features in a dataset. We calculate
Chi-square between each feature and the target and select the desired number of
features with the best Chi-square scores. In order to correctly apply the chi-
squared to test the relation between various features in the dataset and the target
variable, the following conditions have to be met: the variables have to be
categorical, sampled independently, and values should have an expected
frequency greater than 5.

● Fisher’s Score
Fisher score is one of the most widely used supervised feature selection
methods. The algorithm will return the ranks of the variables based on the
fisher’s score in descending order. We can then select the variables as per the
case.

● Correlation Coefficient
Correlation is a measure of the linear relationship between 2 or more variables.
Through correlation, we can predict one variable from the other. The logic
behind using correlation for feature selection is that good variables correlate
highly with the target. Furthermore, variables should be correlated with the
target but uncorrelated among themselves. If two variables are correlated, we
can predict one from the other. Therefore, if two features are correlated, the
model only needs one, as the second does not add additional information.

● Variance Threshold
The variance threshold is a simple baseline approach to feature selection. It
removes all features whose variance doesn’t meet some threshold. By default, it
removes all zero-variance features, i.e., features with the same value in all
samples. We assume that features with a higher variance may contain more
useful information, but note that we are not taking the relationship between
feature variables or feature and target variables into account, which is one of the
drawbacks of filter methods.
(Q.19) Perceptron Neural Network (PNN).
Solution:
The perceptron neural model is a foundational concept in neural networks.
While it is a simple model with a single layer, it serves as a building block for
more complex neural network architectures, such as multi-layer perceptrons
(MLPs). It is a simple algorithm for binary classification. It is based on the
concept of a biological neuron and mimics its basic functionality. The
perceptron takes a set of input features and assigns weights to each feature.
These weighted inputs are then passed through an activation function to produce
an output.

Following are the key components of perceptron neural network:


1. Input Features:
● The perceptron model receives input features, which are numerical
representations of the input data.
● Each input feature represents a specific attribute or characteristic of the data.

2. Weights:
● Each input feature is associated with a weight, which determines the
importance or contribution of that feature to the overall prediction.
● The weights control the strength and direction of the connections between the
input features and the perceptron's output.

3. Summation Function:
● The weighted sum of the input features is computed using a summation
function.
● The summation function multiplies each input feature by its corresponding
weight and then adds them up.

4. Activation Function:
● The output of the summation function is passed through an activation
function.
● The activation function introduces non-linearity and determines the output of
the perceptron.
● Common activation functions used in perceptrons include step function, sign
function, sigmoid function, and ReLU (Rectified Linear Unit) function.
5. Threshold/Bias:
● A threshold or bias term is added to the weighted sum before passing it
through the activation function.
● The threshold/bias allows the perceptron to make decisions based on whether
the weighted sum exceeds a certain threshold.
● It acts as an offset or bias, influencing the activation function's output.

6. Output:
● The output of the activation function represents the prediction or decision
made by the perceptron.
● It can be a binary output (e.g., 0 or 1) or a continuous output, depending on
the problem being solved.

7. Learning Algorithm:
● The perceptron model utilizes a learning algorithm to adjust the weights and
bias term during the training process.
● The learning algorithm updates the weights based on the prediction error and
a specified learning rate.
● One common learning algorithm for perceptrons is the perceptron learning
rule.
(Q.20) Hebbian learning rule.
Solution:
The Hebbian rule is a principle in neuroscience and unsupervised learning that
describes how synapses between neurons can be strengthened or weakened
based on their co-activation. The Hebbian rule states: "Cells that fire together,
wire together." In other words, if two neurons are repeatedly activated together,
the synaptic connection between them will be strengthened. Conversely, if two
neurons are rarely activated together, the synaptic connection will weaken or
even be eliminated.
The Hebb learning rule assumes that:
● If two neighbour neurons activated and deactivated at the same time. Then the
weight connecting these neurons should increase.
● For neurons operating in the opposite phase, the weight between them should
decrease.
● If there is no signal correlation, the weight should not change.
When inputs of both the nodes are either positive or negative, then a strong
positive weight exists between the nodes.
If the input of a node is positive and negative for others, a strong negative
weight exists between the nodes.
Hebb’s learning:
If neuron Xj is near enough to excite neuron Yk and repeatedly participate in it’s
activation, the synaptic connection between these neurons is strengthened and
neuron Yk becomes more sensitive to stimuli from neuron Xj

Rules of Hebbian Learning


● If two neurons on either side of the connection are activated synchronously,
then the weight of that connection is increased.
● If two neurons on either side of the connection are activated asynchronously,
then the weight of that connection is decreased.
(Q.21) Maximization algorithm for clustering.
Solution: (same as Q.9)

You might also like