ML Solutions
ML Solutions
(Q.1) What is machine learning, machine learning application & model? Steps
in developing a machine learning application.
Solution:
What is machine learning:
- Field of study that gives computers the ability to learn without being
explicitly programmed.
- Uses techniques to give machines the ability to “LEARN FROM DATA”,
without being explicitly programmed.
- Machines to make data-driven decisions rather than being explicitly
programmed for carrying out a certain task.
- “A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E.”
Once all data is ready, it is loaded into a suitable place and then the order is
randomized as the order of data should not affect what is learned.
Lastly, Data set is divided into training and testing data sets.
● Step 5: Evaluation
In this, testing dataset kept aside is used to evaluate and identify efficiency of
the model.
Evaluation allows the testing of the model against data that has never been seen
and used for training and is meant to be representative of how the model might
perform when in the real world.
1. Supervised Learning:
Supervised learning algorithms are trained on labeled datasets, where each data
instance is associated with a corresponding target or output variable. The
algorithm learns a mapping between the input features and the target variable,
enabling it to make predictions or classifications on unseen data
Working:
● Given data in the form of input output pair, it is fed to a learning algorithm
one by one, during training.
● Then the algorithm is allowed to predict the output for each example, and give
it feedback as to whether it predicted the right answer or not.
● Over time, the algorithm will learn to approximate the exact nature of the
relationship between input output pairs.
● When fully-trained, the supervised learning algorithm will be able to observe
a new, never-before-seen example and predict a correct label/output for it.
Pros:
● Clear specific objective
● Easy to measure accuracy – Since actual output is known it's easy to design a
performance metric for the system.
● Controlled training process – which in return gives an outcome of a very
specific behaviour.
Cons:
● Intensive Labor - data requires labelling before the model is trained, which
can take hours of human effort.
● Needs a large amount of data.
● Limited insights - no freedom for the machine to explore other possibilities
2. Unsupervised Learning:
Unsupervised learning algorithms work with unlabelled data, where there are no
predefined target variables. The algorithms explore the patterns and structures in
the data to find inherent relationships, clusters, or patterns
Working:
● The very first step is to load the unlabeled data into the system.
● Once the data is loaded into the system, the algorithm analyzes the data
● As the analysis gets completed, the algorithm will look for patterns depending
upon the behavior or attributes of the dataset.
● Once pattern identification and grouping are done, it gives the output.
Pros:
● Fast Process - since no data labeling is required here i.e. fewer human
resources is required in order to perform tasks.
● Unique insights – unique, disruptive insights for a business to consider as it
interprets data on its own.
Cons:
● Difficult to measure accuracy - it is not easy to measure the accuracy since we
don’t have any expected or desired outcome to compare to.
● Data Dimensionality - When the dimension of data and the number of
variables become more and need to be reduced in order to work on that data,
then human involvement becomes necessary to clean the data.
3. Semi-Supervised Learning:
Semi-supervised learning algorithms deal with partially labelled datasets, where
only a small portion of the data instances have labels. These algorithms leverage
both labelled and unlabelled data to learn patterns and make predictions. They
combine elements of supervised and unsupervised learning. Self-training, co-
training, and generative models (e.g., generative adversarial networks - GANs)
can be used for semi-supervised learning tasks.)
Pros:
● Reduces time required for labelling massive data.
● Avoids human biases which can be introduced due to labelling.
● Using lots of unlabelled data during the training process improves the
accuracy of the final model while reducing the time and cost spent building it.
4. Reinforcement Learning:
Reinforcement learning algorithms involve training an agent to interact with an
environment and learn optimal actions through a trial-and-error process. The
agent receives feedback in the form of rewards or penalties based on its actions
and uses this feedback to learn and improve its decision-making policy.
Reinforcement learning is often used in scenarios where there is no labeled
dataset, and the agent learns by exploration and exploitation of the environment.
Pros:
● Reinforcement learning can be used to solve very complex problems that
cannot be solved by conventional techniques.
● This learning model is very similar to the learning of human beings. Hence, it
is close to achieving perfection.
Cons:
● Computation Heavy and Time Consuming.
● The curse of dimensionality limits reinforcement learning heavily for real
physical systems. The curse of dimensionality refers to various phenomena that
arise when analyzing and organizing data in high-dimensional spaces that do not
occur in low-dimensional settings.
Each synapse has a processing value and weight, which is recognized at the
time of the training of the network. The network’s performance and potency
completely depend on the number of neurons in the network, how these are
connected with each other (i.e. topology) and the value of weights assigned to
each synapse.
Underfitting and overfitting are common issues that can arise when training
machine learning models. These issues relate to the model's ability to generalize
well to new, unseen data. Let's explain underfitting and overfitting in detail:
Underfitting:
Underfitting occurs when a machine learning model is too simple or lacks the
capacity to capture the underlying patterns and relationships in the training data.
The model's performance is poor, both on the training data and new, unseen
data. It fails to learn the complexity of the problem and produces high bias.
Signs of underfitting include:
1. High training error: The model struggles to fit the training data accurately,
resulting in a high training error rate.
2. High testing error: The model's performance on new, unseen data is also poor,
leading to a high testing error rate.
3. Oversimplified predictions: The model makes overly simplistic assumptions
or predictions, disregarding important features or patterns in the data.
Underfitting can occur due to various reasons, including:
1. Model simplicity: Using a model with insufficient complexity or too few
parameters to capture the complexity of the data.
2. Insufficient training: Inadequate training data or insufficient training time,
preventing the model from learning effectively.
3. Feature scarcity: Lack of informative features or not capturing the relevant
aspects of the problem.
To address underfitting, you can consider the following approaches:
1. Increase model complexity: Use a more complex model with higher capacity
to capture the underlying patterns in the data.
2. Feature engineering: Add more relevant features or transform existing
features to improve the model's ability to learn.
3. Gather more data: Increase the size of the training dataset to provide the
model with more examples to learn from.
4. Reduce regularization: If regularization techniques are applied, such as L1 or
L2 regularization, consider reducing their strength to allow the model to fit the
data better.
Overfitting:
Overfitting occurs when a machine learning model becomes overly complex
and starts to memorize the noise or random fluctuations in the training data,
rather than learning the underlying patterns. The model fits the training data
extremely well but fails to generalize to new, unseen data.
Signs of overfitting include:
1. Low training error: The model achieves very low training error, as it is able to
fit the training data closely.
2. High testing error: The model performs poorly on new, unseen data, leading
to a high testing error rate.
3. Overly complex predictions: The model may produce overly complex or
erratic predictions, capturing noise rather than true patterns.
Overfitting can occur due to various reasons, including:
1. Model complexity: Using a model with excessive complexity or too many
parameters that allows it to fit noise in the training data.
2. Limited training data: Insufficient training examples may cause the model to
overgeneralize patterns in the available data, including noise.
3. Feature overfitting: Overfitting can also occur when the model is given too
many irrelevant or noisy features.
To address overfitting, you can consider the following approaches:
1. Reduce model complexity: Use a simpler model or apply regularization
techniques to limit the model's capacity to fit noise.
2. Feature selection: Identify and remove irrelevant or noisy features that may
contribute to overfitting.
3. Increase training data: Gather more training examples to provide a more
representative sample and reduce overfitting.
4. Regularization: Apply techniques like L1 or L2 regularization to introduce
constraints on the model's parameters and prevent overfitting.
(Q.5) Error Back Propagation Algorithm(diagram)(flow chart).
Definition:
Least squares regression lines are a specific type of model that analysts
frequently use to display relationships in their data. Statisticians call it “least
squares” because it minimizes the residual sum of squares.
Example:
Now that we have the slope (m), we can find the y-intercept (b) for the line.
Let’s plug the slope and intercept values in the least squares regression line
equation:
y = 11.329 + 1.0616x
This linear equation matches the one that the software displays on the graph.
We can use this equation to make predictions. For example, if we want to
predict the score for studying 5 hours, we simply plug x = 5 into the equation:
Therefore, the model predicts that people studying for 5 hours will have an
average test score of 16.637.
1. Dimensionality Reduction:
SVD is used for dimensionality reduction by reducing the number of features in
a dataset while preserving the most important information. It allows us to
identify the most relevant singular values and vectors, which can be used as a
reduced set of features for further analysis or modeling.
2. Image Compression:
SVD is used in image compression techniques, such as the JPEG format. The
matrix representing an image can be decomposed using SVD, and the singular
values can be truncated to retain only the most significant ones. This results in a
compressed representation of the image, reducing storage requirements without
losing significant visual information.
3. Data Denoising:
SVD can be utilized for denoising data by separating the signal from the noise.
By decomposing a matrix into its singular values and vectors, it is possible to
identify the dominant components (signal) and remove the smaller components
(noise). This is particularly useful in signal processing and data cleaning tasks.
(Q.9) Explain EM algorithm along with its application.
Solution:
Advantages of EM algorithm:
Disadvantages of EM algorithm:
1. Local Optima:
EM algorithm is sensitive to the initial parameter values, and it can
converge to local optima rather than the global optimum. Multiple runs with
different initializations may be required to mitigate this issue.
2. Convergence:
Although the EM algorithm aims to converge to an optimal solution, it is
not guaranteed to reach the global optimum. Convergence can be slow or may
not occur if the likelihood surface is complex or if the algorithm gets stuck in
suboptimal solutions.
3. Computational Complexity:
The computational complexity of the EM algorithm can be a drawback,
especially for large datasets or complex models. Each iteration involves
computing the expected values and updating the model parameters, which can
be computationally expensive and time-consuming.
4. Assumption of Model Correctness:
The EM algorithm assumes that the model structure and distribution
assumptions are correct. If the model is misspecified or the assumptions do not
hold, the parameter estimates obtained from the EM algorithm may be biased or
inaccurate.
5. Sensitivity to Outliers:
The EM algorithm can be sensitive to outliers or extreme observations in the
data. Outliers can disproportionately influence the parameter estimates, leading
to biased results.
(Q.10) What is the activation function? Explain the common activation
functions used in neural networks.
Solution:
Here’s why sigmoid/logistic activation function is one of the most widely used
functions:
● It is commonly used for models where we have to predict the probability as an
output. Since probability of anything exists only between the range of 0 and 1,
sigmoid is the right choice because of its range.
● The function is differentiable and provides a smooth gradient, i.e., preventing
jumps in output values. This is represented by an S-shape of the sigmoid
activation function.
ReLU Function
ReLU stands for Rectified Linear Unit.
Although it gives an impression of a linear function, ReLU has a derivative
function and allows for backpropagation while simultaneously making it
computationally efficient. The main catch here is that the ReLU function does
not activate all the neurons at the same time. The neurons will only be
deactivated if the output of the linear transformation is less than 0.
The advantages of Leaky ReLU are the same as that of ReLU, in addition to the
fact that it does enable backpropagation, even for negative input values.
By making this minor modification for negative input values, the gradient of the
left side of the graph comes out to be a non-zero value. Therefore, we would no
longer encounter dead neurons in that region.
As the dimensionality increases, the number of data points required for good
performance of any machine learning algorithm increases exponentially. When
the number of dimensions in a dataset increases, the volume of the space
represented by the data also increases. As a result, the density of the data points
decreases, and more data is required to represent the underlying surface or
function that best approximates the data.
PCA
PCA (Principal Component Analysis) is a widely used technique for
dimensionality reduction in machine learning and data analysis. It aims to
transform a high-dimensional dataset into a lower-dimensional space while
retaining the most important information or patterns in the data.
By reducing the dimensionality of the data, PCA allows for easier visualization,
analysis, and modeling while preserving the most significant patterns or
structures in the data. The lower-dimensional space is constructed in such a way
that the first principal component explains the maximum variance, followed by
the second component, and so on. The principal components are orthogonal to
each other, meaning they are uncorrelated.
(Q.15) Draw Delta learning rule (LMS-Widrow Hoff) model and explain it with
training process flow chart
Solution:
(Q.16) Ridge Regression v/s Lasso Regression.
Solution:
Aspect Ridge Regression Lasso Regression
L1 regularization (penalty
Regularization L2 regularization (penalty on the on the absolute value of
Type square of coefficients) coefficients)
Objective Minimizes: Minimizes: [ \text{RSS} +
Function RSS+𝜆∑𝑗=1𝑝𝛽𝑗2RSS+λ∑j=1pβj2 \lambda \sum_{j=1}^{p}
Can shrink coefficients all
Shrinks coefficients towards zero the way to zero,
Coefficient but does not eliminate them effectively performing
Shrinkage completely variable selection
Can handle
multicollinearity, but may
Effective in reducing the impact of select only one variable
Impact of multicollinearity by shrinking from a group of highly
Multicollinearity coefficients correlated predictors
Can perform feature
Does not inherently perform feature selection by setting some
Feature Selection selection coefficients to zero
Can be more
computationally intensive,
Computational Generally less computationally especially with a large
Complexity intensive number of predictors
Can lead to sparse models
Coefficients tend to be smaller but with fewer predictors,
Interpretability may not be exactly zero enhancing interpretability
(Q.17) Artificial Neural Networks.
Solution:
Artificial Neural Networks contain artificial neurons which are called units.
These units are arranged in a series of layers that together constitute the whole
Artificial Neural Network in a system. A layer can have only a dozen units or
millions of units as this depends on how the complex neural networks will be
required to learn the hidden patterns in the dataset. Commonly, Artificial
Neural Network has an input layer, an output layer as well as hidden layers.
The input layer receives data from the outside world which the neural network
needs to analyze or learn about. Then this data passes through one or multiple
hidden layers that transform the input into data that is valuable for the output
layer. Finally, the output layer provides an output in the form of a response of
the Artificial Neural Networks to input data provided.
In the majority of neural networks, units are interconnected from one layer to
another. Each of these connections has weights that determine the influence of
one unit on another unit. As the data transfers from one unit to another, the
neural network learns more and more about the data which eventually results
in an output from the output layer.
The structures and operations of human neurons serve as the basis for artificial
neural networks. It is also known as neural networks or neural nets. The input
layer of an artificial neural network is the first layer, and it receives input from
external sources and releases it to the hidden layer, which is the second layer.
In the hidden layer, each neuron receives input from the previous layer
neurons, computes the weighted sum, and sends it to the neurons in the next
layer. These connections are weighted means effects of the inputs from the
previous layer are optimized more or less by assigning different-different
weights to each input and it is adjusted during the training process by
optimizing these weights for improved model performance.
(Q.18) Feature Selection method in dimensionality reduction.
Solution:
Feature selection is the process of selecting a subset of relevant features from a
larger set of available features in a dataset. By reducing the number of features,
feature selection can improve the efficiency, interpretability, and generalization
performance of machine learning models. The importance of feature selection
arises from the fact that not all features in a dataset may be equally relevant or
contribute significantly to the target variable. Irrelevant or redundant features
can introduce noise, increase model complexity, and potentially lead to
overfitting. Feature selection helps to mitigate these issues by focusing on the
most informative features, which can lead to improved model performance,
reduced training time, and better understanding of the underlying data.
Types of Feature Selection Methods in ML
Filter Methods
Filter methods select the most important features based on their statistics
properties. These methods are faster and less computationally expensive than
wrapper methods. When dealing with high-dimensional data, it is
computationally cheaper to use filter methods.
● Fisher’s Score
Fisher score is one of the most widely used supervised feature selection
methods. The algorithm will return the ranks of the variables based on the
fisher’s score in descending order. We can then select the variables as per the
case.
● Correlation Coefficient
Correlation is a measure of the linear relationship between 2 or more variables.
Through correlation, we can predict one variable from the other. The logic
behind using correlation for feature selection is that good variables correlate
highly with the target. Furthermore, variables should be correlated with the
target but uncorrelated among themselves. If two variables are correlated, we
can predict one from the other. Therefore, if two features are correlated, the
model only needs one, as the second does not add additional information.
● Variance Threshold
The variance threshold is a simple baseline approach to feature selection. It
removes all features whose variance doesn’t meet some threshold. By default, it
removes all zero-variance features, i.e., features with the same value in all
samples. We assume that features with a higher variance may contain more
useful information, but note that we are not taking the relationship between
feature variables or feature and target variables into account, which is one of the
drawbacks of filter methods.
(Q.19) Perceptron Neural Network (PNN).
Solution:
The perceptron neural model is a foundational concept in neural networks.
While it is a simple model with a single layer, it serves as a building block for
more complex neural network architectures, such as multi-layer perceptrons
(MLPs). It is a simple algorithm for binary classification. It is based on the
concept of a biological neuron and mimics its basic functionality. The
perceptron takes a set of input features and assigns weights to each feature.
These weighted inputs are then passed through an activation function to produce
an output.
2. Weights:
● Each input feature is associated with a weight, which determines the
importance or contribution of that feature to the overall prediction.
● The weights control the strength and direction of the connections between the
input features and the perceptron's output.
3. Summation Function:
● The weighted sum of the input features is computed using a summation
function.
● The summation function multiplies each input feature by its corresponding
weight and then adds them up.
4. Activation Function:
● The output of the summation function is passed through an activation
function.
● The activation function introduces non-linearity and determines the output of
the perceptron.
● Common activation functions used in perceptrons include step function, sign
function, sigmoid function, and ReLU (Rectified Linear Unit) function.
5. Threshold/Bias:
● A threshold or bias term is added to the weighted sum before passing it
through the activation function.
● The threshold/bias allows the perceptron to make decisions based on whether
the weighted sum exceeds a certain threshold.
● It acts as an offset or bias, influencing the activation function's output.
6. Output:
● The output of the activation function represents the prediction or decision
made by the perceptron.
● It can be a binary output (e.g., 0 or 1) or a continuous output, depending on
the problem being solved.
7. Learning Algorithm:
● The perceptron model utilizes a learning algorithm to adjust the weights and
bias term during the training process.
● The learning algorithm updates the weights based on the prediction error and
a specified learning rate.
● One common learning algorithm for perceptrons is the perceptron learning
rule.
(Q.20) Hebbian learning rule.
Solution:
The Hebbian rule is a principle in neuroscience and unsupervised learning that
describes how synapses between neurons can be strengthened or weakened
based on their co-activation. The Hebbian rule states: "Cells that fire together,
wire together." In other words, if two neurons are repeatedly activated together,
the synaptic connection between them will be strengthened. Conversely, if two
neurons are rarely activated together, the synaptic connection will weaken or
even be eliminated.
The Hebb learning rule assumes that:
● If two neighbour neurons activated and deactivated at the same time. Then the
weight connecting these neurons should increase.
● For neurons operating in the opposite phase, the weight between them should
decrease.
● If there is no signal correlation, the weight should not change.
When inputs of both the nodes are either positive or negative, then a strong
positive weight exists between the nodes.
If the input of a node is positive and negative for others, a strong negative
weight exists between the nodes.
Hebb’s learning:
If neuron Xj is near enough to excite neuron Yk and repeatedly participate in it’s
activation, the synaptic connection between these neurons is strengthened and
neuron Yk becomes more sensitive to stimuli from neuron Xj