Need For Machine Learning
Need For Machine Learning
Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us
the correct path with the shortest route and predicts the traffic conditions.
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It
takes information from the user and sends back to its database to improve
the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment
companies such as Amazon, Netflix, etc., for product recommendation to
the user. Whenever we search for some product on Amazon, then we started
getting an advertisement for the same product while internet surfing on the
same browser and this is because of machine learning.
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
These assistant record our voice instructions, send it over the server on a
cloud, and decode it using ML algorithms and act accordingly.
8. Online Fraud Detection:
Machine learning is making our online transaction safe and secure by
detecting fraud transaction. Whenever we perform some online transaction,
there may be various ways that a fraudulent transaction can take place such
as fake accounts, fake ids, and steal money in the middle of a
transaction. So to detect this, Feed Forward Neural network helps us by
checking whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values,
and these values become the input for the next round. For each genuine
transaction, there is a specific pattern which gets change for the fraud
transaction hence, it detects it and makes our online transactions more
secure.
1. Automation
2. Scope of Improvement
1. Data Acquisition
The whole concept of machine learning is about identifying useful data. The
outcome will be incorrect if a credible data source is not provided. The
quality of the data is also significant. If the user or institution needs more
quality data, wait for it. It will cause delays in providing the output. So,
machine learning significantly depends on the data and its quality.
The data that machines process remains huge in quantity and differs greatly.
Machines require time so that their algorithm can adjust to the environment
and learn it. Trials runs are held to check the accuracy and reliability of the
machine. It requires massive and expensive resources and high-quality
expertise to set up that quality of infrastructure. Trials runs are costly as
they would cost in terms of time and expenses.
3. Results Interpretations
The error committed during the initial stages is huge, and if not corrected at
that time, it creates havoc. Biasness and wrongness have to be dealt with
separately; they are not interconnected. Machine learning depends on two
factors, i.e., data and algorithm. All the errors are dependent on the two
variables. Any incorrectness in any variables would have huge repercussions
on the output.
5. Social Changes
8. Highly Expensive
This software is highly expensive, and not everybody can own it.
Government agencies, big private firms, and enterprises mostly own it. It
needs to be made accessible to everybody for wide use.
9. Privacy Concern
As we know that one of the pillars of machine learning is data. The collection
of data has raised the fundamental question of privacy. The way data is
collected and used for commercial purposes has always been a contentious
issue. In India, the Supreme court of India has declared privacy a
fundamental right of Indians. Without the user's permission, data cannot be
collected, used, or stored. However, many cases have come up that big firms
collect the data without the user's knowledge and using it for their
commercial gains.
Machine learning is evolving concept. This area has not seen any major
developments yet that fully revolutionized any economic sector. The area
requires continuous research and innovation.
Technological Singularity:
Although this topic attracts lots of attention from the many public, scientists
are not interested in the notion of AI exceeding humans' intelligence anytime
in the immediate future. This is often referred to as superintelligence and
superintelligence, which Nick Bostrum defines as "any intelligence that far
surpasses the top human brains in virtually every field, which includes
general wisdom, scientific creativity and social abilities." In spite of the fact
that the concept of superintelligence and strong AI isn't a reality in the world,
the concept poses some interesting questions when we contemplate the
potential use of autonomous systems, such as self-driving vehicles. It's
impossible to imagine that a car with no driver would never be involved in a
car accident, but who would be accountable and accountable in those
situations? Do we need to continue to explore autonomous vehicles, or
should we restrict the use of this technology to produce semi-autonomous
cars that encourage the safety of drivers? The jury isn't yet out on this issue.
However, these kinds of ethical debates are being fought as new and
genuine AI technology is developed.
AI Impact on Jobs:
While the majority of public opinion about artificial intelligence revolves
around job loss, the issue should likely be changed. With each new and
disruptive technology, we can see shifts in demand for certain job positions.
For instance, when we consider the automotive industry, a lot of
manufacturers like GM are focusing their efforts on electric vehicles to be in
line with green policies. The energy sector isn't going away, but the primary
source that fuels it is changing from an energy economy based on fuel to an
electrical one. Artificial intelligence must be seen as a way to think about it,
as artificial intelligence is expected to shift the need for jobs to different
areas. There will be people who can control these systems as data expands
and changes each day. It is still necessary resources in order to solve more
complicated issues within sectors that are more likely to suffer from demand
shifts, including customer service. The most important element of artificial
intelligence and its impact on the employment market will be in helping
individuals adapt to the new realms that are a result of the market.
Privacy:
Privacy is often frequently discussed in relation to data privacy security, data
protection, and security. These concerns have helped policymakers advance
their efforts recently. For instance, in 2016, GDPR legislation was introduced
to safeguard the personal information of individuals within Europe's
European Union and European Economic Area, which gives individuals more
control over their data. Within the United States, individual states are
creating policies, including the California Consumer Privacy Act (CCPA), that
require companies to inform their customers about the processing of their
data. This legislation is forcing companies to think about how they handle
and store personally identifiable information (PII). In the process, security
investments have become a business priority to remove any potential
vulnerabilities or opportunities to hack, monitor, and cyber-attacks.
Bias and Discrimination:
Discrimination and bias in different intelligent machines have brought up
several ethical issues about using artificial intelligence. How can we protect
ourselves from bias and discrimination when training data could be biased?
While most companies have well-meaning intentions with regard to their
automation initiatives, Reuters highlights the unexpected effects of
incorporating AI in hiring practices. As they tried to automate and make it
easier to do so, Amazon unintentionally biased potential candidates based on
gender in positions in the technical field, which led them to end the project.
When events like these come to light, Harvard Business Review (link located
outside of IBM) has raised pertinent questions about the application of AI in
hiring practices. For example, what kind of data could you analyse when
evaluating a candidate for a particular job.
Discrimination and bias aren't just limited to the human resource function.
They are present in a variety of applications ranging from software for facial
recognition to algorithms for social media.
Accountability:
There isn't a significant law to control AI practices. There's no mechanism for
enforcement to make sure that ethical AI is being used. Companies' primary
motivations to adhere to these standards are the negative effects of an
untrustworthy AI system on their bottom lines. To address the issue, ethical
frameworks have been developed in a partnership between researchers and
ethicists to regulate the creation and use of AI models. But, for the time
being, they only serve as a provide guidance the development of AI models.
Research has shown that shared responsibility and insufficient awareness of
potential effects aren't ideal for protecting society from harm.
Based on the methods and way of learning, machine learning is divided into
mainly four types, which are:
o Classification
o Regression
a) Classification
b) Regression
o Since supervised learning work with the labelled dataset so we can have an
exact idea about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior
experience.
Disadvantages:
o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this
process, image classification is performed on different image data with pre-
defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis
purposes. It is done by using medical images and past labelled data with
labels for disease conditions. With such a process, the machine can identify a
disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used
for identifying fraud transactions, fraud customers, etc. It is done by using
historic data to identify the patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are
used. These algorithms classify an email as spam or not spam. The spam
emails are sent to the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in
speech recognition. The algorithm is trained with voice data, and various
identifications can be done using the same, such as voice-activated
passwords, voice commands, etc.
In unsupervised learning, the models are trained with the data that is neither
classified nor labelled, and the model acts on that data without any
supervision.
So, now the machine will discover its patterns and differences, such as colour
difference, shape difference, and predict the output when it is tested with the
test dataset.
1) Clustering
The clustering technique is used when we want to find the inherent groups
from the data. It is a way to group the objects into a cluster such that the
objects with the most similarities remain in one group and have fewer or no
similarities with the objects of other groups. An example of the clustering
algorithm is grouping the customers by their purchasing behaviour.
2) Association
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm
that lies between Supervised and Unsupervised machine learning. It
represents the intermediate ground between Supervised (With Labelled
training data) and Unsupervised learning (with no labelled training data)
algorithms and uses the combination of labelled and unlabeled datasets
during the training period.
Disadvantages:
4. Reinforcement Learning
Reinforcement learning works on a feedback-based process, in
which an AI agent (A software component) automatically explore its
surrounding by hitting & trail, taking action, learning from
experiences, and improving its performance. Agent gets rewarded for
each good action and get punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the rewards.
Disadvantage
Mathematical foundations:
Let’s dive deeper for each subject to know what they are.
Linear Algebra
What is Linear Algebra? This is a branch of mathematic that
concerns the study of the vectors and certain rules to
manipulate the vector. When we are formalizing intuitive
concepts, the common approach is to construct a set of
objects (symbols) and a set of rules to manipulate these
objects. This is what we knew as algebra.
Vector
Matrix
Linear Equation
Distance Function
Inner Product
Matrix Decomposition
Matrix Decomposition is a study that concerning the way to
reducing a matrix into its constituent parts. Matrix
Decomposition aims to simplify more complex matrix
operations on the decomposed matrix rather than on its
original matrix.
Vector Calculus
Calculus is a mathematical study that concern with
continuous change, which mainly consists of functions and
limits. Vector calculus itself is concerned with the
differentiation and integration of the vector fields. Vector
Calculus is often called multivariate calculus, although it
has a slightly different study case. Multivariate calculus
deals with calculus application functions of the multiple
independent variables.
Partial Derivative
Gradient
Optimization
In the learning objective, training a machine learning model
is all about finding a good set of parameters. What we
consider “good” is determined by the objective function or
the probabilistic models. This is what optimization
algorithms are for; given an objective function, we try to
find the best value.
Conclusion
Machine Learning is an everyday tool that Data scientists use
to obtain the valuable pattern we need. Learning the math
behind machine learning could provide you an edge in your
work. There are many math subjects out there, but there are
6 subjects that matter the most when we are starting
learning machine learning math, and that is:
Linear Algebra
Analytic Geometry
Matrix Decomposition
Vector Calculus
Optimization
What is Bayes Theorem?
Bayes theorem is one of the most popular machine learning concepts that
helps to calculate the probability of occurring one event with uncertain
knowledge while other one has already occurred.
Bayes' theorem can be derived using product rule and conditional probability
of event X with known event Y:
Here, both events X and Y are independent events which means probability of
outcome of both events does not depends one another.
Decision theory is a study of an agent's rational choices that supports all kinds of
progress in technology such as work on machine learning and artificial intelligence.
Decision theory looks at how decisions are made, how multiple decisions influence one
another, and how decision-making parties deal with uncertainty.
There are two branches of decision theory – Normative Decision Theory and Optimal Decision
Theory.
There are 4 basic elements in decision theory: acts, events, outcomes, and payoffs.
The greater the degree of surprise in the statements, the greater the
information contained in the statements. For example, let’s say commuting
from place A to B takes 3 hours on average and is known to everyone. If
somebody makes this statement, the statement provides no information at all
as this is already known to everyone. Now, if someone says that it takes 2
hours to go from place A to B provided a specific route is taken, then this
statement consists of good bits of information as there is an element of surprise
in the statement.
The extent of information required to describe an event depends upon the
possibility of occurrence of that event. If the event is a common event, not
much information is required to describe the event. However, for unusual
events, a good amount of information will be needed to describe such
events. Unusual events have a higher degree of surprises and hence greater
associated information.
The amount of information associated with event outcomes depends upon the
probability distribution associated with that event. In other words, the amount
of information is related to the probability distribution of event outcomes.
Recall that the event and its outcomes can be represented as the different
values of the random variable, X from the given sample space. And, the random
variable has an associated probability distribution with a probability associated
with each outcome including the common outcomes consisting of less
information and rare outcomes consisting of a lot of information. The higher
the probability of an event outcome, the lesser the
information contained if that outcome happens. The smaller the probability
of an event outcome, the greater the information contained if that
outcome with lesser probability happens.
How do we measure the information?
There are the following requirements for measuring the information associated
with events:
What is Entropy?
Entropy represents the amount of information associated with the random
variable as the function of the probability distribution for that random variable,
be the probability distribution be probability density function (PDF) or
probability mass function (PMF). The following is the formula for the entropy
for a discrete random variable.
The performance of the machine learning models depends upon how close is
the estimated probability distribution of the random variable (representing the
response variable of ML models) against their true probability distribution.
This can be measured in terms of the entropy loss between the true probability
distribution and the estimated probability distribution of the response variable.
This is also termed cross-entropy loss as it represents entropy loss between
two probability distributions. Recall that entropy can be calculated as a
function of probability distribution related to different outcomes of the random
variables.
The goal of training a classification machine learning model is to come up with
a model which predicts the probability of the response variable belonging to
different classes, as close to true probability. If the model predicts the class as
0 when the true class is 1, the entropy is very high. If the model predicts the
class as 0 when the true class is 0, the entropy is very low. The goal is to
minimize the difference between the estimated probability and true probability
that a particular data set belongs to a specific class. In other words, the goal is
to minimize the cross-entropy loss – the difference between the true and
estimated probability distribution of the response variable (random variable).
The goal is to maximize the occurrence of the data set including the predictor
dataset and response data/label. In other words, the goal is to estimate the
parameters of the models that maximize the occurrence of the data set. The
occurrence of the dataset can be represented in terms of probability. Thus,
maximizing the occurrence of the data set can be represented as maximizing
the probability of occurrence of the data set including class labels and
predictor dataset. The following represents the probability that needs to be
maximized based on the estimation of parameters. This is also
called maximum likelihood estimation. The probability of occurrence of data
can be represented as the joint probability of occurrence of each class label.
Assuming that every outcome of the event is independent of others, the
probability of occurrence of the data can be represented as the following:
In the case of softmax regression, for any pair of true label vs predicted label
for Q classes, the loss function can be calculated as the following:
While training machine learning models for the classification problems, the
goal remains to minimize the loss function across all pairs of true and predicted
labels. The goal is to minimize the cross-entropy loss.
UNIT-2
Introduction
Supervised machine learning is a type of machine learning that learns the relationship
between input and output. The inputs are known as features or ‘X variables’ and output
is generally referred to as the target or ‘y variable’. The type of data which contains both
the features and the target is known as labeled data. It is the key difference between
supervised and unsupervised machine learning, two prominent types of machine
learning. In this tutorial you will learn:
Discriminative Models?
Logistic regression
Support vector machines(SVMs)
Traditional neural networks
Nearest neighbor
Conditional Random Fields (CRFs)
Decision Trees and Random Forest
These models use probability estimates and likelihood to model data points
and differentiate between different class labels present in a dataset. Unlike
discriminative models, these models can also generate new data points.
Assume some functional form for the probabilities such as P(Y), P(X|Y)
With the help of training data, we estimate the parameters of P(X|Y), P(Y)
Use the Bayes theorem to calculate the posterior probability P(Y |X)
Naïve Bayes
Bayesian networks
Markov random fields
Hidden Markov Models (HMMs)
Latent Dirichlet Allocation (LDA)
Generative Adversarial Networks (GANs)
Autoregressive Model
Linear Regression
If you recall the “line of best fit” from school days, this is exactly what linear
regression is. Predicting a person's weight based on their height is a
straightforward example of this concept.
PROS CONS
Performs exceptionally well for linearly separable Assumes linearity between features and target
data variable.
KEY TAKEAWAYS
The least squares method is a statistical procedure to find the best fit
for a set of data points.
The method works by minimizing the sum of the offsets or residuals of
points from the plotted curve.
Least squares regression is used to predict the behavior of dependent
variables.
The least squares method provides the overall rationale for the
placement of the line of best fit among the data points being studied.
Traders and analysts can use the least squares method to identify
trading opportunities and economic or financial trends.
Understanding the Least Squares Method
The least squares method is a form of regression analysis that provides the
overall rationale for the placement of the line of best fit among the data points
being studied. It begins with a set of data points using two variables, which
are plotted on a graph along the x- and y-axis. Traders and analysts can use
this as a tool to pinpoint bullish and bearish trends in the market along with
potential trading opportunities.
For instance, an analyst may use the least squares method to generate a line
of best fit that explains the potential relationship between independent and
dependent variables. The line of best fit determined from the least squares
method has an equation that highlights the relationship between the data
points.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the
data points or more than the required data points present in the given
dataset. Because of this, the model starts caching noise and inaccurate
values present in the dataset, and all these factors reduce the efficiency and
accuracy of the model. The overfitted model has low bias and high
variance.
PlayNext
Unmute
Duration 18:10
Loaded: 0.37%
Fullscreen
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
Underfitting
Underfitting occurs when our machine learning model is not able to capture
the underlying trend of the data. To avoid the overfitting in the model, the
fed of training data can be stopped at an early stage, due to which the model
may not learn enough from the training data. As a result, it may fail to find
the best fit of the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the
training data, and hence it reduces the accuracy and produces unreliable
predictions.
Lasso Regression :
Lasso regression stands for Least Absolute Shrinkage and Selection Operator.
It adds penalty term to the cost function. This term is the absolute sum of the
coefficients. As the value of coefficients increases from 0 this term penalizes,
cause model, to decrease the value of coefficients in order to reduce loss. The
difference between ridge and lasso regression is that it tends to make
coefficients to absolute zero as compared to Ridge which never sets the value
of coefficient to absolute zero.
Limitation of Lasso Regression:
Lasso sometimes struggles with some types of data. If the number of
predictors (p) is greater than the number of observations (n), Lasso will pick
at most n predictors as non-zero, even if all predictors are relevant (or may
be used in the test set).
If there are two or more highly collinear variables then LASSO regression
select one of them randomly which is not good for the interpretation of data
Elastic Net :
Sometimes, the lasso regression can cause a small bias in the model where the
prediction is too dependent upon a particular variable. In these cases, elastic
Net is proved to better it combines the regularization of both lasso and Ridge.
The advantage of that it does not easily eliminate the high collinearity
coefficient.
Logistic Regression
Logistic Regression is a classification algorithm that uses the Sigmoid function instead of a linear
function to model data.
log[1−yy]=b0+b1x1+b2x2+…+bnxn
Gradient Descent
Gradient Descent is an iterative optimization algorithm that tries to find the
optimum value (Minimum/Maximum) of an objective function. It is one of the
most used optimization techniques in machine learning projects for updating the
parameters of a model in order to minimize a cost function.
The main aim of gradient descent is to find the best parameters of a model
which gives the highest accuracy on training as well as testing datasets. In
gradient descent, The gradient is a vector that points in the direction of the
steepest increase of the function at a specific point. Moving in the opposite
direction of the gradient allows the algorithm to gradually descend towards
lower values of the function, and eventually reaching to the minimum of the
function.
Steps Required in Gradient Descent Algorithm
t ← 0
max_iterations ← 1000
w, b ← initialize randomly
Function:
so our model aim is to Minimize \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) –
y^{(i)})^2 and store the parameters which makes it minimum.
Gradient Descent Algorithm For Linear Regression
Gradient descent works by moving downward toward the pits or valleys in the
graph to find the minimum value. This is achieved by taking the derivative of the
cost function, as illustrated in the figure below. During each iteration, gradient
descent step-downs the cost function in the direction of the steepest descent. By
adjusting the parameters in this direction, it seeks to reach the minimum of the
cost function and find the best-fit values for the parameters. The size of each
step is determined by parameter α known as Learning Rate.
In the Gradient Descent algorithm, one can infer two points :
If slope is +ve : θj = θj – (+ve value). Hence the value of θ j decreases.
The choice of correct learning rate is very important as it ensures that Gradient
Descent converges in a reasonable time. :
If we choose α to be very large, Gradient Descent can overshoot the
minimum. It may fail to converge or even diverge.
A Support Vector Machine (SVM) is a supervised classification and regression algorithm that
uses the concept of hyperplanes. These hyperplanes can be understood as multi-dimensional
linear decision boundaries that separate groups of unequal data points. An example of a
hyperplane is shown below.
An optimal fit of the SVM occurs when a hyperplane is furthest from the training data points of
any of the classes—the larger this distance margin, the lower the classifier's error.
To better understand how the SVM works, consider a group of data points like the one shown in
the diagram. It is a good fit if the hyperplane separates the points in the space, so they are
clustered according to their labels. If not, further iterations of the algorithm are performed.
Kernel Methods
Kernels or kernel methods (also called Kernel functions) are sets of different
types of algorithms that are being used for pattern analysis. They are used
classification and regression problems. The SVM uses what is called a “Kernel
Trick” where the data is transformed and an optimal boundary is found for
ones. A hyperplane is one dimension less than the ambient plane. E.g. in the
above figure, we have 2 dimension which represents the ambient space but
the lone which divides or classifies the space is one dimension less than the
no good linear line that should be able to classify the red and the green dots
as the points are randomly distributed. Here comes the use of kernel
function which takes the points to higher dimensions, solves the problem
over there and returns the output. Think of this in this way, we can see that
the green dots are enclosed in some perimeter area while the red one lies
outside it, likewise, there could be other scenarios where green dots might
dimensional area and here our classifier i.e. hyperplane will not be a straight
so,
f(3 ,4, 5)=(9, 12, 15, 12, 16, 20, 15, 20, 25)
And,
This as we find out, f(x).f(y) and K(x, y) give us the same result, but the
dimensions into 9 dimensions) while using the kernel, it was much easier.
SVM:
1. Liner Kernel
Let us say that we have two vectors with name x1 and Y1, then the linear
K(x1, x2) = x1 . x2
2. Polynomial Kernel
A polynomial kernel is defined by the following equation:
Where,
d is the degree of the polynomial and x1 and x2 are vectors
3. Gaussian Kernel
4. Exponential Kernel
5. Laplacian Kernel
This type of kernel is less prone for changes and is totally equal to previously
activation function for the sigmoid kernel is the bipolar sigmoid function.
This kernel is very much used and popular among support vector machines.
problems just like the Gaussian and Laplacian kernels. This also comes under
There are a lot more types of Kernel Method and we have discussed the
mostly used kernels. It purely depends on the type of problem which will
Instance-based learning
The Machine Learning systems which are categorized as instance-
based learning are the systems that learn the training examples by
heart and then generalizes to new instances based on some similarity
measure. It is called instance-based because it builds the hypotheses
from the training instances. It is also known as memory-based
learning or lazy-learning (because they delay processing until a new
instance must be classified). The time complexity of this algorithm
depends upon the size of training data. Each time whenever a new
query is encountered, its previously stores data is examined. And
assign to a target function value for the new instance.
The worst-case time complexity of this algorithm is O (n), where n is
the number of training instances. For example, If we were to create a
spam filter with an instance-based learning algorithm, instead of just
flagging emails that are already marked as spam emails, our spam
filter would be programmed to also flag emails that are very similar to
them. This requires a measure of resemblance between two emails. A
similarity measure between two emails could be the same sender or
the repetitive use of the same keywords or something else.
Advantages:
1. Instead of estimating for the entire instance set, local
approximations can be made to the target function.
2. This algorithm can adapt to new data easily, one which is collected
as we go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query
involves starting the identification of a local model from scratch.
Some of the instance-based learning algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning
K Nearest Neighbors
For instance, suppose we had a graph with two distinct groups of data points
that were located in close proximity to one another and named Group A and
Group B, respectively. Each of these groups of data points would be
represented by a point on the graph. When we add a new data point, the
group of that instance will depend on which group the new point is closer to.
PROS CONS
Makes no assumption
Takes long time for training
about the data
KNN works well with a small number of features but as the numbers of
Intuitive and simple
features grow it struggles to predict accurately.
Decision Tree
Decision trees are tree-based decision models that use a root node internal structure followed by
successive child leaf nodes. The leaf nodes are a placeholder for the classification label, and the
branches show the outcomes of the decision. The paths from the tree's root to the leaves
represent the classifier rules. Each tree and sub-tree models a single decision and enumerates all
the possible decisions to choose the best one. A Decision tree can be optimal if it represents most
of the data with the least number of levels.
Decision trees are helpful for classification but can be extended for Regression using different
algorithms. These trees are computationally efficient, and many tree-based optimizations have
been created over the years to make them perform even faster.
below.
ID3 algorithm:
Where,
Entropy = 0 implies it is of pure class, that means all are of same category.
Random Forest
Random Forest models use a forest of Decision Trees to make better decisions by combining
each tree's decisions. The most popular decision across the trees for a task is the best after the
aggregation. This technique of aggregating multiple results from similar processes is
called Ensembling.
The second component of the Random Forest pertains to another technique called Bagging.
Bagging differs from Ensembling because, in Bagging, the data is different for every model,
while in Ensembling, the different models are run on the same data.
In Bagging, a random sample with replacement is chosen multiple times to create a data sample.
These data samples are then used to train the model independently. After training all these
models, the majority vote is taken to find a better data estimate.
Random forests combine the concepts of Bagging and Ensembling to decide the best feature
splits and select subsets of the same. This algorithm is better than a single Decision Tree as it
reduces bias and the net variance, generating better predictions.
Bagging and Ensembling might seem like they help model the joint probability distribution, but
that is not the case. Understanding the difference between Generative and Discriminative models
can clear this confusion.
Classification Algorithm
The Classification algorithm is a Supervised Learning technique that is used
to identify the category of new observations on the basis of training data. In
Classification, a program learns from the given dataset or observations and
then classifies new observation into a number of classes or groups. Such
as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can
be called as targets/labels or categories.
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until
it receives the test dataset. In Lazy learner case, classification is done on the
basis of the most related data stored in the training dataset. It takes less time
in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a
training dataset before receiving a test dataset. Opposite to Lazy learners,
Eager Learner takes more time in learning, and less time in
prediction. Example: Decision Trees, Naïve Bayes, ANN.
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
1. ?(ylog(p)+(1?y)log(1?p))
2. Confusion Matrix:
3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC
stands for Area Under the Curve.
o It is a graph that shows the performance of the classification model at
different thresholds.
o To visualize the performance of the multi-class classification model, we use
the AUC-ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on
Y-axis and FPR(False Positive Rate) on X-axis.
UNIT-3
Unsupervised learning:
Unsupervised learning is when it can provide a set of unlabelled data, which it is
required to analyze and find patterns inside. The examples are dimension reduction and
clustering.
Here we are discussing mainly popular Clustering algorithms that are widely
used in machine learning:
Applications of Clustering
Below are some commonly known applications of clustering technique in
Machine Learning:
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It
is also known as the centroid-based method. The most common example
of partitioning clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to
define the number of pre-defined groups. The cluster center is created in
such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid.
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned
clustering as there is no requirement of pre-specifying the number of
clusters to be created. In this technique, the dataset is divided into clusters
to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at
the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.
Cluster validity:
The term cluster validation is used to design the procedure of evaluating the goodness of
clustering algorithm results. This is important to avoid finding patterns in a random data, as
well as, in the situation where you want to compare two clustering algorithms.
Generally, clustering validation statistics can be categorized into 3 classes
1. Internal cluster validation, which uses the internal information of the clustering process to
evaluate the goodness of a clustering structure without reference to external information. It
can be also used for estimating the number of clusters and the appropriate clustering
algorithm without any external data.
2. External cluster validation, which consists in comparing the results of a cluster analysis to
an externally known result, such as externally provided class labels. It measures the extent to
which cluster labels match externally supplied class labels. Since we know the “true” cluster
number in advance, this approach is mainly used for selecting the right clustering algorithm
for a specific data set.
3. Relative cluster validation, which evaluates the clustering structure by varying different
parameter values for the same algorithm (e.g.,: varying the number of clusters k). It’s
generally used for determining the optimal number of clusters.
Internal measures for cluster validation
In this section, we describe the most widely used clustering validation indices. Recall that the
goal of partitioning clustering algorithms (Part @ref(partitioning-clustering)) is to split the
data set into clusters of objects, such that:
the objects in the same cluster are similar as much as possible,
and the objects in different clusters are highly distinct
That is, we want the average distance within cluster to be as small as possible; and the
average distance between clusters to be as large as possible.
Internal validation measures reflect often the compactness, the connectedness and
the separation of the cluster partitions.
1. Compactness or cluster cohesion: Measures how close are the objects within the same
cluster. A lower within-cluster variation is an indicator of a good compactness (i.e., a good
clustering). The different indices for evaluating the compactness of clusters are base on
distance measures such as the cluster-wise within average/median distances between
observations.
2. Separation: Measures how well-separated a cluster is from other clusters. The indices used
as separation measures include:
distances between cluster centers
3. the pairwise minimum distances between objects in different clusters
4. Connectivity: corresponds to what extent items are placed in the same cluster as their
nearest neighbors in the data space. The connectivity has a value between 0 and infinity and
should be minimized.
Generally most of the indices used for internal clustering validation combine compactness
and separation measures as follow:
Index=(α×Separation)/(β×Compactness)
The silhouette analysis measures how well an observation is clustered and it estimates
the average distance between clusters. The silhouette plot displays a measure of how close
each point in one cluster is to points in the neighboring clusters.
For each observation i, the silhouette width si is calculated as follows:
1. For each observation i, calculate the average dissimilarity ai between i and all other points
of the cluster to which i belongs.
2. For all other clusters C, to which i does not belong, calculate the average
dissimilarity d(i,C) of i to all observations of C. The smallest of these d(i,C) is defined
as bi=minCd(i,C)=min. The value of bi can be seen as the dissimilarity between i and
its “neighbor” cluster, i.e., the nearest one to which it does not belong.
3. Finally the silhouette width of the observation i is defined by the
formula: Si=(bi−ai)/max(ai,bi)
Silhouette width can be interpreted as follow:
Observations with a large Si (almost 1) are very well clustered.
A small Si (around 0) means that the observation lies between two clusters.
Observations with a negative Si are probably placed in the wrong cluster.
Dunn index
The Dunn index is another internal clustering validation measure which can be computed as
follow:
1. For each cluster, compute the distance between each of the objects in the cluster and the
objects in the other clusters
2. Use the minimum of this pairwise distance as the inter-cluster separation (min.separation)
3. For each cluster, compute the distance between the objects in the same cluster.
4. Use the maximal intra-cluster distance (i.e maximum diameter) as the intra-cluster
compactness
5. Calculate the Dunn index (D) as follow:
D=min.separation/max.diameter
If the data set contains compact and well-separated clusters, the diameter of the clusters is
expected to be small and the distance between the clusters is expected to be large. Thus,
Dunn index should be maximized.
The aim is to compare the identified clusters (by k-means, pam or hierarchical clustering) to
an external reference.
It’s possible to quantify the agreement between partitioning clusters and external reference
using either the corrected Rand index and Meila’s variation index VI, which are implemented
in the R function cluster.stats()[fpc package].
The corrected Rand index varies from -1 (no agreement) to 1 (perfect agreement).
External clustering validation, can be used to select suitable clustering algorithm for a given
data set.
Dimensionality Reduction
The number of input features, variables, or columns present in a given
dataset is known as dimensionality, and the process to reduce these features
is called dimensionality reduction.
It is commonly used in the fields that deal with high-dimensional data, such
as speech recognition, signal processing, bioinformatics, etc. It can
also be used for data visualization, noise reduction, cluster analysis,
etc.
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly
known as the curse of dimensionality. If the dimensionality of the input
dataset increases, any machine learning algorithm and model becomes more
complex. As the number of features increases, the number of samples also
gets increased proportionally, and the chance of overfitting also increases. If
the machine learning model is trained on high-dimensional data, it becomes
overfitted and results in poor performance.
o By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of
features.
o Reduced dimensions of features of the dataset help in visualizing the data
quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.
Feature Selection
Feature selection is the process of selecting the subset of the relevant
features and leaving out the irrelevant features present in a dataset to build
a model of high accuracy. In other words, it is a way of selecting the optimal
features from the input dataset.
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the
relevant features is taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a
machine learning model for its evaluation. In this method, some features are
fed to the ML model, and evaluate the performance. The performance
decides whether to add those features or remove to increase the accuracy of
the model. This method is more accurate than the filtering method but
complex to work. Some common techniques of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
o LASSO
o Elastic Net
o Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing many
dimensions into space with fewer dimensions. This approach is useful when
we want to keep the whole information but use fewer resources while
processing the information.
PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces the
dimensionality. Some real-world applications of PCA are image processing,
movie recommendation system, optimizing the power allocation in
various communication channels.
o In this technique, firstly, all the n variables of the given dataset are taken to
train the model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1
features for n times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the
performance of the model, and then we will drop that variable or features;
after that, we will be left with n-1 features.
Recommendation systems
Product recommendation is a popular application of machine learning that
aims to personalize the customer shopping experience. By analyzing
customer behavior, preferences, and purchase history, a recommendation
engine can suggest products more likely to interest a particular customer.
One way to handle the cold-start problem is to use a hybrid approach that
combines content-based filtering and demographic information. For example,
suppose a new customer is browsing for men's clothing. In that case, the
recommendation engine can suggest products based on the most popular
men's clothing items and the customer's age and location.
Types of recommendation systems
There are several types of recommendation systems in machine learning,
including:
EM algorithm
The Expectation-Maximization (EM) algorithm is defined as the combination
of various unsupervised machine learning algorithms, which is used to
determine the local maximum likelihood estimates (MLE) or maximum
a posteriori estimates (MAP) for unobservable variables in statistical
models. Further, it is a technique to find maximum likelihood estimation
when the latent variables are present. It is also referred to as the latent
variable model.
Key Points:
EM Algorithm
The EM algorithm is the combination of various unsupervised ML algorithms,
such as the k-means clustering algorithm. Being an iterative approach, it
consists of two modes. In the first mode, we estimate the missing or latent
variables. Hence it is referred to as the Expectation/estimation step (E-
step). Further, the other mode is used to optimize the parameters of the
models so that it can explain the data more clearly. The second mode is
known as the maximization-step or M-step.
o Expectation step (E - step): It involves the estimation (guess) of all
missing values in the dataset so that after completing this step, there should
not be any missing value.
o Maximization step (M - step): This step involves the use of estimated data
in the E-step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.
The primary goal of the EM algorithm is to use the available observed data of
the dataset to estimate the missing data of the latent variables and then use
that data to update the values of the parameters in the M-step.
Steps in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include
Initialization Step, Expectation Step, Maximization Step, and
convergence Step. These steps are explained as follows:
o 1st Step: The very first step is to initialize the parameter values. Further, the
system is provided with incomplete observed data with the assumption that
data is obtained from a specific model.
Reinforcement Learning
o Reinforcement Learning is a feedback-based Machine learning technique in
which an agent learns to behave in an environment by performing the actions
and seeing the results of actions. For each good action, the agent gets
positive feedback, and for each bad action, the agent gets negative feedback
or penalty.
o In Reinforcement Learning, the agent learns automatically using feedbacks
without any labeled data, unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its
experience only.
o RL solves a specific type of problem where decision making is sequential, and
the goal is long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The
primary goal of an agent in reinforcement learning is to improve the
performance by getting the maximum positive rewards.
o The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way. Hence, we can say
that "Reinforcement learning is a type of machine learning method
where an intelligent agent (computer program) interacts with the
environment and learns to act within that." How a Robotic dog learns
the movement of his arms is an example of Reinforcement learning.
o It is a core part of Artificial intelligence, and all AI agent works on the concept
of reinforcement learning. Here we do not need to pre-program the agent, as
it learns from its own experience without any human intervention.
o Example: Suppose there is an AI agent present within a maze environment,
and his goal is to find the diamond. The agent interacts with the environment
by performing some actions, and based on those actions, the state of the
agent gets changed, and it also receives a reward or penalty as feedback.
o The agent continues doing these three things (take action, change
state/remain in the same state, and get feedback), and by doing these
actions, he learns and explores the environment.
o The agent learns that what actions lead to positive feedback or rewards and
what actions lead to negative feedback penalty. As a positive reward, the
agent gets a positive point, and as a penalty, it gets a negative point.
Elements of Reinforcement Learning
There are four main elements of Reinforcement Learning, which are given
below:
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment
3) Value Function: The value function gives information about how good
the situation and action are and how much reward an agent can expect. A
reward indicates the immediate signal for each good and bad action,
whereas a value function specifies the good state and action for the
future. The value function depends on the reward as, without reward, there
could be no value. The goal of estimating values is to achieve more rewards.
The model is used for planning, which means it provides a way to take a
course of action by considering all future situations before actually
experiencing those situations. The approaches for solving the RL
problems with the help of the model are termed as the model-based
approach. Comparatively, an approach without using a model is called
a model-free approach.
Factor Graphs
They are a form of PGM with round nodes and square nodes representing
variable probability distributions (factors), and vertices expressing conditional
relationships between nodes. They offer a broad framework for simulating the
combined dispersion of a set of random variables.
Bayesian Methods
The first essential concept allowing this new machine learning architecture
is Bayesian inference/learning. Latent/hidden parameters are represented in
MBML as random variables with probability distributions. This provides for a
consistent and rational approach to quantifying uncertainty in model
parameters. Again when the observed variables in the model are locked to
their values, the Bayes’ theorem is used to update the previously assumed
probability distributions.
In contrast, the classical ML framework assigns model parameters to average
values derived by maximizing an objective function. Bayesian inference on big
models with millions of variables is accomplished similarly, but in a more
complicated way, employing the Bayes’ theorem. This is because Bayes’
theory is an accurate inference approach that is intractable when applied to
huge datasets. The rise in the processing capacity of computers over the last
decade has enabled the research and innovation of algorithms that can scale
to enormous data sets.
Probabilistic Programming
Describe the Model: Using factor graphs, describe the process that created
the data.
Condition on Reported Data: Make the observed variables equal to their
known values.
Backward reasoning is used to update the prior distribution across the latent
constructs or parameters. Estimate the Bayesian probability distributions of
latent constructs based on observable variables.
error is the difference between the current estimate for 𝑉𝑡, the discounted
rt+1 + γV(st+1) − V(st) value is commonly called the TD Error. Here the TD
value estimate of 𝑉𝑡+1, and the actual reward gained from transitioning
between 𝑠𝑡 and 𝑠𝑡+1. The TD error at each time is the error in the calculation
made at that time. Because the TD error at step t relies on the next state and
next reward, it is not available until step t + 1. When we update the value
function with the TD error, it is called a backup. The TD error is related to the
Bellman equation.
UNIT-4
Introduction
Maximum likelihood is an approach commonly used for such density estimation
problems, in which a likelihood function is defined to get the probabilities of the
distributed data. It is imperative to study and understand the concept of maximum
likelihood as it is one of the primary and core concepts essential for learning other
advanced machine learning and deep learning techniques and algorithms.
In this article, we will discuss the likelihood function, the core idea behind that, and how
it works with code examples. This will help one to understand the concept better and
apply the same when needed.
Let us dive into the likelihood first to understand the maximum likelihood estimation.
import numpy as np
lr=LogisticRegression()
lr.fit(X_train,y_train)
lr_pred=lr.predict(X_test)
sns.regplot(x="X",y='lr_pred',data=df_pred ,logistic=True, ci=None)
The above code will fit the logistic regression for the given dataset and generate the
line plot for the data representing the distribution of the data and the best fit according
to the algorithm.
Key Takeaways
Maximum Likelihood is a function that describes the data points and their
likeliness to the model for best fitting.
Maximum likelihood is different from the probabilistic methods, where probabilistic
methods work on the principle of calculation probabilities. In contrast, the
likelihood method tries o maximize the likelihood of data observations according
to the data distribution.
Maximum likelihood is an approach used for solving the problems like density
distribution and is a base for some algorithms like logistic regression.
The approach is very similar and is predominantly known as the perceptron trick
in terms of deep learning methods.
This algorithm was given by the R. Agrawal and Srikant in the year 1994.
It is mainly used for market basket analysis and helps to find those products
that can be bought together. It can also be used in the healthcare field to
find drug reactions for patients.
Frequent itemsets are those items whose support is greater than the
threshold value or user-specified minimum support. It means if A & B are the
frequent itemsets together, then individually A and B should also be the
frequent itemset.
Step-2: Take all supports in the transaction with higher support value than
the minimum or selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value
than the threshold or minimum confidence.
B F P (A=T) P (A=F)
T T 0.95 0.05
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
The alarm ‘A’ node can be ‘true’ or ‘false’ ( i.e may have rung or
may not have rung). It has two parent nodes burglary ‘B’ and fire ‘F’
which can be ‘true’ or ‘false’ (i.e may have occurred or may not
have occurred) depending upon different conditions.
Person ‘P1’ –
A P (P1=T) P (P1=F)
T 0.95 0.05
F 0.05 0.95
The person ‘P1’ node can be ‘true’ or ‘false’ (i.e may have called the
person ‘gfg’ or not) . It has a parent node, the alarm ‘A’, which can
be ‘true’ or ‘false’ (i.e may have rung or may not have rung ,upon
burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
A P (P2=T) P (P2=F)
T 0.80 0.20
F 0.01 0.99
The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the
person ‘gfg’ or not). It has a parent node, the alarm ‘A’, which can
be ‘true’ or ‘false’ (i.e may have rung or may not have rung, upon
burglary ‘B’ or fire ‘F’).
Solution: Considering the observed probabilistic scan –
With respect to the question — P ( P1, P2, A, ~B, ~F) , we need to
get the probability of ‘P1’. We find it with regard to its parent node –
alarm ‘A’. To get the probability of ‘P2’, we find it with regard to its
parent node — alarm ‘A’.
We find the probability of alarm ‘A’ node with regard to ‘~B’ & ‘~F’
since burglary ‘B’ and fire ‘F’ are parent nodes of alarm ‘A’.
From the observed probabilistic scan, we can deduce –
P ( P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
= 0.95 * 0.80 * 0.001 * 0.999 * 0.998
= 0.00075
Probabilistic modeling
Probabilistic modeling is a statistical technique used to take into account the impact of
random events or actions in predicting the potential occurrence of future outcomes.
3. Structure Learning
BN is specified by an expert and after that, it is used to perform
inference. The task of defining the network is too complex for
humans in other applications. The parameters of the local
distributions and the network structure must learn from data in this
case.
With the help of Data Science, industries are able to apply machine
learning and predictive modeling to develop tools for the
recognition of unusual patterns in the fraud-detection ecosystem.
Naive Bayes is one of the important algorithms that is used for
fraud detection in the industries.
where,
K -> kernel (non-negative function)
h -> bandwidth (smoothing parameter, h > 0)
Kh -> scaled kernel
fh(x) -> density (to calculate)
n -> no. of samples in random sample.
A sample plot for nonparametric density estimation is given below.
The RNN
The outputs from one recurrent unit at each time step can be
fed as input to the next unit at the same time step. This
forms a deep sequential model that can model a larger range
of more complex sequences than a single recurrent unit.
Long Term Dependencies
2. Hidden Markov models. These are used to represent systems with some
unobservable states. In addition to showing states and transition rates,
hidden Markov models also represent observations and observation
likelihoods for each state. Hidden Markov models are used for a range of
applications, including thermodynamics, finance and pattern recognition.
Another two commonly applied types of Markov model are used when the
system being represented is controlled -- that is, when the system is
influenced by a decision-making agent. These are as follows:
Other Markov models are based on the chain representations but with added
information, such as observations and observation likelihoods.
The transition matrix below represents shifting gears in a car with a manual
transmission. Six states are possible, and a transition from any given state to
any other state depends only on the current state -- that is, where the car
goes from second gear isn't influenced by where it was before second gear.
Such a transition matrix might be built from empirical observations that show,
for example, that the most probable transitions from first gear are to second or
neutral.
This transition matrix represents shifting gears in a car with a manual transmission and the
six states that are possible.
The image below represents the toss of a coin. Two states are possible:
heads and tails. The transition from heads to heads or heads to tails is equally
probable (.5) and is independent of all preceding coin tosses.The circles
represent the two possible states -- heads or tails -- and the arrows show the possible
states the system could transition to in the next step. The number .5 represents the
probability of that transition occurring.
The basic idea behind an HMM is that the hidden states generate the
observations, and the observed data is used to estimate the hidden state
sequence. This is often referred to as the forward-backwards algorithm.
o Speech Recognition
One of the most well-known applications of HMMs is speech recognition. In
this field, HMMs are used to model the different sounds and phones that
makeup speech. The hidden states, in this case, correspond to the different
sounds or phones, and the observations are the acoustic signals that are
generated by the speech. The goal is to estimate the hidden state sequence,
which corresponds to the transcription of the speech, based on the observed
acoustic signals. HMMs are particularly well-suited for speech recognition
because they can effectively capture the underlying structure of the speech,
even when the data is noisy or incomplete. In speech recognition systems,
the HMMs are usually trained on large datasets of speech signals, and the
estimated parameters of the HMMs are used to transcribe speech in real
time.
o Natural Language Processing
Another important application of HMMs is natural language processing. In this
field, HMMs are used for tasks such as part-of-speech tagging, named
entity recognition, and text classification. In these applications, the
hidden states are typically associated with the underlying grammar or
structure of the text, while the observations are the words in the text. The
goal is to estimate the hidden state sequence, which corresponds to the
structure or meaning of the text, based on the observed words. HMMs are
useful in natural language processing because they can effectively capture
the underlying structure of the text, even when the data is noisy or
ambiguous. In natural language processing systems, the HMMs are usually
trained on large datasets of text, and the estimated parameters of the HMMs
are used to perform various NLP tasks, such as text classification, part-of-
speech tagging, and named entity recognition.
o Bioinformatics
HMMs are also widely used in bioinformatics, where they are used to model
sequences of DNA, RNA, and proteins. The hidden states, in this case,
correspond to the different types of residues, while the observations are the
sequences of residues. The goal is to estimate the hidden state sequence,
which corresponds to the underlying structure of the molecule, based on the
observed sequences of residues. HMMs are useful in bioinformatics because
they can effectively capture the underlying structure of the molecule, even
when the data is noisy or incomplete. In bioinformatics systems, the HMMs
are usually trained on large datasets of molecular sequences, and the
estimated parameters of the HMMs are used to predict the structure or
function of new molecular sequences.
o Finance
Finally, HMMs have also been used in finance, where they are used to model
stock prices, interest rates, and currency exchange rates. In these
applications, the hidden states correspond to different economic states, such
as bull and bear markets, while the observations are the stock prices, interest
rates, or exchange rates. The goal is to estimate the hidden state sequence,
which corresponds to the underlying economic state, based on the observed
prices, rates, or exchange rates. HMMs are useful in finance because they
can effectively capture the underlying economic state, even when the data is
noisy or incomplete. In finance systems, the HMMs are usually trained on
large datasets of financial data, and the estimated parameters of the HMMs
are used to make predictions about future market trends or to develop
investment strategies.
UNIT-5
Neural networks are artificial systems that were inspired by biological neural
networks. These systems learn to perform tasks by being exposed to various
datasets and examples without any task-specific rules. The idea is that the
system generates identifying characteristics from the data they have been
passed without being programmed with a pre-programmed understanding of
these datasets. Neural networks are based on computational models for
threshold logic. Threshold logic is a combination of algorithms and
mathematics. Neural networks are based either on the study of the brain or on
the application of neural networks to artificial intelligence. The work has led to
improvements in finite automata theory. Components of a typical neural network
involve neurons, connections which are known as synapses, weights, biases,
propagation function, and a learning rule. Neurons will receive an
input from predecessor neurons that have an activation ,
threshold , an activation function f, and an output function .
Connections consist of connections, weights and biases which rules how
neuron transfers output to neuron . Propagation computes the input and
outputs the output and sums the predecessor neurons function with the weight.
The learning of neural network basically refers to the adjustment in the free
parameters i.e. weights and bias. There are basically three sequence of events
of learning process.
These includes:
1. The neural network is simulated by an new environment.
2. Then the free parameters of the neural network is changed as a result of this
simulation.
3. The neural network then responds in a new way to the environment because
of the changes in its free parameters.
Perceptron model is also treated as one of the best and simplest types of
Artificial Neural networks. However, it is a supervised learning algorithm of
binary classifiers. Hence, we can consider it as a single-layer neural network
with four main parameters, i.e., input values, weights and Bias, net
sum, and an activation function.
o Activation Function:
These are the final and important components that help to determine
whether the neuron will fire or not. Activation Function can be considered
primarily as a step function.
o Sign function
o Step function, and
o Sigmoid function
The data scientist uses the activation function to take a subjective decision
based on various problem statements and forms the desired outputs.
Activation function may differ (e.g., Sign, Step, and Sigmoid) in perceptron
models by checking whether the learning process is slow or has vanishing or
exploding gradients.
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural
network that consists of four main parameters named input values (Input
nodes), weights and Bias, net sum, and an activation function. The
perceptron model begins with the multiplication of all input values and their
weights, then adds these values together to create the weighted sum. Then
this weighted sum is applied to the activation function 'f' to obtain the
desired output. This activation function is also known as the step
function and is represented by 'f'.
This step function or Activation function plays a vital role in ensuring that
output is mapped between required values (0,1) or (-1,1). It is important to
note that the weight of input is indicative of the strength of a node. Similarly,
an input's bias value gives the ability to shift the activation function curve up
or down.
Step-1
In the first step first, multiply all input values with corresponding weight
values and then add them to determine the weighted sum. Mathematically,
we can calculate the weighted sum as follows:
∑wi*xi + b
Step-2
Y = f(∑wi*xi + b)
o Forward Stage: Activation functions start from the input layer in the
forward stage and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are
modified as per the model's requirement. In this stage, the error between
actual output and demanded originated backward on the output layer and
ended on the input layer.
f(x)=1; if w.x+b>0
otherwise, f(x)=0
Characteristics of Perceptron
The perceptron model has the following characteristics.
Future of Perceptron
The future of the Perceptron model is much bright and significant as it helps
to interpret data by building intuitive patterns and applying them in the
future. Machine learning is a rapidly growing technology of Artificial
Intelligence that is continuously evolving and in the developing phase; hence
the future of perceptron technology will continue to support and facilitate
analytical behavior in machines that will, in turn, add to the efficiency of
computers.
In the feed-forward neural network, there are not any feedback loops or
connections in the network. Here is simply an input layer, a hidden layer, and
an output layer.
There can be multiple hidden layers which depend on what kind of data you
are dealing with. The number of hidden layers is known as the depth of the
neural network. The deep neural network can learn from more functions.
Input layer first provides the neural network with data and the output layer
then make predictions on that data which is based on a series of functions.
ReLU Function is the most commonly used activation function in the deep
neural network.
1) The first input is fed to the network, which is represented as matrix x1, x2,
and one where one is the bias value.
2) Each input is multiplied by weight with respect to the first and second
model to obtain their probability of being in the positive region in each
model.
And as we know to obtain the probability of the point being in the positive
region of this model, we take the sigmoid and thus producing our final output
in a feed-forward process.
Let takes the neural network which we had previously with the following
linear models and the hidden layer which combined to form the non-linear
model in the output layer.
So, what we will do we use our non-linear model to produce an output that
describes the probability of the point being in the positive region. The point
was represented by 2 and 2. Along with bias, we will represent the input as
The first linear model in the hidden layer recall and the equation defined it
Which means in the first layer to obtain the linear combination the inputs are
multiplied by -4, -1 and the bias value is multiplied by twelve.
The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied
by three to obtain the linear combination of that same point in our second
model.
Now, to obtain the probability of the point is in the positive region relative to
both models we apply sigmoid to both points as
The second layer contains the weights which dictated the combination of the
linear models in the first layer to obtain the non-linear model in the second
layer. The weights are 1.5, 1, and a bias value of 0.5.
Now, we have to multiply our probabilities from the first layer with the
second set of weights as
It is complete math behind the feed forward process where the inputs from
the input traverse the entire depth of the neural network. In this example,
there is only one hidden layer. Whether there is one hidden layer or twenty,
the computational processes are the same for all hidden layers.
Backpropagation Process in Deep Neural
Network
Backpropagation is one of the important concepts of a neural network. Our
task is to classify our data best. For this, we have to update the weights of
parameter and bias, but how can we do that in a deep neural network? In the
linear regression model, we use gradient descent to optimize the parameter.
Similarly here we also use gradient descent algorithm using
Backpropagation.
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
H2=0.3925
To find the value of y1, we first multiply the input value i.e., the outcome of
H1 and H2 from the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214
To calculate the final result of H1, we performed the sigmoid function as
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched
with our target values T1 and T2.
Now, we will find the total error, which is simply the difference between the
outputs from the target outputs. The total error is calculated as
Activation Functions
The activation function of a neuron defines it’s output given
its inputs.We will be talking about 4 popular activation
functions:
1. Sigmoid Function:
2. Tanh Function:
3. Softmax Function:
4. ReLU Function:
Loss Functions
The other key aspect in setting up the neural network
infrastructure is selecting the right loss functions. With
neural networks, we seek to minimize the error (difference
between actual and predicted value) which is calculated by
the loss function. We will be discussing 3 popular loss
functions:
Range: (0,inf)
Pros: Preferred loss function if the distribution of the target
variable is Gaussian as it has good derivatives and helps the
model converge quickly
Cons: Is not robust to outliers in the data (unlike loss
functions like Mean Absolute Error) and penalizes high and
low predictions exponentially (unlike loss functions like Mean
Squared Logarithmic Error Loss)
5. Computational Resources
Machine learning algorithms can be computationally expensive, and they may require a
lot of resources to be successfully trained. This may be a major barrier, particularly for
people or smaller companies who want access to high-performance computing
resources. Distributed and cloud computing can be used to get around this restriction,
however the project's cost might go up.
For huge datasets and complex models, machine learning approaches can be
computationally expensive. The scalability and feasibility of machine learning
algorithms may be hampered by the need for significant processing resources. The
availability of computational resources like processor speed, memory, and storage is
another limitation on machine learning.
Using cloud computing is one way to overcome the computational resource barrier.
Users can scale up or decrease their use of computer resources according to their
demands using cloud computing platforms like Amazon Web Services (AWS) and
Microsoft Azure, which offer on-demand access to computing resources. The cost and
difficulty of maintaining computational resources can be greatly decreased.
To lower the computing demands, optimizing the data preprocessing pipelines and
machine learning algorithms is crucial. This may entail the use of more effective
algorithms, a decrease in the data's dimensionality, and the removal of pointless or
redundant information.
6. Lack of Causality
Predictions based on correlations in the data are frequently made using machine
learning algorithms. Machine learning algorithms may not shed light on the underlying
causal links in the data because correlation does not always imply causation. This may
reduce our capacity for precise prediction when causality is crucial.
The absence of causation is one of machine learning's main drawbacks. The main
purpose of machine learning algorithms is to find patterns and correlations in data;
however, they cannot establish causal links between different variables. In other words,
machine learning models can forecast future events based on seen data, but they
cannot explain why such events occur.
A major drawback of using machine learning models to judge is the absence of
causality. For instance, if a machine learning model is used to forecast the likelihood
that a consumer would buy a product, it may find factors like age, income, and gender
that are connected with buying behavior. The model, however, is unable to determine if
these variables are the source of the buying behaviour or whether there are further
underlying causes.
To get over this restriction, machine learning may need to be integrated with other
methodologies like experimental design. Researchers can identify causal relationships
by manipulating variables and observing how those changes impact a result using an
experimental design. However, compared to traditional machine learning techniques,
this approach may require more time and resources.
Machine learning can be a useful tool for predicting outcomes from observable data,
but it's crucial to be aware of its limitations when making decisions based on these
predictions. The lack of causation is a basic flaw in machine learning systems. To
establish causation, it could be necessary to use methods other than machine learning.
7. Ethical Considerations
Machine learning models can have major social, ethical, and legal repercussions when
used to make judgments that affect people's lives. Machine learning models, for
instance, may have a differential effect on groups of individuals when used to make
employment or lending choices. Privacy, security, and data ownership must also be
addressed when adopting machine learning models.
The ethical issue of bias and discrimination is a major one. If the training data is biased
or the algorithms are not created in a fair and inclusive manner, biases and
discrimination in society may be perpetuated and even amplified by machine learning
algorithms.
Another important ethical factor is privacy. Machine learning algorithms can collect and
process large amounts of personal data, which raises questions about how that data is
utilized and safeguarded.
Accountability and transparency are also crucial ethical factors. It is essential to ensure
that machine learning algorithms are visible and understandable and that systems are
in place to hold the creators and users of these algorithms responsible for their actions.
Finally, there are ethical issues around how machine learning will affect society. More
sophisticated machine learning algorithms may have far-reaching social, economic, and
political repercussions that require careful analysis and regulation.
Deep Learning
Deep learning is the branch of machine learning which is based on artificial
neural network architecture. An artificial neural network or ANN uses layers of
interconnected nodes called neurons that work together to process and learn
from the input data.
In a fully connected Deep neural network, there is an input layer and one or
more hidden layers connected one after the other. Each neuron receives input
from the previous layer neurons or the input layer. The output of one neuron
becomes the input to other neurons in the next layer of the network, and this
process continues until the final layer produces the output of the network. The
layers of the neural network transform the input data through a series of
nonlinear transformations, allowing the network to learn complex
representations of the input data.
Today Deep learning has become one of the most popular and visible areas of
machine learning, due to its success in a variety of applications, such as
computer vision, natural language processing, and Reinforcement learning.
Deep learning can be used for supervised, unsupervised as well as
reinforcement machine learning. it uses a variety of ways to process these.
Supervised Machine Learning: Supervised machine learning is
the machine learning technique in which the neural network learns to make
predictions or classify data based on the labeled datasets. Here we input
both input features along with the target variables. the neural network learns
to make predictions based on the cost or error that comes from the
difference between the predicted and the actual target, this process is known
as backpropagation. Deep learning algorithms like Convolutional neural
networks, Recurrent neural networks are used for many supervised tasks
like image classifications and recognization, sentiment analysis, language
translations, etc.
Unsupervised Machine Learning: Unsupervised machine learning is
the machine learning technique in which the neural network learns to
discover the patterns or to cluster the dataset based on unlabeled datasets.
Here there are no target variables. while the machine has to self-determined
the hidden patterns or relationships within the datasets. Deep learning
algorithms like autoencoders and generative models are used for
unsupervised tasks like clustering, dimensionality reduction, and anomaly
detection.
Reinforcement Machine Learning: Reinforcement Machine Learning is
the machine learning technique in which an agent learns to make decisions
in an environment to maximize a reward signal. The agent interacts with the
environment by taking action and observing the resulting rewards. Deep
learning can be used to learn policies, or a set of actions, that maximizes the
cumulative reward over time. Deep reinforcement learning algorithms like
Deep Q networks and Deep Deterministic Policy Gradient (DDPG) are used
to reinforce tasks like robotics and game playing etc.
Convolution Neural Network
A Convolutional Neural Network (CNN) is a type of Deep Learning
neural network architecture commonly used in Computer Vision.
Computer vision is a field of Artificial Intelligence that enables a
computer to understand and interpret the image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform
really well. Neural Networks are used in various datasets like images,
audio, and text. Different types of Neural Networks are used for
different purposes, for example for predicting the sequence of words
we use Recurrent Neural Networks more precisely an LSTM,
similarly for image classification we use Convolution Neural networks.
In this blog, we are going to build a basic building block for CNN.
In a regular Neural Network there are three types of layers:
1. Input Layers: It’s the layer in which we give input to our model.
The number of neurons in this layer is equal to the total number of
features in our data (number of pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then feed into the
hidden layer. There can be many hidden layers depending upon our
model and data size. Each hidden layer can have different numbers
of neurons which are generally greater than the number of features.
The output from each layer is computed by matrix multiplication of
output of the previous layer with learnable weights of that layer and
then by the addition of learnable biases followed by activation
function which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a
logistic function like sigmoid or softmax which converts the output
of each class into the probability score of each class.
The data is fed into the model and output from each layer is obtained
from the above step is called feedforward, we then calculate the
error using an error function, some common error functions are cross-
entropy, square loss error, etc. The error function measures how well
the network is performing. After that, we backpropagate into the model
by calculating the derivatives. This step is
called Backpropagation which basically is used to minimize the loss.
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version
of artificial neural networks (ANN) which is predominantly used to
extract the feature from the grid-like matrix dataset. For example
visual datasets like images or videos where data patterns play an
extensive role.
CNN architecture
Now imagine taking a small patch of this image and running a small
neural network, called a filter or kernel on it, with say, K outputs and
representing them vertically. Now slide that neural network across the
whole image, as a result, we will get another image with different
widths, heights, and depths. Instead of just R, G, and B channels now
we have more channels but lesser width and height. This operation is
called Convolution. If the patch size is the same as that of the image
it will be a regular neural network. Because of this small patch, we
have fewer weights.
Now let’s talk about a bit of mathematics that is involved in the whole
convolution process.
Convolution layers consist of a set of learnable filters (or kernels)
having small widths and heights and the same depth as that of input
volume (3 if the input layer is image input).
For example, if we have to run convolution on an image with
dimensions 34x34x3. The possible size of filters can be axax3,
where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to
the image dimension.
During the forward pass, we slide each filter across the whole input
volume step by step where each step is called stride (which can
have a value of 2, 3, or even 4 for high-dimensional images) and
compute the dot product between the kernel weights and patch
from input volume.
As we slide our filters we’ll get a 2-D output for each filter and we’ll
stack them together as a result, we’ll get output volume having a
depth equal to the number of filters. The network will learn all the
filters.
Recurrent Neural Network (RNN)?
Recurrent Neural Network(RNN) is a type of Neural Network where the output
from the previous step is fed as input to the current step. In traditional neural
networks, all the inputs and outputs are independent of each other, but in cases
when it is required to predict the next word of a sentence, the previous words
are required and hence there is a need to remember the previous words. Thus
RNN came into existence, which solved this issue with the help of a Hidden
Layer. The main and most important feature of RNN is its Hidden state, which
remembers some information about a sequence. The state is also referred to
as Memory State since it remembers the previous input to the network. It uses
the same parameters for each input as it performs the same task on all the
inputs or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.
REcurrent neural network
where:
ht -> current state
ht-1 -> previous state
xt -> input state
Formula for applying Activation function(tanh):
where:
whh -> weight at recurrent neuron
wxh -> weight at input neuron
The formula for calculating output:
Yt -> output
Why -> weight at output layer
These parameters are updated using Backpropagation. However, since RNN
works on sequential data here we use an updated backpropagation which is
known as Backpropagation through time.
Backpropagation Through Time (BPTT)
In RNN the neural network is in an ordered fashion and since in the ordered
network each variable is computed one at a time in a specified order like first h1
then h2 then h3 so on. Hence we will apply backpropagation throughout all
these hidden time states sequentially.
We already know how to compute this one as it is the same as any simple deep
One to One
This type of RNN behaves the same as any simple Neural network it is also
known as Vanilla Neural Network. In this Neural network, there is only one input
and one output.
One to One RNN
One To Many
In this type of RNN, there is one input and many outputs associated with it. One
of the most used examples of this network is Image captioning where given an
image we predict a sentence having Multiple words.
Many to One
In this type of network, Many inputs are fed to the network at several states of
the network generating only one output. This type of network is used in the
problems like sentimental analysis. Where we give multiple words as input and
predict only the sentiment of the sentence as output.
Many to One RNN
Many to Many
In this type of neural network, there are multiple inputs and multiple outputs
corresponding to a problem. One Example of this Problem will be language
translation. In language translation, we provide multiple words from one
language as input and predict multiple words from the second language as
output.
Use cases: