0% found this document useful (0 votes)
28 views19 pages

ML Document-1 - Merged

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views19 pages

ML Document-1 - Merged

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

The minimum description length:

The minimum description length (MDL) principle is a powerful


method of inductive inference, the basis of statistical modeling,
pattern recognition, and machine learning. It holds that the best
explanation, given a limited set of observed data, is the one that
permits the greatest compression of the data. MDL methods are
particularly well-suited for dealing with model selection, prediction,
and estimation problems in situations where the models under
consideration can be arbitrarily complex, and overfitting the data is a
serious concern. This extensive, step-by-step introduction to the MDL
Principle provides a comprehensive reference (with an emphasis on
conceptual issues) that is accessible to graduate students and
researchers in statistics, pattern classification, machine learning, and
data mining, to philosophers interested in the foundations of
statistics, and to researchers in other applied sciences that involve
model selection, including biology, econometrics, and experimental
psychology.
Information Theory:
This chapter provides a brief introduction to some of the
fundamental concepts in Information Theory, as it relates to machine
learning. Aside: The field of Information Theory was originally
developed to model the capacity of a noisy channel through which
signals are coded and transmitted, but its applications extend far
beyond those origins. The landmark paper in the field, called A
Mathematical Theory of Communication, by Claude Shannon, was
published in 1948; it is very accessible and highly recommended.
Entropy:
At the heart of information theory is the notion of entropy. For our
purposes, entropy provides a measure of uncertainty associated with
a random variable or random process. For a discrete random variable
x ∈ {1, ..., K}, with probabilities Pi ≡ P(x = i), entropy (often denoted
H) is defined as
H = X K i=1 Pi log2 1
Pi = − X K i=1 Pi log2 Pi
Relative Entropy
Conditional entropy is the expected entropy in one random variable
x, when conditioned on a random variable y:
H(x|y) = − X i,j P(xi , yj ) log2
P(xi |yj ) = − X j P(yj ) X i P(xi |y)log 2
P(xi |yj ) = X j P(yj ) H(x|yj )
Mutual Information:
Mutual information is one of the most fundamental concepts in
information theory. It is a measure of the information shared by two
random variables; i.e., a measure of how much about the state of
one such variable is known when conditioned on the state of the
other. It is defined in terms of entropy and conditional entropy, i.e.,
I(x; y) = H(x) − H(x|y) = H(y) − H(y|x)
cross entropy
The cross entropy between two distributions Q and P, is given by
H = − X i Qi log2 Pi .
It is the expected ’surprise’ of a random variable distributed
according to P with expectation with respect to Q. You might note
that this quantity shows up in the definition of the KL divergence
above
what is active learning?
Active learning is a machine learning framework in which the
learning algorithm can interactively query a user (teacher or oracle)
to label new data points with the true labels. learning is also referred
to as optimal experimental design.
By using active learning, we can selectively leverage a system like
crowd-sourcing, to ask human experts to selectively label some items
in the data set, but not have to label the entirety. The algorithm
iteratively selects the most informative examples based on some
value metric and sends those unlabelled examples to a labelling
oracle, who returns the true labels for those queried examples back
to the algorithm.
OCCAM’S RAZOR

 Occam’s razor is commonly employed in machine


learning to guide model selection and prevent over
fitting. Overfitting occurs when a model becomes
overly complex and fits the training data too
closely, resulting in poor generalization to new,
unseen data. Occam's razor helps address this issue
by favouring simpler models that are less likely to
overfit.

 In machine learning, Occam's razor can be


visualized using the bias-variance trade-off. The
bias refers to the error introduced by approximating
a real-world problem with a simplified model,
while variance refers to the model's sensitivity to
fluctuations in the training data. The goal is to find
the optimal balance between bias and variance to
achieve good generalization.This can be
represented mathematically using regularization
techniques such as L1 and L2 regularization
 REGULARIZED OBJECTIVE =LOSS + REGULARIZATION

FORWARD SELECTION:

Various techniques can be employed to implement


Occam's razor in feature selection. One common
approach is called “FORWARD SELECTION” the
model based on their individual contribution to
its performance.
 where features are incrementally added to the model
based on their individual contribution to its
performance. Starting with an empty set of features,
the algorithm iteratively selects the most informative
feature at each step, considering its impact on the
model's performance. This process continues until a
stopping criterion, such as reaching a desired level of
performance or a predetermined number of features,
is met
BACKWARD ELEMINATION :
Another approach is "backward elimination," where all
features are initially included in the model, and features
are gradually eliminated based on their contribution or
lack thereof.
 The algorithm removes the least informative feature
at each step, re-evaluates the model's performance,
and continues eliminating features until the stopping
criterion is satisfied.

EXAMPLES OF OCCAM’S RAZOR :

 LOST KEYS :Instead of assuming you’ve been robbed or


there’s household ghost, Occam’s razor would
suggest you simply misplaced them.
ANIMAL CROP CIRCLES :
 Than believing mysterious forces or aliens are
responsible for crop circles ,A simpler explanation is
that some deer trampled the crops.

 cross-validation :
the data is divided into two different groups, called a training set and
a testing set. The data is then separated into a number of groups or
subsets called folds. Each fold contains about the same amount of
data. The number of folds used depends on factors like the size, data
type, and model.
For example, if you separate your data into 10 subsets or folds, you
would use nine as the training group and only one as a testing group.
1. Partition the data
Divide the data set into 10 subsets or folds. Each fold contains an
equal proportion of the data
2. Train and test model
Train and test the model 10 times, using a different fold as the test
set in each iteration. In the first iteration, the first fold is the test set,
and you train the model on the remaining k-1 folds. In the second
iteration, the second fold is the test set, and the process continues in
this way until you reach 10 times.
3. Calculate performance metrics
After each iteration, calculate your model's performance metrics
based on the model’s predictions on the test set.
4. Aggregate results
The performance metrics gathered in each iteration are usually
aggregated to generate an overall assessment of the model's
performance and create an evaluation model.

Underfitting in Machine Learning


A statistical model or a machine learning algorithm is said to have
underfitting when a model is too simple to capture data complexities.
It represents the inability of the model to learn the training data
effectively result in poor performance both on the training and
testing data. In simple terms, an underfit models are inaccurate,
especially when applied to new, unseen examples. It mainly happens
when we use very simple model with overly simplified assumptions.
To address underfitting problem of the model, we need to use more
complex models, with enhanced feature representation, and less
regularization.
Reasons for Underfitting
1. The model is too simple, So it may be not capable to represent
the complexities in the data.
2. The input features which is used to train the model is not the
adequate representations of underlying factors influencing the
target variable.
3. The size of the training dataset used is not enough.
4. Excessive regularization are used to prevent the overfitting,
which constraint the model to capture the data well.
5. Features are not scaled.
Techniques to Reduce Underfitting
1. Increase model complexity.
2. Increase the number of features, performing feature
engineering.
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of
training to get better results.

Overfitting in Machine Learning


A statistical model is said to be overfitted when the model does not
make accurate predictions on testing data. When a model gets
trained with so much data, it starts learning from the noise and
inaccurate data entries in our data set. And when testing with test
data results in High variance. Then the model does not categorize the
data correctly, because of too many details and noise. The causes of
overfitting are the non-parametric and non-linear methods because
these types of machine learning algorithms have more freedom in
building the model based on the dataset and therefore they can
really build unrealistic models. A solution to avoid overfitting is using
a linear algorithm if we have linear data or using the parameters like
the maximal depth if we are using decision trees.
Reasons for Overfitting:
1. High variance and low bias.
2. The model is too complex.
3. The size of the training data.
Techniques to Reduce Overfitting
1. Improving the quality of training data reduces overfitting by
focusing on meaningful patterns, mitigate the risk of fitting the
noise or irrelevant features.
2. Increase the training data can improve the model’s ability to
generalize to unseen data and reduce the likelihood of
overfitting.

Example:

STUDENTS FEATURES CLASS TEST SEMESTER


A guessing 50% 97% (under fitting)
B Memory 98% 89% (over fitting)
C Problem sloving 92% 90% (best fitting)
 Under fitting both the traning error and testing error are large
 Over fitting trainig error is less and testing error is large

Bias and Variance in Machine Learning


 Bias: Bias refers to the error due to overly simplistic
assumptions in the learning algorithm. These assumptions
make the model easier to comprehend and learn but might not
capture the underlying complexities of the data. It is the error
due to the model’s inability to represent the true relationship
between input and output accurately. When a model has poor
performance both on the training and testing data means high
bias because of the simple model
 Variance: Variance, on the other hand, is the error due to the
model’s sensitivity to fluctuations in the training data. It’s the
variability of the model’s predictions for different instances of
training data. High variance occurs when a model learns the
training data’s noise and random fluctuations rather than the underlying
pattern. As a result, the model performs well on the training data but
poorly on the testing data, indicating overfitting.
What is a Decision Tree?
A decision tree is a flowchart-like structure used to make decisions or
predictions. It consists of nodes representing decisions or tests on attributes,
branches representing the outcome of these decisions, and leaf nodes representing
final outcomes or predictions. Each internal node corresponds to a test on an
attribute, each branch corresponds to the result of the test, and each leaf node
corresponds to a class label or a continuous value.
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.

Structure of a Decision Tree


1. Root Node: Represents the entire dataset and the initial decision to be
made.
2. Internal Nodes: Represent decisions or tests on attributes. Each
internal node has one or more branches.
3. Branches: Represent the outcome of a decision or test, leading to
another node.
4. Leaf Nodes: Represent the final decision or prediction. No further
splits occur at these nodes.

How Decision Trees Work?


The process of creating a decision tree involves:

1. Selecting the Best Attribute: Using a metric like Gini impurity,


entropy, or information gain, the best attribute to split the data is
selected.
2. Splitting the Dataset: The dataset is split into subsets based on the
selected attribute.
3. Repeating the Process: The process is repeated recursively for each
subset, creating a new internal node or leaf node until a stopping
criterion is met (e.g., all instances in a node belong to the same class or
a predefined depth is reached).
ALGORITHM:
Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.

Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).

Step-3: Divide the S into subsets that contains possible values for the best
attributes.

Step-4: Generate the decision tree node, which contains the best attribute.

Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the
best attribute for the root node and for sub-nodes. So, to solve such problems there
is a technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of information
gain, and a node/attribute having the highest information gain is split first.
It can be calculated using the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It


specifies randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision
tree in the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the
high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human
follow while making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
o For more class labels, the computational complexity of the decision tree may
increase.
Neural Network Learning
Introduction

Neural network learning is a fundamental concept in machine learning, inspired


by the structure and function of the human brain. The neural networks consist of
interconnected nodes or neurons that process and learn from data, enabling tasks
such as pattern recognition and decision making in machine learning. Neural
networks are sometimes called artificial neural networks (ANNs) or simulated
neural networks (SNNs). They are a subset of machine learning, and at the heart
of deep learning models.
Neural networks are complex systems that mimic some features of the
functioning of the human brain. It is composed of an input layer, one or more
hidden layers, and an output layer made up of layers of artificial neurons that
are coupled. The two stages of the basic process are called backpropagation and
forward propagation.

Artificial Neural Networks (ANNs)

 Composed of interconnected nodes (neurons) and edges (weights).


 Each neuron receives inputs, applies an activation function, and produces
an output.
 ANNs can learn complex patterns and relationships in data.

Learning Process

1. Training : Feed the network with labelled data (inputs and desired
outputs)
2. Forward Propagation : Inputs flow through the network, generating
predictions.
3. Error Calculation : Compare predictions with actual outputs,
calculating the error.
4. Backpropagation : Adjust weights and biases to minimize the error.
5. Optimization : Repeat steps 2-4 until convergence or stopping
criteria.

Key Concepts

 Activation Functions: Introduce non-linearity, enabling ANNs to learn


complex relationships.
 Loss Functions: Measure the difference between predictions and actual
outputs.
 Optimization Algorithms: Update weights and biases to minimize the
loss function.
 Overfitting: When the network becomes too specialized to the training
data, failing to generalize well.

Types of Neural Networks

 Feedforward Networks: Simple, one-way flow of information.


 Recurrent Neural Networks (RNNs): Feedback connections enable
sequential data processing.
 Convolutional Neural Networks (CNNs): Designed for image and signal
processing.

Examples

1. Image Classification:
 Input: Images of animals (e.g., dogs, cats, birds).
 Output: Labels indicating the type of animal.
 Neural network learns to identify patterns in images and classify them
correctly.
2. Speech Recognition:
 Input: Audio recordings of spoken words.
 Output: Transcribed text.
 Neural network learns to recognize patterns in audio signals and transcribe
them into text.
3. Predicting Stock Prices:
 Input: Historical stock price data.
 Output: Predicted future stock prices.
 Neural network learns to identify patterns in stock price fluctuations and
make predictions.
4. Sentiment Analysis:
 Input: Text reviews or comments.
 Output: Sentiment labels (positive, negative, neutral).
 Neural network learns to identify patterns in language and determine
sentiment.

Some real-world examples of neural networks in action include:

 Google's AlphaGo, which defeated a human world champion in Go.


 Self-driving cars using CNNs for image recognition and navigation.
 Virtual assistants like Siri, Alexa, and Google Assistant using RNNs for
speech recognition.
 Medical diagnosis systems using neural networks to analyze medical
images and predict diseases.
MULTI LAYER NEUTRAL NETWORKS &BAcK
PROPAGATION
Linear vs Non Linear Functions:
Linear functions are those which can be represented on a single line or those which have a
constant slope.
Eg: y=mx+c or y=c.
Non Linear functions are those which doesn’t have any constant slope or to be more easier,
all the polynomials with the highest exponent greater than 1 can be termed as non linear
functions.
Eg: y=x^2.

Linear vs non-linear functions

Why do we need Back Propagation in Multi Layer Neural Networks ?


Firstly, let’s know how do a multi layer neural network looks like. In a multi layer neural
network, there will be one input layer, one output layer and one or more hidden layers.
Representation of a Multi Layer Neural Network
Each and every node in the nth layer will be connected to each and every node in the (n-1)th
layer(n>1). So, the input from the input layer is multiplied with the associated weights of every
link and will be traversed till the output layer for the final ouput. In case of any error, unlike
perceptron, in this case we might need to update several weight vectors in many hidden
layers. This is where Back Propagation comes into place. It’s nothing but updation of the
weight vectors in the hidden layers according to the training error or the loss produced in
the ouput layer.
BACK PROPAGATION ALGORITHM
Considering mutiple output units rather than a single output unit , Therefore the formula for
calculating training error for a neural network can be represented as follows:

Error function in multi-layer neural networks


• outputs is the set of output units in network
• d is the data point
• t and o are target values and the output values produced by the network for the kth
output unit for data point ‘d’.
Now that we have the error function, input and output units we need to know the rule for
updation of weight vector. Before that let’s know about one of the most common activation
functions used in multi layer neural networks i.e sigmoid function.
A sigmoid function is any function which is continuously differentiable be it e^x or hyberbolic
tangent(tanh) which produces the output in the range of 0 to 1 ( not including 0 and 1). It can
be represented as:
Sigmoid Function
where, y is the linear combination of input vector and the weight vector at a given node.
Now, let’s know how the weight vectors are updated in multi layer networks according to Back
Propagation Algorithm.
Updation of weights in Back Propagation
The algorithm can be represented in step-wise manner:
• Input the first data point into the network and calculate the output for each output
unit and let it be ‘o’ for every unit ‘u’.
• For each output unit ‘k’, training error ‘ 𝛿 ‘ can be calculated by the given formula:

• For each hidden unit ‘h’, training error ‘ 𝛿 ‘ can be calculated by the given formula in
which the training error of output units to which the hidden layer is connected is taken
into consideration:

• Update weight vectors by the given formula:

• weight vector from jth node to ith node is updated using above formula in which ‘η’ is
the learning rate, ‘𝛿’ is the training error and ‘x’ is the input vector for the given node.
Termination Criterion for Multi layer networks
The above algorithm is continuously implemented on all data points until we specify a
termination criterion, which can be implemented in either of these three ways:
• training the network for a fixed number of epochs ( iterations ).
• setting the threshold to an error, if the error goes below the given threshold, we can
stop training the neural network further.
• Creating a validation sample of data, after every iteration we validate our model with
this data and the iteration with the highest accuracy can be considered as the final
model.
The first way of termination might not yield us better results , the most recommended way
is the third way as we are aware of the accuracy of our model so far.

Conclusion:
So, this is the information of math , regarding multi layer neural networks. Multilayer neural
networks, with their multiple layers and nonlinear activations, excel at capturing complex
patterns in data. Backpropagation is the key training algorithm, which propagates errors
backward from the output to the input layer, allowing for weight adjustments using gradient
descent. This process enables the network to learn effectively. Together, these techniques
form the backbone of modern deep learning, leading to significant advancements in areas like
computer vision, natural language processing, and beyond.

You might also like