Basic Concepts of Machine Learning For Beginners 1732109263
Basic Concepts of Machine Learning For Beginners 1732109263
- V V NAGARAJU DUGGIRALA
DEFINITION:
Machine Learning is a set of methods that can automatically detect patterns in data and then use
the uncovered patterns to predict future data or to perform other kinds of decision making under
uncertainty (such as planning how to collect more data).
LEARNING PROBLEM:
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E.
In general, to have a well-defined learning problem we must identify these three features:
E-mail classification:
• y = f(x)
o y = output, x = feature representation, f( ) is prediction function
• Training: Given a training set, estimate the prediction function f( ) by minimizing the
prediction error.
• Testing: Apply f ( ) to unknown test sample x and predicted value (output) is y.
• y = f(w, x) (for linear model)
o y = output, x = feature representation, w = weight, f(w, ) = prediction function
• Training: Given a training set, estimate the prediction function, f( ) by minimizing the
prediction error.
• Testing: Apply f(w, ) to unknown test sample x and predicted value (output) is y
o Parameter: primary problem is to find the parameters W.
The following are the various steps involved in problem solving using AI/ML:
Training Set:
Test Set:
• Test set is used to get a final, unbiased estimate of how well the learning method works.
• We expect that this estimate to be worse than training/validation set.
• Test set is not biased and this distinguishes test set from training set.
• Similar to Training set, Test set is a finite sample and bound to have variance due to sample
size, but the test set doesn’t have an optimistic or pessimistic bias.
• The test set doesn’t effect the outcome of learning process which only uses training set.
Validation Set:
• Validation set is not used for learning but is for avoiding overfitting.
• The idea of ‘Validation Set’ is identical to ‘Test Set’ but there is a difference.
• We remove a subset from data and this subset is not used for training.
• Although validation set is not directly used for training, it will be used in making certain
choices in the learning process. As the set affects the learning process, it is no longer a test
set.
• The more we use validation set to fine tune the model, the more validation set becomes like
a training set.
• The validation and test sets are roughly the same sizes, much smaller than the size of the
training set. The learning algorithm cannot use examples from these two subsets to build the
model.
• Validation set is used for selecting the model complexity.
In the past, the rule of thumb was to use 70% of the dataset for training, 15% for validation and 15%
for testing. However, in the age of big data, datasets often have millions of examples. In such cases,
it could be reasonable to keep 95% for training and 2.5%/2.5% for validation/testing.
Details of Training set, test set, validation test is shown below:
TYPES OF MACHINE LEARNING
• Supervised Learning
• Unsupervised Learning
• Semi-supervised Learning
• Reinforcement Learning
1. Train the model – for example provide images of apples along with expected response to the
model. Train the algorithm (reverse engineer the relationship between x and y), generated
learned algorithm (y = f(x)) and predict for the new instance using y = f(x)
2. Test the model – in this step the model learns from labeled data and generates output of the
model based on the images to the model without expected output.
• Classification
• Regression
Classification:
Regression:
• Predict tomorrow’s stock market price given current market conditions and other possible
side information.
• Predict the age of a viewer watching a given video on YouTube.
• Predict the location in 3D space of a robot arm end effector, given control signals (torques)
sent to its various motors.
• Predict the amount of prostate specific antigen (PSA) in the body as a function of a number
of different clinical measurements.
• Predict the temperature at any location inside a building using weather data, time, door
sensors etc.,
• Logistic Regression
• K-Nearest Neighbors (KNN)
• Decision Trees
• Random Forest
• Naïve Bayes Classifier (NBC)
• Support Vector Machines (SVM)
• Neural Networks
LOGISTIC REGRESSION:
Use cases:
• Loan sanction
• Customer segmentation
• Spam filter
• Exam Result prediction
Applications:
• Weather Forecast
o The binary dependent variables of sun, cloud, storm and rain are regressed against
independent variables that define weather properties.
o The data is used to conclude whether the weather will be sunny/stormy/cloudy/ rainy.
• Cancer Prediction
o The outcome will either be malignant or benign.
Advantages:
• Easy to understand.
Disadvantages:
Sigmoid Probability:
• The probability in the logistic regression is represented by the Sigmoid function (logistic
function or the S-curve).
• The sigmoid function gives an ‘S’ shaped curve.
• This curve has a finite limit that is Y can only be 0 or 1.
• The probability distribution of output y is restricted to 1 or 0. This is called sigmoid
probability.
Steps in KNN:
1. For each unknown sample find the distance from all labelled samples.
2. Sort these labels by their distance from the unknown sample.
3. Select the ‘k’ labelled sample nearest to the unknown sample.
4. Identify the majority label.
5. Assign the majority label to the unknown sample.
Advantages of KNN:
• Simple to implement.
• Works well in practice provided it is given a good distance metric and has enough labeled
training data.
• Does not require to build a model, make assumptions and tune parameters.
• Can be extended easily with new examples.
Disadvantages of KNN:
• Information retrieval.
• Handwritten character classification using nearest neighbor in large databases.
• Recommender systems
• Breast cancer diagnosis
• Medical data mining.
• Pattern recognition
• Is one of the predictive modelling approaches used in statistics, data mining and machine
learning.
• Uses a decision tree as a predictive model and maps observations about an item to
conclusions about its target value.
• When the target variable can take a finite set of values, the tree models are called
classification trees. In these tree structures, leaves represent class labels and branches
represent conjunctions of features that lead to those class labels.
• Decision trees where the target variable can take continuous values (typically real numbers)
are called regression trees.
• Makes sequential, hierarchical decisions about the outcome variable based on the predictor
data
• Is a graphical representation of all possible solutions to a decision
• A series of yes/no rules based on the features, forming a tree, to match all possible
outcomes of a decision.
• Use predictive models to achieve results
• Is drawn upside down with its root at the top
• Splits into branches based on a condition or internal node
• Doesn’t split the end of the branch if it is the decision/leaf
• Predicts continuous values like price of a house
• Referred to as CART (Classification and Regression Tree)
• Represents a single input variable (x) and a split point on that variable
• Goal is to create a model that predicts the value of a target variable based on several input
variables.
• Each internal (non-leaf) node is labeled with an input feature.
• The arcs coming from a node labeled with an input feature are labeled with each of the
possible values of the target or output feature or the arc leads to a subordinate decision
node on a different input feature.
• Each leaf of the tree is labeled with a class or a probability distribution over the classes,
signifying that the data set has been classified by the tree into either a specific class or into a
probability distribution.
• A tree is built by splitting the source set, constituting the root node of the tree into subsets
which constitute the successor children.
• The splitting is based on a set of splitting rules based on classification features.
• Makes sequential, hierarchical decisions about the outcome variable based on the predictor
data
• A decision tree is a tree whose internal nodes are tests on input patterns and whose leaf
nodes are categories of patterns.
• A decision tree assigns a class number or output to an input pattern by filtering the pattern
down through the tests in the tree.
• Each test has mutually exclusive and exhaustive outcomes.
• Are defined by recursively partitioning the input space and defining a local model in each
resulting region of input space which can be represented by a tree with one leaf per region.
• A flowchart shaped like a tree where every internal node denotes a check on an attribute
• Each branch represents the outcome of the test and each leaf node represents the class
label
• A path from root to leaf represents classification rules
• Follow a natural if-then-else construction
• Are best for categorization problem where attributes are syste matically checked
• The goal is to make the optimal choice at the end of each node
• Has applications in medical diagnosis and credit risk analysis
• Decision Tree (a.k.a CART) network will be as shown below:
Terms used in Decision Tree:
• Root Node – the entire population or sample that further gets divided
• Splitting – Division of nodes into two or more sub nodes
• Decision Node – A sub node splits into further sub nodes
• Leaf/Terminal Node – Node that doesn’t split
• Pruning – process of removing sub nodes
• Branch/Sub tree – A subsection of the entire tree
• Parent Node – A node which is divided into sub nodes and the sub nodes are the child node
• Makes the decision to split a node based on purity
• Need to reach maximum purity to optimize the model
1. Place the best attribute of the dataset at the root of the tree.
2. Split the training set into subsets.
3. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the
tree.
1. Entropy
a. measures the impurity of a collection of examples
b. depends on the distribution of the random variable
c. Measures the amount of information in a random variable
2. Information Gain
a. Expected reduction in entropy caused by partitioning the examples on an attribute
b. Higher the information gain, the more effective the attribute in classifying training
data
The attribute with the highest information gain is selected as the splitting attribute
Branch with Entropy 0 is a leaf node and branch with entropy greater than 0 needs splitting
The algorithm iterates on the non-leaf branches until all nodes become the leaf node
• Can be very non-robust and unstable. A small change in the training data can result in a large
change in the tree due to the hierarchical nature of the tree-growing process causing errors
at the top to affect the rest of the tree and consequently the final predictions.
• Decision tree learners can create over complex trees that do not generalize well from the
training data, known as ‘overfitting’.
• Do not predict very accurately compared to other kinds of model, partly due to the greedy
nature of the tree construction algorithm.
• Not suitable for prediction of continuous attribute
• Does not handle non-rectangular regions well
• Computationally expensive to train
• Tends to overfit
• Large decision trees may be hard to understand
• Requires fixed length feature vectors
• Not often used on its own for prediction because it’s also often too simple and not powerful
enough for complex data.
• Occurs when the learning algorithm continues to develop hypotheses that reduce training
set error at the cost of an increased test set error
• The decision trees will act very well on the training data at the expense of accuracy with the
entire distribution
• The sparsity and large margin principles are necessary to prevent overfitting, i.e., to ensure
that we do not use all the basis functions.
• It is not always desirable to construct trees that perfectly model the training data due to
overfitting.
1. Allow the tree to grow until it overfits and then post-prune the tree
2. Prevent the tree from growing too deep before it reaches the point where it perfectly
classifies the training data
RANDOM FOREST:
• Classification
o Land cover classification
o Cloud/shadow screening
• Regression
o Continuous fields (percent cover) mapping
o Biomass mapping
• Uses even weaker decision trees than Random Forest, that are increasingly focused on ‘hard’
examples.
• Used for regression and classification problems.
• It produces a prediction model in the form of an ensemble of weak prediction models
typically decision trees.
• It builds the model in a stage-wise fashion and it generalizes them by allowing optimization
of an arbitrary differentiable loss function.
• It develops an ensemble of tree-based models by training each of the trees in the ensemble
on different labels and then combining the trees.
• Differs from Bagging and Random Forests in that it can reduce bias in addition to reducing
variance.
• With Gradient Boosting, tree depth is only required to the extent that there is a significant
interaction between variables.
• The basic difference in principle between bagging and boosting is that boosting constantly
monitors its cumulative error and uses that residual for subsequent training. This difference
accounts for GB only needing tree depth when there is significant interaction among various
attributes in the problem.
• High performance.
• A small change in the feature set or training set can create radical changes in the model.
• Not easy to understand predictions.
Applications of SVM:
• Image classification
• Hand-written character recognition
• Biological and other sciences.
Advantages of SVM:
• Fast algorithms
• State of the art accuracy
• Power and flexibility from kernels
• Theoretical justification
Epoch:
NEURAL NETWORKS
PERCEPTRON:
• Is a neural network unit (an artificial neuron) that does certain computations to detect
features or business intelligence in the input data.
• Is a machine learning algorithm for supervised learning of binary classifiers
• Building blocks of ANN (mother of all ANNs)
• The perceptron is simply a fancy name for the simple neuron model with the step activation
function. It was among the very first formal models of neural computation and because of its
fundamental role in the history of neural networks, it wouldn’t be unfair to call it the
“mother of all artificial neural networks”.
• It can be used as a simple classifier in binary classification tasks.
• A method for learning the weights of perceptron from data, called the Perceptron algorithm,
was introduced by the psychologist Frank Rosenblatt in 1957. Suffice to say that it is just
about as simple as the nearest neighbor classifier. Basic principle is to feed the network
training data one example at a time. Each misclassification leads to an update in the weight.
• Follows the ‘feed-forward’ model, meaning inputs are sent into the neuron, are processed
and result in an output.
• A simplest neural network possible, a computational model of a single neuron.
• Consists of one or more inputs, a processor and a single output.
• Perceptron Algorithm:
o For every input, multiply that input by its weight
o Sum all the weighted inputs
o Compute the output of the perceptron based on that sum passed through an
activation function.
• Perceptron architecture is shown below:
• In the biological lingo, we call the wires that provide the input to the neurons dendrites.
Sometimes, depending on the incoming signals, the neuron may fire and send a signal out
for the other neurons to receive.
• The wire that transmits the outgoing signal is called an axon.
• Each axon may be connected to one or more dendrites at intersections that are called
synapses.
• The following figure shows axon, synapse, dendrite, neuron.
Features of Artificial Neuron:
• Input Layer
• Hidden Layer
• Output Layer
Input layer – vector data, each input collects one feature/dimension of the data and passes it on to
the (first) hidden layer. It has many sensors to collect data from the outside world.
Hidden Layer – Each hidden unit computes a weighted sum of all the units from the input layer (or
any previous layer) and passes it through a non-linear activation function. These are hidden between
input and output layers. Each additional layer adds further complexity in training the network, but
would provide better results in most of the situations. It can be thought of as a classifier or feature
detector.
Output Layer – Each output unit computes a weighted sum of all the hidden units and passes it
through a (possibly nonlinear) threshold function. This gives the result predicted b y the network.
The wider and deeper the network the more complicated the mapping. Cross Validation or hyper
parameter search methods are used to determine the number of layers and hidden units in a NN.
A simple neural network example with terminology is given below:
• Universality – given a large enough layer of hidden units (or multiple layers) a neural
network can represent any function.
• Representation Learning: classic statistical machine learning is about learning functions to
map input data to output. But Neural Networks, and especially Deep Learning are more
learning a representation in order to perform classification or some other task.
• Each hidden unit emits an output that is a nonlinear activation function of its net activation
• This is essential to neural networks power, if it’s linear then it all becomes just linear
regression. The output is thus threshold through this nonlinear activation function.
Activation Functions:
Sigmoid:
• The Sigmoid function is used to represent a probability distribution over a binary variable.
• Sigmoid is now discouraged except for final layer to obtain probabilities.
• tanh is most used and often performs best for deep networks.
• Formula for tanh:
• cannot learn via gradient based methods on examples for which the activation is zero.
• It has zero gradient whenever the unit is not active. This may cause units that do not active
initially never active as the gradient-based optimization will not adjust their weights.
• It amy slow down the training process due to the constant zero gradients.
• Discontinuity of ReLU at 0 may hurt the performance of backpropagation.
Leaky ReLU:
Softmax:
• A special kind of activation layer usually at the end of Fully Connected layer outputs.
• Can be viewed as a fancy normalizer.
• Produce a discrete probability distribution vector
• The Softmax function of z is a generalization of the Sigmoid function that represents a
probability distribution over a discrete variable with n possible values.
• Softmax functions are often used as the output units of a classifier.
• Cost functions that do not use a log to undo the exp of the softmax cause a failure to learn
when the argument to the exp becomes very negative causing the gradient to vanish.
• Very convenient when combined with cross entropy loss
There are countless types of neural networks. The following are some of the most relevant types:
• Feed Forward
• Radial Basis
• Kohonen
• Recurrent
• Modular
Feed Forward Neural Networks
• Used in computer vision and speech recognition when classifying the target classes are
complicated.
• Responsive to noisy data.
• Easy to maintain.
• Does not have input at each step
• Has different parameters for each layer
Radial Basis Neural Networks:
• Collection of different networks work independently & contribute towards the final output.
• Increases computation speed (through the breakdown of a complicated computational
process into simpler computations), but processing time is subject to the number of
neurons.
• Very slow to train, because they often have a very complex architecture.
• Almost impossible to understand predictions.
• Diminishing gradient inhibits multiple layers
• Can get stuck in local minimums
• Training time can be extensive
Confusion Matrix:
• A specific table layout that allows visualization of the performance of a supervised learning
algorithm.
• Each column of the matrix represents the instances in a predicted class while each row
represents the instances in an actual class or vice-versa.
• The name stems from the fact that it makes it easy to see if the system is confusing two
classes (i.e. commonly mislabeling one as another).
• For multiple label categorization problems, the confusion matrix can be used as an
evaluation metric.
• Confusion Matrix for a multi-class classification demonstrates how the model of
classification is confused when it makes projections.
Accuracy:
• Accuracy Matrix provides with a summary of the prediction being made accurately.
o Accuracy = [True Positive (TP) + True Negative (TN)] / Total
• Accuracy may not be useful measure in cases where
o There is a large class skew (for example, is 98% accuracy good if 97% of the instances
are negative?)
o There are differential misclassification costs – say, getting a positive wrong, costs
more than getting a negative wrong. Consider a medical domain in which a false
positive, results in an extraneous test but a false negative, results in a failure to treat
a disease.
o We are most interested in a subset of high-confidence predictions.
ROC Curve:
• Accuracy Matrix includes ROC (Receiver Operating Characteristics) Curve and AUC (Area
Under the ROC Curve).
• ROC curve compares the model’s TPR (True Positive Rate) and FPR (False Positive Rate) to
the ones from a random assignment.
• It is a performance measurement for classification problem at numerous threshold settings.
• ROC is a probability curve and is becoming more popular in ML.
• ROC curve is developed during WW II to statistically model false positive and false negative
detections of radar operators
• ROC measure has better statistical foundations than most other measures
• ROC is a standard measure in medicine and biology
• AUC represents degree or measure of separability – the higher the AUC, the better the
model is at predicting.
• AUC measures the entire two-dimensional area under the entire ROC curve.
• ROC curve (similar to PR curves, Precision-Recall):
o allow predictive performance to be assessed at various levels of confidence
o Assume binary classification tasks
o Sometimes summarized by calculating area under the curve
• ROC curves are insensitive to changes in class distribution (ROC curve does not change if the
proportion of positive and negative instances in the test set varied)
• ROC curves identify optimal classification thresholds for tasks with differential
misclassification costs.
• A typical ROC curve is shown below. Different methods can work better in different parts of
ROC space. This depends on the cost of FP vs. FN.
Properties of ROC:
DROPOUT:
• Means randomly ignore certain units during training, don’t update them via gradient
descent, leads to hidden units that specialize
• The key idea is to randomly drop units (along with their connections) from the neural
network during training.
• Is a method for regularization of machine learning.
• With probability p, don’t include a weight in the gradient updates.
• The key idea of a dropout layer is to randomly disable input units after each iteration of the
training.
• A neural network with n units that employs dropout can be seen as a collection of 2^n
possible neural networks.
• These possible networks have a smaller number of units but still share weights such that the
total number of parameters is unchanged.
• Training a network with dropout can be seen as training a collection of 2^n smaller
networks with extensive weight sharing.
• At test time it is easy to approximate the effect of averaging the predictions of the smaller
networks by using a single network without dropout that has smaller weights. This
significantly reduces overfitting and gives significant improvements over other
regularization methods.
• Proven to be very effective in reducing overfitting.
• Dropout can prevent the network from becoming too dependent on any one or any small
combination of neurons and can force the network to be accurate even in the absence of
certain information.
• Dropout introduces a significant amount of noise in the gradients compared to standard
SGD.
• Dropout introduces an extra hyperparameter – the probability of retaining a unit, which
controls the intensity of dropout.
• The central idea of dropout is to take a large model that overfits easily and repeatedly
sample and train smaller sub-models from it.
• Besides feed forward neural networks, dropout can also be applied to Restricted Boltzmann
Machines (RBM).
Advantages of Dropout:
Drawbacks of Dropout:
• It increases training time. A dropout network typically takes 2 to 3 times longer to train
than a standard neural network of the same architecture.
BACKPROPAGATION ALGORITHM
• Is the most common network learning method and has been successfully applied to a
variety of learning tasks such as handwriting recognition and robot control
• Considers hypothesis space of all functions that can be represented by assigning weights
to the given, fixed network of interconnected units.
• Backpropagation searches the space of possible hypotheses using gradient descent to
iteratively reduce the error in the network fit to the training examples.
• One of the most intriguing properties of backpropagation is its ability to invent new
features that are not explicit in the input to the network.
• Most common ANN learning algorithm
• Squared error over the entire training set is computed in this algorithm.
• This algorithm computes the gradient vector of the NLL by applying the chain rule of
calculus.
• Backpropagation (also known as Chain rule) is the procedure to compute gradients of
the loss with respect to parameters in a multi-layer neural network.
• Allows to fit models with hidden layers
• Layer 1 errors can be computed by passing the layer 2 errors back through the W matrix,
hence the term ‘backpropagation’.
• Propagates backwards the gradients through the network starting at the end, that is why
the term ‘backpropagation’.
• Given the backpropagation algorithm you can directly run gradient descent using it as a
subroutine for computing the gradients.
• Key property is that we can compute the gradients locally: each node only needs to
know about its immediate neighbors
• This is supposed to make the algorithm ‘neurally plausible’, although this interpretation
is somewhat controversial.
• Backpropagation requires computing the gradient of the output with respect to the
weights (among other things)
• Forward Propagation is the process of computing the output of the network given its
input.
• Backpropagation work with any a.e. differentiable transformation in addition to ReLU
layer.
• The computational cost of Backpropagation is about twice Forward Propagation as there
is a need to compute gradients with respect to input and parameters at every layer.
• FPROP and BPROP are dual of each other.
Process:
• Input encoding
• Output encoding
• Network graph structure
• Other learning algorithm parameters
Heuristics for training Neural Networks for improving Backpropagation:
• Less hidden nodes, just enough complexity to work, not too much to overfit.
• Train multiple networks with different sizes and search for the best design
• Validation set – train on training set until error on validation set starts to rise, then evaluate
on evaluation set
• Try different activation functions: tanh, ReLU, ELU….
• Dropout – randomly ignore certain units during training, don’t update them via gradient
descent, leads to hidden units that specialize
• Modify learning rate over time (cooling schedule)
• The backpropagation algorithm iterates through many cycles of two processes. Each cycle is
known as an epoch.
• As the network contains no existing knowledge, the starting weights are typically set at
random.
• Then the algorithm iterates through the processes, until a stopping criterion is reached.
• Each epoch in the backpropagation algorithm includes:
o A forward phase in which the neurons are activated in sequence from the input
layer to the output layer, applying each neuron’s weights and activation function
along the way. Upon reaching the final layer, an output signal is produced.
o A backward phase in which the network’s output signal resulting from the forward
phase is compared to the true target value in the training data. The difference
between the network’s output signal and the true value results in an error that is
propagated backwards in the network to modify the connection weights between
neurons and reduce future errors.
• Over time, the network uses the information sent backward to reduce the total error of the
network.
• By applying gradient descent, the algorithm determines how much a weight should be
changed even though the relationship between each neuron’s inputs & outputs is complex.
• The backpropagation algorithm uses the derivative of each neuron’s activation function to
identify the gradient in the direction of each of the incoming weights and hence the
importance of having a differentiable activation function.
• The gradient suggests how steeply the error will be reduced or increased for a change in the
weight. The algorithm will attempt to change the weights that result in the greatest
reduction in error by an amount known as the learning rate.
• The greater the learning rate, the faster the algorithm will attempt to descend the gradients.
Drawbacks with back propagation:
• Deep Learning has been around for a while, but it returned to the headlines in 2016 when
Google’s AlphaGo program crushed Lee Sedol, one of the highest-ranking Go players in the
world.
• With advancements in machine learning algorithms and deep learning chipsets, DL is being
more actively implemented.
• Deep Learning is being applied across industries from healthcare to finance to retail and
everything in between.
• The global deep learning market is expected to reach USD 10.2 Billion by 2025.
• On a less conceptual and more tactical level, deep learning (DL) is a subset of machine
learning (ML) which focuses on learning data representations.
• The focus is on relationships rather than tasks like classical ML algorithms, creates
transferable solutions. It’s the difference between being able to identify a cat as a whole as
opposed to understanding the different concepts defining a cat (like the paws, tail and ears)
and the way they are nested.
• This is one of the key reasons why deep learning is more powerful than classical machine
learning – it creates transferable solutions. That is, the concepts of paw, tail and ears can be
easily reused to understand what a dog is as well.
• Deep Learning algorithms are able to create transferable solutions through neural networks:
that is layers of neurons/units.
• Neurons make up neural networks and those in turn allow machines (via deep learning) to
‘learn’ like humans. To have robust understanding, it’s important to understand how those
underlying neurons actually work.
• Deep Learning uses several layers of neural networks stacked on top of one another.
• Deep learning refers to certain kinds of machine learning techniques where several “layers”
of simple processing units are connected in a network so that the input to the system is
passed through each one of them in turn.
• This architecture has been inspired by the processing of visual information in the brain
coming through the eyes and captured by the retina.
• This depth allows the network to learn more complex structures without requiring
unrealistically large amounts of data.
• Deep Learning deals with finding features automatically from raw data.
• In deep learning the raw data is fed to an algorithm that extracts hierarchical features
automatically based on optimizing the performance of the algorithm on the task.
• Deep learning models automatically learn to associate inputs and desired outputs from
examples.
• Deep Learning is a subset of machine learning that involves multiple layers of
representations that allow a computer to learn and deduce outputs from data.
• Deep Learning is learning hierarchical models. ConvNets are the most successful example.
• DL is a general-purpose framework for representation learning
o Given an objective
o Learn representation that is required to achieve objective
o Directly from raw inputs
o Using minimal domain knowledge
• Manufacturing
• Automotive
• Hospitality
• Health Care
• Banking, Insurance and Finance
• Agriculture
• Entertainment
• IT/Security
• Retail, Supply Chain and Logistics
• Linear transformations
• Non-linear activation functions
• A loss function on the output
o Mean Squared Error (MSE)
o Log likelihood
• Accuracy
o The accuracy of the model in terms of the top-5 error on datasets such as ImageNet.
Also, the type of data augmentation used (e.g., multiple crops, ensemble models)
should be reported.
o The accuracy determines if it can perform the given task.
• Network architecture
o The Network Architecture of the model should be reported, including number of
layers, filter sizes, number of filters and number of channels.
• Number of weights
o The number of weights impact the storage requirement of the model and should be
reported. If possible, the number of non-zero weights should be reported since this
reflects the theoretical minimum storage requirements.
• Number of Multiply and Accumulates (MACs)
o The number of MACs that needs to be performed should be reported as it is
somewhat indicative of the number of operations and potential throughput of the
given DNN. If possible, the number of non-zero MACs should also be reported since
this reflects the theoretical minimum compute requirements.
• Similar to feed-forward neural networks but dominate computer vision because of their much
high accuracy.
• A form of Multi-Layer Perceptron (MLP) which is particularly well suited to one -dimension
signals like speech or text, or 2 dimensional signals like images is the ‘Convolutional Neural
Network’ (CNN).
• A network with convolutional layers is called ‘Convolutional Neural Network’.
• An MLP in which the hidden units have local receptive fields and in which the weights are tied
or shared across the image in order to reduce the number of parameters.
• CNN consists of an input and an output layer, as well as multiple hidden layers.
• The hidden layers of a CNN typically consist of convolutional layers, pooling layers, fully
connected layers and normalization layers
• Original goal of CNN is ‘how to create good representations of the visual world in a way it
could be used to support recognition?’
• CNNs are developed from examining the way our own visual detection cortex works.
• The idea of CNNs was neurobiologically motivated by the findings of locally-sensitive and
orientation-selective nerve cells in the visual cortex.
• Inventors of CNN designed a network structure that implicitly extracts relevant features.
• CNNs are a special kind of multilayer neural networks.
• CNN starts with an input image, extracts primitive features and combines them form certain
parts of object and finally it pulls together all of various parts of objects together to form
object itself.
• A standard neural network applied to images scales quadratically with the size of the input
and does not leverage stationarity and this is solved by CNN and use of convolutional layer.
• It is hierarchical way of seeing objects.
o Layer 1 – very simple features are detected
o Layer 2 – combined to form more complicated shape of object and so on in
subsequent layers.
• CNN is a set of layers with each of them is responsible for detecting a set of feature sets. These
features are going to be more abstract as it goes further into examining the next layer.
• CNNs automatically find the features and classify them.
• CNN is a type of neural network empowered with specific hidden layers including the
convolution layer, pooling layer and the fully connected layer.
• CNN typical architecture is shown below:
• Local connections - Represent how each set of neurons in a cluster is connected to each other,
which in turn represents a set of features.
• Layering – represents the hierarchy in features that are learned
• Spatial Invariance – represents the capability of CNNs to learn abstractions invariant of size,
contrast, rotation and variation.
Popular CNNs:
• LeNet
• AlexNet
• VGGNet
• ResNet
CNN Architectures:
• VGGNet
o 16 layers
o Only 3*3 convolutions
o 138 million parameters
• ResNet
o 152 layers
o ResNet50
• Parameter sharing – a feature detector (such as a vertical edge detector) that’s useful in one
part of the image is probably useful in another part of the image . Convolution shares the same
parameters across all spatial locations. Traditional matrix multiplication does not share any
parameters. Parameter sharing example is shown below:
• Sparsity of connections – in each layer, each output value depends only on small number of
inputs
• Main purpose of a convolutional layer is to detect different patterns or features from an input
image (for example its edges)
• Convolution function is a simple matrix multiplication.
• Convolution layer connects each hidden unit to a small patch of the input and shares the
weight across space.
• Size of output in a convolutional layer and it’s computational cost is proportional to the
number of filters and depends on the stride.
• If kernels have size KxK, input has size DxD, stride is 1, and there are M input feature maps
and N output feature maps then:
o The input has size M = DxD
o The output has size N = (D-K+1) x (D-K+1)
o The kernels have MxNxKxK coefficients (which have to be learned)
o Computational Cost = M*K*K*N*(D-K+1)*(D-K+1)
Feature Map:
Stride:
• Governs how many cells the filter is moved in the input to calculate the next cell in the result
• Stride dictates the sliding behavior of the max pooling.
• If stride = 2, the window moves by 2 pixels every time.
• Number of pixels overlap between adjacent filters
Padding – benefits:
• Allows to use a CONV layer without necessarily shrinking the height and width of the volumes.
This is important for building deeper networks, since otherwise the height/width would shrink
as we go to deeper layers.
• It helps us to keep more of the information at the border of an image. Without padding, very
few values at the next layer would be affected by pixels as the edges of an image.
Zero padding:
Pooling (subsampling):
Weight sharing:
Translation invariance:
• When input is changed spatially (translated or shifted), the corresponding output to recognize
the object should not be changed
• CNN can produce the same output even though the input image is shifted due to weight
sharing
Pooling Layer:
• Max pooling
• Average pooling
• Stochastic pooling
• ROI pooling
• L2 pooling
• L2 pooling over features
Various forms above pooling are shown below:
Lp Pooling:
CNN Architecture:
• The CNN architecture comprises multiple combinations of convolution and pooling layers.
• The reduced image from these layers (convolution + pooling) is then passed through the
activation function.
• A simple CNN architecture is shown below:
• Task dependent
• Cross Validation
• The more data: the more layers and the more kernels
o Look at the number of parameters at each layer
o Look at the number of flops at each layer
• Computational resources
• Be creative.
• Data Augmentation
• Weight Initialization
• Stochastic Gradient Descent
• Batch Normalization
• Shortcut connections
• Training diverges:
o Learning rate may be too large and decrease learning rate
o BPROP is buggy and solution is numerical gradient checking
• Parameters collapse / loss is minimized but accuracy is low
o Check Loss function:
▪ Is it appropriate for the task you want to solve?
▪ Does it have degenerate solutions? Check ‘pull-up’ term.
• Network is underperforming
o Compute flops and nr.params. If too small, make net larger
o Visualize hidden units/params and fix optimization
• Network is too slow
o Compute flops and nr. Params. Solution is GPU, distributed framework and make
network smaller.
Layers of CNN:
• Based on the connection pattern and operations, we can think of a layer in a CNN as:
o Convolutional – A layer can have multiple channels
o Non-linear (often not drawn)
o Max-pooling
o Fully connected
o Soft Max
Working of CNNs:
• Learning an Image - CNN focuses on smaller and specific patterns than the whole image. It’s
convenient and effective to present a smaller region with fewer parameters, thereby reducing
computational complexity.
• Convolutional Layer – CNN is a neural network with convolutional layers and other layers. A
convolutional layer has several filters that perform the convolution operation.
• Convolution Operation:
o Consider a 6x6 image convolved with 3x3 filter(s) to give an output of size 4x4. Filters
can be considered as network parameters to be learned.
o Shift the filter around the input matrix (commonly known as stride) once a convolved
output is achieved. If the stride size is changed, the convolved output will vary (only
outputting intense pixels).
o The convolution operation gets repeated for each filter resulting in a feature map.
o When RGB image is used as input to CNN, the depth of filter is always equal to the
depth of image (3 in case of RGB).
• Decomposing larger filters into smaller filters is shown below
Join Us to Learn and download free AI & ML:
https://t.me/AIMLDeepThaught
Applications of CNN:
CROSS VALIDATION:
• The term cross-validation is used loosely in literature, where practitioners and researchers
sometimes refer to the train/test holdout method as a cross-validation technique.
• It is a crossing over of training and validation stages in successive rounds.
• Main idea behind CV is that each sample in the dataset has the opportunity of being tested.
• The predictive performance of the models is assessed using Cross Validation.
• The motivation to use CV is that when we fit a model, we fit it to a training dataset. Without
CV, we just have information on how the models perform to training data.
• Proposed by Wolpert.
• If one algorithm tends to perform better than another on a given class of problems, then the
reverse will be true on a different class of problems.
• There is no universally best model.
MODEL COMPRESSION
• Recent deep learning models are becoming more complex. For example, LeNet -5 with 1 M
parameters and VGG-16 with 133M parameters.
• The following are the problems with complex deep learning models:
o Huge storage (memory, disk) requirement
o Computationally expensive
o Uses lots of energy
o Hard to deploy models on small devices (like smart phones)
• The goal of model compression is to make lightweight model that is fast, memory -efficient
and energy efficient and especially useful for edge device.
• Key technique to allow using AI everywhere
• Mitigates energy problem.
• There are several ways of model compression – whether training a lightweight model or
compressing a trained model. Also applying different techniques.
• Refers to reducing the number of parameters in the convolutional layers and FC layers.
• Approach to reduce the computational cost of CNN.
• The three-stage compression method consisting of pruning, quantization and encoding is
shown below. The input is the original model and the output is the compression model.
• Pruning
• Weight Sharing
• Quantization
• Low-rank Approximation
• Sparse Regularization
• Distillation
• Hashing
Pruning:
• Reduces the number of parameters and operations in CNNs by permanently dropping less
important connections which enables smaller networks to inherit knowledge from the large
predecessor networks and maintains comparable of performance.
• Refers to less number of weights
• Pruning works well with quantization and motivated by how real brain learns
• Remove weights which |weight| < threshold and retrain after pruning weights
• Learn effective connections by iterative pruning
• Convolutional layer is more sensitive to pruning compared to fully-connected layer.
• Given a well-trained CNN, prune less important filters together with their connecting feature
maps in order to reduce computation cost.
• Minimum weight and smallest activation are the two criteria for pruning CNN.
• Given an RNN, prune weights during initial training in order to gain sparsity of the model.
• Result becomes much better when using both pruning and quantization methods together.
Model can be compressed up to 3% of original size.
• Robust to various settings and can achieve good performance.
• Can support both train from scratch and pre-trained model.
• Pruning process is explained as below:
Drawbacks of Pruning:
Quantization:
• VQ is a method for compressing densely connected layers to make CNN models smaller.
• Similar to scalar quantization where a large set of numbers is mapped to a smaller set, VQ
quantizes groups of numbers together rather than addressing them one at a time.
• Used to significantly reduce the number of dynamic parameters in deep models.
• VQ methods have a clear gain over existing matrix factorization methods.
• Quantization works well on pruned network. For example, unpruned AlexNet has 60 million
weights to quantize, while pruned AlexNet has only 6.7 million weights to quantize. Given the
same amount of centroids, the latter has less error.
• The main idea is to train the DNN with binary weights during the forward and backward
propagations, while retaining precision of the stored weights in which gradients are
accumulated in order to regularize all the parameters.
• Network quantization compresses the original network by reducing the number of bits
required to represent each weight.
• The following example shows Trained Ternary Quantization (TTQ) in a DNN.
Drawbacks of Quantization:
• The accuracy of the binary nets is significantly lowered when dealing with large CNNs such as
GoogleNet.
• Another drawback of such binary nets is that existing binarization schemes are based on
simple matrix approximations and ignore the effect of binarization on the accuracy loss.
Hashing:
• Designing a proper hashing technique to accelerate the training of CNNs or save memory
space also an interesting problem.
• HashedNets is a recent technique to reduce model sizes by using a hash function to group
connection weights into hash buckets and all connections within the same hash bucket share
a single parameter value.
• Network shrinks the storage costs of neural networks significantly while mostly preserves the
generalization performance in image classification.
• Sparsity will minimize hash collision making feature hashing even more effective.
• HashNets may be used together with pruning to give even better parameter savings.
• Weight sharing is determined by a hash function before the networks see any training data.
• Given a DNN, compress the neural network with weight sharing.
• Hashed Net use a low-cost hash function to randomly group connection weights into hash
buckets (weights within the same hash bucket share a same value) in order to reduce the
model size.
• An example of Hashed Net is shown below:
• The implementation is not that easy since it involves decomposition operation, which is
computationally expensive.
• Another issue is that current methods perform low-rank approximation layer by layer and thus
cannot perform global parameter compression which is important as different layers hold
different information.
• Finally, factorization requires extensive model retraining to achieve convergence when
compared to the original model.
• A typical framework for the low rank regularization method is shown below. The left is the
original convolutional layer and the right is the low rank constraint convolutional layer with
rank k.
Knowledge Distillation:
Drawbacks of KD:
• KD can only be applied to classification tasks with softmax loss function which hinders its
usage.
• Another drawback is the model assumptions sometimes are too strict to make the
performance competitive with other type of approaches.
LOSS FUNCTION:
Hinge Loss:
• Is usually used to train large margin classifiers such as Support Vector Machine ( SVM).
Softmax Loss:
Contrastive Loss:
• Is commonly used to train Siamese Network which is a weakly-supervised scheme for learning
a similarity measure from pairs of data instances labelled as matching or non-matching.
• A single margin loss function.
• Causes a dramatic drop in retrieval results when fine-tuning the network on all pairs.
• Calculates the similarity between the images.
Triplet Loss:
Limitations of RNN:
• The main difference between CNN and RNN is the ability to process temporal information or
data that comes in sequences such as a sentence for example. CNN and RNN are used for
completely different purposes and there are differences in the structures of the neural
networks themselves to fit those different use cases.
• CNNs employ filters within convolutional layers to transform data. Whereas RNN reuse
activation functions from other data points in the sequence to generate the next output in a
series.
• Boltzmann Machine is a stochastic recurrent neural network with stochastic binary units and
undirected edges between units.
• Unfortunately learning for Boltzmann machines is impractical and has a scalability issue and
as a result Restricted Boltzmann Machine (RBM) has been introduced.
• RBM has one layer of hidden units and restricts connections between hidden units.
• This allows for more efficient learning algorithm.
• Represented by a bipartite graph, with symmetric, weighted connections.
• One layer has visible nodes and the other hidden variables as shown below:
• We restrict the connectivity to make learning easier.
• In an RBM, the hidden units are conditionally independent given the visible states.
• As the name indicates, it is a multi-layer belief network. Each layer is a RBM and they are
stacked on each other to construct DBN.
• Consists of two different types of neural networks – Belief Networks and Restricted
Boltzmann Machines.
• DBN is unsupervised learning in contrast to perceptron and backpropagation neural
networks.
• Multi-layer belief networks.
• Each layer is Restricted Boltzmann Machine and stacked each other to construct DBN.
• The first step of training DBN is to learn a layer of features from the visible units, using
Contrastive Divergency algorithm.
• The next step is to treat the activations of previously trained features as visible units and
learn features of features in a second hidden layer.
• Finally, the whole DBN is trained when the learning for the final hidden layer is achieved.
Learning DBN:
• It is easy to generate an unbiased example at the leaf nodes, so we can see what kinds of
data the network believes in.
• It is hard to infer the posterior distribution over all possible configurations of hidden causes.
• It is hard to even get a sample from the posterior.
Applications of DBN:
AUTOENCODERS
• An autoencoder is a neural network that is trained to copy its input to its output, with the
typical purpose of dimension reduction – the process of reducing the number of random
variables under consideration.
• An Autoencoder is a kind of unsupervised neural network that is used for dimensionality
reduction and feature discovery. Also known as auto-associative learning.
• It features an encoder function to create a hidden layer (or multiple layers) which contains a
code to describe the input. Is a neural network which is trained to reproduce its input.
• There is then a decoder which creates a reconstruction of the input from the hidden layer.
• The most intuitive application of autoencoders is data compression. Given a 256 x 256 pixel
image for example, a representation of a 28 x 28 pixel may be learned which is easier to
handle.
• An autoencoder can then become useful by having a hidden layer smaller than the input layer,
forcing it to create a compressed representation of the data in the hidden layer by learning
correlations in the data.
• This facilitates classification, visualization, communication and storage of data.
• Autoencoders are a form of unsupervised learning meaning that an autoencoder only needs
unlabeled data – a set of input data rather than input-output pairs.
• A standard way to train an autoencoder is to ensure that the hidden layer is narrower than
the visible layer. This prevents the model from learning the identity function.
• An interesting variation of the MLP whose output is the same as its input.
• The key is to make the hidden layer much smaller than the input and output layers, so the
network can’t just learn to copy the input to the hidden layer and the hidden layer to the
output, in which case we may as well throw the whole thing out. But if the hidden layer is
small, something interesting happens: the network is forced to encode the input in fewer bits,
so it can be represented in the hidden layer and then decode those bits back to full size.
• It could learn to encode a million-pixel image as just the seven-character word or some short
code invented by itself and simultaneously learn to decode the image into another image.
• An autoencoder is like a file compression tool, with two important advantages: it figures out
how to compress things on its own and like Hopfield networks it can turn a noisy, distorted
image into a nice clean one.
• Autoencoders were very hard to learn in the beginning (1980s) even though they had a single
hidden layer. Figuring out how to pack a lot of information into the same few bits is a very
difficult problem.
• These were MLPs where the inputs and outputs were clamped together so that the (smaller
number of) hidden nodes produced a lower dimensional representation of the inputs.
• Is directional in that the weights run from input to hidden node to output.
• is a feed-forward neural network whose job is to take an input X and predict X.
• To make this non-trivial, we need to add a bottleneck layer whose dimension is much smaller
than the input.
• Autoencoders have an alternative to manifold learning for conducting nonlinear feature
fusion.
• An autoencoder is a type of artificial neural network used to learn efficient data coding in an
unsupervised manner.
• Autoencoders analyze nonlinear components unlike PCA and achieves the PCA capacity even
without activation functions
Applications of Autoencoders:
• Dimensionality reduction
o Visualization
o Feature Extraction
o High prediction accuracy
o High speed of prediction
o Low memory requirements
• Semantic Hashing
• Unsupervised pretraining
• Image denoising
• Anomaly detection
• Feature Learning
Components of an Autoencoder:
• Encoder – reduces the input dimensions and compresses the input data into an encoded
representation
• Bottleneck – contains the compressed representation of the input data in its lowest possible
form
• Decoder – reconstructs the data from the encoded representation to the original as same as
possible
• Reconstruction Loss – measures the performance of the decoder and the similarities between
the input and the output
• A simple autoencoder is shown below:
Training an Autoencoder:
Types of Autoencoders:
Deep Autoencoders:
• Training Deep Autoencoder models using back propagation does not work well, because the
gradient signal becomes too small as it passes back through multiple layers and the learning
algorithm often gets stuck in poor local minima.
• One solution to the drawback with Deep Autoencoders as mentioned above, is to greedily
train a series of RBMs and to use these to initialize an auto-encoder. The whole system can
then be fine-tuned using backprop in the usual fashion.
• This approach works much better than trying to fit the deep autoencoder directly starting with
random weights.
• The following are the steps in training a Deep Autoencoder:
o First, we greedily train some RBMs.
o Then we construct the Autoencoder by replicating the weights.
o Finally, we fine-tune the weights using back propagation.
Denoising Autoencoder:
Triplet Loss:
• Consists of two identical neural networks (sister networks) each taking one of the two input
images.
• The last layers of the two networks are then fed to a ‘contrastive loss’ function, which
calculates the similarity between the images.
• Input: a pair of input signatures
• Output (Target): A label, 0 for similar, 1 else.
• No one ‘architecture’ fits all.
• Design largely governed by what performs well empirically on the task at hand.
• Matching
• Retrieval
• Recognition
• Re-identification
Advantages of GANs:
• Excellent test of our ability to use high dimensional, complicated probability distributions
• Simulate possible futures for planning or simulated RL
• Missing data
• Semi-supervised learning
• Multi-modal outputs
• Realistic generation tasks
Loss functions in GANs:
Training Generator:
Types of Regression Algorithms:
• Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Ridge Regression
• Lasso Regression
• ElasticNet Regression
Linear Regression:
• Is a popular regression learning algorithm that learns a model which is a linear combination
of features of the input example.
• Is a statistical model used to predict the relationship between independent and dependent
variables denoted by x and y respectively.
• Can be described as the ‘best fit’ line through all data points.
• Predictions in linear regression are numerical.
• Easy to understand
• We can clearly see what the biggest drivers of the model are.
C = coefficient
Polynomial Regression:
• The relationship between the dependent variable y and the independent variable x is
modeled as an nth degree polynomial in x.
Ridge Regression:
• Is a technique for analyzing multiple regression data that suffer from multicollinearity.
• By adding a degree of bias to the regression estimates, ridge regression reduces the
standard errors.
• Works better statistically and also easier to fit numerically.
• Uses a ‘soft’ weighting of all the dimensions.
• Ridge regression shrinks the regression coefficients by imposing a penalty on their size. The
ridge coefficients minimize a penalized residual sum of squares.
• Ridge regression can have better prediction error than linear regression in a variety of
scenarios, depending on the choice of lambda.
• The first step in Ridge regression is to standardize the variables (both dependent and
independent) by subtracting their means and dividing by their standard deviations.
• All ridge regression calculations are based on standardized variables.
Lasso Regression:
ElasticNet Regression:
• R-square is the most common metric to judge the performance of regression models.
• R-square lies between 0-100%
• The disadvantage with R-squared is that it assumes every independent variable in the model
causes variations in the dependent variable. This can be solved using Adjusted R squared,
which is adjusted for the number of predictors in the model.
Methods:
• K-Means Clustering
• Hierarchical Clustering
• DBSCAN
• Gaussian mixtures
• Spectral Clustering
1. Input unlabeled data – for example provide images of different kinds of fruits without
expected output
2. Test the model – in this step the model identifies patterns like shape, color, size and groups
the fruits based on these features, attributes or qualities.
Clustering:
Applications of clustering:
• Segmenting customers into groups with similar buying patterns for targeted marketing
campaigns.
• Detecting anomalous behavior such as unauthorized network intrusions by identifying
patterns of use falling outside the known clusters.
• Simplifying extremely large datasets by grouping features with similar values into a smaller
number of homogeneous categories.
• Clustering is often used in marketing in order to group users according to multiple
characteristics, such as location, purchasing behavior, age, and gender.
• Also used in scientific research for example to find population clusters within DNA data.
• Used with unlabeled data (i.e., data without defined categories or groups)
• One of the popular examples of clustering
• Is a way to find clusters or groups in the data
• Is a set of steps that work iteratively to find the groups and label the data
• ‘k’ is a variable that represents the number of groups
• We need to run the k-Means clustering algorithm for a range of k values and compare the
results to find the value of k that best represents the number of clusters in the data
• Algorithm works iteratively to assign each data point to one of k groups based on the
features that are provided.
• Data points are clustered based on feature similarity
• The results of the k-means clustering algorithm are the centroids of the k clusters which can
be used to label new data
• Uses iterative refinement to produce a result.
• The algorithm inputs are the number of clusters k and the data set
• Data set is a collection of features for each data point.
• Resulting clusters are always convex sets.
In data assignment step, each centroid defines one of the clusters. Each data point is assigned to its
nearest centroid, based on the squared Euclidean distance.
In centroid update step, the centroids are computed. This is done by taking the mean of all data
points assigned to that centroid’s cluster.
Dendrogram:
• Agglomerative
• Divisive
Agglomerative is a bottom-up approach. Each observation starts in its own cluster and pairs of
clusters are merged as one moves up the hierarchy. This strategy starts with each element in a
separate cluster and merges them accordingly to a given property. Divisive is a top down approach.
All observations start in one cluster and splits are performed recursively as one moves down the
hierarchy.
• To decide which clusters should be combined (for agglomerative) or where a cluster should
be split (for divisive) a measure of dissimilarity between sets of observations is required.
• This is achieved by use of an appropriate metric (a measure of distance between pairs of
observations), and a linkage criterion which specifies the dissimilarity of sets as a function of
the pairwise distances of observations in the sets.
• In single linkage hierarchical clustering, the distance between two clusters is defined as the
shortest distance between two points in each cluster. In complete linkage hierarchical
clustering the distance between two clusters is defined as the longest distance between two
points in each cluster.
DBSCAN Algorithm:
Steps in DBSCAN:
• Select a distance for your radius and you select a point within your dataset – then all other
data points within the radius’s distance from your initial point are added to the cluster.
• Repeat the process for each new point added to the cluster and repeat until no new points
within the radii of the most recently added points.
• Choose another point within the dataset and build another cluster using the same approach.
Advantages of DBSCAN:
• Is intuitive.
Disadvantages of DBSCAN:
• Its effectiveness and output rely heavily on what you choose for a radius.
• Won’t react well to certain types of distributions.
Gaussian mixtures:
• Gaussian Mixture Models (GMMs) have been until very recently regarded as the most
powerful model for estimating the probabilistic distribution of speech signals associated
with each of Hidden Markov Model (HMM) states.
• GMM is a density model where we combine a finite number of Gaussian distributions.
Spectral Clustering:
Clustering Vs. Classification: The following table shows the differences between Clustering and
Classification.
• The process of converting a set of data having large number of dimensions into data with
smaller number of dimensions is called Dimensionality Reduction.
• When dealing with high dimensional data, it is often useful to reduce the dimensionality by
projecting the data to a lower dimensional subspace which captures the ‘essence’ of the
data. This is called ‘Dimensionality Reduction’.
• The motivation behind this technique is that although the data may appear high
dimensional, there may only be a small number of degrees of variability, corresponding to
latent factors. For example, when modeling the appearance of face images, there may only
be a few underlying latent factors which describe most of the variability, such as lighting,
pose, identity etc.
• The following are the benefits of dimensionality reduction:
o Compresses the Data and reduces the storage space required
o Requires lesser computation time
o Removes redundant features
o Reduces the dimensions of data to 2D or 3Dmay allow us to plot and visualize it
precisely
o Potentially reduces the noise
• There are two types of dimensionality reduction:
o Feature Extraction – finds new features in the data after it has been transformed
from a high dimensional space to a low dimensional space
o Feature Selection – finds the most relevant features to a problem. This is done by
obtaining a subset or key features of the original variables.
Data Visualization:
Feature Selection:
• Is used to select those features that contribute most to the prediction variable that we are
interested in
• Benefits of feature selection are:
o Reduces overfitting by making data less redundant
o Reduces training time by eliminating the misleading data
o Improves accuracy by collating fewer data points
• Statistical relationship among variables depend on regression and correlation
• Two techniques used for feature selection are:
o Regression
o Factor Analysis
• Until the late 2000s, the broader class of systems that fell into the category ‘machine
learning’ relied heavily on feature engineering.
• Features are transformations of input data resulting in numerical features that facilitate a
downstream algorithm such as a classifier to produce correct outcomes on new data.
• Feature engineering aims to take the original data and come up with representations of the
same data that can be fed to an algorithm to solve a problem.
REGRESSION:
MULTICOLLINEARITY PROBLEM:
Eigenvalue:
• A measure of the variance that a factor explains for the observed variables. A factor with an
eigenvalue less than 1 explains less variance than a single observed value
Applications of PCA:
• In Biology – PCA is used to interpret gene microarray data, to account for the fact that each
measurement is usually the result of many genes which are correlated in their behavior by
the fact that they belong to different biological pathways.
• In NLP – a variant of PCA called ‘latent semantic analysis’ is used for document retrieval in
Natural Language Processing.
• In Signal processing – variant of PCA, Independent Component Analysis (ICA) is used to
separate signals into their different sources.
• In Computer Graphics – it is common to project motion capture data to a low dimensional
space and use it to create animations.
Drawbacks of PCA:
• Reduces dimensions
• Searches the linear combination of variables that best separates two classes
• Reduces the degree of overfitting
• Determines how to classify a new observation out of a group of classes
• The decision boundary between any two classes is a straight line.
LOCALLY LINEAR EMBEDDING (LLE) ALGORITHM
Steps in LLE:
• Is a nonlinear dimensionality reduction method and derived from Isometric feature Mapping
• Connects each data point in a high dimensional space (a face for example) to all nearby points
(very similar faces), computes the shortest distances between all pairs of points along the
resulting network and finds the reduced coordinates that best approximate these distances.
• In contrast to PCA, faces coordinates in this space are often quite meaningful: one may
represent which direction the face is facing (left profile, three quarters, head on etc.,); another
how face looks (very sad, a little sad, neutral, happy, very happy etc.,); and so on.
• From understanding motion in video to detecting emotion in speech, IsoMap has a surprising
ability to zero in on the most important dimensions of complex data.
• Maps points on a high-dimensional non-linear manifold to a lower dimensional set of
coordinates.
• Is a multi-dimensional scaling (MDS) method that use geodesic to measure distance so it can
capture manifold structure.
• Geodesic - In differential geometry, a geodesic is a curve representing in some sense the
shortest path between two points in a surface, or more generally in a Riemannian manifold.
It is a generalization of the notion of a "straight line" to a more general setting.
IsoMap algorithm:
• The connectivity of each data point in the neighborhood graph is defined as its nearest k
Euclidean neighbors in the high-dimensional space. This step is vulnerable to ‘short-circuit
errors’ if k is too large with respect to the manifold structure or if noise in the data moves the
points slightly off the manifold.
• Even a single short-circuit error can alter many entries in the geodesic distance matrix which
in turn can lead to a drastically different and incorrect low-dimensional embedding.
• Conversely, if k is too small, the neighborhood graph may become too sparse to approximate
geodesic accurately.
t-distribution STOCHASTIC NEIGHBORHOOD EMBEDDING (t-SNE) ALGORITHM
t-SNE converts the higher dimensional data into the lower dimensional data by following steps:
• It measures the similarity between the two data points and it does for every pair. Similar data
points will have more value of similarity and the different data point have less value.
• Then it converts that similarity distance to probability (joint probability) according to normal
distribution.
• It does the similarity check for every point. Thus it will have the similarity matrix ‘S1’ for every
point. This is all calculation it does for our data points that lie in higher dimensional space.
• Now, t-SNE arranges all of the data points randomly on the required lower dimensional
• And it does all of the same calculation for lower dimensional data points as it does for higher
ones – calculating similarity distance but with a major difference it assigns probability
according to t-distribution instead of normal distribution and this is because it is called t-SNE,
not simple SNE.
• Now we also have the similarity matrix for lower dimensional data points. Let’s call it S2.
• Now what t-SNE does is it compares matrix S1 and S2 and tries to make the difference in
between matrix S1 and S2 much more smaller by doing some complex mathematics.
• At the end we will have lower dimensional data points which tries to capture even complex
relationships at which PCA fails.
• So on a very high level this is how t-SNE works.
• The visualizations produced by t-SNE are significantly better than those produced by the other
techniques on almost all of the datasets.
• t-SNE use random walks on neighborhood graphs to allow the implicit structure of all of the
data to influence the way in which a subset of the data is displayed.
• t-SNE is better than existing techniques at creating a single map that reveals structure at many
different scales.
• t-SNE is capable of retaining local structure of the data while also revealing some important
global structure of the data (such as clusters at multiple scales).
• Both the computational and memory complexity of t-SNE are O(n^2)
Drawbacks of t-SNE:
• The goal in semi-supervised learning is to use both labeled and unlabeled data to build
better learners, rather than using each one alone.
• Semi-supervised learning is applied to the test data.
• SSL algorithms can use unlabeled data to help improve prediction accuracy if data satisfies
appropriate assumptions.
• Uses unlabeled data to help solve a supervised task.
• The following are the semi-supervised learning algorithms:
o Self Training
o Generative Models
o S3VMs
o Graph Based Algorithms
o Multiview algorithms
Self Training:
Advantages:
Generative Models:
Examples:
Advantages:
Advantages:
Disadvantages:
Graph-based Algorithms:
Advantages:
Disadvantages:
Multiview Algorithms:
Co-training idea:
• Co-training refers to two views of an item, for example image and HTML text.
• Train an image classifier and text classifier.
• The two classifiers teach each other.
• Assumes that features can be split into two sets and each sub-feature set is sufficient to
train a good classifier.
• Initially two separate classifiers are trained with the labeled data on the two sub-feature sets
• Each classifier then classifies the unlabeled data and teaches the other classifier with the few
unlabeled example (and the predicted labels) they feel most confident.
• Each classifier is retrained with the additional training examples given by the other classifier
and the process repeats.
Pros of Co-training algorithm:
• Natural feature splits may not exist and models using both features should do better.
• Reinforcement Learning (RL) solves a very specific kind of problem where the decision
making is sequential.
• RL is a class of learning problems in which an agent interacts with an unfamiliar, dynamic
and stochastic environment.
• It’s called reinforcement learning because the agent gets positive reinforcement for tasks
done well and negative reinforcement for tasks done poorly.
• The purpose of RL is to learn the optimal policy based only received rewards.
• RL provides a general-purpose framework for AI
• RL problems can be solved by end-to-end deep learning
• A single agent can now solve many challenging tasks
• Value-based RL
o Estimate the optimal value function
o This is the maximum value achievable under any policy
• Policy-based RL
o Search directly the optimal policy
o This is the policy achieving maximum future rewards
• Model-based RL
o Build a model of the environment
o Plan (e.g. by lookahead) using model
Markov Property:
• At each time step t, agent chooses an action which depends on current state S t.
• Current state completely characterizes the state of the world.
Bayesian Learning:
Pros:
Cons:
• Is a Machine Learning model used to generated Word Embeddings with words which are
similar to each other are in close proximity in vector space.
• If we want to train a machine learning model on textual input or to find relations between
words, first we need to convert text to vectors. These are called Word Embeddings.
• Word2Vec can automatically capture the relations between words like Paris, Beijing, Tokyo,
Delhi, New York are all clustered together in vector space. Similarly Cat, Dog, Rat, Duck are
all clustered together in vector space.
• Word2Vec can be used to perform relational operations and also extract and provide the
most similar words, how similar are two words, pick the odd word out of group of words if
the model is trained with large enough corpus of data.
Representing words:
GENERALIZATION
• Refers to the capability of applying learned knowledge to previously unseen data, i.e, to
make predictions on novel inputs.
• Without generalization there is no learning, but just memorizing
• If the validation loss decreases as well, the learned patterns seem to generalize
• Data augmentation is key to improve generalization:
o Random translation
o Left/right flipping
o Scaling
• The following are ways to improve Generalization:
o Weight sharing (greatly reduce the number of parameters)
o Data Augmentation (e.g., jittering, noise injection etc.,)
o Dropout
o Weight delay (L2, L1)
o Sparsity in the hidden units
o Multi-task (unsupervised learning)
OVERFITTING
• A model overfits when it simply memorizes the data, e.g., a curve that fit through every training data.
• If the overfitted model is tested on the training data, the model will give 0 training errors.
• To test the model generalization ability, the model should be tested on unseen test cases.
• Underfitting, Good and Overfitting are shown below:
BIAS
• Defined as the average squared difference between predictions and true values.
• It’s a measure of how good your model fits the data
• Zero bias would mean that the model captures the true data generating process perfectly.
Both training and validation loss would go to zero. That’s unrealistic as data is almost always
noisy, so some bias inevitable.
• Bias error is useful to quantify how much on average are the predicted values different from
the actual value.
• A high bias error means we have a under-performing model. Thus, we aim at low bias.
VARIANCE
• Variance quantifies how the predictions made on same observation are different from each
other.
• A high variance model will over-fit on your training population and perform badly on any
observation beyond training. Thus, we aim at low variance.
• The tradeoff between Bias/Variance is shown below:
Occam’s Razor:
• This is one of the three principles of learning from data; others being Sampling Bias and Data
Snooping.
• The principles is: The simplest model that fits the data is also the most plausible.
• pick the simplest model that adequately explains the data
• Select the simplest hypothesis (solution) that suits/fits the data
• When Occam’s Razor says that simpler is better, it doesn’t mean simpler is more elegant. It
means simpler has a better chance of being right.
• This principle is about performance, not about aesthetics. If a complex explanation of the
data, performs better, we will take it.
• Has been formally proved under different sets of idealized conditions.
BAG OF WORDS
TF-IDF:
• One of the approaches to scoring in which the frequency of words is rescaled by how often
they appear in all documents so that the scores for frequent words like the that are also
frequent across all documents are also penalized.
Term Frequency (TF):
Advantages of BoW:
Limitations of BoW:
• Vocabulary: Has dimensionality issue as the total dimension is the vocabulary size. It can easily
overfit the model. The vocabulary requires careful design, most specifically in order to manage
the size, which impacts the sparsity of the document representations. The remedy is to use
some well-known dimensionality reduction technique to input data.
• Sparsity: Sparse representations are harder to model both for computational reasons (space
and time complexity) and also for information reasons, where the challenge is for the models
to harness so little information in such a large representational space.
• Meaning: BoW representation doesn’t consider the semantic relation between words.
Generally, the neighbor words in a sentence should be useful for predicting the target wo rd.
Discarding word order ignores the context, and in turn meaning of words in the document
(semantics). Context and meaning can offer a lot to the model, that if modeled could tell the
difference between the same words differently arranged ( this is interesting vs is this
interesting), synonyms (old bike vs used bike) and much more.
RECOMMENDER SYSTEMS
• Recommender Systems are software tools and techniques providing suggestions for items to
be of use to a user. The suggestions relate to various decision-making processes, such as what
items to buy, what music to listen to or what online news to read.
• Recommender system is an information-filtering technique that provides users with
recommendations for which they might be interested in.
• The following are the benefits of Recommender Systems to companies:
o Increase the number of items sold
o Sell more diverse items
o Increase the user satisfaction
o Increase user fidelity
o Better understand what the user wants
• The following are the benefits of Recommender Systems to users:
o Find some good items
o Find all good items
o Annotation in context
o Recommend a sequence
o Recommend a bundle
• Predictive perspective
• Interaction perspective
• Conversion perspective
• Retrieval perspective
• Recommendation perspective
Collaborative Filtering:
• Collaborative filtering systems make recommendations based on historic preferences of the
users.
• It uses item based nearest neighbor or user based nearest neighbor method.
• User based nearest neighbor recommends items by finding users similar to the active user.
• Item based nearest neighbor is easy to scale and can be computed offline and served without
constant retraining.
• Item based nearest neighbor can be implemented best through KNN model.
• A common example of CF is predicting which movies people will want to watch based on how
they and other people have rated movies which they have already seen. The key idea is that
the prediction is not based on features of the movie or user, but merely on a ratings matrix.
Cosine similarity:
Association Rule:
• Association rule mining uses machine learning models to analyze data for patterns or co-
occurrence in a database.
• Each transaction is a list of items.
• Association rule finds all rules that correlate the presence of one set of items with that of
another set of items.
• It identifies frequent patterns and is most commonly used for market basket analysis.
• Performance measures for association rules include support, confidence and lift.
• Support indicates how frequently the items appear in the data, provides fraction of
transactions that contain X and Y.
• Confidence indicates the number of times the if-then statements are found true and
indicates how often X and Y occur together, given the number of times X occurs.
• Lift compare the actual confidence with the expected confidence and indicate the strength
of a rule over the random co-occurrence of X and Y.
A priori Algorithm:
Steps:
• Ensemble methods refer to combining many different machine learning models in order to
get a more powerful prediction.
• Thus ensemble methods increase the accuracy of the predictions.
• A commonly used class of ensemble methods are forests of randomized trees.
• Ensemble Methods are used in order to:
o Decrease variance (bagging)
o Decrease bias (boosting)
o Improve predictions (stacking)
Bagging:
• Bagging refer to Bootstrap Aggregation averages a given procedure over many samples to
reduce its variance.
• Bagging tests multiple models on the data by sampling and replacing data, i.e, it utilizes
bootstrapping.
• This reduces the noise and variance by utilizing multiple samples.
• Each hypothesis has the same weight as all the others. Now, aggregating of the output of
various models is done.
• Averaging the prediction over a collection of unstable predictors generated from bootstrap
samples (both classification and regression)
• A technique for reducing the variance of an estimated prediction function.
• Bootstrap: randomly drawn datasets with replacement from the training data, each sample
the same size as the original training set. Sampling the training data with replacement.
• Sample training data (Di), ‘k’ times – train a classifier Ci on Di.
• Each test sample is classified by a ‘k’ classifiers.
• Results are averaged to obtain the final decision.
• A technique for reducing the variance of an estimated prediction function.
Boosting:
• NLP is a field of computer science, AI and computational linguistics concerned with the
interactions between computers and human (natural) languages.
Language Model:
Two types:
• Sparsity – Solved
• World Similarity – Solved
• Finite context – Not
• Computational Complexity - Softmax
N-gram model:
Corpus:
• A body of text is called ‘Corpus’ from the Latin word for ‘body’.
Smoothing:
• Text classification can be done with Naïve Bayes n-gram models or with any of the
classification algorithms.
Applications of NLP:
• Language Modeling (Speech recognition, Machine Translation)
• Word-sense Learning
• Reasoning over Knowledge Bases
• Acoustic Modeling
• Parts-Of-Speech Tagging
• Chunking
• Named Entity Recognition
• Semantic Role Labeling
• Parsing
• Sentiment Analysis
• Paraphrasing
• Question-Answering
https://lnkd.in/gaJtbwcu