0% found this document useful (0 votes)
14 views51 pages

UNIT-2 DL

Uploaded by

mahum1838
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views51 pages

UNIT-2 DL

Uploaded by

mahum1838
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

UNIT-2

DEEP NEURAL NETWORKS


Deep neural network
• Deep neural networks, or deep learning networks, have several hidden layers
with millions of artificial neurons linked together.
• A number, called weight, represents the connections between one node and
another.
• The weight is a positive number if one node excites another, or negative if one
node suppresses the other.
• A deep neural network (DNN) is an artificial neural network (ANN) with
multiple layers between the input and output layers.
• There are different types of neural networks but they always consist of the same
components: neurons, synapses, weights, biases, and functions.
•While deep learning is certainly not new, it is experiencing explosive growth because of the intersection of
deeply layered neural networks and the use of GPUs to accelerate their execution. Big data has also fed this
growth.
•Because deep learning relies on training neural networks with example data and rewarding them based
on their success, the more data, the better to build these deep learning structures.
•The number of architectures and algorithms that are used in deep learning is wide and varied. This section
explores six of the deep learning architectures spanning the past 20 years.
•Notably, long short-term memory (LSTM) and convolutional neural networks (CNNs) are two of the
oldest approaches in this list but also two of the most used in various applications.
•Deep learning architectures into supervised and unsupervised learning and
introduces several popular deep learning architectures:
1. convolutional neural networks
2. recurrent neural networks (RNNs)
3. long short-term memory/gated recurrent unit (GRU)
4. self-organizing map (SOM)
5. autoencoders (AE)
6. restricted Boltzman machine (RBM)
• It also gives an overview of deep belief networks (DBN) and deep stacking
networks (DSNs)
• Artificial neural network (ANN) is the underlying architecture behind deep
learning. Based on ANN, several variations of the algorithms have been invented.
Supervised deep learning
•Supervised learning refers to the problem space wherein the target to be
predicted is clearly labelled within the data that is used for training.
•Deep learning uses supervised learning in situations such as image classification
or object detection, as the network is used to predict a label or a number (the
input and the output are both known).
•As the labels of the images are known, the network is used to reduce the error
rate, so it is “supervised”.
•we introduce at a high-level two of the most popular supervised deep learning
architectures –
1. convolutional neural networks and
2. recurrent neural networks as well as some of their variants.
Convolutional neural networks
• A convolutional neural network, or CNN, is a deep learning neural network designed
for processing structured arrays of data such as images.
• Convolutional neural networks are widely used in computer vision and have become the
state of the art for many visual applications such as image classification, and have also
found success in natural language processing for text classification.
• Convolutional neural networks are very good at picking up on patterns in the input
image, such as lines, gradients, circles, or even eyes and faces. It is this property that
makes convolutional neural networks so powerful for computer vision.
• Unlike earlier computer vision algorithms, convolutional neural networks can operate
directly on a raw image and do not need any preprocessing.
• A convolutional neural network is a feed-forward neural network, often with up to 20
or 30 layers. The power of a convolutional neural network comes from a special kind of
layer called the convolutional layer.
• A CNN is a multilayer neural network that was biologically inspired by the animal visual cortex. The
architecture is particularly useful in image-processing applications.
• The first CNN was created by Yann LeCun; at the time, the architecture focused on handwritten
character recognition, such as postal code interpretation.
• As a deep network, early layers recognize features (such as edges), and later layers recombine these
features into higher-level attributes of the input.
• The LeNet CNN architecture is made up of several layers that implement feature extraction and then
classification (see the following image).
• The image is divided into receptive fields that feed into a convolutional layer, which then extracts
features from the input image.
• The next step is pooling, which reduces the dimensionality of the extracted features (through
down-sampling) while retaining the most important information (typically, through max pooling).
• Another convolution and pooling step is then performed that feeds into a fully connected multilayer
perceptron.
• The final output layer of this network is a set of nodes that identify features of the image (in this case,
a node per identified number). You train the network by using back-propagation.
The use of deep layers of processing, convolutions, pooling, and a fully connected
classification layer opened the door to various new applications of deep learning neural
networks. In addition to image processing, the CNN has been successfully applied to
video recognition and various tasks within natural language processing.
• A Convolutional Neural Network (CNN) is a type of deep learning algorithm that is
particularly well-suited for image recognition and processing tasks.
• It is made up of multiple layers, including convolutional layers, pooling layers, and fully
connected layers.
• The convolutional layers are the key component of a CNN, where filters are applied to the
input image to extract features such as edges, textures, and shapes.
• The output of the convolutional layers is then passed through pooling layers, which are
used to down-sample the feature maps, reducing the spatial dimensions while retaining the
most important information.
• The output of the pooling layers is then passed through one or more fully connected
layers, which are used to make a prediction or classify the image.
• CNNs are trained using a large dataset of labeled images, where the network learns to
recognize patterns and features that are associated with specific objects or classes.
• Once trained, a CNN can be used to classify new images, or extract features for use in other
applications such as object detection or image segmentation.
• CNNs have achieved state-of-the-art performance on a wide range of
image recognition tasks, including object classification, object
detection, and image segmentation.
• They are widely used in computer vision, image processing, and
other related fields, and have been applied to a wide range of
applications, including self-driving cars, medical imaging, and
security systems.
Convolutional Neural Network Design
• The construction of a convolutional neural network is a multi-layered feed-forward neural
network, made by assembling many unseen layers on top of each other in a particular order.
• It is the sequential design that give permission to CNN to learn hierarchical attributes.
• In CNN, some of them followed by grouping layers and hidden layers are typically
convolutional layers followed by activation layers.
• The pre-processing needed in a ConvNet is kindred to that of the related pattern of neurons in
the human brain and was motivated by the organization of the Visual Cortex.
• Different Types of CNN Models:
1. LeNet
2. AlexNet
3. ResNet
4. GoogleNet
5. MobileNet
6. VGG
Applications of CNN
• Decoding Facial Recognition
• Understanding Climate
• Collecting Historic and Environmental Elements
• Image recognition
• video analysis
• natural language processing
Recurrent neural networks
• The RNN is one of the foundational network architectures from which other
deep learning architectures are built.
• The primary difference between a typical multilayer network and a recurrent
network is that rather than completely feed-forward connections, a recurrent
network might have connections that feed back into prior layers (or into the
same layer).
• This feedback allows RNNs to maintain memory of past inputs and model
problems in time. RNNs consist of a rich set of architectures (we'll look at one
popular topology called LSTM next).
• The key differentiator is feedback within the network, which could manifest itself
from a hidden layer, the output layer, or some combination thereof.
• RNNs can be unfolded in time and trained with standard back-propagation or
by using a variant of back-propagation that is called back-propagation in time
(BPTT)
• Recurrent Neural Network(RNN) is a type of Neural Network where the output from
the previous step is fed as input to the current step.
• In traditional neural networks, all the inputs and outputs are independent of each
other.
• Still, in cases when it is required to predict the next word of a sentence, the
previous words are required and hence there is a need to remember the previous
words.
How RNN differs from Feedforward Neural Network?
• Artificial neural networks that do not have looping nodes are called feed forward
neural networks. Because all information is only passed forward, this kind of
neural network is also referred to as a multi-layer neural network.
• Information moves from the input layer to the output layer – if any hidden layers
are present – uni directionally in a feedforward neural network. These networks are
appropriate for image classification tasks, for example, where input and output are
independent. Nevertheless, their inability to retain previous inputs automatically
renders them less useful for sequential data analysis.
• Recurrent Neural Network(RNN) is a type of Neural Network where the output from
the previous step is fed as input to the current step.
• In traditional neural networks, all the inputs and outputs are independent of each
other. Still, in cases when it is required to predict the next word of a sentence, the
previous words are required and hence there is a need to remember the previous
words.
• Thus RNN came into existence, which solved this issue with the help of a Hidden
Layer. The main and most important feature of RNN is its Hidden state, which
remembers some information about a sequence.
• The state is also referred to as Memory State since it remembers the previous input to
the network.
• It uses the same parameters for each input as it performs the same task on all the
inputs or hidden layers to produce the output.
• This reduces the complexity of parameters, unlike other neural networks.
Types of RNN
There are four types of RNNs based on the number of inputs and outputs in the network.
1. One to One
2. One to Many
3. Many to One
4. Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known as Vanilla
Neural Network. In this Neural network, there is only one input and one output.
One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the most used
examples of this network is Image captioning where given an image we predict a sentence having
Multiple words.
Many to One
In this type of network, Many inputs are fed to the network at several states of the network
generating only one output. This type of network is used in the problems like sentimental analysis.
Where we give multiple words as input and predict only the sentiment of the sentence as output.
Many to Many
In this type of neural network, there are multiple inputs and multiple outputs corresponding to a
problem. One Example of this Problem will be language translation. In language translation, we
provide multiple words from one language as input and predict multiple words from the second
language as output.
Advantages
1. An RNN remembers each and every piece of information through time. It is
useful in time series prediction only because of the feature to remember
previous inputs as well. This is called Long Short Term Memory.
2. Recurrent neural networks are even used with convolutional layers to extend the
effective pixel neighborhood.
Disadvantages
1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation
function.
Applications of Recurrent Neural Network
1. Language Modelling and Generating Text
2. Speech Recognition
3. Machine Translation
4. Image Recognition, Face detection
5. Time series Forecasting
LSTM networks
•The LSTM was created in 1997 by Hochreiter and Schimdhuber, but it has grown in popularity in recent years as an
RNN architecture for various applications. You'll find LSTMs in products that you use every day, such as smartphones.
IBM applied LSTMs in IBM Watson® for milestone-setting conversational speech recognition.
•LSTM (Long Short-Term Memory) is a recurrent neural network (RNN) architecture widely used in Deep Learning. It
excels at capturing long-term dependencies, making it ideal for sequence prediction tasks.
•The LSTM departed from typical neuron-based neural network architectures and instead introduced the concept of a
memory cell. The memory cell can retain its value. for a short or long time as a function of its inputs, which allows the
cell to remember what's important and not just its last computed value.
•The LSTM memory cell contains three gates that control how information flows into or out of the cell.
•The input gate controls when new information can flow into the memory.
•The forget gate controls when an existing piece of information is forgotten, allowing the cell to remember new data.
•Finally, the output gate controls when the information that is contained in the cell is used in the output from the cell. The
cell also contains weights, which control each gate. The training algorithm, commonly BPTT, optimizes these weights
based on the resulting network output error.
•Recent applications of CNNs and LSTMs produced image and video captioning systems in which an image or video is
captioned in natural language. The CNN implements the image or video processing, and the LSTM is trained to convert
the CNN output into natural language.
•Example applications: Image and video captioning systems
The advantages of LSTM
• Long-term dependencies can be captured by LSTM networks. They have a
memory cell that is capable of long-term information storage.
• In traditional RNNs, there is a problem of vanishing and exploding gradients
when models are trained over long sequences. By using a gating mechanism that
selectively recalls or forgets information, LSTM networks deal with this
problem.
• LSTM enables the model to capture and remember the important context,
even when there is a significant time gap between relevant events in the
sequence. So where understanding context is important, LSTMS are used. eg.
machine translation.
The disadvantages of LSTM
• Compared to simpler architectures like feed-forward neural networks LSTM
networks are computationally more expensive. This can limit their scalability for
large-scale datasets or constrained environments.
• Training LSTM networks can be more time-consuming compared to simpler
models due to their computational complexity. So training LSTMs often requires
more data and longer training times to achieve high performance.
• Since it is processed word by word in a sequential manner, it is hard to
parallelize the work of processing the sentences.
Applications of LSTM
Language Modeling: LSTMs have been used for natural language processing tasks such as
language modeling, machine translation, and text summarization. They can be trained to
generate coherent and grammatically correct sentences by learning the dependencies
between words in a sentence.
Speech Recognition: LSTMs have been used for speech recognition tasks such as transcribing
speech to text and recognizing spoken commands. They can be trained to recognize patterns
in speech and match them to the corresponding text.
Time Series Forecasting: LSTMs have been used for time series forecasting tasks such as
predicting stock prices, weather, and energy consumption. They can learn patterns in time
series data and use them to make predictions about future events.
Anomaly Detection: LSTMs have been used for anomaly detection tasks such as detecting
fraud and network intrusion. They can be trained to identify patterns in data that deviate from
the norm and flag them as potential anomalies.
Recommender Systems: LSTMs have been used for recommendation tasks such as
recommending movies, music, and books. They can learn patterns in user behavior and use
them to make personalized recommendations.
Video Analysis: LSTMs have been used for video analysis tasks such as object detection,
activity recognition, and action classification. They can be used in combination with other
neural network architectures, such as Convolutional Neural Networks (CNNs), to analyze video
data and extract useful information.
GRU networks
• Gated Recurrent Unit (GRU) is a type of recurrent neural network (RNN) that was
introduced by Cho et al. Like LSTM, GRU can process sequential data such as text,
speech, and time-series data.
• In 2014, a simplification of the LSTM was introduced called the gated recurrent unit.
This model has two gates, getting rid of the output gate present in the LSTM model.
• These gates are an update gate and a reset gate.
• The update gate indicates how much of the previous cell contents to maintain.
• The reset gate defines how to incorporate the new input with the previous cell
contents. A GRU can model a standard RNN simply by setting the reset gate to 1 and the
update gate to 0. The reset gate determines how much of the previous hidden state
should be forgotten, while the update gate determines how much of the new input
should be used to update the hidden state.
• The output of the GRU is calculated based on the updated hidden state.
The GRU is simpler than the LSTM, can be trained more quickly, and can be
more efficient in its execution. However, the LSTM can be more expressive and
with more data can lead to better results.
Example applications: Natural language text compression, handwriting
recognition, speech recognition, gesture recognition, image captioning
Unsupervised deep learning
•Unsupervised learning refers to the problem space wherein there is no target label within the data that
is used for training.
•Unsupervised learning in artificial intelligence is a type of machine learning that learns from data
without human supervision. Unlike supervised learning, unsupervised machine learning models are
given unlabeled data and allowed to discover patterns and insights without any explicit guidance or
instruction.
• Unsupervised learning is a type of machine learning that learns from unlabeled data. This means that the
data does not have any pre-existing labels or categories. The goal of unsupervised learning is to
discover patterns and relationships in the data without any explicit guidance.
• Unsupervised learning is the training of a machine using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. Here the task of the
machine is to group unsorted information according to similarities, patterns, and differences without
any prior training of data.
• This section discusses three unsupervised deep learning architectures: self-organized maps,
autoencoders, and restricted boltzmann machines. We also discuss how deep belief networks and deep
stacking networks are built based on the underlying unsupervised architecture.
Self-organized maps
• Self-organized map (SOM) was invented by Dr. Teuvo Kohonen in 1982 and was popularly
known as the Kohonen map.
• SOM is an unsupervised neural network that creates clusters of the input data set by reducing
the dimensionality of the input. SOMs vary from the traditional artificial neural network in quite
a few ways.
• SOM is used for clustering and mapping (or dimensionality reduction) techniques to map
multidimensional data onto lower-dimensional which allows people to reduce complex
problems for easy interpretation.
• SOM has two layers, one is the Input layer and the other one is the Output layer.
• The first significant variation is that weights serve as a characteristic of the node. After the inputs
are normalized, a random input is first chosen.
• Random weights close to zero are initialized to each feature of the input record. These weights now
represent the input node. Several combinations of these random weights represent variations of
the input node.
• The euclidean distance between each of these output nodes with the input node is calculated.
The node with the least distance is declared as the most accurate representation of the input and is
marked as the best matching unit or BMU.
• With these BMUs as center points, other units are similarly calculated and assigned to the cluster that
it is the distance from.
• Radius of points around BMU weights are updated based on proximity. Radius is shrunk.
• Next, in an SOM, no activation function is applied, and because there are no target labels to
compare against there is no concept of calculating error and back propogation.
• Example applications: Dimensionality reduction, clustering high-dimensional inputs to 2-
dimensional output, radiant grade result, and cluster visualization
Algorithm
Training:
Step 1: Initialize the weights wij random value may be assumed. Initialize the learning
rate α.
Step 2: Calculate squared Euclidean distance.
D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
Step 3: Find index J, when D(j) is minimum that will be considered as winning index.
Step 4: For each j within a specific neighborhood of j and for all i, calculate the new
weight.
wij(new)=wij(old) + α[xi – wij(old)]
Step 5: Update the learning rule by using :
α(t+1) = 0.5 * t
Step 6: Test the Stopping Condition.
Autoencoders
•Though the history of when autoencoders were invented is hazy, the first known usage of autoencoders was
found to be by LeCun in 1987. This variant of an ANN is composed of 3 layers: input, hidden, and output
layers.
•First, the input layer is encoded into the hidden layer using an appropriate encoding function. The number
of nodes in the hidden layer is much less than the number of nodes in the input layer.
•This hidden layer contains the compressed representation of the original input. The output layer aims to
reconstruct the input layer by using a decoder function.
•During the training phase, the difference between the input and the output layer is calculated
using an error function, and the weights are adjusted to minimize the error.
•Unlike traditional unsupervised learning techniques, where there is no data to compare the
outputs against, autoencoders learn continuously using backward propagation. For this reason,
autoencoders are classified as self supervised algorithms.
• Input layer take raw input data
• The hidden layers progressively reduce the dimensionality of the input, capturing important
features and patterns. These layer compose the encoder.
• The bottleneck layer (latent space) is the final hidden layer, where the dimensionality is
significantly reduced. This layer represents the compressed encoding of the input data.
•Example applications: Dimensionality reduction, data interpolation, and data
compression/decompression
Types of Autoencoders
1. Denoising Autoencoder
Denoising autoencoder works on a partially corrupted input and trains to recover the original
undistorted image. As mentioned above, this method is an effective way to constrain the network from
simply copying the input and thus learn the underlying structure and important features of the data.
2. Sparse Autoencoder
This type of autoencoder typically contains more hidden units than the input but only a few are allowed
to be active at once. This property is called the sparsity of the network. The sparsity of the network can be
controlled by either manually zeroing the required hidden units, tuning the activation functions or by
adding a loss term to the cost function.
3. Variational Autoencoder
Variational autoencoder makes strong assumptions about the distribution of latent variables and uses
the Stochastic Gradient Variational Bayes estimator in the training process.
4. Convolutional Autoencoder
Convolutional autoencoders are a type of autoencoder that use convolutional neural networks (CNNs) as
their building blocks. The encoder consists of multiple layers that take a image or a grid as input and
pass it through different convolution layers thus forming a compressed representation of the input. The
decoder is the mirror image of the encoder it deconvolves the compressed representation and tries to
reconstruct the original image.
Restricted Boltzmann Machines
•Though RBMs became popular much later, they were originally invented by Paul Smolensky in 1986 and
was known as a Harmonium.
•An RBM is a 2-layered neural network. The layers are input and hidden layers. As shown in the following
figure, in RBMs every node in a hidden layer is connected to every node in a visible layer.
•In a traditional Boltzmann Machine, nodes within the input and hidden layer are also connected. Due to
computational complexity, nodes within a layer are not connected in a Restricted Boltzmann Machine.
•During the training phase, RBMs calculate the probabilty distribution of the training set using a
stochastic approach. When the training begins, each neuron gets activated at random.
•Also, the model contains respective hidden and visible bias. While the hidden bias is used in the forward
pass to build the activation, the visible bias helps in reconstructing the input.
•Because in an RBM the reconstructed input is always different from the original input, they are also
known as generative models.
•Also, because of the built-in randomness, the same predictions result in different outputs. In fact, this is the
most significant difference from an autoencoder, which is a deterministic model.
•Example applications: Dimensionality reduction and collaborative filtering
Multimodal fusion architectures
Multimodal fusion is the integration of heterogeneous data from different modalities to take
advantage of the complementarity of data in order to provide better prediction performance. As a
matter of fact, each modality contains useful and complementary information to other modalities.
multimodal fusion architectures are designed to combine information from multiple modalities at
various levels of the network. Here are some common approaches to multimodal fusion in deep
neural networks:
1. Early Fusion:
1. In early fusion, features from different modalities are combined at the input layer of the
neural network.
2. For example, in a task involving both images and text, early fusion might concatenate
image features and text embeddings before passing them through the neural network.
2. Late Fusion:
1. Late fusion involves processing each modality independently through separate neural
network branches and combining their outputs at a later stage, typically at the final layers.
2. This approach allows the network to learn modality-specific representations before making a
joint decision.
3. Intermediate Fusion:
1. Intermediate fusion combines features from different modalities at an intermediate layer of
the network.
2. This allows the model to capture both early and late fusion characteristics, leveraging
modality-specific information while also facilitating joint learning.
4.Attention Mechanisms:
Attention mechanisms, such as self-attention or cross-modal attention, can be used to dynamically weigh the
importance of different modalities or parts of modalities.
These mechanisms enable the network to focus on relevant information for a given task.
5.Multimodal Transformers:
Transformer architectures, initially developed for natural language processing, have been adapted for multimodal
tasks.
Multimodal Transformers can process sequences of data from different modalities, enabling effective fusion for
tasks like image captioning or video understanding.
6.Graph Neural Networks (GNNs):
GNNs can be applied when the relationships between modalities can be represented as a graph structure.
Nodes in the graph may correspond to different modalities, and edges may represent relationships or interactions
between them.
7.Memory Networks:
Memory-augmented neural networks can be used to store and retrieve information from different modalities
dynamically during the course of processing.
This allows the network to maintain context and relevant information across modalities.
8.Hybrid Architectures:
Hybrid architectures combine elements from various fusion strategies to create a custom solution tailored to the
specific requirements of the task.
For instance, a model might use early fusion for one set of modalities and late fusion for another.
•Multi modal fusion-In a multimodal setting, it is very common to transfer models trained on the individual
modalities and merge them at a single point.
•It can be at the deepest layers, known as late fusion, which is relatively successful on a number of multimodal
tasks.
•Good ways to combine multimodal features to better exploit the information embedded at different layers in
deep learning models for classification.
•In vision, for example, lower layers are known to serve as edge detectors with different orientations and extent,
while further layers capture more complex information such as semantic concepts, like faces, trees, animals, etc.
•For example, learning to classify furry animals might require analysis of lower level visual features that can be
used to build up the concept off ur, where as classes like chirping birds or growling might require analysis of more
complex audio visual attributes.
•Indeed, features from different layers at different modalities can give different insights from the input data. The
problem of multimodal classification by directly posing the problem as a combinatorial search.
•To categorize the different recent approaches of deep multimodal fusion, define two main paths:
1.architectures and
2.constraints.
•The first path focuses on building best possible fusion architectures e.g.by finding at which depths the unimodal
layers should be fused.
•Late fusion is often defined by the combination of the final scores of each unimodal branch.
Deep multiple instance learning
• Multiple instance learning (MIL) is a variation of supervised learning where a single class label
is assigned to a bag of instances. In this, we state the MIL problem as learning the Bernoulli
distribution of the bag label where the bag label probability is fully parameterized by neural
networks.
• Multiple instance learning can be used to learn the properties of the sub images which
characterize the target scene. From there on, these frameworks have been applied to a wide
spectrum of applications, ranging from image concept learning and text categorization, to stock
market prediction.
∙ It is a form of weakly supervised learning.
∙ Training instances are arranged in sets, called bags.
∙ A label is provided for entire bags rather than for the individual instances contained in them.
Thus, in MIL, we aim to learn a concept given labels for bags of instances.
• MIL is a variation of supervised learning that is more suitable to pathology applications. The technique
involves assigning a single class label to a collection of inputs — in this context, referred to as a bag of
instances.
• While it is assumed that labels exist for each instance within a bag, there is no access to those labels and they
remain unknown during training. A bag is typically labeled as negative if all instances in the bag are negative,
or positive if there is at least one positive instance (known as the standard MIL assumption).
• A simple example is shown in the figure in which we only know whether a keychain contains the key that
can open a given door. This allows us to infer that the green key can open the door.
There are various assumptions upon which we can base our MIL model, but here we use the standard
MIL assumption:
a bag may be labeled negative if all the instances in the bag are negative, or positive if there is at
least one positive instance. This formulation naturally fits various problems in computer vision and
document classification.
For example, we might have access to medical images for which only overall patient diagnoses are
available instead of costly local annotations provided by an expert.
Definition of the standard MIL assumption
• Training instances are arranged in sets generally called bags.
• A label is given to bags but not to individual instances.
• Negative bags do not contain positive instances.
• Positive bags may contain negative and positive instances.
• Positive bags contain at least one positive instance
Relaxed MIL assumptions
• In many applications, the standard MIL assumption is to restrictive. MIL can alternatively formulated as:
• A bag is positive when it contains a sufficient number of positive instances.
• A bag is positive when it contains a certain combination of positive instances.
• Positive and negative bags differ by their instance distributions.
Example of relaxed MIL assumptions
•Both sand and water segments are positive instances for beach pictures.
•However, picture of beach must contain both segments of sand and water. Otherwise, they can be pictures of
desert or sea.
Tasks that can be performed in MIL:
There are two main MIL approaches:
1. Instance-based: the function f classifies each instance individually, and MIL
pooling combines the instance labels to assign a bag to a class (g is the identity
function). However, since individual labels are not known, it possible that the
instance-level classifier might not be trained sufficiently, thereby introducing
error in the final prediction.
2. Embedding-based: instead of classifying the instances individually, the
function f maps instances to a low-dimensional embedding. MIL pooling is then
used to obtain a bag representation that is independent of the number of instances in
the bag. g then classifies these bag representations to provide ϴ(X). A downside of
this method is that it lacks interpretability.

You might also like