Data Science Interview Questions #Week3
Data Science Interview Questions #Week3
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 15
Page 1 of 12
Q1. What is Autoencoder?
Answer:
Autoencoder neural network: It is an unsupervised Machine learning algorithm that applies
backpropagation, setting the target values to be equal to the inputs. It is trained to attempt to copy its
input to its output. Internally, it has the hidden layer that describes a code used to represent the input.
It is trying to learn the approximation to the identity function, to output x̂ x^ that is similar to the xx.
Autoencoders belongs to the neural network family, but they are also closely related to PCA
(principal components analysis).
Auto encoders, although it is quite similar to PCA, but its Autoencoders are much more flexible than
PCA. Autoencoders can represent both liners and non-linear transformation in encoding, but PCA
can perform linear transformation. Autoencoders can be layered to form deep learning network due
to its Network representation.
Types of Autoencoders:
1. Denoising autoencoder
Autoencoders are Neural Networks which are used for feature selection and extraction.
However, when there are more nodes in hidden layer than there are inputs, the Network is
risking to learn so-called “Identity Function”, also called “Null Function”, meaning that
output equals the input, marking the Autoencoder useless.
Page 2 of 12
Denoising Autoencoders solve this problem by corrupting the data on purpose by randomly
turning some of the input values to zero. In general, the percentage of input nodes which
are being set to zero is about 50%. Other sources suggest a lower count, such as 30%. It
depends on the amount of data and input nodes you have.
2. Sparse autoencoder
An autoencoder takes the input image or vector and learns code dictionary that changes the
raw input from one representation to another. Where in sparse autoencoders with a sparsity
enforcer that directs a single-layer network to learn code dictionary which in turn minimizes
the error in reproducing the input while restricting number of code words for reconstruction.
The sparse autoencoder consists a single hidden layer, which is connected to the input vector
by a weight matrix forming the encoding step. The hidden layer then outputs to a
reconstruction vector, using a tied weight matrix to form the decoder.
Page 3 of 12
Lexical or Word Level Similarity
When referring to text similarity, people refer to how similar the two pieces of text are at the surface
level. Example- how similar are the phrases “the cat ate the mouse” with “the mouse ate the cat
food” by just looking at the words? On the surface, if you consider only word-level similarity, these
two phrases (with determiners disregarded) appear very similar as 3 of the 4 unique words are an
exact overlap.
Semantic Similarity:
Another notion of similarity mostly explored by NLP research community is how similar in meaning
are any two phrases? If we look at the phrases, “ the cat ate the mouse ” and “ the mouse ate the cat
food”. As we know that while the words significantly overlaps, these two phrases have different
meaning. Meaning out of the phrases is often the more difficult task as it requires deeper level of
analysis.Example, we can actually look at the simple aspects like order of
words: “cat==>ate==>mouse” and “mouse==>ate==>cat food”. Words overlap in this case, the
order of the occurrence is different, and we can tell that, these two phrases have different meaning.
This is just the one example. Most people use the syntactic parsing to help with the semantic
similarity. Let’s have a look at the parse trees for these two phrases. What can you get from it?
Page 4 of 12
Q3. What is dropout in neural networks?
Answer:
When we training our neural network (or model) by updating each of its weights, it might become
too dependent on the dataset we are using. Therefore, when this model has to make a prediction or
classification, it will not give satisfactory results. This is known as over-fitting. We might understand
this problem through a real-world example: If a student of science learns only one chapter of a book
and then takes a test on the whole syllabus, he will probably fail.
To overcome this problem, we use a technique that was introduced by Geoffrey Hinton in 2012. This
technique is known as dropout.
Dropout refers to ignoring units (i.e., neurons) during the training phase of certain set of neurons,
which is chosen at random. By “ignoring”, I mean these units are not considered during a particular
forward or backward pass.
At each training stage, individual nodes are either dropped out of the net with probability 1-p or kept
with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out
node are also removed.
Page 5 of 12
Q4. What is Forward Propagation?
Answer:
Input X provides the information that then propagates to hidden units at each layer and then finally
produce the output y. The architecture of network entails determining its depth, width, and the
activation functions used on each layer. Depth is the number of the hidden layers. Width is the
number of units (nodes) on each hidden layer since we don’t control neither input layer nor output
layer dimensions. There are quite a few set of activation functions such Rectified Linear Unit,
Sigmoid, Hyperbolic tangent, etc. Research has proven that deeper networks outperform networks
with more hidden units. Therefore, it’s always better and won’t hurt to train a deeper network.
Page 6 of 12
Q6. What is Information Extraction?
Answer:
Information extraction (IE): It is the task of automatically extracting structured information from the
unstructured and/or semi-structured machine-readable documents. In most of the cases, this activity
concerns processing human language texts using natural language processing (NLP).
Information extraction depends on named entity recognition (NER), a sub-tool used to find targeted
information to extract. NER recognizes entities first as one of several categories, such as location
(LOC), persons (PER), or organizations (ORG). Once the information category is recognized, an
information extraction utility extracts the named entity’s related information and constructs a
machine-readable document from it, which algorithms can further process to extract meaning. IE
finds meaning by way of other subtasks, including co-reference resolution, relationship extraction,
language, and vocabulary analysis, and sometimes audio extraction.
Page 7 of 12
Q7. What is Text Generation?
Answer:
Text Generation: It is a type of the Language Modelling problem. Language Modelling is the core
problem for several of natural language processing tasks such as speech to text, conversational
system, and the text summarization. The trained language model learns the likelihood of occurrence
of the word based on the previous sequence of words used in the text. Language models can be
operated at the character level, n-gram level, sentence level or even paragraph level.
A language model is at the core of many NLP tasks, and is simply a probability distribution over a
sequence of words:
It can also be used to estimate the conditional probability of the next word in a sequence:
Page 8 of 12
Q8. What is Text Summarization?
Answer:
We all interact with the applications which uses the text summarization. Many of the applications are
for the platform which publishes articles on the daily news, entertainment, sports. With our busy
schedule, we like to read the summary of those articles before we decide to jump in for reading entire
article. Reading a summary helps us to identify the interest area, gives a brief context of the story.
Text summarization is a subdomain of Natural Language Processing (NLP) that deals with extracting
summaries from huge chunks of texts. There are two main types of techniques used for text
summarization: NLP-based techniques and deep learning-based techniques.
Text summarization: It refers to the technique of shortening long pieces of text. The intention is to
create the coherent and fluent summary having only the main points outlined in the document.
How text summarization works:
The two types of summarization, abstractive and the extractive summarization.
1. Abstractive Summarization: It select words based on the semantic understanding; even those
words did not appear in the source documents. It aims at producing important material in the
new way. They interprets and examines the text using advanced natural language techniques
to generate the new shorter text that conveys the most critical information from the original
text.
It can be correlated in the way human reads the text article or blog post and then summarizes in
their word.
This approach weights the most important part of sentences and uses the same to form the
summary. Different algorithm and the techniques are used to define the weights for the
sentences and further rank them based on importance and similarity among each other.
Page 9 of 12
Q9. What is Topic Modelling?
Answer:
Topic Modelling is the task of using unsupervised learning to extract the main topics (represented as
a set of words) that occur in a collection of documents.
Topic modeling, in the context of Natural Language Processing, is described as a method of
uncovering hidden structure in a collection of texts.
Dimensionality Reduction:
Topic modeling is the form of dimensionality reduction. Rather than representing the text T in its
feature space as {Word_i: count(Word_i, T) for Word_i in V}, we can represent the text in its topic
space as ( Topic_i: weight(Topic_i, T) for Topic_i in Topics ).
Unsupervised learning:
Topic modeling can be compared to the clustering. As in the case of clustering, the number of topics,
like the number of clusters, is the hyperparameter. By doing the topic modeling, we build clusters of
words rather than clusters of texts. A text is thus a mixture of all the topics, each having a certain
weight.
A Form of Tagging
If document classification is assigning a single category to a text, topic modeling is assigning multiple
tags to a text. A human expert can label the resulting topics with human-readable labels and use
different heuristics to convert the weighted topics to a set of tags.
Page 10 of 12
Q10.What is Hidden Markov Models?
Answer:
Hidden Markov Models (HMMs) are the class of probabilistic graphical model that allow us to
predict the sequence of unknown (hidden) variables from the set of observed variables. The simple
example of an HMM is predicting the weather (hidden variable) based on the type of clothes that
someone wears (observed). An HMM can be viewed as the Bayes Net unrolled through time with
observations made at the sequence of time steps being used to predict the best sequence of the hidden
states.
The below diagram from Wikipedia shows that HMM and its transitions. The scenario is the room
that contains urns X1, X2, and X3, each of which contains a known mix of balls, each ball labeled y1,
y2, y3, and y4. The sequence of four balls is randomly drawn. In this particular case, the user observes
the sequence of balls y1,y2,y3, and y4 and is attempting to discern the hidden state, which is the right
sequence of three urns that these four balls were pulled from.
To make this point clear, let us consider the scenario below where the weather, the hidden variable,
can be hot, mild or cold, and the observed variables are the type of clothing worn. The arrows
represent transitions from a hidden state to another hidden state or from a hidden state to an observed
variable.
Page 11 of 12
-----------------------------------------------------------------------------------------------------------------
Page 12 of 12
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# Day-16
Q1.What is Statistics Learning?
Answer:
Statistical learning: It is the framework for understanding data based on the statistics, which can be
classified as the supervised or unsupervised. Supervised statistical learning involves building the
statistical model for predicting, or estimating, an output based on one or more inputs, while
in unsupervised statistical learning, there are inputs but no supervising output, but we can learn
relationships and structure from such data.
.
Prediction Accuracy and Model Interpretability:
Out of many methods that we use for the statistical learning, some are less flexible and more
restrictive . When inference is the goal, then there are clear advantages of using the simple and
relatively inflexible statistical learning methods. When we are only interested in the prediction, we
use flexible models available.
Q2. What is ANOVA?
Answer:
ANOVA: it stands for “ Analysis of Variance ” is an extremely important tool for analysis of data
(both One Way and Two Way ANOVA is used). It is a statistical method to compare the population
means of two or more groups by analyzing variance. The variance would differ only when the means
are significantly different.
ANOVA test is the way to find out if survey or experiment results are significant. In other words, It
helps us to figure out if we need to reject the null hypothesis or accept the alternate hypothesis. We
are testing groups to see if there’s a difference between them. Examples of when we might want to
test different groups:
The group of psychiatric patients are trying three different therapies: counseling, medication,
and biofeedback. We want to see if one therapy is better than the others.
The manufacturer has two different processes to make light bulbs if they want to know which
one is better.
Students from the different colleges take the same exam. We want to see if one college
outperforms the other.
Types of ANOVA:
One-way ANOVA
Two-way ANOVA
One-way ANOVA is the hypothesis test in which only one categorical variable or the single
factor is taken into consideration. With the help of F-distribution, it enables us to compare
means of three or more samples. The Null hypothesis (H0) is the equity in all population
means while an Alternative hypothesis is the difference in at least one mean.
There are two-ways ANOVA examines the effect of two independent factors on a dependent
variable. It also studies the inter-relationship between independent variables influencing the
values of the dependent variable, if any.
When comparing the two or more continuous response variables by the single factor, a one-way
MANOVA is appropriate (e.g. comparing ‘test score’ and ‘annual income’ together by ‘level of
education’). The two-way MANOVA also entails two or more continuous response variables, but
compares them by at least two factors (e.g. comparing ‘test score’ and ‘annual income’ together by
both ‘level of education’ and ‘zodiac sign’).
So the main difference is the fact that for the classifier approach, the algorithm assumes the outcome
as the class of more presence, and on the regression approach, the response is the average value of
the nearest neighbors.
z-test: It is a statistical test used to determine whether the two population means are different when
the variances are known, and the sample size is large. The test statistic is assumed to have the normal
distribution, and nuisance parameters such as standard deviation should be known for an accurate z-
test to be performed.
Another definition of Z-test: A Z-test is a type of hypothesis test. Hypothesis testing is just the way
for you to figure out if results from a test are valid or repeatable. Example, if someone said they had
found the new drug that cures cancer, you would want to be sure it was probably true. Hypothesis
test will tell you if it’s probably true or probably not true. A Z test is used when your data is
approximately normally distributed.
Z-Tests Working :
Tests that can be conducted as the z-tests include one-sample location test, a two-sample location
test, a paired difference test, and a maximum likelihood estimate. Z-tests are related to t-tests, but t-
tests are best performed when an experiment has the small sample size. Also, T-tests assumes the
standard deviation is unknown, while z-tests assumes that it is known. If the standard deviation of
the population is unknown, then the assumption of the sample variance equaling the population
variance is made.
When we can run the Z-test :
Different types of tests are used in the statistics (i.e., f test, chi-square test, t-test). You would use a
Z test if:
Chi-square (χ2) statistic: It is a test that measures how expectations compare to actual observed data
(or model results). The data used in calculating a chi-square statistic must be random, raw, mutually
exclusive, drawn from independent variables, and drawn from a large enough sample. For example,
the results of tossing a coin 100 times meet these criteria.
Chi-square test is intended to test how it is that an observed distribution is due to chance. It is also
called the "goodness of fit" statistic because it measures how well the observed distribution of the
data fits with the distribution that is expected if the variables are independent.
Chi-square test is designed to analyze the categorical data. That means that the data has been counted
and divided into categories. It will not work with parametric or continuous data (such as height in
inches). For example, if you want to test whether attending class influences how students perform on
an exam, using test scores (from 0-100) as data would not be appropriate for a Chi-square test.
However, arranging students into the categories "Pass" and "Fail" would. Additionally, the data in a
Chi-square grid should not be in the form of percentages, or anything other than frequency (count)
data.
Q10. What is correlation and the covariance in the statistics?
Answer:
The Covariance and Correlation are two mathematical concepts; these two approaches are widely
used in the statistics. Both Correlation and the Covariance establish the relationship and also
measures the dependency between the two random variables, the work is similar between these two,
in the mathematical terms, they are different from each other.
Correlation: It is the statistical technique that can show whether and how strongly pairs of variables
are related. For example, height and weight are related; taller people tend to be heavier than shorter
people. The relationship isn't perfect. People of the same height vary in weight, and you can easily
think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the
average weight of people 5'5'' is less than the average weight of people 5'6'', and their average weight
is less than that of people 5'7'', etc. Correlation can tell you just how much of the variation in peoples'
weights is related to their heights.
Covariance: It measures the directional relationship between the returns on two assets. The positive
covariance means that asset returns move together while a negative covariance means they move
inversely. Covariance is calculated by analyzing at-return surprises (standard deviations from the
expected return) or by multiplying the correlation between the two variables by the standard
deviation of each variable.
------------------------------------------------------------------------------------------------------------------------
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 17
P a g e 1 | 11
Q1. What is ERM (Empirical Risk Minimization)?
Answer:
Empirical risk minimization (ERM): It is a principle in statistical learning theory which defines a
family of learning algorithms and is used to give theoretical bounds on their performance. The idea
is that we don’t know exactly how well an algorithm will work in practice (the true "risk") because
we don't know the true distribution of data that the algorithm will work on, but as an alternative we
can measure its performance on a known set of training data.
We assumed that our samples come from this distribution and use our dataset as an approximation.
If we compute the loss using the data points in our dataset, it’s called empirical risk. It is “empirical”
and not “true” because we are using a dataset that’s a subset of the whole population.
When our learning model is built, we have to pick a function that minimizes the empirical risk that
is the delta between predicted output and actual output for data points in the dataset. This process of
finding this function is called empirical risk minimization (ERM). We want to minimize the true risk.
We don’t have information that allows us to achieve that, so we hope that this empirical risk will
almost be the same as the true empirical risk.
Let’s get a better understanding by Example
We would want to build a model that can differentiate between a male and a female based on specific
features. If we select 150 random people where women are really short, and men are really tall, then
the model might incorrectly assume that height is the differentiating feature. For building a truly
accurate model, we have to gather all the women and men in the world to extract differentiating
features. Unfortunately, that is not possible! So we select a small number of people and hope that this
sample is representative of the whole population.
P a g e 2 | 11
Q2. What is PAC (Probably Approximately Correct)?
Answer:
PAC: In computational learning theory, probably approximately correct (PAC) learning is a
framework for mathematical analysis of machine learning.
The learner receives samples and must have to pick a generalization function (called the hypothesis)
from a specific class of possible functions. Our goal is that, with high probability, the selected
function will have low generalization error. The learner must be able to learn the concept given any
arbitrary approximation ratio, probability of success, or distribution of the samples.
Hypothesis class is PAC(Probably Approximately Correct) learnable if there exists a function m_H and
algorithm that for any labeling function f, distribution D over the domain of inputs X,
delta and epsilon that with m ≥ m_H produces a hypothesis h like that with probability 1-delta it
returns a true error lower than epsilon. Labeling function is nothing other than saying that we have a
specific function f that labels the data in the domain.
P a g e 3 | 11
Q3. What is ELMo?
Answer:
ELMo is a novel way to represent words in vectors or embeddings. These word embeddings help
achieve state-of-the-art (SOTA) results in several NLP tasks:
It is a deep contextualized word representation that models both complex characteristics of word use
(e.g., syntax and semantics), and how these uses vary across linguistic contexts. These word vectors
are learned functions of internal states of a deep biLM(bidirectional language model), which is pre-
trained on large text corpus. They could be easily added to existing models and significantly improve
state of the art across a broad range of challenging NLP problems, including question answering,
textual entailment and sentiment analysis.
P a g e 4 | 11
Q4. What is Pragmatic Analysis in NLP?
Answer:
Pragmatic Analysis(PA): It deals with outside word knowledge, which means understanding i.e
external to documents and queries. PA that focuses on what was described is reinterpreted by what it
actually meant, deriving the various aspects of language that require real-world knowledge.
It deals with overall communicative and social content and its effect on interpretation. It means
abstracting the meaningful use of language in situations. In this analysis, the main focus always on
what was said in reinterpreted on what is intended.
It helps users to discover this intended effect by applying a set of rules that characterize cooperative
dialogues.
E.g., "close the window?" should be interpreted as a request instead of an order.
P a g e 5 | 11
Q6. What is ULMFit?
Answer:
Transfer Learning in NLP(Natural language Processing) is an area that had not been explored with
great success. But, in May 2018, Jeremy Howard and Sebastian Ruder came up with the paper
– Universal Language Model Fine-tuning for Text Classification(ULMFit) which explores the
benefits of using a pre trained model on text classification. It proposes ULMFiT(Universal Language
Model Fine-tuning for Text Classification), a transfer learning method that could be applied to any
task in NLP. In this method outperforms the state-of-the-art on six text classification tasks.
ULMFiT uses a regular LSTMwhich is the state-of-the-art language model architecture (AWD-
LSTM). The LSTM network has three layers. Single architecture is used throughout – for pre-training
as well as for fine-tuning.
ULMFiT achieves the state-of-the-art result using novel techniques like:
Discriminative fine-tuning
Slanted triangular learning rates
Gradual unfreezing
Discriminative Fine-Tuning
P a g e 6 | 11
Different layers of a neural network capture different types of information so they should be fine-
tuned to varying extents. Instead of using the same learning rates for all layers of the model,
discriminative fine-tuning allows us to tune each layer with different learning rates.
Slanted triangular learning
The model should quickly converge to a suitable region of the parameter space in the beginning of
training and then later refine its parameters. Using a constant learning rate throughout training is not
the best way to achieve this behaviour. Instead Slanted Triangular Learning Rates (STLR) linearly
increases the learning rate at first and then linearly decays it.
Gradual Unfreezing
Gradual unfreezing is the concept of unfreezing the layers gradually, which avoids the catastrophic
loss of knowledge possessed by the model. It first unfreezes the top layer and fine-tunes all the
unfrozen layers for 1 epoch. It then unfreezes the next lower frozen layer and repeats until all the
layers have been fine-tuned until convergence at the last iteration.
P a g e 7 | 11
How does it work?
Traditional context-free models (like word2vec or GloVe) generate a single word embedding
representation for each word in the vocabulary which means the word “right” would have the same
context-free representation in “I’m sure I’m right” and “Take a right turn.” However, BERT would
represent based on both previous and next context, making it bidirectional. While the concept of
bidirectional was around for a long time, BERT was first on its kind to successfully pre-train
bidirectional in a deep neural network.
Q8.What is XLNet?
Answer:
XLNet is a BERT-like model instead of a totally different one. But it is an auspicious and potential
one. In one word, XLNet is a generalized autoregressive pretraining method.
Autoregressive (AR) language model: It is a kind of model that using the context word to predict the
next word. But here the context word is constrained to two directions, either forward or backwards.
The advantages of AR language model are good at generative Natural language Process(NLP) tasks.
Because when generating context, usually is the forward direction. AR language model naturally
works well on such NLP tasks.
P a g e 8 | 11
But Autoregressive language model has some disadvantages, and it only can use forward context or
backward context, which means it can't use forward and backward context at the same time.
P a g e 9 | 11
“The Transformers” is a Japanese band. That band was formed in 1968, during the height of the
Japanese music history.”
In the above example, the word “the band” in the second sentence refers to the band “The
Transformers” introduced in the first sentence. When you read about the band in the second sentence,
you know that it is referencing to the “The Transformers” band. That may be important for translation.
For translating other sentences like that, a model needs to figure out these sort of dependencies and
connections. Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) have
been used to deal with this problem because of their properties.
P a g e 10 | 11
Example:
-------------------------------------------------------------------------------------------------------------
P a g e 11 | 11
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# Day-18
Q1. What is Levenshtein Algorithm?
Answer:
Levenshtein distance is a string metric for measuring the difference between two sequences. The
Levenshtein distance between two words is the minimum number of single-character edits (i.e.
insertions, deletions or substitutions) required to change one word into the other.
By Mathematically, the Levenshtein distance between the two strings a, b (of length |a| and |b|
respectively) is given by the leva, b( |a| , |b| ) where :
Where, 1 (ai≠bi): This is the indicator function equal to zero when ai≠bi and equal to 1 otherwise,
and leva, b(i,j) is the distance between the first i characters of a and the first j characters of b.
Example:
The Levenshtein distance between "HONDA" and "HYUNDAI" is 3, since the following three edits
change one into the other, and there is no way to do it with fewer than three edits:
Q2. What is Soundex?
Answer:
Soundex attempts to find similar names or homophones using phonetic notation. The program retains
letters according to detailed equations, to match individual titles for purposes of ample volume
research.
Soundex phonetic algorithm: Its indexes strings depend on their English pronunciation. The
algorithm is used to describe homophones, words that are pronounced the same, but spelt differently.
Suppose we have the following sourceDF.
Let’s run below code and see how the soundex algorithm encodes the above words.
Let’s summarize the above results:
Above approaches convert the parse tree into a sequence following a depth-first traversal to be able
to apply sequence-to-sequence models to it. The linearized version of the above parse tree looks as
follows: (S (N) (VP V N)).
Given document d, topic z is available in that document with the probability P(z|d)
Given the topic z, word w is drawn from z with probability P(w|z)
The joint probability of seeing the given document and word together is:
In the above case, P(D), P(Z|D), and P(W|Z) are the parameters of our models. P(D) can be
determined directly from corpus. P(Z|D) and the P(W|Z) are modelled as multinomial
distributions and can be trained using the expectation-maximisation algorithm (EM).
the document weight vector, representing the “weights” of each topic in a document
Topic matrix represents each topic and its corresponding vector embedding.
Together, a document vector and word vector generate “context” vectors for each word in a
document. lda2vec power lies in the fact that it not only learns word embeddings for words; it
simultaneously learns topic representations and document representations as well.
Q8. What is Expectation-Maximization Algorithm(EM)?
Answer:
The Expectation-Maximization Algorithm, in short, EM algorithm, is an approach for maximum
likelihood estimation in the presence of latent variables.
This algorithm is an iterative approach that cycles between two modes. The first mode attempts to
predict the missing or latent variables called the estimation-step or E-step. The second mode attempts
to optimise the parameters of the model to explain the data best called the maximization-step or M-
step.
------------------------------------------------------------------------------------------------------------------
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 19
P a g e 1 | 12
Q1. What is LSI(Latent Semantic Indexing)?
Answer:
Latent Semantic Indexing (LSI): It is an indexing and retrieval method that uses a mathematical
technique called SVD(Singular value decomposition) to find patterns in relationships between terms
and concepts contained in an unstructured collection of text. It is based on the principle that words
that are used in the same contexts tend to have similar meanings.
For example, Tiger and Woods are associated with men instead of an animal, and a Wood, Parris,
and Hilton are associated with the singer.
Example:
If you use LSI to index a collection of articles and the words “fan” and “regulator” appear together
frequently enough, the search algorithm would notice that the two terms are semantically close. A
search for “fan” will, therefore, return a set of items containing that phrase, but also items that contain
just the word “regulator”. It doesn't understand word distance, but by examining a sufficient number
of documents, it only knows the two terms are interrelated. It then uses that information to provide
an expanded set of results with better recall than an understandable keyword search.
The diagram below describes the effect between LSI and keyword searches. W stands for a document.
P a g e 2 | 12
Q2. What is Named Entity Recognition? And tell some use cases of
NER?
Answer:
Named-entity recognition (NER): It is also known as entity extraction, and entity identification is a
subtask of information extraction that explore to locate and classify atomic elements in
text into predefined categories like the names of persons, organizations, places, expressions of
times, quantities, monetary values, percentages and more.
In each text document, particular terms represent specific entities that are more informative and have
a different context. These entities are called named entities, which more accurately refer to conditions
that represent real-world objects like people, places, organizations or institutions, and so on, which
are often expressed by proper names. The naive approach could be to find these by having a look at
the noun phrases in text documents. It also is known as entity chunking/extraction, which is a popular
technique used in information extraction to analyze and segment the named entities and categorize
or classify them under various predefined classes.
P a g e 3 | 12
Now, if we pass it through the Named Entity Recognition API, it pulls out the entities Bangalore
(location) and Fitbit (Product). This can be then used to categorize the complaint and assign it to the
relevant department within the organization that should be handling this.
P a g e 4 | 12
Q4. What is the language model?
Answer:
Language Modelling (LM): It is one of the essential parts of modern NLP. There are many sorts of
applications for Language Modelling, like Machine Translation, Spell Correction Speech
Recognition, Summarization, Question Answering, Sentiment analysis, etc. Each of those tasks
requires the use of the language model. The language model is needed to represent the text to a form
understandable from the machine point of view.
The statistical language model is a probability distribution over a series of words. Given such a series,
say of length m, it assigns a probability to the whole series.
It provides context to distinguish between phrases and words that sounds are similar. For example,
in American English, the phrases " wreck a nice beach " and "recognize speech" sound alike but mean
different things.
Data sparsity is a significant problem in building language models. Most possible word sequences
are not noticed in training. One solution is to make the inference that the probability of a word only
depends on the previous n words. This is called as an n-gram model or unigram model when n = 1.
The unigram model is also known as the bag of words model.
How does this Language Model help in NLP Tasks?
The probabilities restoration by a language model is most useful to compare the likelihood that
different sentences are "good sentences." This was useful in many practical tasks, for example:
Spell checking: You observe a word that is not identified as a known word as part of a sentence.
Using the edit distance algorithm, we find the closest known words to the unknown words. These are
the candidate corrections. For example, we observe the word "wurd" in the context of the sentence,
"I like to write this wurd." The candidate corrections are ["word", "weird", "wind"]. How can we
select among these candidates the most likely correction for the suspected error "weird"?
Automatic Speech Recognition: we receive as input a string of phonemes; a first model predicts for
sub-sequences of the stream of phonemes candidate words; the language model helps in ranking the
most likely sequence of words compatible with the candidate words produced by the acoustic model.
Machine Translation: each word from the source language is mapped to multiple candidate words
in the target language; the language model in the target language can rank the most likely sequence
of candidate target words.
P a g e 5 | 12
Q5. What is Word Embedding?
Answer:
A word embedding is a learned representation for text where words that have the same meaning have
a similar observation.
It is basically a form of word representation that bridges the human understanding of language to that
of a machine. Word embeddings divide representations of text in an n-dimensional space. These are
essential for solving most NLP problems.
And the other point worth considering is how we obtain word embeddings as no two sets of word
embeddings are similar. Word embeddings aren't random; they're developed by training the neural
network. A recent powerful word embedding usage comes from Google named Word2Vec, which is
trained by predicting several words that appear next to other words in a language. For example, the
word "cat", the neural network would predict the words like "kitten" and "feline." This intuition of
words comes out "near" each other allows us to place them in vector space.
P a g e 6 | 12
window over a word because no internal structure of the word is taken into account. As long as the
characters are within this window, the order of the n-grams doesn’t matter.
fastText works well with rare words. So even if a word wasn’t seen during training, it can be broken
down into n-grams to get its embeddings.
Word2vec and GloVe both fail to provide any vector representation for words that are not in the
model dictionary. This is a huge advantage of this method.
P a g e 7 | 12
GloVe aims to achieve two goals:
P a g e 8 | 12
It is an excellent library package for processing texts, working with word vector models (such as
FastText, Word2Vec, etc) and for building the topic models. Another significant advantage with
gensim is: it lets us handle large text files without having to load the entire file in memory.
We can also tell as It is an open-source library for unsupervised topic modeling and natural
language processing, using modern statistical machine learning.
Gensim is implemented in Python and Cython. Gensim is designed to handle extensive text
collections using data streaming and incremental online algorithms, which differentiates it from
most other machine learning software packages that target only in-memory processing.
P a g e 9 | 12
Q9. What is Encoder-Decoder Architecture?
Answer:
Encoder:
Encoder simply takes the input data, and trains on it, then it passes the final state of its
recurrent layer as an initial state to the first recurrent layer of the decoder part.
Decoder :
The decoder takes the final state of encoder’s final recurrent layer and uses it as an initial
state to its initial, recurrent layer, the input of the decoder is sequences that we want to get
French sentences.
P a g e 10 | 12
Q10. What is Context2Vec?
Answer:
Assume a case where you have a sentence like. I can’t find May. Word May maybe refers to a month's
name or a person's name. You use the words surround it (context) to help yourself to determine the
best suitable option. Actually, this problem refers to the Word Sense Disambiguation task, on which
you investigate the actual semantics of the word based on several semantic and linguistic techniques.
The Context2Vec idea is taken from the original CBOW Word2Vec model, but instead of relying on
averaging the embedding of the words, it relies on a much more complex parametric model that is
based on one layer of Bi-LSTM. Figure1 shows the architecture of the CBOW model.
Figure1
P a g e 11 | 12
Context2Vec applied the same concept of windowing, but instead of using a simple average
function, it uses 3 stages to learn complex parametric networks.
A Bi-LSTM layer that takes left-to-right and right-to-left representations
A feedforward network that takes the concatenated hidden representation and produces a
hidden representation through learning the network parameters.
Finally, we apply the objective function to the network output.
We used the Word2Vec negative sampling idea to get better performance while calculating
the loss value.
The following are some samples of the closest words to a given context.
P a g e 12 | 12
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 20
Q1. Do you have any idea about Event2Mind in NLP?
Answer:
Yes, it is based on NLP research paper to understand the common-sense inference from sentences.
Event2Mind: Common-sense Inference on Events, Intents, and Reactions
The study of “Commonsense Reasoning” in NLP deals with teaching computers how to gain and
employ common sense knowledge. NLP systems require common sense to adapt quickly and
understand humans as we talk to each other in a natural environment.
This paper proposes a new task to teach systems commonsense reasoning: given an event described
in a short “event phrase” (e.g. “PersonX drinks coffee in the morning”), the researchers teach a system
to reason about the likely intents (“PersonX wants to stay awake”) and reactions (“PersonX feels
alert”) of the event’s participants.
Understanding a narrative requires common-sense reasoning about the mental states of people in
relation to events. For example, if “Robert is dragging his feet at work,” pragmatic implications about
Robert’s intent are that “Robert wants to avoid doing things” (Above Fig). You can also infer that
Robert’s emotional reaction might be feeling “bored” or “lazy.” Furthermore, while not explicitly
mentioned, you can assume that people other than Robert are affected by the situation, and these
people are likely to feel “impatient” or “frustrated.”
P a g e 1 | 12
This type of pragmatic inference can likely be useful for a wide range of NLP applications that require
accurate anticipation of people’s intents and emotional reactions, even when they are not expressly
mentioned. For example, an ideal dialogue system should react in empathetic ways by reasoning
about the human user’s mental state based on the events the user has experienced, without the user
explicitly stating how they are feeling. Furthermore, advertisement systems on social media should
be able to reason about the emotional reactions of people after events such as mass shootings and
remove ads for guns, which might increase social distress. Also, the pragmatic inference is a
necessary step toward automatic narrative understanding and generation. However, this type of
commonsense social reasoning goes far beyond the widely studied entailment tasks and thus falls
outside the scope of existing benchmarks.
P a g e 2 | 12
This type of natural language inference(NLI) requires common-sense reasoning, substantially
broadening the scope of prior work that focused primarily on linguistic entailment. Whereas the
dominant entailment paradigm asks if 2 natural language sentences (the ‘premise’ and the
‘hypothesis’) describe the same set of possible worlds, here we focus on whether a (multiple-choice)
ending represents a possible (future) world that can a from the situation described in the premise,
even when it is not strictly entailed. Making such inference necessitates a rich understanding of
everyday physical conditions, including object affordances and frame semantics.
P a g e 3 | 12
Q3. What is the Pix2Pix network?
Answer:
Pix2Pix network: It is a Conditional GANs (cGAN) that learn the mapping from an input image to
output an image.
Image-To-Image Translation is the process for translating one representation of the image into
another representation.
The image-to-image translation is another example of a task that GANs (Generative Adversarial
Networks) are ideally suited for. These are tasks in which it is nearly impossible to hard-code a loss
function. Studies on GANs are concerned with novel image synthesis, translating from a random
vector z into an image. Image-to-Image translation converts one image to another like the edges of
the bag below to the photo image. Another exciting example of this is shown below:
In the experiments, the authors report that they found the most success with the lambda parameter
equal to 100.
P a g e 5 | 12
How does it work?
The UNet architecture looks like a ‘U,’ which justifies its name. This UNet architecture consists of
3 sections: The contraction, the bottleneck, and the expansion section. The contraction section is
made of many contraction blocks. Each block takes an input that applies two 3X3 convolution layers,
followed by a 2X2 max pooling. The number of features or kernel maps after each block doubles so
that UNet architecture can learn complex structures. Bottommost layer mediates between the
contraction layer and the expansion layer. It uses two 3X3 CNN layers followed by 2X2 up
convolution layer.
But the heart of this architecture lies in the expansion section. Similar to the contraction layer, it also
has several expansion blocks. Each block passes input to two 3X3 CNN layers, followed by a 2X2
upsampling layer. After each block number of feature maps used by the convolutional layer, get half
to maintain symmetry. However, every time input is also get appended by feature maps of the
corresponding contraction layer. This action would ensure that features that are learned while
contracting the image will be used to reconstruct it. The number of expansion blocks is as same as
the number of contraction blocks. After that, the resultant mapping passes through another 3X3 CNN
layer, with the number of feature maps equal to the number of segments desired.
P a g e 6 | 12
Q5. What is pair2vec?
Answer:
This paper pre trains word pair representations by maximizing pointwise mutual information of
pairs of words with their context. This encourages a model to learn more meaningful representations
of word pairs than with more general objectives, like modeling. The pre-trained representations are
useful in tasks like SQuAD and MultiNLI that require cross-sentence inference. You can expect to
see more pretraining tasks that capture properties particularly suited to specific downstream tasks and
are complementary to more general-purpose tasks like language modeling.
Reasoning about implied relationships between pairs of words is crucial for cross sentences inference
problems like question answering (QA) and natural language inference (NLI). In NLI, e.g., given a
premise such as “golf is prohibitively expensive,” inferring that the hypothesis “golf is a cheap
pastime” is a contradiction requires one to know that expensive and cheap are antonyms. Recent
work has shown that current models, which rely heavily on unsupervised single-word embeddings,
struggle to grasp such relationships. In this pair2vec paper, we show that they can be learned with
word pair2vec(pair vector), which are trained, unsupervised, at a huge scale, and which significantly
improve performance when added to existing cross-sentence attention mechanisms.
P a g e 7 | 12
Unlike single word representations, which are typically trained by modeling the co-occurrence of a
target word x with its context c, our word-pair representations are learned by modeling the three-way
co-occurrence between two words (x,y) and the context c that ties them together, as illustrated in
above Table. While similar training signal has been used to learn models for ontology
construction and knowledge base completion, this paper shows, for the first time, that considerable
scale learning of pairwise embeddings can be used to improve the performance of neural cross-
sentence inference models directly.
P a g e 8 | 12
The goal of meta-learning is to train the model on a variety of learning tasks, such that it can solve
new learning tasks with only a small number of training samples. It tends to focus on finding model
agnostic solutions, whereas multi-task learning remains deeply tied to model architecture.
Thus, meta-level AI algorithms make AI systems:
· Learn faster
· Generalizable to many tasks
· Adaptable to environmental changes like in Reinforcement Learning
One can solve any problem with a single model, but meta-learning should not be confused with one-
shot learning.
P a g e 10 | 12
Q9. What is Dropout Neural Networks?
Answer:
The term “dropout” refers to dropping out units (both hidden and visible) in a neural network.
At each training stage, individual nodes are either dropped out of the net with probability 1-p or kept
with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out
node are also removed.
Why do we need Dropout?
The answer to these questions is “to prevent over-fitting.”
A fully connected layer occupies most of the parameters, and hence, neurons develop co-dependency
amongst each other during training, which curbs the individual power of each neuron leading to over-
fitting of training data.
P a g e 11 | 12
Q10. What is GAN?
Answer:
A generative adversarial network (GAN): It is a class of machine learning systems invented by Ian
Goodfellow and his colleagues in 2014. Two neural networks are contesting with each other in a
game (in the idea of game theory, often but not always in the form of a zero-sum game). Given a
training set, this technique learns to generate new data with the same statistics as the training set.
E.g., a GAN trained on photographs can produce original pictures that look at least superficially
authentic to human observers, having many realistic characteristics. Though initially proposed as a
form of a generative model for unsupervised learning, GANs have also proven useful for semi-
supervised learning,[2] fully supervised learning, and reinforcement learning.
Example of GAN
Given an image of a face, the network can construct an image that represents how that person
could look when they are old.
# Day21
P a g e 1 | 15
Q1. Explain Grad-CAM architecture?
Answer:
According to the research paper, “We propose a technique for making Convolutional Neural Network
(CNN)-based models more transparent by visualizing input regions that are ‘important’ for
predictions – producing visual explanations. Our approach is called Gradient-weighted Class
Activation Mapping (Grad-CAM), which uses class-specific gradient information to localize the
crucial regions. These localizations are combined with the existing pixel-space visualizations to
create a new high-resolution, and class-discriminative display called the Guided Grad-CAM. These
methods help better to understand CNN-based models, including image captioning and the apparent
question answering (VQA) models. We evaluate our visual explanations by measuring the ability to
discriminate between the classes and to inspire trust in humans, and their correlation with the
occlusion maps. Grad-CAM provides a new way to understand the CNN-based models.”
A technique for making CNN(Convolutional Neural Network)-based models more transparent by
visualizing the regions of input that are “important” for predictions from these models — or visual
explanations.
This visualization is both high-resolution (when the class of interest is ‘tiger cat,’ it identifies crucial
‘tiger cat’ features like stripes, pointy ears and eyes) and class-discriminative (it shows the ‘tiger cat’
but not the ‘boxer (dog)’).
P a g e 2 | 15
Q2.Explain squeeze-net architecture?
Answer:
Nowadays, technology is at its peak. Self-driving cars and IoT is going to be household talks in the
next few years to come. Therefore, everything is controlled remotely, say, e.g., in self-driving cars,
we will need our system to communicate with the servers regularly. So accordingly, if we have a
model that has a small size, then we can quickly deploy it in the cloud. So that’s why we needed an
architecture that is less in size and also achieves the same level of accuracy that other architecture
achieves.
It’s Architecture
Replace 3x3 filters with 1x1 filter- We plan to use the maximum number of 1x1 filters as
using a 1X1 filter rather than a 3X3 filter can reduce the number of parameters by 9X. We
may think that replacing 3X3 filters with 1X1 filters may perform badly as it has less
information to work on. But this is not a case. Typically 3X3 filter may capture the spatial
information of pixels close to each other while the 1X1 filter zeros in on pixel and captures
features amongst its channels.
Decrease number of input channels to 3x3 filters- to maintain a small total number of
parameters in a CNN, and it is crucial not only to decrease the number of 3x3 filters, but also
to decrease the number of input channels to 3x3 filters. We reduce the number of input
channels to 3x3 filters using squeeze layers. The author of this paper has used a term called
the “fire module,” in which there is a squeeze layer and an expanded layer. In the squeeze
layer, we are using 1X1 filters, while in the expanded layer, we are using a combo of 3X3
filters and 1X1 filters. The author is trying to limit the number of inputs to 3X3 filters to
reduce the number of parameters in the layer.
P a g e 3 | 15
Downsample late in a network so that convolution layers have a large activation
map- Having got an intuition about contracting the sheer number of parameters we are
working with, how the model is getting most out of the remaining set of parameters. The
author in this paper has downsampled the feature map in later layers, and this increases
the accuracy. But this is an excellent contrast to networks like VGG where a large feature
map is taken, and then it gets smaller as network approach towards the end. This different
approach is too interesting, and they cite the paper by K. He and H. Sun that similarly
applies delayed downsampling that leads to higher classification accuracy.
This architecture consists of the fire module, which enables it to bring down the number
of parameters.
And other thing that surprises me is the lack of fully connected layers or dense layers at the end,
which one will see in a typical CNN architecture. The dense layers, in the end, learn all the
relationships between the high-level features and the classes it is trying to identify. The fully
connected layers are designed to learn that noses and ears make up a face, and wheels and lights
indicate cars. However, in this architecture, that extra learning step seems to be embedded within the
transformations between various “fire modules.”
P a g e 4 | 15
The squeeze-net can accomplish an accuracy nearly equal to AlexNet with 50X less number of
parameters. The most impressive part is that if we apply Deep compression to the already smaller
model, then it can reduce the size of the squeeze-net model to 510x times that of AlexNet.
Q3.ZFNet architecture
Answer:
The architecture of the network is an optimized version of the last year’s winner - AlexNet. The
authors spent some time to find out the bottlenecks of AlexNet and removing them, achieving
superior performance.
(a): First layer ZFNET features without feature scale clipping. (b): the First layer features from
AlexNet. Note that there are lot of dead features - ones where the network did not learn any patterns.
(c): the First layer features for ZFNet. Note that there are only a few dead features. (d): Second layer
features from AlexNet. The grid-like patterns are so-called aliasing artifacts. They appear when
P a g e 5 | 15
receptive fields of convolutional neurons overlap, and neighboring neurons learn similar structures.
(e): 2nd layer features for ZFNet. Note that there are no aliasing artifacts. Source: original paper.
In particular, they reduced the filter size in the 1st convolutional layer from 11x11 to 7x7, which
resulted in fewer dead features learned in the first layer (see the image below for an example of that).
A dead feature is a situation where a convolutional kernel fails to learn any significant representation.
Visually it looks like a monotonic single-color image, where all the values are close to each other.
In addition to changing the filter size, the authors of FZNet have doubled the number of filters in all
convolutional layers and the number of neurons in the fully connected layers as compared to the
AlexNet. In the AlexNet, there were 48-128-192-192-128-2048-2048 kernels/neurons, and in the
ZFNet, all these doubled to 96-256-384-384-256-4096-4096. This modification allowed the network
to increase the complexity of internal representations and as a result, decrease the error rate from
15.4% for last year’s winner, to 14.8% to become the winner in 2013.
P a g e 6 | 15
In the NAS algorithm, the controller Recurrent Neural Network (RNN) samples the building blocks,
putting them together to create some end to end architecture. Architecture generally combines the
same style as state-of-the-art(SOTA) networks, such as DenseNets or ResNets, but uses a much
different combination and the configuration of blocks.
This new network architecture is then trained to convergence to obtain the least accuracy on the held-
out validation set. The resulting efficiencies are used to update the controller so that the controller
will generate better architectures over time, perhaps by selecting better blocks or making better
connections. The controller weights are updated with a policy gradient. The whole end-to-end setup
is shown below.
It’s a reasonably intuitive approach! In simple means: have an algorithm grab different blocks and
put those blocks together to make the network. Train and test out that network. Based on our results,
adjust the blocks we used to make the network and how you put them together!
SENets stands for Squeeze-and-Excitation Networks introduces a building block for CNNs that
improves channel interdependencies at almost no computational cost. They have used in the 2017
ImageNet competition and helped to improve the result from last year by 25%. Besides this large
performance boost, they can be easily added to existing architectures. The idea is this:
P a g e 7 | 15
Let’s add parameters to each channel of the convolutional block so that the network can adaptively
adjust the weighting of each feature map.
As simple as may it sound, this is it. So, let’s take a closer look at why this works so well.
Why it works too well?
CNN's uses its convolutional filters to extract hierarchal information from the images. Lower layers
find little pieces of context like high frequencies or edges, while upper layers can detect faces, text,
or other complex geometrical shapes. They extract whatever is necessary to solve the task precisely.
All of this works by fusing spatial and channel information of an image. The different filters will first
find the spatial features in each input channel before adding the information across all available
output channels.
All we need to understand for now is that the network weights each of its channels equally when
creating output feature maps. It is all about changing this by adding a content-aware mechanism to
weight each channel adaptively. In its too basic form, this could mean adding a single parameter to
each channel and giving it linear scalar how relevant each one is.
However, the authors push it a little further. First, they get the global understanding of each channel
by squeezing feature maps to a single numeric value. This results in the vector of size n, where n is
equal to the number of convolutional channels. Afterward, it is fed through a two-layer neural
network, which outputs a vector of the same size. These n values can now be used as weights on the
original features maps, scaling each channel based on its importance.
P a g e 8 | 15
The Bottom-Up Pathway
The bottom-up pathway is feedforward computation of backbone ConvNet. It is known as one
pyramid level is for each stage. The output of last layer of each step will be used as the reference set
of feature maps for enriching the top-down pathway by lateral connection.
Top-Down Pathway and Lateral Connection
The higher resolution features are upsampled spatially coarser, but semantically stronger,
feature maps from higher pyramid levels. More particularly, the spatial resolution
is upsampled by a factor of 2 using nearest neighbor for simplicity.
Each lateral connection adds feature maps of the same spatial size from the bottom-up
pathway and top-down pathway.
Specifically, the feature maps from the bottom-up pathway undergo 1×1
convolutions to reduce channel dimensions.
And feature maps from the bottom-up pathway and top-down pathway are merged
by element-wise addition.
Prediction in FPN
Finally, the 3×3 convolution is appended on each merged map to generate a final feature
map, which is to reduce the aliasing effect of upsampling. This last set of feature maps is
called {P2, P3, P4, P5}, corresponding to {C2, C3, C4, C5} that are respectively of same
spatial sizes.
Because all levels of pyramid use shared classifiers/regressors as in a traditional featured
image pyramid, feature dimension at output d is fixed with d = 256. Thus, all extra
convolutional layers have 256 channel outputs.
P a g e 9 | 15
The steps in black color are the old stuff that existed in R-CNN. The stages in red color do not
appear in R-CNN.
1. Selective Search
First, color similarities, texture similarities, regions size, and region filling are used as non-
object-based segmentation. Therefore you obtain many small segmented areas as shown
at the bottom left of the image above.
Then, the bottom-up approach is used that small segmented areas are merged to form the
larger segment areas.
Thus, about 2K regions, proposals (bounding box candidates) are generated, as shown
in the above image.
2. Box Rejection
R-CNN is used to reject bounding boxes that are most likely to be the background.
P a g e 10 | 15
3. Pre train Using Object-Level Annotations
Usually, pretraining is on image-level annotation. It is not good when an object is too small within
the image because the object should occupy a large area within the bounding box created by the
selective search.
Thus, pretraining is on object-level annotation. And the deep learning(DL) model can be any
models such as ZFNet, VGGNet, and GoogLeNet.
4. Def-Pooling Layer
P a g e 11 | 15
For the def-pooling path, output from conv5, goes through the Conv layer, then goes through the def-
pooling layer, and then has a max-pooling layer.
In simple terms, the summation of ac multiplied by dc,n, is the 5×5 deformation penalty in the figure
above. The penalty of placing object part from assumed the central position.
By training the DeepID-Net, object parts of the object to be detected will give a high activation value
after the def-pooling layer if they are closed to their anchor places. And this output will connect to
200-class scores for improvement.
5. Context Modeling
In object detection tasks in ILSVRC, there are 200 classes. And there is also the classification
competition task in ILSVRC for classifying and localizing 1000-class objects. The contents are more
diverse compared with the object detection task. Hence, 1000-class scores, obtained by
classification network, are used to refine 200-class scores.
In the above picture: A Simple Fractal Expansion (on Left), Recursively Stacking of Fractal
Expansion as One Block (in the Middle), 5 Blocks Cascaded as FractalNet (on the Right)
For the base case, f1(z) is the convolutional layer:
P a g e 13 | 15
Where C is a number of columns as in the middle of the above figure. The number of the
convolutional layers at the deepest path within the block will have 2^(C-1). In this case, C=4, thereby,
a number of convolutional layers are 2³=8 layers.
For the join layer (green), the element-wise mean is computed. It is not concatenation or addition.
With five blocks (B=5) cascaded as FractalNet at the right of the figure, then the number of
convolutional layers at the most profound path within the whole network is B×2^(C-1), i.e., 5×2³=40
layers.
In between 2 blocks, 2×2 max pooling is done to reduce the size of feature maps. Batch Norm and
ReLU are used after each convolution.
Conventionally, at the transformation of the Conv layer and FC layer, there is one single pooling
layer or even no pooling layer. In SPPNet, it suggests having multiple pooling layers with different
scales.
In the figure, 3-level SPP is used. Suppose conv5 layer has 256 feature maps. Then at the SPP layer,
P a g e 14 | 15
1. first, each feature map is pooled to become one value (which is grey). Thus 256-d vector
is formed.
2. Then, each feature map is pooled to have four values (which is green), and form the 4×256-
d vector.
3. Similarly, each feature map is pooled to have 16 values (in blue), and form the 16×256-d
vector.
4. The above three vectors are concatenated to form a 1-d vector.
5. Finally, this 1-d vector is going into FC layers as usual.
With SPP, you don’t need to crop the image to a fixed size, like AlexNet, before going into CNN. Any
image sizes can be inputted.
P a g e 15 | 15