0% found this document useful (0 votes)
11 views

Data Science Interview Questions #Week3

The document provides a comprehensive overview of key concepts in data science, including autoencoders, text similarity, dropout in neural networks, forward propagation, text mining, information extraction, text generation, text summarization, topic modeling, and hidden Markov models. It also covers statistical learning, ANOVA, and the differences between parametric and non-parametric methods. Each concept is explained with definitions, examples, and applications relevant to machine learning and natural language processing.

Uploaded by

bedominator34
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Data Science Interview Questions #Week3

The document provides a comprehensive overview of key concepts in data science, including autoencoders, text similarity, dropout in neural networks, forward propagation, text mining, information extraction, text generation, text summarization, topic modeling, and hidden Markov models. It also covers statistical learning, ANOVA, and the differences between parametric and non-parametric methods. Each concept is explained with definitions, examples, and applications relevant to machine learning and natural language processing.

Uploaded by

bedominator34
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

DATA SCIENCE

INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)

# DAY 15

Page 1 of 12
Q1. What is Autoencoder?
Answer:
Autoencoder neural network: It is an unsupervised Machine learning algorithm that applies
backpropagation, setting the target values to be equal to the inputs. It is trained to attempt to copy its
input to its output. Internally, it has the hidden layer that describes a code used to represent the input.

It is trying to learn the approximation to the identity function, to output x̂ x^ that is similar to the xx.
Autoencoders belongs to the neural network family, but they are also closely related to PCA
(principal components analysis).
Auto encoders, although it is quite similar to PCA, but its Autoencoders are much more flexible than
PCA. Autoencoders can represent both liners and non-linear transformation in encoding, but PCA
can perform linear transformation. Autoencoders can be layered to form deep learning network due
to its Network representation.
Types of Autoencoders:
1. Denoising autoencoder
Autoencoders are Neural Networks which are used for feature selection and extraction.
However, when there are more nodes in hidden layer than there are inputs, the Network is
risking to learn so-called “Identity Function”, also called “Null Function”, meaning that
output equals the input, marking the Autoencoder useless.

Page 2 of 12
Denoising Autoencoders solve this problem by corrupting the data on purpose by randomly
turning some of the input values to zero. In general, the percentage of input nodes which
are being set to zero is about 50%. Other sources suggest a lower count, such as 30%. It
depends on the amount of data and input nodes you have.

2. Sparse autoencoder
An autoencoder takes the input image or vector and learns code dictionary that changes the
raw input from one representation to another. Where in sparse autoencoders with a sparsity
enforcer that directs a single-layer network to learn code dictionary which in turn minimizes
the error in reproducing the input while restricting number of code words for reconstruction.
The sparse autoencoder consists a single hidden layer, which is connected to the input vector
by a weight matrix forming the encoding step. The hidden layer then outputs to a
reconstruction vector, using a tied weight matrix to form the decoder.

Q2. What Is Text Similarity?


Answer:
When talking about text similarity, different people have a slightly different notion on what text
similarity means. In essence, the goal is to compute how ‘close’ two pieces of text are in (1) meaning
or (2) surface closeness. The first is referred to as semantic similarity, and the latter is referred to
as lexical similarity. Although the methods for lexical similarity are often used to achieve semantic
similarity (to a certain extent), achieving true semantic similarity is often much more involved.

Page 3 of 12
Lexical or Word Level Similarity
When referring to text similarity, people refer to how similar the two pieces of text are at the surface
level. Example- how similar are the phrases “the cat ate the mouse” with “the mouse ate the cat
food” by just looking at the words? On the surface, if you consider only word-level similarity, these
two phrases (with determiners disregarded) appear very similar as 3 of the 4 unique words are an
exact overlap.

Semantic Similarity:
Another notion of similarity mostly explored by NLP research community is how similar in meaning
are any two phrases? If we look at the phrases, “ the cat ate the mouse ” and “ the mouse ate the cat
food”. As we know that while the words significantly overlaps, these two phrases have different
meaning. Meaning out of the phrases is often the more difficult task as it requires deeper level of
analysis.Example, we can actually look at the simple aspects like order of
words: “cat==>ate==>mouse” and “mouse==>ate==>cat food”. Words overlap in this case, the
order of the occurrence is different, and we can tell that, these two phrases have different meaning.
This is just the one example. Most people use the syntactic parsing to help with the semantic
similarity. Let’s have a look at the parse trees for these two phrases. What can you get from it?

Page 4 of 12
Q3. What is dropout in neural networks?
Answer:
When we training our neural network (or model) by updating each of its weights, it might become
too dependent on the dataset we are using. Therefore, when this model has to make a prediction or
classification, it will not give satisfactory results. This is known as over-fitting. We might understand
this problem through a real-world example: If a student of science learns only one chapter of a book
and then takes a test on the whole syllabus, he will probably fail.
To overcome this problem, we use a technique that was introduced by Geoffrey Hinton in 2012. This
technique is known as dropout.
Dropout refers to ignoring units (i.e., neurons) during the training phase of certain set of neurons,
which is chosen at random. By “ignoring”, I mean these units are not considered during a particular
forward or backward pass.
At each training stage, individual nodes are either dropped out of the net with probability 1-p or kept
with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out
node are also removed.

Page 5 of 12
Q4. What is Forward Propagation?
Answer:
Input X provides the information that then propagates to hidden units at each layer and then finally
produce the output y. The architecture of network entails determining its depth, width, and the
activation functions used on each layer. Depth is the number of the hidden layers. Width is the
number of units (nodes) on each hidden layer since we don’t control neither input layer nor output
layer dimensions. There are quite a few set of activation functions such Rectified Linear Unit,
Sigmoid, Hyperbolic tangent, etc. Research has proven that deeper networks outperform networks
with more hidden units. Therefore, it’s always better and won’t hurt to train a deeper network.

Q5. What is Text Mining?


Answer:
Text mining: It is also referred as text data mining, roughly equivalent to text analytics, is the process
of deriving high-quality information from text. High-quality information is typically derived through
the devising of patterns and trends through means such as statistical pattern learning. Text mining
usually involves the process of structuring the input text (usually parsing, along with the addition of
some derived linguistic features and the removal of others, and subsequent insertion into a database),
deriving patterns within the structured data, and finally evaluation and interpretation of the output.
'High quality' in text mining usually refers to some combination of relevance, novelty, and interest.
Typical text mining tasks include text categorization, text clustering, concept/entity extraction,
production of granular taxonomies, sentiment analysis, document summarization, and entity relation
modeling (i.e., learning relations between named entities).

Page 6 of 12
Q6. What is Information Extraction?
Answer:
Information extraction (IE): It is the task of automatically extracting structured information from the
unstructured and/or semi-structured machine-readable documents. In most of the cases, this activity
concerns processing human language texts using natural language processing (NLP).
Information extraction depends on named entity recognition (NER), a sub-tool used to find targeted
information to extract. NER recognizes entities first as one of several categories, such as location
(LOC), persons (PER), or organizations (ORG). Once the information category is recognized, an
information extraction utility extracts the named entity’s related information and constructs a
machine-readable document from it, which algorithms can further process to extract meaning. IE
finds meaning by way of other subtasks, including co-reference resolution, relationship extraction,
language, and vocabulary analysis, and sometimes audio extraction.

Page 7 of 12
Q7. What is Text Generation?
Answer:
Text Generation: It is a type of the Language Modelling problem. Language Modelling is the core
problem for several of natural language processing tasks such as speech to text, conversational
system, and the text summarization. The trained language model learns the likelihood of occurrence
of the word based on the previous sequence of words used in the text. Language models can be
operated at the character level, n-gram level, sentence level or even paragraph level.
A language model is at the core of many NLP tasks, and is simply a probability distribution over a
sequence of words:

It can also be used to estimate the conditional probability of the next word in a sequence:

Page 8 of 12
Q8. What is Text Summarization?
Answer:
We all interact with the applications which uses the text summarization. Many of the applications are
for the platform which publishes articles on the daily news, entertainment, sports. With our busy
schedule, we like to read the summary of those articles before we decide to jump in for reading entire
article. Reading a summary helps us to identify the interest area, gives a brief context of the story.
Text summarization is a subdomain of Natural Language Processing (NLP) that deals with extracting
summaries from huge chunks of texts. There are two main types of techniques used for text
summarization: NLP-based techniques and deep learning-based techniques.
Text summarization: It refers to the technique of shortening long pieces of text. The intention is to
create the coherent and fluent summary having only the main points outlined in the document.
How text summarization works:
The two types of summarization, abstractive and the extractive summarization.
1. Abstractive Summarization: It select words based on the semantic understanding; even those
words did not appear in the source documents. It aims at producing important material in the
new way. They interprets and examines the text using advanced natural language techniques
to generate the new shorter text that conveys the most critical information from the original
text.
It can be correlated in the way human reads the text article or blog post and then summarizes in
their word.

2. Extractive Summarization: It attempt to summarize articles by selecting the subset of words


that retain the most important points.

This approach weights the most important part of sentences and uses the same to form the
summary. Different algorithm and the techniques are used to define the weights for the
sentences and further rank them based on importance and similarity among each other.

Page 9 of 12
Q9. What is Topic Modelling?
Answer:
Topic Modelling is the task of using unsupervised learning to extract the main topics (represented as
a set of words) that occur in a collection of documents.
Topic modeling, in the context of Natural Language Processing, is described as a method of
uncovering hidden structure in a collection of texts.

Dimensionality Reduction:
Topic modeling is the form of dimensionality reduction. Rather than representing the text T in its
feature space as {Word_i: count(Word_i, T) for Word_i in V}, we can represent the text in its topic
space as ( Topic_i: weight(Topic_i, T) for Topic_i in Topics ).
Unsupervised learning:
Topic modeling can be compared to the clustering. As in the case of clustering, the number of topics,
like the number of clusters, is the hyperparameter. By doing the topic modeling, we build clusters of
words rather than clusters of texts. A text is thus a mixture of all the topics, each having a certain
weight.

A Form of Tagging
If document classification is assigning a single category to a text, topic modeling is assigning multiple
tags to a text. A human expert can label the resulting topics with human-readable labels and use
different heuristics to convert the weighted topics to a set of tags.

Page 10 of 12
Q10.What is Hidden Markov Models?
Answer:
Hidden Markov Models (HMMs) are the class of probabilistic graphical model that allow us to
predict the sequence of unknown (hidden) variables from the set of observed variables. The simple
example of an HMM is predicting the weather (hidden variable) based on the type of clothes that
someone wears (observed). An HMM can be viewed as the Bayes Net unrolled through time with
observations made at the sequence of time steps being used to predict the best sequence of the hidden
states.

The below diagram from Wikipedia shows that HMM and its transitions. The scenario is the room
that contains urns X1, X2, and X3, each of which contains a known mix of balls, each ball labeled y1,
y2, y3, and y4. The sequence of four balls is randomly drawn. In this particular case, the user observes
the sequence of balls y1,y2,y3, and y4 and is attempting to discern the hidden state, which is the right
sequence of three urns that these four balls were pulled from.

Why Hidden, Markov Model?


The reason it is called the Hidden Markov Model is because we are constructing an inference model
based on the assumptions of a Markov process. The Markov process assumption is simply that the
“future is independent of the past given the present”.

To make this point clear, let us consider the scenario below where the weather, the hidden variable,
can be hot, mild or cold, and the observed variables are the type of clothing worn. The arrows
represent transitions from a hidden state to another hidden state or from a hidden state to an observed
variable.

Page 11 of 12
-----------------------------------------------------------------------------------------------------------------

Page 12 of 12
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# Day-16
Q1.What is Statistics Learning?
Answer:
Statistical learning: It is the framework for understanding data based on the statistics, which can be
classified as the supervised or unsupervised. Supervised statistical learning involves building the
statistical model for predicting, or estimating, an output based on one or more inputs, while
in unsupervised statistical learning, there are inputs but no supervising output, but we can learn
relationships and structure from such data.

Y = f(X) + ɛ ,X = (X1,X2, . . .,Xp),

f : It is an unknown function & ɛ is random error (reducible & irreducible).


Prediction & Inference:
In the situations , where the set of inputs X are readily available, but the output Y is not known, we
often treat f as the black box (not concerned with the exact form of “f”), as long as it yields the
accurate predictions for Y. This is the prediction.
There are the situations where we are interested in understanding the way that Y is affected as X
change. In this type of situation, we wish to estimate f, but our goal is not necessarily to make the
predictions for Y. Here we are more interested in understanding the relationship between the X and
Y. Now f cannot be treated as the black box, because we need to know it’s exact form. This
is inference.

Parametric & Non-parametric methods


Parametric statistics: This statistical tests based on underlying the assumptions about data’s
distribution. In other words, It is based on the parameters of the normal curve. Because parametric
statistics are based on the normal curve, data must meet certain assumptions, or parametric statistics
cannot be calculated. Before running any parametric statistics, you should always be sure to test the
assumptions for the tests that you are planning to run.

f(X) = β0 + β1X1 + β2X2 + . . . + βpXp


As by the name, nonparametric statistics are not based on parameters of the normal curve. Therefore,
if our data violate the assumptions of a usual parametric and nonparametric statistics might better
define the data, try running the nonparametric equivalent of the parametric test. We should also
consider using nonparametric equivalent tests when we have limited sample sizes (e.g., n < 30).
Though the nonparametric statistical tests have more flexibility than do parametric statistical tests,
nonparametric tests are not as robust; therefore, most statisticians recommend that when appropriate,
parametric statistics are preferred.

.
Prediction Accuracy and Model Interpretability:
Out of many methods that we use for the statistical learning, some are less flexible and more
restrictive . When inference is the goal, then there are clear advantages of using the simple and
relatively inflexible statistical learning methods. When we are only interested in the prediction, we
use flexible models available.
Q2. What is ANOVA?
Answer:
ANOVA: it stands for “ Analysis of Variance ” is an extremely important tool for analysis of data
(both One Way and Two Way ANOVA is used). It is a statistical method to compare the population
means of two or more groups by analyzing variance. The variance would differ only when the means
are significantly different.
ANOVA test is the way to find out if survey or experiment results are significant. In other words, It
helps us to figure out if we need to reject the null hypothesis or accept the alternate hypothesis. We
are testing groups to see if there’s a difference between them. Examples of when we might want to
test different groups:

 The group of psychiatric patients are trying three different therapies: counseling, medication,
and biofeedback. We want to see if one therapy is better than the others.
 The manufacturer has two different processes to make light bulbs if they want to know which
one is better.
 Students from the different colleges take the same exam. We want to see if one college
outperforms the other.

Types of ANOVA:
 One-way ANOVA
 Two-way ANOVA

One-way ANOVA is the hypothesis test in which only one categorical variable or the single
factor is taken into consideration. With the help of F-distribution, it enables us to compare
means of three or more samples. The Null hypothesis (H0) is the equity in all population
means while an Alternative hypothesis is the difference in at least one mean.
There are two-ways ANOVA examines the effect of two independent factors on a dependent
variable. It also studies the inter-relationship between independent variables influencing the
values of the dependent variable, if any.

Q3. What is ANCOVA?


Answer:
Analysis of Covariance (ANCOVA): It is the inclusion of the continuous variable in addition to the
variables of interest ( the dependent and independent variable) as means for the control. Because the
ANCOVA is the extension of the ANOVA, the researcher can still assess main effects and the
interactions to answer their research hypotheses. The difference between ANCOVA and an ANOVA
is that an ANCOVA model includes the “covariate” that is correlated with dependent variable and
means on dependent variable are adjusted due to effects the covariate has on it. Covariates can also
be used in many ANOVA based designs: such as between-subjects, within-subjects (repeated
measures), mixed (between – and within – designs), etc. Thus, this technique answers the question
In simple terms, The difference between ANOVA and the ANCOVA is the letter "C", which stands
for 'covariance'. Like ANOVA, "Analysis of Covariance" (ANCOVA) has the single continuous
response variable. Unlike ANOVA, ANCOVA compares the response variable by both the factor and
a continuous independent variable (example comparing test score by both 'level of education' and the
'number of hours spent in studying'). The terms for the continuous independent variable (IV) used in
the ANCOVA is "covariate".
Example of ANCOVA
Q4. What is MANOVA?
Answer:
MANOVA (multivariate analysis of variance): It is a type of multivariate analysis used to analyze
data that involves more than one dependent variable at a time. MANOVA allows us to test hypotheses
regarding the effect of one or more independent variables on two or more dependent variables.
The obvious difference between ANOVA and the "Multivariate Analysis of Variance" (MANOVA)
is the “M”, which stands for multivariate. In basic terms, MANOVA is an ANOVA with two or more
continuous response variables. Like ANOVA, MANOVA has both the one-way flavor and a two-
way flavor. The number of factor variables involved distinguish the one-way MANOVA from a two-
way MANOVA.

When comparing the two or more continuous response variables by the single factor, a one-way
MANOVA is appropriate (e.g. comparing ‘test score’ and ‘annual income’ together by ‘level of
education’). The two-way MANOVA also entails two or more continuous response variables, but
compares them by at least two factors (e.g. comparing ‘test score’ and ‘annual income’ together by
both ‘level of education’ and ‘zodiac sign’).

Q5. What is MANCOVA?


Answer:
Multivariate analysis of covariance (MANCOVA): It is a statistical technique that is the extension of
analysis of covariance (ANCOVA). It is the multivariate analysis of variance (MANOVA) with a
covariate(s).). In MANCOVA, we assess for statistical differences on multiple continuous dependent
variables by an independent grouping variable, while controlling for a third variable called the
covariate; multiple covariates can be used, depending on the sample size. Covariates are added so
that it can reduce error terms and so that the analysis eliminates the covariates’ effect on the
relationship between the independent grouping variable and the continuous dependent variables.
ANOVA and ANCOVA, the main difference between the MANOVA and MANCOVA, is the “C,”
which again stands for the “covariance.” Both the MANOVA and MANCOVA feature two or more
response variables, but the key difference between the two is the nature of the IVs. While the
MANOVA can include only factors, an analysis evolves from MANOVA to MANCOVA when one
or more covariates are added to the mix.
Q6. Explain the differences between KNN classifier
and KNN regression methods.
Answer:
They are quite similar. Given a value for KK and a prediction point x0x0, KNN regression first
identifies the KK training observations that are closes to x0x0, represented by N0. It then
estimates f(x0) using the average of all the training responses in N0. In other words,

So the main difference is the fact that for the classifier approach, the algorithm assumes the outcome
as the class of more presence, and on the regression approach, the response is the average value of
the nearest neighbors.

Q7. What is t-test?


Answer:
To understand T-Test Distribution, Consider the situation, you want to compare the performance of
two workers of your company by checking the average sales done by each of them, or to compare
the performance of a worker by comparing the average sales done by him with the standard value. In
such situations of daily life, t distribution is applicable.
A t-test is the type of inferential statistic used to determine if there is a significant difference between
the means of two groups, which may be related in certain features. It is mostly used when the data
sets, like the data set recorded as the outcome from flipping a coin 100 times, would follow a normal
distribution and may have unknown variances. A t-test is used as a hypothesis testing tool, which
allows testing of an assumption applicable to a population.
Understand t-test with Example: Let’s say you have a cold, and you try a naturopathic remedy. Your
cold lasts a couple of days. The next time when you have a cold, you buy an over-the-counter
pharmaceutical, and the cold lasts a week. You survey your friends, and they all tell you that their
colds were of a shorter duration (an average of 3 days) when they took the homeopathic remedy.
What you want to know is, are these results repeatable? A t-test can tell you by comparing the means
of the two groups and letting you know the probability of those results happening by chance.

Q8. What is Z-test?


Answer:

z-test: It is a statistical test used to determine whether the two population means are different when
the variances are known, and the sample size is large. The test statistic is assumed to have the normal
distribution, and nuisance parameters such as standard deviation should be known for an accurate z-
test to be performed.
Another definition of Z-test: A Z-test is a type of hypothesis test. Hypothesis testing is just the way
for you to figure out if results from a test are valid or repeatable. Example, if someone said they had
found the new drug that cures cancer, you would want to be sure it was probably true. Hypothesis
test will tell you if it’s probably true or probably not true. A Z test is used when your data is
approximately normally distributed.
Z-Tests Working :
Tests that can be conducted as the z-tests include one-sample location test, a two-sample location
test, a paired difference test, and a maximum likelihood estimate. Z-tests are related to t-tests, but t-
tests are best performed when an experiment has the small sample size. Also, T-tests assumes the
standard deviation is unknown, while z-tests assumes that it is known. If the standard deviation of
the population is unknown, then the assumption of the sample variance equaling the population
variance is made.
When we can run the Z-test :
Different types of tests are used in the statistics (i.e., f test, chi-square test, t-test). You would use a
Z test if:

 Your sample size is greater than 30. Otherwise, use a t-test.


 Data points should be independent from each other. Some other words, one data point is not
related or doesn’t affect another data point.
 Your data should be normally distributed. However, for large sample sizes (over 30), this
doesn’t always matter.
 Your data should be randomly selected from a population, where each item has an equal
chance of being selected.
 Sample sizes should be equal, if at all possible.
Q9. What is Chi-Square test?
Answer:

Chi-square (χ2) statistic: It is a test that measures how expectations compare to actual observed data
(or model results). The data used in calculating a chi-square statistic must be random, raw, mutually
exclusive, drawn from independent variables, and drawn from a large enough sample. For example,
the results of tossing a coin 100 times meet these criteria.
Chi-square test is intended to test how it is that an observed distribution is due to chance. It is also
called the "goodness of fit" statistic because it measures how well the observed distribution of the
data fits with the distribution that is expected if the variables are independent.
Chi-square test is designed to analyze the categorical data. That means that the data has been counted
and divided into categories. It will not work with parametric or continuous data (such as height in
inches). For example, if you want to test whether attending class influences how students perform on
an exam, using test scores (from 0-100) as data would not be appropriate for a Chi-square test.
However, arranging students into the categories "Pass" and "Fail" would. Additionally, the data in a
Chi-square grid should not be in the form of percentages, or anything other than frequency (count)
data.
Q10. What is correlation and the covariance in the statistics?
Answer:
The Covariance and Correlation are two mathematical concepts; these two approaches are widely
used in the statistics. Both Correlation and the Covariance establish the relationship and also
measures the dependency between the two random variables, the work is similar between these two,
in the mathematical terms, they are different from each other.
Correlation: It is the statistical technique that can show whether and how strongly pairs of variables
are related. For example, height and weight are related; taller people tend to be heavier than shorter
people. The relationship isn't perfect. People of the same height vary in weight, and you can easily
think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the
average weight of people 5'5'' is less than the average weight of people 5'6'', and their average weight
is less than that of people 5'7'', etc. Correlation can tell you just how much of the variation in peoples'
weights is related to their heights.

Covariance: It measures the directional relationship between the returns on two assets. The positive
covariance means that asset returns move together while a negative covariance means they move
inversely. Covariance is calculated by analyzing at-return surprises (standard deviations from the
expected return) or by multiplying the correlation between the two variables by the standard
deviation of each variable.

------------------------------------------------------------------------------------------------------------------------
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)

# DAY 17

P a g e 1 | 11
Q1. What is ERM (Empirical Risk Minimization)?
Answer:
Empirical risk minimization (ERM): It is a principle in statistical learning theory which defines a
family of learning algorithms and is used to give theoretical bounds on their performance. The idea
is that we don’t know exactly how well an algorithm will work in practice (the true "risk") because
we don't know the true distribution of data that the algorithm will work on, but as an alternative we
can measure its performance on a known set of training data.
We assumed that our samples come from this distribution and use our dataset as an approximation.
If we compute the loss using the data points in our dataset, it’s called empirical risk. It is “empirical”
and not “true” because we are using a dataset that’s a subset of the whole population.
When our learning model is built, we have to pick a function that minimizes the empirical risk that
is the delta between predicted output and actual output for data points in the dataset. This process of
finding this function is called empirical risk minimization (ERM). We want to minimize the true risk.
We don’t have information that allows us to achieve that, so we hope that this empirical risk will
almost be the same as the true empirical risk.
Let’s get a better understanding by Example
We would want to build a model that can differentiate between a male and a female based on specific
features. If we select 150 random people where women are really short, and men are really tall, then
the model might incorrectly assume that height is the differentiating feature. For building a truly
accurate model, we have to gather all the women and men in the world to extract differentiating
features. Unfortunately, that is not possible! So we select a small number of people and hope that this
sample is representative of the whole population.

P a g e 2 | 11
Q2. What is PAC (Probably Approximately Correct)?
Answer:
PAC: In computational learning theory, probably approximately correct (PAC) learning is a
framework for mathematical analysis of machine learning.
The learner receives samples and must have to pick a generalization function (called the hypothesis)
from a specific class of possible functions. Our goal is that, with high probability, the selected
function will have low generalization error. The learner must be able to learn the concept given any
arbitrary approximation ratio, probability of success, or distribution of the samples.
Hypothesis class is PAC(Probably Approximately Correct) learnable if there exists a function m_H and
algorithm that for any labeling function f, distribution D over the domain of inputs X,
delta and epsilon that with m ≥ m_H produces a hypothesis h like that with probability 1-delta it
returns a true error lower than epsilon. Labeling function is nothing other than saying that we have a
specific function f that labels the data in the domain.

P a g e 3 | 11
Q3. What is ELMo?
Answer:
ELMo is a novel way to represent words in vectors or embeddings. These word embeddings help
achieve state-of-the-art (SOTA) results in several NLP tasks:

It is a deep contextualized word representation that models both complex characteristics of word use
(e.g., syntax and semantics), and how these uses vary across linguistic contexts. These word vectors
are learned functions of internal states of a deep biLM(bidirectional language model), which is pre-
trained on large text corpus. They could be easily added to existing models and significantly improve
state of the art across a broad range of challenging NLP problems, including question answering,
textual entailment and sentiment analysis.

P a g e 4 | 11
Q4. What is Pragmatic Analysis in NLP?
Answer:
Pragmatic Analysis(PA): It deals with outside word knowledge, which means understanding i.e
external to documents and queries. PA that focuses on what was described is reinterpreted by what it
actually meant, deriving the various aspects of language that require real-world knowledge.
It deals with overall communicative and social content and its effect on interpretation. It means
abstracting the meaningful use of language in situations. In this analysis, the main focus always on
what was said in reinterpreted on what is intended.
It helps users to discover this intended effect by applying a set of rules that characterize cooperative
dialogues.
E.g., "close the window?" should be interpreted as a request instead of an order.

Q5. What is Syntactic Parsing?


Answer:
Syntactic Parsing or Dependency Parsing: It is a task of recognizing a sentence and assigning a
syntactic structure to it. Most Widely we used syntactic structure is the parse tree which can be
generated using some parsing algorithms. These parse trees are useful in various applications like
grammar checking or more importantly, it plays a critical role in the semantic analysis stage. For
example to answer the question “Who is the point guard for the LA Laker in the next game ?” we
need to figure out its subject, objects, attributes to help us figure out that the user wants the point
guard of the LA Lakers specifically for the next game.
Example:

P a g e 5 | 11
Q6. What is ULMFit?
Answer:
Transfer Learning in NLP(Natural language Processing) is an area that had not been explored with
great success. But, in May 2018, Jeremy Howard and Sebastian Ruder came up with the paper
– Universal Language Model Fine-tuning for Text Classification(ULMFit) which explores the
benefits of using a pre trained model on text classification. It proposes ULMFiT(Universal Language
Model Fine-tuning for Text Classification), a transfer learning method that could be applied to any
task in NLP. In this method outperforms the state-of-the-art on six text classification tasks.
ULMFiT uses a regular LSTMwhich is the state-of-the-art language model architecture (AWD-
LSTM). The LSTM network has three layers. Single architecture is used throughout – for pre-training
as well as for fine-tuning.
ULMFiT achieves the state-of-the-art result using novel techniques like:

 Discriminative fine-tuning
 Slanted triangular learning rates
 Gradual unfreezing
Discriminative Fine-Tuning

P a g e 6 | 11
Different layers of a neural network capture different types of information so they should be fine-
tuned to varying extents. Instead of using the same learning rates for all layers of the model,
discriminative fine-tuning allows us to tune each layer with different learning rates.
Slanted triangular learning

The model should quickly converge to a suitable region of the parameter space in the beginning of
training and then later refine its parameters. Using a constant learning rate throughout training is not
the best way to achieve this behaviour. Instead Slanted Triangular Learning Rates (STLR) linearly
increases the learning rate at first and then linearly decays it.
Gradual Unfreezing
Gradual unfreezing is the concept of unfreezing the layers gradually, which avoids the catastrophic
loss of knowledge possessed by the model. It first unfreezes the top layer and fine-tunes all the
unfrozen layers for 1 epoch. It then unfreezes the next lower frozen layer and repeats until all the
layers have been fine-tuned until convergence at the last iteration.

Q7. What is BERT?


Answer:
BERT (Bidirectional Encoder Representations from Transformers) is an open-sourced NLP pre-
training model developed by researchers at Google in 2018. A direct descendant to GPT (Generalized
Language Models), BERT has outperformed several models in NLP and provided top results in
Question Answering, Natural Language Inference (MNLI), and other frameworks.
What makes it’s unique from the rest of the model is that it is the first deeply bidirectional,
unsupervised language representation, pre-trained using only a plain text corpus. Since it’s open-
sourced, anyone with machine learning knowledge can easily build an NLP model without the need
for sourcing massive datasets for training the model, thus saving time, energy, knowledge and
resources.

P a g e 7 | 11
How does it work?
Traditional context-free models (like word2vec or GloVe) generate a single word embedding
representation for each word in the vocabulary which means the word “right” would have the same
context-free representation in “I’m sure I’m right” and “Take a right turn.” However, BERT would
represent based on both previous and next context, making it bidirectional. While the concept of
bidirectional was around for a long time, BERT was first on its kind to successfully pre-train
bidirectional in a deep neural network.

Q8.What is XLNet?
Answer:
XLNet is a BERT-like model instead of a totally different one. But it is an auspicious and potential
one. In one word, XLNet is a generalized autoregressive pretraining method.
Autoregressive (AR) language model: It is a kind of model that using the context word to predict the
next word. But here the context word is constrained to two directions, either forward or backwards.

The advantages of AR language model are good at generative Natural language Process(NLP) tasks.
Because when generating context, usually is the forward direction. AR language model naturally
works well on such NLP tasks.

P a g e 8 | 11
But Autoregressive language model has some disadvantages, and it only can use forward context or
backward context, which means it can't use forward and backward context at the same time.

Q9. What is the transformer?


Answer:
Transformer: It is a deep machine learning model introduced in 2017, used primarily in the field
of natural language processing (NLP). Like recurrent neural networks(RNN), It is designed to handle
ordered sequences of data, such as natural language, for various tasks like machine
translation and text summarization. However, Unlike recurrent neural networks(RNN), Transformers
do not require that the sequence be processed in the order. So, if the data in question is a natural
language, the Transformer does not need to process the beginning of a sentence before it processes
the end. Due to this feature, the Transformer allows for much more parallelization than RNNs during
training.
Transformers are developed to solve the problem of sequence transduction current neural networks. It
means any task that transforms an input sequence to an output sequence. This includes speech
recognition, text-to-speech transformation, etc.
For models to perform a sequence transduction, it is necessary to have some sort of memory.
example, let us say that we are translating the following sentence to another language (French):

P a g e 9 | 11
“The Transformers” is a Japanese band. That band was formed in 1968, during the height of the
Japanese music history.”
In the above example, the word “the band” in the second sentence refers to the band “The
Transformers” introduced in the first sentence. When you read about the band in the second sentence,
you know that it is referencing to the “The Transformers” band. That may be important for translation.
For translating other sentences like that, a model needs to figure out these sort of dependencies and
connections. Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) have
been used to deal with this problem because of their properties.

Q10. What is Text summarization?


Answer:
Text summarization: It is the process of shortening a text document, to create a summary of the
significant points of the original document.
Types of Text Summarization Methods :
Text summarization methods can be classified into different types.

P a g e 10 | 11
Example:

-------------------------------------------------------------------------------------------------------------

P a g e 11 | 11
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# Day-18
Q1. What is Levenshtein Algorithm?
Answer:
Levenshtein distance is a string metric for measuring the difference between two sequences. The
Levenshtein distance between two words is the minimum number of single-character edits (i.e.
insertions, deletions or substitutions) required to change one word into the other.
By Mathematically, the Levenshtein distance between the two strings a, b (of length |a| and |b|
respectively) is given by the leva, b( |a| , |b| ) where :

Where, 1 (ai≠bi): This is the indicator function equal to zero when ai≠bi and equal to 1 otherwise,
and leva, b(i,j) is the distance between the first i characters of a and the first j characters of b.
Example:
The Levenshtein distance between "HONDA" and "HYUNDAI" is 3, since the following three edits
change one into the other, and there is no way to do it with fewer than three edits:
Q2. What is Soundex?
Answer:
Soundex attempts to find similar names or homophones using phonetic notation. The program retains
letters according to detailed equations, to match individual titles for purposes of ample volume
research.
Soundex phonetic algorithm: Its indexes strings depend on their English pronunciation. The
algorithm is used to describe homophones, words that are pronounced the same, but spelt differently.
Suppose we have the following sourceDF.

Let’s run below code and see how the soundex algorithm encodes the above words.
Let’s summarize the above results:

 "two" and "to" both are encoded as T000


 "break" and "brake" both are encoded as B620
 "hear" and "here" both are encoded as H600
 "free" is encoded as F600 and "tree" is encoded as T600: Encodings are similar, but word is
different
The Soundex algorithm was often used to compare first names that were spelt differently.

Q3. What is Constituency parse?


Answer:
A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of
phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple
sentence, "John sees Bill", a constituency parse would be:

Above approaches convert the parse tree into a sequence following a depth-first traversal to be able
to apply sequence-to-sequence models to it. The linearized version of the above parse tree looks as
follows: (S (N) (VP V N)).

Q4. What is LDA(Latent Dirichlet Allocation)?


Answer:
LDA: It is used to classify text in the document to a specific topic. LDA builds a topic per document
model and words per topic model, modelled as Dirichlet distributions.

 Each document is modeled as a distribution of topics, and each topic is modelled as


multinomial distribution of words.
 LDA assumes that every chunk of text we feed into it will contain words that are somehow
related. Therefore choosing the right corpus of data is crucial.
 It also assumes documents are produced from a mixture of topics. Those topics then generate
words based on their probability distribution.
The Bayesian version of PLSA is LDA. It uses Dirichlet priors for the word-topic and document-
topic distributions, lending itself to better generalization.
What LDA give us?
It is a probabilistic method. For every document, the results give us a mixture of topics that make up
the document. To be precise, we can get probability distribution over the k topics for every document.
Every word in the document is attributed to the particular topic with probability given by distribution.
These topics themselves were defined as probability distributions over vocabulary. Our results are
two sets of probability distributions:

 The collection of distributions of topics for each document


 The collection of distributions of words for each topic.
Q5.What is LSA?
Answer:
Latent Semantic Analysis (LSA): It is a theory and the method for extract and represents the
contextual usage meaning of words by statistical computation applied to large corpus of texts.
It is an information retrieval technique which analyzes and identifies the pattern in an unstructured
collection of text and relationship between them.
Latent Semantic Analysis itself is an unsupervised way of uncovering synonyms in a collection of
documents.
Why LSA(Latent Semantic Analysis)?
LSA is a technique for creating vector representation of the document. Having a vector representation
of the document gives us a way to compare documents for their similarity by calculating the distance
between vectors. In turn, means we can do handy things such as classify documents to find out which
of a set knows topics they most likely reside to.
Classification implies we have some known topics that we want to group documents into, and that
you have some labelled training data. If you're going to identify natural groupings of the documents
without any labelled data, you can use clustering
Q6. What is PLSA?
Answer:
PLSA stands for Probabilistic Latent Semantic Analysis, uses a probabilistic method instead of SVD
to tackle problem. The main idea is to find the probabilistic model with latent topics that we
can generate data we observe in our document term matrix. Specifically, we want a model P(D, W)
such that for any document d and word w, P(d,w) corresponds to that entry in document-term matrix.
Each document is found in the mixture of topics, and each topic consists of the collection of words.
PLSA adds the probabilistic spin to these assumptions:

 Given document d, topic z is available in that document with the probability P(z|d)
 Given the topic z, word w is drawn from z with probability P(w|z)

The joint probability of seeing the given document and word together is:

In the above case, P(D), P(Z|D), and P(W|Z) are the parameters of our models. P(D) can be
determined directly from corpus. P(Z|D) and the P(W|Z) are modelled as multinomial
distributions and can be trained using the expectation-maximisation algorithm (EM).

Q7. What is LDA2Vec?


Answer:
It is inspired by LDA, word2vec model is expanded to simultaneously learn word, document, topic
and paragraph topic vectors.
Lda2vec is obtained by modifying the skip-gram word2vec variant. In the original skip-gram method,
the model is trained to predict context words based on a pivot word. In lda2vec, the pivot word vector
and a document vector are added to obtain a context vector. This context vector is then used to predict
context words.
At the document level, we know how to represent the text as mixtures of topics. At the word-level,
we typically used something like word2vec to obtain vector representations. It is an extension of
word2vec and LDA that jointly learns word, document, and topic vectors.
How does it work?
It correctly builds on top of the skip-gram model of word2vec to generate word vectors. Neural net
that learns word embedding by trying to use input word to predict enclosing context words.
With Lda2vec, other than using the word vector directly to predict context words, you leverage
a context vector to make the predictions. Context vector is created as the sum of two other vectors:
the word vector and the document vector.
The same skip-gram word2vec model generates the word vector. The document vector is most
impressive. It is a really weighted combination of two other components:

 the document weight vector, representing the “weights” of each topic in a document
 Topic matrix represents each topic and its corresponding vector embedding.
Together, a document vector and word vector generate “context” vectors for each word in a
document. lda2vec power lies in the fact that it not only learns word embeddings for words; it
simultaneously learns topic representations and document representations as well.
Q8. What is Expectation-Maximization Algorithm(EM)?
Answer:
The Expectation-Maximization Algorithm, in short, EM algorithm, is an approach for maximum
likelihood estimation in the presence of latent variables.
This algorithm is an iterative approach that cycles between two modes. The first mode attempts to
predict the missing or latent variables called the estimation-step or E-step. The second mode attempts
to optimise the parameters of the model to explain the data best called the maximization-step or M-
step.

 E-Step. Estimate the missing variables in the dataset.


 M-Step. Maximize the parameters of the model in the presence of the data.
The EM algorithm can be applied quite widely, although it is perhaps most well known in machine
learning for use in unsupervised learning problems, such as density estimation and clustering.
For detail explanation of EM is, let us first consider this example. Say that we are in a school, and
interested to learn the height distribution of female and male students in the school. The most sensible
thing to do, as we probably would agree with me, is to randomly take a sample of N students of both
genders, collect their height information and estimate the mean and standard deviation for male and
female separately by way of maximum likelihood method.
Now say that you are not able to know the gender of student while we collect their height information,
and so there are two things you have to guess/estimate: (1) whether the individual sample of height
information belongs to a male or a female and (2) the parameters (μ, θ) for each gender which is
now unobservable. This is tricky because only with the knowledge of who belongs to which group,
can we make reasonable estimates of the group parameters separately. Similarly, only if we know the
parameters that define the groups, can we assign a subject properly. How do you break out of this
infinite loop? Well, EM algorithm just says to start with initial random guesses.
Q9.What is Text classification in NLP?
Answer:
Text classification is also known as text tagging or text categorization is a process of categorizing
text into organized groups. By using NLP, text classification can automatically analyze text and then
assign a set of pre-defined tags or categories based on content.
Unstructured text is everywhere on the internet, such as emails, chat conversations, websites, and the
social media but it’s hard to extract value from given data unless it’s organized in a certain way.
Doing so used to be a difficult and expensive process since it required spending time and resources
to manually sort the data or creating handcrafted rules that are difficult to maintain. Text classifiers
with NLP have proven to be a great alternative to structure textual data in a fast, cost-effective, and
scalable way.
Text classification is becoming an increasingly important part of businesses as it allows us to get
insights from data and automate business processes quickly. Some of the most common
examples and the use cases for automatic text classification include the following:

 Sentiment Analysis: It is the process of understanding if a given text is talking positively or


negatively about a given subject (e.g. for brand monitoring purposes).
 Topic Detection: In this, the task of identifying the theme or topic of a piece of text (e.g.
know if a product review is about Ease of Use, Customer Support, or Pricing when analysing
customer feedback).
 Language Detection: the procedure of detecting the language of a given text (e.g. know if an
incoming support ticket is written in English or Spanish for automatically routing tickets to
the appropriate team).
Q10. What is Word Sense Disambiguation (WSD)?
Answer:
WSD (Word Sense Disambiguation) is a solution to the ambiguity which arises due to different
meaning of words in a different context.
In natural language processing, word sense disambiguation (WSD) is the problem of determining
which "sense" (meaning) of a word is activated by the use of the word in a particular context, a
process which appears to be mostly unconscious in people. WSD is the natural classification problem:
Given a word and its possible senses, as defined by the dictionary, classify an occurrence of the word
in the context into one or more of its sense classes. The features of the context (such as the
neighbouring words) provide the evidence for classification.
For example, consider these two below sentences.
“ The bank will not be accepting the cash on Saturdays. ”
“ The river overflowed the bank .”
The word “ bank “ in the given sentence refers to commercial (finance) banks, while in the second
sentence, it refers to a riverbank. The uncertainty that arises, due to this is tough for the machine to
detect and resolve. Detection of change is the first issue and fixing it and displaying the correct output
is the second issue.

------------------------------------------------------------------------------------------------------------------
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)

# DAY 19

P a g e 1 | 12
Q1. What is LSI(Latent Semantic Indexing)?
Answer:
Latent Semantic Indexing (LSI): It is an indexing and retrieval method that uses a mathematical
technique called SVD(Singular value decomposition) to find patterns in relationships between terms
and concepts contained in an unstructured collection of text. It is based on the principle that words
that are used in the same contexts tend to have similar meanings.
For example, Tiger and Woods are associated with men instead of an animal, and a Wood, Parris,
and Hilton are associated with the singer.
Example:
If you use LSI to index a collection of articles and the words “fan” and “regulator” appear together
frequently enough, the search algorithm would notice that the two terms are semantically close. A
search for “fan” will, therefore, return a set of items containing that phrase, but also items that contain
just the word “regulator”. It doesn't understand word distance, but by examining a sufficient number
of documents, it only knows the two terms are interrelated. It then uses that information to provide
an expanded set of results with better recall than an understandable keyword search.
The diagram below describes the effect between LSI and keyword searches. W stands for a document.

P a g e 2 | 12
Q2. What is Named Entity Recognition? And tell some use cases of
NER?
Answer:
Named-entity recognition (NER): It is also known as entity extraction, and entity identification is a
subtask of information extraction that explore to locate and classify atomic elements in
text into predefined categories like the names of persons, organizations, places, expressions of
times, quantities, monetary values, percentages and more.
In each text document, particular terms represent specific entities that are more informative and have
a different context. These entities are called named entities, which more accurately refer to conditions
that represent real-world objects like people, places, organizations or institutions, and so on, which
are often expressed by proper names. The naive approach could be to find these by having a look at
the noun phrases in text documents. It also is known as entity chunking/extraction, which is a popular
technique used in information extraction to analyze and segment the named entities and categorize
or classify them under various predefined classes.

Named Entity Recognition use-case

 Classifying content for news providers-


NER can automatically scan entire articles and reveal which are the significant people,
organizations, and places discussed in them. Knowing the relevant tags for each item helps
in automatically categorizing the articles in defined hierarchies and enable smooth content
discovery.
 Customer Support:
Let’s say we are handling the customer support department of an electronics store with
multiple branches worldwide; we go through a number mentions in our customers’ feedback.
Such as this for instance.

P a g e 3 | 12
Now, if we pass it through the Named Entity Recognition API, it pulls out the entities Bangalore
(location) and Fitbit (Product). This can be then used to categorize the complaint and assign it to the
relevant department within the organization that should be handling this.

Q3. What is perplexity?


Answer:
Perplexity: It is a measurement of how well a probability model predicts a sample. In the context of
NLP, perplexity(Confusion) is one way to evaluate language models.
The term perplexity has three closely related meanings. It is a measure of how easy a probability
distribution is to predict. It is a measure of how variable a prediction model is. And It is a measure
of prediction error. The third meaning of perplexity is calculated slightly differently, but all three
have the same fundamental idea.

P a g e 4 | 12
Q4. What is the language model?
Answer:
Language Modelling (LM): It is one of the essential parts of modern NLP. There are many sorts of
applications for Language Modelling, like Machine Translation, Spell Correction Speech
Recognition, Summarization, Question Answering, Sentiment analysis, etc. Each of those tasks
requires the use of the language model. The language model is needed to represent the text to a form
understandable from the machine point of view.
The statistical language model is a probability distribution over a series of words. Given such a series,
say of length m, it assigns a probability to the whole series.
It provides context to distinguish between phrases and words that sounds are similar. For example,
in American English, the phrases " wreck a nice beach " and "recognize speech" sound alike but mean
different things.
Data sparsity is a significant problem in building language models. Most possible word sequences
are not noticed in training. One solution is to make the inference that the probability of a word only
depends on the previous n words. This is called as an n-gram model or unigram model when n = 1.
The unigram model is also known as the bag of words model.
How does this Language Model help in NLP Tasks?
The probabilities restoration by a language model is most useful to compare the likelihood that
different sentences are "good sentences." This was useful in many practical tasks, for example:
Spell checking: You observe a word that is not identified as a known word as part of a sentence.
Using the edit distance algorithm, we find the closest known words to the unknown words. These are
the candidate corrections. For example, we observe the word "wurd" in the context of the sentence,
"I like to write this wurd." The candidate corrections are ["word", "weird", "wind"]. How can we
select among these candidates the most likely correction for the suspected error "weird"?
Automatic Speech Recognition: we receive as input a string of phonemes; a first model predicts for
sub-sequences of the stream of phonemes candidate words; the language model helps in ranking the
most likely sequence of words compatible with the candidate words produced by the acoustic model.
Machine Translation: each word from the source language is mapped to multiple candidate words
in the target language; the language model in the target language can rank the most likely sequence
of candidate target words.

P a g e 5 | 12
Q5. What is Word Embedding?
Answer:
A word embedding is a learned representation for text where words that have the same meaning have
a similar observation.
It is basically a form of word representation that bridges the human understanding of language to that
of a machine. Word embeddings divide representations of text in an n-dimensional space. These are
essential for solving most NLP problems.
And the other point worth considering is how we obtain word embeddings as no two sets of word
embeddings are similar. Word embeddings aren't random; they're developed by training the neural
network. A recent powerful word embedding usage comes from Google named Word2Vec, which is
trained by predicting several words that appear next to other words in a language. For example, the
word "cat", the neural network would predict the words like "kitten" and "feline." This intuition of
words comes out "near" each other allows us to place them in vector space.

Q6. Do you have an idea about fastText?


Answer:
fastText: It is another word embedding method that is an extension of the word2vec
model. Alternatively, learning vectors for words directly. It represents each word as an n-gram of
characters. So, for example, take the word, “artificial” with n=3, the fastText representation of this
word is <ar, art, rti, tif, ifi, fic, ici, ial, al>, where the angular brackets indicate the beginning and end
of the word.
This helps to capture the meaning of shorter words and grant the embeddings to understand prefixes
and suffixes. Once the word has been showed using character skip-grams, a n-gram model is trained
to learn the embeddings. This model is acknowledged to be a bag of words model with a sliding

P a g e 6 | 12
window over a word because no internal structure of the word is taken into account. As long as the
characters are within this window, the order of the n-grams doesn’t matter.
fastText works well with rare words. So even if a word wasn’t seen during training, it can be broken
down into n-grams to get its embeddings.
Word2vec and GloVe both fail to provide any vector representation for words that are not in the
model dictionary. This is a huge advantage of this method.

Q7. What is GloVe?


Answer:
GloVe(global vectors) is for word representation. GloVe is an unsupervised learning algorithm
developed by Stanford for achieving word embeddings by aggregating a global word-word co-
occurrence matrix from a corpus. The resulting embeddings show interesting linear substructures of
the word in vector space.
The GloVe model produces a vector space with meaningful substructure, as evidenced by its
performance of 75% on a new word analogy task. It also outperforms related models on similarity
tasks and named entity recognition.
How GloVe find meaning in statistics?
Produces a vector space with meaningful substructure, as evidenced by its performance of 75% on
a new word analogy task. It also outperforms related models on similarity tasks and named entity
recognition.

P a g e 7 | 12
GloVe aims to achieve two goals:

 (1) Create word vectors that capture meaning in vector space


 (2) Takes advantage of global count statistics instead of only local information
Unlike word2vec – which learns by streaming sentences – GloVe determines based on a co-
occurrence matrix and trains word vectors, so their differences predict co-occurrence ratios
GloVe weights the loss based on word frequency.
Somewhat surprisingly, word2vec and GloVe turn out to be remarkably similar, despite starting off
from entirely different starting points.

Q8. Explain Gensim?


Answer:
Gensim: It is billed as a Natural Language Processing package that does ‘Topic Modeling for
Humans’. But its practically much more than that.
If you are unfamiliar with topic modeling, it is a technique to extract the underlying topics from
large volumes of text. Gensim provides algorithms like LDA and LSI (which we already seen in
previous interview questions) and the necessary sophistication to built high-quality topic models.

P a g e 8 | 12
It is an excellent library package for processing texts, working with word vector models (such as
FastText, Word2Vec, etc) and for building the topic models. Another significant advantage with
gensim is: it lets us handle large text files without having to load the entire file in memory.
We can also tell as It is an open-source library for unsupervised topic modeling and natural
language processing, using modern statistical machine learning.

Gensim is implemented in Python and Cython. Gensim is designed to handle extensive text
collections using data streaming and incremental online algorithms, which differentiates it from
most other machine learning software packages that target only in-memory processing.

P a g e 9 | 12
Q9. What is Encoder-Decoder Architecture?
Answer:

The encoder-decoder architecture consists of two main parts :

 Encoder:
Encoder simply takes the input data, and trains on it, then it passes the final state of its
recurrent layer as an initial state to the first recurrent layer of the decoder part.

 Decoder :
The decoder takes the final state of encoder’s final recurrent layer and uses it as an initial
state to its initial, recurrent layer, the input of the decoder is sequences that we want to get
French sentences.

Some more example for better understanding:

P a g e 10 | 12
Q10. What is Context2Vec?

Answer:
Assume a case where you have a sentence like. I can’t find May. Word May maybe refers to a month's
name or a person's name. You use the words surround it (context) to help yourself to determine the
best suitable option. Actually, this problem refers to the Word Sense Disambiguation task, on which
you investigate the actual semantics of the word based on several semantic and linguistic techniques.
The Context2Vec idea is taken from the original CBOW Word2Vec model, but instead of relying on
averaging the embedding of the words, it relies on a much more complex parametric model that is
based on one layer of Bi-LSTM. Figure1 shows the architecture of the CBOW model.
Figure1

P a g e 11 | 12
Context2Vec applied the same concept of windowing, but instead of using a simple average
function, it uses 3 stages to learn complex parametric networks.
 A Bi-LSTM layer that takes left-to-right and right-to-left representations
 A feedforward network that takes the concatenated hidden representation and produces a
hidden representation through learning the network parameters.
 Finally, we apply the objective function to the network output.

We used the Word2Vec negative sampling idea to get better performance while calculating
the loss value.
The following are some samples of the closest words to a given context.

P a g e 12 | 12
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)

# DAY 20
Q1. Do you have any idea about Event2Mind in NLP?
Answer:
Yes, it is based on NLP research paper to understand the common-sense inference from sentences.
Event2Mind: Common-sense Inference on Events, Intents, and Reactions
The study of “Commonsense Reasoning” in NLP deals with teaching computers how to gain and
employ common sense knowledge. NLP systems require common sense to adapt quickly and
understand humans as we talk to each other in a natural environment.
This paper proposes a new task to teach systems commonsense reasoning: given an event described
in a short “event phrase” (e.g. “PersonX drinks coffee in the morning”), the researchers teach a system
to reason about the likely intents (“PersonX wants to stay awake”) and reactions (“PersonX feels
alert”) of the event’s participants.

Understanding a narrative requires common-sense reasoning about the mental states of people in
relation to events. For example, if “Robert is dragging his feet at work,” pragmatic implications about
Robert’s intent are that “Robert wants to avoid doing things” (Above Fig). You can also infer that
Robert’s emotional reaction might be feeling “bored” or “lazy.” Furthermore, while not explicitly
mentioned, you can assume that people other than Robert are affected by the situation, and these
people are likely to feel “impatient” or “frustrated.”

P a g e 1 | 12
This type of pragmatic inference can likely be useful for a wide range of NLP applications that require
accurate anticipation of people’s intents and emotional reactions, even when they are not expressly
mentioned. For example, an ideal dialogue system should react in empathetic ways by reasoning
about the human user’s mental state based on the events the user has experienced, without the user
explicitly stating how they are feeling. Furthermore, advertisement systems on social media should
be able to reason about the emotional reactions of people after events such as mass shootings and
remove ads for guns, which might increase social distress. Also, the pragmatic inference is a
necessary step toward automatic narrative understanding and generation. However, this type of
commonsense social reasoning goes far beyond the widely studied entailment tasks and thus falls
outside the scope of existing benchmarks.

Q2. What is SWAG in NLP?


Answer:
SWAG stands for Situations with Adversarial Generations is a dataset consisting of 113k multiple-
choice questions about a rich spectrum of grounded situations.
Swag: A Large Scale Adversarial Dataset for Grounded Commonsense Inference
According to NLP research paper on SWAG is “Given a partial description like “he opened the hood
of the car,” humans can reason about the situation and anticipate what might come next (“then, he
examined the engine”). In this paper, you introduce the task of grounded commonsense inference,
unifying natural language inference(NLI), and common-sense reasoning.
We present SWAG, a dataset with 113k multiple-choice questions about the rich spectrum of
grounded positions. To address recurring challenges of annotation artifacts and human biases found
in many existing datasets, we propose AF(Adversarial Filtering), a novel procedure that constructs a
de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter
the data. To account for the aggressive adversarial filtering, we use state-of-the-art language models
to oversample a diverse set of potential counterfactuals massively. Empirical results present that
while humans can solve the resulting inference problems with high accuracy (88%), various
competitive models make an effort on our task. We provide a comprehensive analysis that indicates
significant opportunities for future research.
When we read a tale, we bring to it a large body of implied knowledge about the physical world. For
instance, given the context “on stage, a man takes a seat at the piano,” we can easily infer what the
situation might look like: a man is giving a piano performance, with a crowd watching him. We can
furthermore infer him likely next action: he will most likely set his fingers on the piano key and start
playing.

P a g e 2 | 12
This type of natural language inference(NLI) requires common-sense reasoning, substantially
broadening the scope of prior work that focused primarily on linguistic entailment. Whereas the
dominant entailment paradigm asks if 2 natural language sentences (the ‘premise’ and the
‘hypothesis’) describe the same set of possible worlds, here we focus on whether a (multiple-choice)
ending represents a possible (future) world that can a from the situation described in the premise,
even when it is not strictly entailed. Making such inference necessitates a rich understanding of
everyday physical conditions, including object affordances and frame semantics.

P a g e 3 | 12
Q3. What is the Pix2Pix network?
Answer:
Pix2Pix network: It is a Conditional GANs (cGAN) that learn the mapping from an input image to
output an image.

Image-To-Image Translation is the process for translating one representation of the image into
another representation.
The image-to-image translation is another example of a task that GANs (Generative Adversarial
Networks) are ideally suited for. These are tasks in which it is nearly impossible to hard-code a loss
function. Studies on GANs are concerned with novel image synthesis, translating from a random
vector z into an image. Image-to-Image translation converts one image to another like the edges of
the bag below to the photo image. Another exciting example of this is shown below:

In Pix2Pix Dual Objective Function with an Adversarial and L1 Loss


A naive way to do Image-to-Image translation would be to discard the adversarial framework
altogether. A source image would just be passed through a parametric function, and the difference in
the resulting image and the ground truth output would be used to update the weights of the
network. However, designing this loss function with standard distance measures such as L1 and L2
will fail to capture many of the essential distinctive characteristics between these images. However,
P a g e 4 | 12
authors do find some value to the L1 loss function as a weighted sidekick to the adversarial loss
function.
The Conditional-Adversarial Loss (Generator versus Discriminator) is very popularly formatted as
follows:

The L1 loss function previously mentioned is shown below:

Combining these functions results in:

In the experiments, the authors report that they found the most success with the lambda parameter
equal to 100.

Q4. Explain UNet Architecture?


Answer:
U-Net architecture: It is built upon the Fully Convolutional Network and modified in a way that it
yields better segmentation in medical imaging. Compared to FCN-8, the two main differences are (a)
U-net is symmetric and (b) the skip connections between the downsampling path and upsampling
path apply a concatenation operator instead of a sum. These skip connections intend to provide local
information to the global information while upsampling. Because of its symmetry, the network has a
large number of feature maps in the upsampling path, which allows transferring information. By
comparison, the underlying FCN architecture only had the number of classes feature maps in its
upsampling way.

P a g e 5 | 12
How does it work?

The UNet architecture looks like a ‘U,’ which justifies its name. This UNet architecture consists of
3 sections: The contraction, the bottleneck, and the expansion section. The contraction section is
made of many contraction blocks. Each block takes an input that applies two 3X3 convolution layers,
followed by a 2X2 max pooling. The number of features or kernel maps after each block doubles so
that UNet architecture can learn complex structures. Bottommost layer mediates between the
contraction layer and the expansion layer. It uses two 3X3 CNN layers followed by 2X2 up
convolution layer.
But the heart of this architecture lies in the expansion section. Similar to the contraction layer, it also
has several expansion blocks. Each block passes input to two 3X3 CNN layers, followed by a 2X2
upsampling layer. After each block number of feature maps used by the convolutional layer, get half
to maintain symmetry. However, every time input is also get appended by feature maps of the
corresponding contraction layer. This action would ensure that features that are learned while
contracting the image will be used to reconstruct it. The number of expansion blocks is as same as
the number of contraction blocks. After that, the resultant mapping passes through another 3X3 CNN
layer, with the number of feature maps equal to the number of segments desired.

P a g e 6 | 12
Q5. What is pair2vec?
Answer:
This paper pre trains word pair representations by maximizing pointwise mutual information of
pairs of words with their context. This encourages a model to learn more meaningful representations
of word pairs than with more general objectives, like modeling. The pre-trained representations are
useful in tasks like SQuAD and MultiNLI that require cross-sentence inference. You can expect to
see more pretraining tasks that capture properties particularly suited to specific downstream tasks and
are complementary to more general-purpose tasks like language modeling.
Reasoning about implied relationships between pairs of words is crucial for cross sentences inference
problems like question answering (QA) and natural language inference (NLI). In NLI, e.g., given a
premise such as “golf is prohibitively expensive,” inferring that the hypothesis “golf is a cheap
pastime” is a contradiction requires one to know that expensive and cheap are antonyms. Recent
work has shown that current models, which rely heavily on unsupervised single-word embeddings,
struggle to grasp such relationships. In this pair2vec paper, we show that they can be learned with
word pair2vec(pair vector), which are trained, unsupervised, at a huge scale, and which significantly
improve performance when added to existing cross-sentence attention mechanisms.

P a g e 7 | 12
Unlike single word representations, which are typically trained by modeling the co-occurrence of a
target word x with its context c, our word-pair representations are learned by modeling the three-way
co-occurrence between two words (x,y) and the context c that ties them together, as illustrated in
above Table. While similar training signal has been used to learn models for ontology
construction and knowledge base completion, this paper shows, for the first time, that considerable
scale learning of pairwise embeddings can be used to improve the performance of neural cross-
sentence inference models directly.

Q6. What is Meta-Learning?


Answer:
Meta-learning: It is an exciting area of research that tackles the problem of learning to learn. The goal
is to design models that can learn new skills or fastly to adapt to new environments with minimum
training examples. Not only does this dramatically speed up and improve the design of ML(Machine
Learning) pipelines or neural architectures, but it also allows us to replace hand-engineered
algorithms with novel approaches learned in a data-driven way.

P a g e 8 | 12
The goal of meta-learning is to train the model on a variety of learning tasks, such that it can solve
new learning tasks with only a small number of training samples. It tends to focus on finding model
agnostic solutions, whereas multi-task learning remains deeply tied to model architecture.
Thus, meta-level AI algorithms make AI systems:
· Learn faster
· Generalizable to many tasks
· Adaptable to environmental changes like in Reinforcement Learning
One can solve any problem with a single model, but meta-learning should not be confused with one-
shot learning.

Q7. What is ALiPy(Active Learning in Python)?


Answer:
Supervised ML methods usually require a large set of labeled examples for model training.
However, in many real applications, there are ample unlabeled data but limited labeled data; and
acquisition of labels is costly. Active learning (AL) reduces labeling costs by iteratively selecting
the most valuable data to query their labels from the annotator.
Active learning is the leading approach to learning with limited labeled data. It tries to reduce
human efforts on data annotation by actively querying the most prominent examples.
ALiPy is a Python toolbox for active learning(AL), which is suitable for various users. On the one
hand, the entire process of active learning has been well implemented. Users can efficiently
perform experiments by many lines of codes to finish the entire process from data pre-processes to
P a g e 9 | 12
result in visualization. More than 20 commonly used active learning(AL) methods have been
implemented in the toolbox, providing users many choices.

Q8.What is the Lingvo model?


Answer:
Lingvo: It is a Tensorflow framework offering a complete solution for collaborative deep learning
research, with a particular focus towards sequence-to-sequence models. These models are composed
of modular building blocks that are flexible and easily extensible, and experiment configurations are
centralized and highly customizable. Distributed training and quantized inference are supported
directly within a framework, and it contains existing implementations of an ample number of utilities,
helper functions, and newest research ideas. This model has been used in collaboration by dozens of
researchers in more than 20 papers over the last two years.
Why does this Lingvo research matter?
The process of establishing a new deep learning(DL) system is quite complicated. It involves
exploring an ample space of design choices involving training data, data processing logic, the size,
and type of model components, the optimization procedures, and the path to deployment. This
complexity requires the framework that quickly facilitates the production of new combinations and
the modifications from existing documents and experiments and shares these new results. It is a
workspace ready to be used by deep learning researchers or developers. Nguyen Says: “We have
researchers working on state-of-the-art(SOTA) products and research algorithms, basing their
research off of the same codebase. This ensures that code is battle-tested. Our collective experience
is encoded in means of good defaults and primitives that we have found useful over these tasks.”

P a g e 10 | 12
Q9. What is Dropout Neural Networks?
Answer:
The term “dropout” refers to dropping out units (both hidden and visible) in a neural network.
At each training stage, individual nodes are either dropped out of the net with probability 1-p or kept
with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out
node are also removed.
Why do we need Dropout?
The answer to these questions is “to prevent over-fitting.”
A fully connected layer occupies most of the parameters, and hence, neurons develop co-dependency
amongst each other during training, which curbs the individual power of each neuron leading to over-
fitting of training data.

P a g e 11 | 12
Q10. What is GAN?
Answer:
A generative adversarial network (GAN): It is a class of machine learning systems invented by Ian
Goodfellow and his colleagues in 2014. Two neural networks are contesting with each other in a
game (in the idea of game theory, often but not always in the form of a zero-sum game). Given a
training set, this technique learns to generate new data with the same statistics as the training set.
E.g., a GAN trained on photographs can produce original pictures that look at least superficially
authentic to human observers, having many realistic characteristics. Though initially proposed as a
form of a generative model for unsupervised learning, GANs have also proven useful for semi-
supervised learning,[2] fully supervised learning, and reinforcement learning.
Example of GAN

 Given an image of a face, the network can construct an image that represents how that person
could look when they are old.

Generative Adversarial Networks takes up a game-theoretic approach, unlike a conventional neural


network. The network learns to generate from a training distribution through a 2-player game. The
two entities are Generator and Discriminator. These two adversaries are in constant battle throughout
the training process.
P a g e 12 | 12
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)

# Day21

P a g e 1 | 15
Q1. Explain Grad-CAM architecture?
Answer:
According to the research paper, “We propose a technique for making Convolutional Neural Network
(CNN)-based models more transparent by visualizing input regions that are ‘important’ for
predictions – producing visual explanations. Our approach is called Gradient-weighted Class
Activation Mapping (Grad-CAM), which uses class-specific gradient information to localize the
crucial regions. These localizations are combined with the existing pixel-space visualizations to
create a new high-resolution, and class-discriminative display called the Guided Grad-CAM. These
methods help better to understand CNN-based models, including image captioning and the apparent
question answering (VQA) models. We evaluate our visual explanations by measuring the ability to
discriminate between the classes and to inspire trust in humans, and their correlation with the
occlusion maps. Grad-CAM provides a new way to understand the CNN-based models.”
A technique for making CNN(Convolutional Neural Network)-based models more transparent by
visualizing the regions of input that are “important” for predictions from these models — or visual
explanations.

This visualization is both high-resolution (when the class of interest is ‘tiger cat,’ it identifies crucial
‘tiger cat’ features like stripes, pointy ears and eyes) and class-discriminative (it shows the ‘tiger cat’
but not the ‘boxer (dog)’).

P a g e 2 | 15
Q2.Explain squeeze-net architecture?
Answer:
Nowadays, technology is at its peak. Self-driving cars and IoT is going to be household talks in the
next few years to come. Therefore, everything is controlled remotely, say, e.g., in self-driving cars,
we will need our system to communicate with the servers regularly. So accordingly, if we have a
model that has a small size, then we can quickly deploy it in the cloud. So that’s why we needed an
architecture that is less in size and also achieves the same level of accuracy that other architecture
achieves.
It’s Architecture

 Replace 3x3 filters with 1x1 filter- We plan to use the maximum number of 1x1 filters as
using a 1X1 filter rather than a 3X3 filter can reduce the number of parameters by 9X. We
may think that replacing 3X3 filters with 1X1 filters may perform badly as it has less
information to work on. But this is not a case. Typically 3X3 filter may capture the spatial
information of pixels close to each other while the 1X1 filter zeros in on pixel and captures
features amongst its channels.

 Decrease number of input channels to 3x3 filters- to maintain a small total number of
parameters in a CNN, and it is crucial not only to decrease the number of 3x3 filters, but also
to decrease the number of input channels to 3x3 filters. We reduce the number of input
channels to 3x3 filters using squeeze layers. The author of this paper has used a term called
the “fire module,” in which there is a squeeze layer and an expanded layer. In the squeeze
layer, we are using 1X1 filters, while in the expanded layer, we are using a combo of 3X3
filters and 1X1 filters. The author is trying to limit the number of inputs to 3X3 filters to
reduce the number of parameters in the layer.

P a g e 3 | 15
 Downsample late in a network so that convolution layers have a large activation
map- Having got an intuition about contracting the sheer number of parameters we are
working with, how the model is getting most out of the remaining set of parameters. The
author in this paper has downsampled the feature map in later layers, and this increases
the accuracy. But this is an excellent contrast to networks like VGG where a large feature
map is taken, and then it gets smaller as network approach towards the end. This different
approach is too interesting, and they cite the paper by K. He and H. Sun that similarly
applies delayed downsampling that leads to higher classification accuracy.
This architecture consists of the fire module, which enables it to bring down the number
of parameters.

And other thing that surprises me is the lack of fully connected layers or dense layers at the end,
which one will see in a typical CNN architecture. The dense layers, in the end, learn all the
relationships between the high-level features and the classes it is trying to identify. The fully
connected layers are designed to learn that noses and ears make up a face, and wheels and lights
indicate cars. However, in this architecture, that extra learning step seems to be embedded within the
transformations between various “fire modules.”

P a g e 4 | 15
The squeeze-net can accomplish an accuracy nearly equal to AlexNet with 50X less number of
parameters. The most impressive part is that if we apply Deep compression to the already smaller
model, then it can reduce the size of the squeeze-net model to 510x times that of AlexNet.

Q3.ZFNet architecture
Answer:

The architecture of the network is an optimized version of the last year’s winner - AlexNet. The
authors spent some time to find out the bottlenecks of AlexNet and removing them, achieving
superior performance.

(a): First layer ZFNET features without feature scale clipping. (b): the First layer features from
AlexNet. Note that there are lot of dead features - ones where the network did not learn any patterns.
(c): the First layer features for ZFNet. Note that there are only a few dead features. (d): Second layer
features from AlexNet. The grid-like patterns are so-called aliasing artifacts. They appear when

P a g e 5 | 15
receptive fields of convolutional neurons overlap, and neighboring neurons learn similar structures.
(e): 2nd layer features for ZFNet. Note that there are no aliasing artifacts. Source: original paper.

In particular, they reduced the filter size in the 1st convolutional layer from 11x11 to 7x7, which
resulted in fewer dead features learned in the first layer (see the image below for an example of that).
A dead feature is a situation where a convolutional kernel fails to learn any significant representation.
Visually it looks like a monotonic single-color image, where all the values are close to each other.
In addition to changing the filter size, the authors of FZNet have doubled the number of filters in all
convolutional layers and the number of neurons in the fully connected layers as compared to the
AlexNet. In the AlexNet, there were 48-128-192-192-128-2048-2048 kernels/neurons, and in the
ZFNet, all these doubled to 96-256-384-384-256-4096-4096. This modification allowed the network
to increase the complexity of internal representations and as a result, decrease the error rate from
15.4% for last year’s winner, to 14.8% to become the winner in 2013.

Q4. What is NAS (Neural Architecture Search)?


Answer:
Developing the neural network models often requires significant architecture engineering. We can
sometimes get by with transfer learning, but if we want the best possible performance, it’s usually
best to design your network. This requires specialized skills and is challenging in general; we may
not even know the limits of the current state-of-the-art(SOTA) techniques. Its a lot of trial and error,
and experimentation itself is time-consuming and expensive.
This is the NAS(Neural Architecture Search) comes in. NAS(Neural Architecture Search) is an
algorithm that searches for the best neural network architecture. Most of the algorithms work in the
following way. Start off by defining the set of “building blocks” that can be used for our network.
E.g., the state-of-the-art(SOTA) NASNet paper proposes these commonly used blocks for an image
recognition network-

P a g e 6 | 15
In the NAS algorithm, the controller Recurrent Neural Network (RNN) samples the building blocks,
putting them together to create some end to end architecture. Architecture generally combines the
same style as state-of-the-art(SOTA) networks, such as DenseNets or ResNets, but uses a much
different combination and the configuration of blocks.
This new network architecture is then trained to convergence to obtain the least accuracy on the held-
out validation set. The resulting efficiencies are used to update the controller so that the controller
will generate better architectures over time, perhaps by selecting better blocks or making better
connections. The controller weights are updated with a policy gradient. The whole end-to-end setup
is shown below.

It’s a reasonably intuitive approach! In simple means: have an algorithm grab different blocks and
put those blocks together to make the network. Train and test out that network. Based on our results,
adjust the blocks we used to make the network and how you put them together!

Q5. What is SENets?


Answer:

SENets stands for Squeeze-and-Excitation Networks introduces a building block for CNNs that
improves channel interdependencies at almost no computational cost. They have used in the 2017
ImageNet competition and helped to improve the result from last year by 25%. Besides this large
performance boost, they can be easily added to existing architectures. The idea is this:

P a g e 7 | 15
Let’s add parameters to each channel of the convolutional block so that the network can adaptively
adjust the weighting of each feature map.
As simple as may it sound, this is it. So, let’s take a closer look at why this works so well.
Why it works too well?
CNN's uses its convolutional filters to extract hierarchal information from the images. Lower layers
find little pieces of context like high frequencies or edges, while upper layers can detect faces, text,
or other complex geometrical shapes. They extract whatever is necessary to solve the task precisely.
All of this works by fusing spatial and channel information of an image. The different filters will first
find the spatial features in each input channel before adding the information across all available
output channels.
All we need to understand for now is that the network weights each of its channels equally when
creating output feature maps. It is all about changing this by adding a content-aware mechanism to
weight each channel adaptively. In its too basic form, this could mean adding a single parameter to
each channel and giving it linear scalar how relevant each one is.
However, the authors push it a little further. First, they get the global understanding of each channel
by squeezing feature maps to a single numeric value. This results in the vector of size n, where n is
equal to the number of convolutional channels. Afterward, it is fed through a two-layer neural
network, which outputs a vector of the same size. These n values can now be used as weights on the
original features maps, scaling each channel based on its importance.

Q6. Feature Pyramid Network (FPN)


Answer:

P a g e 8 | 15
The Bottom-Up Pathway
The bottom-up pathway is feedforward computation of backbone ConvNet. It is known as one
pyramid level is for each stage. The output of last layer of each step will be used as the reference set
of feature maps for enriching the top-down pathway by lateral connection.
Top-Down Pathway and Lateral Connection

 The higher resolution features are upsampled spatially coarser, but semantically stronger,
feature maps from higher pyramid levels. More particularly, the spatial resolution
is upsampled by a factor of 2 using nearest neighbor for simplicity.
 Each lateral connection adds feature maps of the same spatial size from the bottom-up
pathway and top-down pathway.
 Specifically, the feature maps from the bottom-up pathway undergo 1×1
convolutions to reduce channel dimensions.
 And feature maps from the bottom-up pathway and top-down pathway are merged
by element-wise addition.
Prediction in FPN

 Finally, the 3×3 convolution is appended on each merged map to generate a final feature
map, which is to reduce the aliasing effect of upsampling. This last set of feature maps is
called {P2, P3, P4, P5}, corresponding to {C2, C3, C4, C5} that are respectively of same
spatial sizes.
 Because all levels of pyramid use shared classifiers/regressors as in a traditional featured
image pyramid, feature dimension at output d is fixed with d = 256. Thus, all extra
convolutional layers have 256 channel outputs.

Q7. DeepID-Net( Def-Pooling Layer)


Answer:
A new def-pooling (deformable constrained pooling) layer is used to model the deformation of the
object parts with geometric constraints and penalties. That means, except detecting the whole object
directly, it is also important to identify object parts, which can then assist in detecting the whole
object.

P a g e 9 | 15
The steps in black color are the old stuff that existed in R-CNN. The stages in red color do not
appear in R-CNN.
1. Selective Search

 First, color similarities, texture similarities, regions size, and region filling are used as non-
object-based segmentation. Therefore you obtain many small segmented areas as shown
at the bottom left of the image above.
 Then, the bottom-up approach is used that small segmented areas are merged to form the
larger segment areas.
 Thus, about 2K regions, proposals (bounding box candidates) are generated, as shown
in the above image.

2. Box Rejection
R-CNN is used to reject bounding boxes that are most likely to be the background.

P a g e 10 | 15
3. Pre train Using Object-Level Annotations

Usually, pretraining is on image-level annotation. It is not good when an object is too small within
the image because the object should occupy a large area within the bounding box created by the
selective search.
Thus, pretraining is on object-level annotation. And the deep learning(DL) model can be any
models such as ZFNet, VGGNet, and GoogLeNet.
4. Def-Pooling Layer

P a g e 11 | 15
For the def-pooling path, output from conv5, goes through the Conv layer, then goes through the def-
pooling layer, and then has a max-pooling layer.
In simple terms, the summation of ac multiplied by dc,n, is the 5×5 deformation penalty in the figure
above. The penalty of placing object part from assumed the central position.
By training the DeepID-Net, object parts of the object to be detected will give a high activation value
after the def-pooling layer if they are closed to their anchor places. And this output will connect to
200-class scores for improvement.
5. Context Modeling
In object detection tasks in ILSVRC, there are 200 classes. And there is also the classification
competition task in ILSVRC for classifying and localizing 1000-class objects. The contents are more
diverse compared with the object detection task. Hence, 1000-class scores, obtained by
classification network, are used to refine 200-class scores.

6. The Model Averaging-


Multiple models are used to increase the accuracy, and the results from all models are averaged.
This technique has been used since AlexNet, LeNet, and so on.
P a g e 12 | 15
7. Bounding Box Regression
Bounding box regression is to fine-tune the bounding box location, which has been used in R-CNN.

Q8. What is FractalNet Architecture?


Answer:
In2015, after the invention of ResNet, with numerous champion won, there are plenty of researchers
working on how to improve the ResNet, such as Pre-Activation ResNet, RiR, RoR, Stochastic Depth,
and WRN. In this story, conversely, a non-residual-network approach, FractalNet, is shortly
reviewed. When VGGNet is starting to degrade when it goes from 16 layers (VGG-16) to 19 layers
(VGG-19), FractalNet can go up to 40 layers or even 80 layers.
Architecture

In the above picture: A Simple Fractal Expansion (on Left), Recursively Stacking of Fractal
Expansion as One Block (in the Middle), 5 Blocks Cascaded as FractalNet (on the Right)
For the base case, f1(z) is the convolutional layer:

After that, recursive fractals are:

P a g e 13 | 15
Where C is a number of columns as in the middle of the above figure. The number of the
convolutional layers at the deepest path within the block will have 2^(C-1). In this case, C=4, thereby,
a number of convolutional layers are 2³=8 layers.
For the join layer (green), the element-wise mean is computed. It is not concatenation or addition.
With five blocks (B=5) cascaded as FractalNet at the right of the figure, then the number of
convolutional layers at the most profound path within the whole network is B×2^(C-1), i.e., 5×2³=40
layers.
In between 2 blocks, 2×2 max pooling is done to reduce the size of feature maps. Batch Norm and
ReLU are used after each convolution.

Q9. What is the SSPNet architecture?


Answer:
SPPNet has introduced the new technique in CNN called Spatial Pyramid Pooling (SPP) at the
transition of the convolutional layer and fully connected layer. This is a work from Microsoft.

Conventionally, at the transformation of the Conv layer and FC layer, there is one single pooling
layer or even no pooling layer. In SPPNet, it suggests having multiple pooling layers with different
scales.
In the figure, 3-level SPP is used. Suppose conv5 layer has 256 feature maps. Then at the SPP layer,

P a g e 14 | 15
1. first, each feature map is pooled to become one value (which is grey). Thus 256-d vector
is formed.
2. Then, each feature map is pooled to have four values (which is green), and form the 4×256-
d vector.
3. Similarly, each feature map is pooled to have 16 values (in blue), and form the 16×256-d
vector.
4. The above three vectors are concatenated to form a 1-d vector.
5. Finally, this 1-d vector is going into FC layers as usual.
With SPP, you don’t need to crop the image to a fixed size, like AlexNet, before going into CNN. Any
image sizes can be inputted.

P a g e 15 | 15

You might also like