0% found this document useful (0 votes)
3 views

Data Science Interview Questions #Week2

The document provides a comprehensive guide for data science interview preparation over 30 days, focusing on key topics such as TensorFlow, machine learning concepts, and programming languages like R and Python. It covers definitions, advantages, limitations, and use cases of various machine learning tools and techniques, including neural networks, reinforcement learning, and data cleansing. Additionally, it emphasizes the importance of domain knowledge and feature selection in the machine learning process.

Uploaded by

bedominator34
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Data Science Interview Questions #Week2

The document provides a comprehensive guide for data science interview preparation over 30 days, focusing on key topics such as TensorFlow, machine learning concepts, and programming languages like R and Python. It covers definitions, advantages, limitations, and use cases of various machine learning tools and techniques, including neural networks, reinforcement learning, and data cleansing. Additionally, it emphasizes the importance of domain knowledge and feature selection in the machine learning process.

Uploaded by

bedominator34
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

DATA SCIENCE

INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)

# DAY 08

Page 1|7
Q1. What is Tensorflow?
Ans:
TensorFlow: TensorFlow is an open-source software library released in 2015 by Google to make it
easier for the developers to design, build, and train deep learning models. TensorFlow is originated
as an internal library that the Google developers used to build the models in house, and we expect
additional functionality to be added in the open-source version as they are tested and vetted in internal
flavour. Although TensorFlow is the only one of several options available to the developers and we
choose to use it here because of thoughtful design and ease of use.
At a high level, TensorFlow is a Python library that allows users to express arbitrary computation as
a graph of data flows. Nodes in this graph represent mathematical operations, whereas edges
represent data that is communicated from one node to another. Data in TensorFlow are represented
as tensors, which are multidimensional arrays. Although this framework for thinking about
computation is valuable in many different fields, TensorFlow is primarily used for deep learning in
practice and research.

Q2. What are Tensors?


Ans:
Tensor: In mathematics, it is an algebraic object that describes the linear mapping from one set of
algebraic objects to the another. Objects that the tensors may map between include, but are not limited
to the vectors, scalars and recursively, even other tensors (for example, a matrix is the map between
vectors and thus a tensor. Therefore the linear map between matrices is also the tensor). Tensors are
inherently related to the vector spaces and their dual spaces and can take several different forms. For
Page 2|7
example, a scalar, a vector, a dual vector at a point, or a multi-linear map between vector
spaces. Euclidean vectors and scalars are simple tensors. While tensors are defined as independent of
any basis. The literature on physics, often referred by their components on a basis related to a
particular coordinate system.

Q3. What is TensorBoard?


Ans:
TensorBoard, a suit of visualising tools, is an easy solution to Tensorflow offered by the creators
that lets you visualise the graphs, plot quantitative metrics about the graph with additional data like
images to pass through it.

This one is some example of how the TensorBoard is working.


Page 3|7
Q4. What are the features of TensorFlow?
Ans:
• One of the main features of TensorFlow is its ability to build neural networks.
• By using these neural networks, machines can perform logical thinking and learn similar to
humans.
• There are the other tensors for processing, such as data loading, preprocessing, calculation,
state and outputs.
• It considered not only as deep learning but also as the library for performing the tensor
calculations, and it is the most excellent library when considered as the deep learning
framework that can also describe basic calculation processing.
• TensorFlow describes all calculation processes by calculation graph, no matter how simple
the calculation is.

Q5. What are the advantages of TensorFlow?


Ans:
• It allows Deep Learning.
• It is open-source and free.
• It is reliable (and without major bugs)
• It is backed by Google and a good community.
• It is a skill recognised by many employers.
• It is easy to implement.

Q6. List a few limitations of Tensorflow.


Ans:

• Has the GPU memory conflicts with Theano if imported in the same scope.
• It has dependencies with other libraries.
• Requires prior knowledge of the advanced calculus and linear algebra along with the pretty
good understanding of machine learning.

Page 4|7
Q7. What are the use cases of Tensor flow?
Ans:
Tensorflow is an important tool of deep learning, it has mainly five use cases, and they are:

• Time Series
• Image recognition
• Sound Recognition
• Video detection
• Text-based Applications

Q8. What are the very important steps of Tensorflow architecture?


Ans:
There are three main steps in the Tensorflow architecture are:

• Pre-process the Data


• Build a Model
• Train and estimate the model

Page 5|7
Q9. What is Keras?
Ans:
Keras: It is an Open Source Neural Network library written in Python that runs on the top of Theano
or Tensorflow. It is designed to be the modular, fast and easy to use. It was developed by François
Chollet, a Google engineer.

Q10. What is a pooling layer?


Ans:
Pooling layer: It is generally used in reducing the spatial dimensions and not depth, on a
convolutional neural network model.

Q11. What is the difference between CNN and RNN?


Ans:
CNN (Convolutional Neural Network)

• Best suited for spatial data like images


• CNN is powerful compared to RNN
• This network takes a fixed type of inputs and outputs
• These are the ideal for video and image processing

Page 6|7
RNN (Recurrent Neural Network)

• Best suited for sequential data


• RNN supports less feature set than CNN.
• This network can manage the arbitrary input and output lengths.
• It is ideal for text and speech analysis.

Q12. What are the benefits of Tensorflow over other libraries?


Ans:
The following benefits are:

• Scalability
• Visualisation of Data
• Debugging facility
• Pipelining

-------------------------------------------------------------------------------------------------------------

Page 7|7
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)

# DAY 09

Page 1|9
Q1: How would you define Machine Learning?
Ans:
Machine learning: It is an application of artificial intelligence (AI) that provides systems the ability
to learn automatically and to improve from experiences without being programmed. It focuses on the
development of computer applications that can access the data and used it to learn for themselves.
The process of learning starts with the observations or data, such as examples, direct experience, or
instruction, to look for the patterns in data and to make better decisions in the future based on
examples that we provide. The primary aim is to allow the computers to learn automatically without
human intervention or assistance and adjust actions accordingly.

Q2. What is a labeled training set?


Ans:
Machine learning is derived from the availability of the labeled data in the form of a training
set and test set that is used by the learning algorithm. The separation of data into the training portion
and a test portion is the way the algorithm learns. We split up the data containing known response
variable values into two pieces. The training set is used to train the algorithm, and then you use the
trained model on the test set to predict the variable response values that are already known. The final
step is to compare with the predicted responses against actual (observed) responses to see how close
they are. The difference is the test error metric. Depending on the test error, you can go back to refine
the model and repeat the process until you’re satisfied with the accuracy.

Page 2|9
Q3. What are the two common supervised tasks?
Ans:
The two common supervised tasks are regression and classification.
Regression-
The regression problem is when the output variable is the real or continuous value, such as “salary”
or “weight.” Many different models can be used, and the simplest is linear regression. It tries to fit
the data with the best hyper-plane, which goes through the points.

Classification
It is the type of supervised learning. It specifies the class to which the data elements belong to and
is best used when the output has finite and discrete values. It predicts a class for an input variable,
as well.

Q4. Can you name four common unsupervised tasks?


Ans:
The common unsupervised tasks include clustering, visualization, dimensionality reduction, and
association rule learning.
Clustering

Page 3|9
It is a Machine Learning technique that involves the grouping of the data points. Given a set of data
points, and we can use a clustering algorithm to classify each data point into the specific group. In
theory, data points that lie in the same group should have similar properties and/ or features, and
data points in the different groups should have high dissimilar properties and/or features. Clustering
is the method of unsupervised learning and is a common technique for statistical data analysis used
in many fields.

Visualization
Data visualization is the technique that uses an array of static and interactive visuals within the
specific context to help people to understand and make sense of the large amounts of data. The data
is often displayed in the story format that visualizes patterns, trends, and correlations that may go
otherwise unnoticed. It is regularly used as an avenue to monetize data as the product. An example
of using monetization and data visualization is Uber. The app combines visualization with real-time
data so that customers can request a ride.

Q5. What type of Machine Learning algorithm we use to


allow a robot to walk in various unknown terrains?
Ans:
Reinforcement Learning is likely to perform the best if we want a robot to learn how to walk in the
various unknown terrains since this is typically the type of problem that the reinforcement learning
Page 4|9
tackles. It may be possible to express the problem as a supervised or semisupervised learning
problem, but it would be less natural.
Reinforcement Learning-
It’s about to take suitable actions to maximize rewards in a particular situation. It is employed by the
various software and machines to find out the best possible behavior/path it should take in specific
situations. Reinforcement learning is different from the supervised learning in a way that in
supervised learning, training data has answer key with it so that the model is trained with the correct
answer itself, but in reinforcement learning, there is no answer, and the reinforcement agent decides
what to do to perform the given task. In the absence of the training dataset, it is bound to learn from
its experience.

Q6. What type of algorithm would we use to segment your


customers into multiple groups?
Ans:
If we don’t know how to define the groups, then we can use the clustering algorithm (unsupervised
learning) to segment our customers into clusters of similar customers. However, if we know what
groups we would like to have, then we can feed many examples of each group to a classification
algorithm (supervised learning), and it will classify all your customers into these groups.

Q7: What is an online machine learning?


Ans:
Online machine learning: It is a method of machine learning in which data becomes available in
sequential order and to update our best predictor for the future data at each step, as opposed to
batch learning techniques that generate the best predictor by learning on entire training data set at
once. Online learning is a common technique and used in the areas of machine learning where it is
computationally infeasible to train over the datasets, requiring the need for Out- of-
Core algorithms. It is also used in situations where the algorithm must adapt to new patterns in the
data dynamically or when the data itself is generated as the function of time, for example, stock
Page 5|9
price prediction. Online learning algorithms might be prone to catastrophic interference and
problem that can be addressed by the incremental learning approaches.

Q8: What is out-of-core learning?


Ans:
Out-of-core: It refers to the processing data that is too large to fit into the computer’s main
memory. Typically, when the dataset fits neatly into the computer’s main memory, randomly
accessing sections of data has a (relatively) small performance penalty.
When data must be stored in a medium like a large spinning hard drive or an external computer
network, it becomes very expensive to seek an arbitrary section of data randomly or to process the
same data multiple times. In such a case, an out-of-core algorithm will try to access all the relevant
data in a sequence.
However, modern computers have deep memory hierarchy, and replacing random access with the
sequential access can increase the performance even on datasets that fit within memory.

Page 6|9
Q9. What is the Model Parameter?
Ans:
Model parameter: It is a configuration variable that is internal to a model and whose value can be
predicted from the data.

 While making predictions, the model parameter is needed.


 The values define the skill of a model on problems.
 It is estimated or learned from data.
 It is often not set manually by the practitioner.
 It is often saved as part of the learned model.

Page 7|9
Parameters are key to machine learning algorithms. They are part of the model that is learned from
historical training data.

Q11: What is Model Hyperparameter?


Ans:
Model hyperparameter: It is a configuration that is external to a model and whose values cannot
be estimated from the data.

 It is often used in processes to help estimate model parameters.


 The practitioner often specifies them.
 It can often be the set using heuristics.
 It is tuned for the given predictive modeling problems.
We cannot know the best value for the model hyperparameter on the given problem. We may use
the rules of thumb, copy values used on other problems, or search for the best value by trial and
error.

Q12. What is cross-validation?


Ans:
Cross-validation: It is a technique for evaluating Machine Learning models by training several
Machine Learning models on subsets of available input data and evaluating them on the
complementary subset of data. Use cross-validation to detect overfitting, i.e., failing to generalize a
pattern.
There are three steps involved in cross-validation are as follows :

 Reserve some portion of the sample dataset.

Page 8|9
 Using the rest dataset and train models.
 Test the model using a reserve portion of the data-set.

--------------------------------------------------------------------------------------------------------------

Page 9|9
DATA SCIENCE
INTERVIEW PREPARATION
(30 Days of Interview
Preparation)

# DAY 10

P a g e 1 | 11
Q1. What is a Recommender System?

Answer:
A recommender system is today widely deployed in multiple fields like movie recommendations, music
preferences, social tags, research articles, search queries and so on. The recommender systems work as
per collaborative and content-based filtering or by deploying a personality-based approach. This type of
system works based on a person’s past behavior in order to build a model for the future. This will predict
the future product buying, movie viewing or book reading by people. It also creates a filtering approach
using the discrete characteristics of items while recommending additional items.

Q2. Compare SAS, R and Python programming?

Answer:
SAS: it is one of the most widely used analytics tools used by some of the biggest companies on earth.
It has some of the best statistical functions, graphical user interface, but can come with a price tag and
hence it cannot be readily adopted by smaller enterprises

P a g e 2 | 11
R: The best part about R is that it is an Open Source tool and hence used generously by academia and
the research community. It is a robust tool for statistical computation, graphical representation and
reporting. Due to its open source nature it is always being updated with the latest features and then readily
available to everybody.
Python: Python is a powerful open source programming language that is easy to learn, works well with
most other tools and technologies. The best part about Python is that it has innumerable libraries and
community created modules making it very robust. It has functions for statistical operation, model
building and more.

Q3. Why is important in data analysis?

Answer:
With data coming in from multiple sources it is important to ensure that data is good enough for analysis.
This is where data cleansing becomes extremely vital. Data cleansing extensively deals with the process
of detecting and correcting of data records, ensuring that data is complete and accurate and the
components of data that are irrelevant are deleted or modified as per the needs. This process can be
deployed in concurrence with data wrangling or batch processing.

P a g e 3 | 11
Once the data is cleaned it confirms with the rules of the data sets in the system. Data cleansing is an
essential part of the data science because the data can be prone to error due to human negligence,
corruption during transmission or storage among other things. Data cleansing takes a huge chunk of time
and effort of a Data Scientist because of the multiple sources from which data emanates and the speed at
which it comes.

Q4. What are the various aspects of a Machine Learning process?

Answer:
Here we will discuss the components involved in solving a problem using machine learning.
Domain knowledge
This is the first step wherein we need to understand how to extract the various features from the data and
learn more about the data that we are dealing with. It has got more to do with the type of domain that we
are dealing with and familiarizing the system to learn more about it.

P a g e 4 | 11
Feature Selection
This step has got more to do with the feature that we are selecting from the set of features that we have.
Sometimes it happens that there are a lot of features and we have to make an intelligent decision regarding
the type of feature that we want to select to go ahead with our machine learning endeavor.
Algorithm
This is a vital step since the algorithms that we choose will have a very major impact on the entire process
of machine learning. You can choose between the linear and nonlinear algorithm. Some of the algorithms
used are Support Vector Machines, Decision Trees, Naïve Bayes, K-Means Clustering, etc.
Training
This is the most important part of the machine learning technique and this is where it differs from the
traditional programming. The training is done based on the data that we have and providing more real
world experiences. With each consequent training step the machine gets better and smarter and able to
take improved decisions.
Evaluation
In this step we actually evaluate the decisions taken by the machine in order to decide whether it is up to
the mark or not. There are various metrics that are involved in this process and we have to closed deploy
each of these to decide on the efficacy of the whole machine learning endeavor.
Optimization
This process involves improving the performance of the machine learning process using various
optimization techniques. Optimization of machine learning is one of the most vital components wherein
the performance of the algorithm is vastly improved. The best part of optimization techniques is that
machine learning is not just a consumer of optimization techniques but it also provides new ideas for
optimization too.
Testing
Here various tests are carried out and some these are unseen set of test cases. The data is partitioned into
test and training set. There are various testing techniques like cross-validation in order to deal with
multiple situations.

P a g e 5 | 11
Q4. What is Interpolation and Extrapolation?
Answer:
The terms of interpolation and extrapolation are extremely important in any statistical analysis.
Extrapolation is the determination or estimation using a known set of values or facts by extending it and
taking it to an area or region that is unknown. It is the technique of inferring something using data that
is available.
Interpolation on the other hand is the method of determining a certain value which falls between a certain
set of values or the sequence of values. This is especially useful when you have data at the two extremities
of a certain region but you don’t have enough data points at the specific point. This is when you deploy
interpolation to determine the value that you need.

P a g e 6 | 11
Q5. What does P-value signify about the statistical data?
Answer:
P-value is used to determine the significance of results after a hypothesis test in statistics. P-value helps
the readers to draw conclusions and is always between 0 and 1.
• P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null hypothesis
cannot be rejected.
• P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis
can be rejected.
• P-value=0.05is the marginal value indicating it is possible to go either way.

Q6. During analysis, how do you treat missing values?


Answer:
The extent of the missing values is identified after identifying the variables with missing values. If any
patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful
business insights. If there are no patterns identified, then the missing values can be substituted with mean
or median values (imputation) or they can simply be ignored.

P a g e 7 | 11
There are various factors to be considered when answering this question-
Understand the problem statement, understand the data and then give the answer.Assigning a default
value which can be mean, minimum or maximum value. Getting into the data is important.
If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.
If you have a distribution of data coming, for normal distribution give the mean value.
Should we even treat missing values is another important point to consider? If 80% of the values for a
variable are missing then you can answer that you would be dropping the variable instead of treating the
missing values.

Q7. Explain the difference between a Test Set and a Validation Set?

Answer:
Validation set can be considered as a part of the training set as it is used for parameter selection and to
avoid Overfitting of the model being built. On the other hand, test set is used for testing or evaluating
the performance of a trained machine leaning model.
In simple terms ,the differences can be summarized as-
Training Set is to fit the parameters i.e. weights.
Test Set is to assess the performance of the model i.e. evaluating the predictive power and
generalization.
Validation set is to tune the parameters.

P a g e 8 | 11
Q8. What is the curse of dimensionality? Can you list some ways to
deal with it?
Answer:

The curse of dimensionality is when the training data has a high feature count, but the dataset does not
have enough samples for a model to learn correctly from so many features. For example, a training dataset
of 100 samples with 100 features will be very hard to learn from because the model will find random
relations between the features and the target. However, if we had a dataset of 100k samples with 100
features, the model could probably learn the correct relationships between the features and the target.

There are different options to fight the curse of dimensionality:

 Feature selection. Instead of using all the features, we can train on a smaller subset of features.
 Dimensionality reduction. There are many techniques that allow to reduce the dimensionality
of the features. Principal component analysis (PCA) and using autoencoders are examples of
dimensionality reduction techniques.
 L1 regularization. Because it produces sparse parameters, L1 helps to deal with high-
dimensionality input.
 Feature engineering. It’s possible to create new features that sum up multiple existing
features. For example, we can get statistics such as the mean or median.

P a g e 9 | 11
Q9. What is data augmentation? Can you give some examples?
Answer:

Data augmentation is a technique for synthesizing new data by modifying existing data in such a way
that the target is not changed, or it is changed in a known way.

Computer vision is one of fields where data augmentation is very useful. There are many modifications
that we can do to images:

 Resize
 Horizontal or vertical flip
 Rotate
 Add noise
 Deform
 Modify colors

Each problem needs a customized data augmentation pipeline. For example, on OCR, doing flips will
change the text and won’t be beneficial; however, resizes and small rotations may help.

Q10. What is stratified cross-validation and when should we use it?


Answer:

Cross-validation is a technique for dividing data between training and validation sets. On typical cross-
validation this split is done randomly. But in stratified cross-validation, the split preserves the ratio of
the categories on both the training and validation datasets.
P a g e 10 | 11
For example, if we have a dataset with 10% of category A and 90% of category B, and we use stratified
cross-validation, we will have the same proportions in training and validation. In contrast, if we use
simple cross-validation, in the worst case we may find that there are no samples of category A in the
validation set.

Stratified cross-validation may be applied in the following scenarios:

 On a dataset with multiple categories. The smaller the dataset and the more imbalanced the
categories, the more important it will be to use stratified cross-validation.
 On a dataset with data of different distributions. For example, in a dataset for autonomous
driving, we may have images taken during the day and at night. If we do not ensure that both
types are present in training and validation, we will have generalization problems.

P a g e 11 | 11
DATA SCIENCE
INTERVIEW PREPARATION
(30 Days of Interview
Preparation)

# DAY 11

P a g e 1 | 12
Q1. What are tensors?
Answer:

The tensors are no more than a method of presenting the data in deep learning. If put in the simple term,
tensors are just multidimensional arrays that allow developers to represent the data in a layer, which
means deep learning you are using contains high-level data sets where each dimension represents a
different feature.

The foremost benefit of using tensors is it provides the much-needed platform-flexibility and is easy to
trainable on CPU. Apart from this, tensors have the auto differentiation capabilities, advanced support
system for queues, threads, and asynchronous computation. All these features also make it customizable.

Q2. Define the concept of RNN?

Answer:
RNN is the artificial neutral which were created to analyze and recognize the patterns in the sequences
of the data. Due to their internal memory, RNN can certainly remember the things about the inputs they
receive.

P a g e 2 | 12
Most common issues faced with RNN

Although RNN is around for a while and uses backpropagation, there are some common issues faced
by developers who work it. Out of all, some of the most common issues are:
 Exploding gradients
 Vanishing gradients

P a g e 3 | 12
Q3. What is a ResNet, and where would you use it? Is it efficient?
Answer:

Among the various neural networks that are used for computer vision, ResNet (Residual Neural
Networks), is one of the most popular ones. It allows us to train extremely deep neural networks, which
is the prime reason for its huge usage and popularity. Before the invention of this network, training
extremely deep neural networks was almost impossible.

To understand why we must look at the vanishing gradient problem which is an issue that arises when
the gradient is backpropagated to all the layers. As a large number of multiplications are performed, the
size of the network keeps decreasing till it becomes extremely small, and thus, the network starts
performing badly. ResNet helps to counter the vanishing gradient problem.

The efficiency of this network is highly dependent on the concept of skip connections. Skip connections
are a method of allowing a shortcut path through which the gradient can flow, which in effect helps
counter the vanishing gradient problem.

An example of a skip connection is shown below:

In general, a skip connection allows us to skip the training of a few layers. Skip connections are also
called identity shortcut connections as they allow us to directly compute an identity function by just
relying on these connections and not having to look at the whole network.

The skipping of these layers makes ResNet an extremely efficient network.

P a g e 4 | 12
Q4. Transfer learning is one of the most useful concepts today. Where
can it be used?

Answer:

Pre-trained models are probably one of the most common use cases for transfer learning.

For anyone who does not have access to huge computational power, training complex models is always
a challenge. Transfer learning aims to help by both improving the performance and speeding up your
network.

In layman terms, transfer learning is a technique in which a model that has already been trained to do
one task is used for another without much change. This type of learning is also called multi-task learning.

Many models that are pre-trained are available online. Any of these models can be used as a starting
point in the creation of the new model required. After just using the weights, the model must be refined
and adapted on the required data by tuning the parameters of the model.

P a g e 5 | 12
The general idea behind transfer learning is to transfer knowledge not data. For humans, this task is easy
– we can generalize models that we have mentally created a long time ago for a different purpose. One
or two samples is almost always enough. However, in the case of neural networks, a huge amount of data
and computational power are required.

Transfer learning should generally be used when we don’t have a lot of labeled training data, or if there
already exists a network for the task you are trying to achieve, probably trained on a much more massive
dataset. Note, however, that the input of the model must have the same size during training. Also, this
works only if the tasks are fairly similar to each other, and the features learned can be generalized. For
example, something like learning how to recognize vehicles can probably be extended to learn how to
recognize airplanes and helicopters.

Q5. What does tuning of hyperparameters signify? Explain with


examples.

Answer:

A hyperparameter is just a variable that defines the structure of the network. Let’s go through some
hyperparameters and see the effect of tuning them.

1. A number of hidden layers – Most times, the presence or absence of a large number of hidden
layers may determine the output, accuracy and training time of the neural network. Having a
large number of these layers may sometimes cause an increase in accuracy.
2. Learning rate – This is simply a measure of how fast the neural network will change its
parameters. A large learning rate may lead to the network not being able to converge, but might
also speed up learning. On the other hand, a smaller value for the learning rate will probably slow
down the network but might lead to the network being able to converge.
3. Number of epochs – This is the number of times the entire training data is run through the
network. Increasing the number of epochs leads to better accuracy.
4. Momentum – Momentum is a measure of how and where the network will go while taking into
account all of its past actions. A proper measure of momentum can lead to a better network.
5. Batch Size – Batch size determines the number of subsamples that are inputs to the network
before every parameter update.

P a g e 6 | 12
Q6. Why are deep learning models referred as black boxes?

Answer:

Lately, the concept of deep learning being a black box has been floating around. A black box is a system
whose functioning cannot be properly grasped, but the output produced can be understood and utilized.

Now, since most models are mathematically sound and are created based on legit equations, how is it
possible that we do not know how the system works?

First, it is almost impossible to visualize the functions that are generated by a system. Most machine
learning models end up with such complex output that a human can't make sense of it.

Second, there are networks with millions of hyperparameters. As a human, we can grasp around 10 to 15
parameters. But analysing a million of them seems out of the question.

Third and most important, it becomes very hard, if not impossible, to trace back why the system made
the decisions it did. This may not sound like a huge problem to worry about but consider the case of a
self driving car. If the car hits someone on the road, we need to understand why that happened and prevent
it. But this isn’t possible if we do not understand how the system works.

P a g e 7 | 12
To make a deep learning model not be a black box, a new field called Explainable Artificial Intelligence
or simply, Explainable AI is emerging. This field aims to be able to create intermediate results and trace
back the decision-making process of a system.

Q7. Why do we have gates in neural networks?

Answer:

To understand gates, we must first understand recurrent neural networks.

Recurrent neural networks allow information to be stored as a memory using loops. Thus, the output of
a recurrent neural network is not only based on the current input but also the past inputs which are stored
in the memory of the network. Backpropagation is done through time, but in general, the truncated
version of this is used for longer sequences.

Gates are generally used in networks that are dependent on time. In effect, any network which would
require memory, so to speak, would benefit from the use of gates. These gates are generally used to keep
track of any information that is required by the network without leading to a state of either vanishing or
exploding gradients. Such a network can also preserve the error through time. Since a sense of constant
error is maintained, the network can learn better.

These gated units can be considered as units with recurrent connections. They also contain additional
neurons, which are gates. If you relate this process to a signal processing system, the gate is used to
P a g e 8 | 12
regulate which part of the signal passes through. A sigmoid activation function is used which means that
the values taken are from 0 to 1.

An advantage of using gates is that it enables the network to either forget information that it has already
learned or to selectively ignore information either based on the state of the network or the input the gate
receives.

Gates are extensively used in recurrent neural networks, especially in Long Short-Term Memory (LSTM)
networks. A general LSTM network will have 3 to 5 gates, typically an input gate, output gate, hidden
gate, and activation gate.

Q8. What is a Sobel filter?


Answer:

The Sobel filter performs a two-dimensional spatial gradient measurement on a given image, which then
emphasizes regions that have a high spatial frequency. In effect, this means finding edges.

In most cases, Sobel filters are used to find the approximate absolute gradient magnitude for every point
in a grayscale image. The operator consists of a pair of 3×3 convolution kernels. One of these kernels is
rotated by 90 degrees.

P a g e 9 | 12
These kernels respond to edges that run horizontal or vertical with respect to the pixel grid, one kernel
for each orientation. A point to note is that these kernels can be applied either separately or can be
combined to find the absolute magnitude of the gradient at every point.

The Sobel operator has a large convolution kernel, which ends up smoothing the image to a greater
extent, and thus, the operator becomes less sensitive to noise. It also produces higher output values for
similar edges compared to other methods.

To overcome the problem of output values from the operator overflowing the maximum allowed pixel
value per image type, avoid using image types that support pixel values.

Q9. What is the purpose of a Boltzmann Machine?

Answer:

Boltzmann machines are algorithms that are based on physics, specifically thermal equilibrium. A special
and more well-known case of Boltzmann machines is the Restricted Boltzmann machine, which is a type
of Boltzmann machine where there are no connections between hidden layers of the network.

P a g e 10 | 12
The concept was coined by Geoff Hinton, who most recently won the Turing award. In general, the
algorithm uses the laws of thermodynamics and tries to optimize a global distribution of energy in the
system.

In discrete mathematical terms, a restricted Boltzmann machine can be called a symmetric bipartite
graph, i.e. two symmetric layers. These machines are a form of unsupervised learning, which means that
there are no labels provided with data. It uses stochastic binary units to reach this state.

Boltzmann machines are derived from Markov state machines. A Markov State Machine is a model that
can be used to represent almost any computable function. The restricted Boltzmann machine can be
regarded as an undirected graphical model. It is used in dimensionality reduction, collaborative filtering,
learning features as well as modeling. It can also be used for classification and regression. In general,
restricted Boltzmann machines are composed of a two-layer network, which can then be extended further.

Note that these models are probabilistic since each of the nodes present in the system learns low-level
features from items in the dataset. For example, if we take a grayscale image, each node that is
responsible for the visible layer will take just one-pixel value from the image.

A part of the process of creating such a machine is a feature hierarchy where sequences of activations
are grouped in terms of features. In thermodynamics principles, simulated annealing is a process that the
machine follows to separate signal and noise.

P a g e 11 | 12
Q10. What are the types of weight initialization?

Answer:

There are two major types of weight initialization:- zero initialization and random initialization.

Zero initialization: In this process, biases and weights are initialised to 0. If the weights are set to
0, all derivatives with respect to the loss functions in the weight matrix become equal. Hence, none of
the weights change during subsequent iterations. Setting the bias to 0 cancels out any effect it may
have.

All hidden units become symmetric due to zero initialization. In general, zero initialization is not very
useful or accurate for classification and thus must be avoided when any classification task is required.

Random initialization: As compared to 0 initialization, this involves setting random values for the
weights. The only disadvantage is that set very high values will increase the learning time as the
sigmoid activation function maps close to 1. Likewise, if low values are set, the learning time increases
as the activation function is mapped close to 0.

Setting too high or too low values thus generally leads to the exploding or vanishing gradient problem.

New types of weight initialization like “He initialization” and “Xavier initialization” have also
emerged. These are based on specific equations and are not mentioned here due to their sheer
complexity.

P a g e 12 | 12
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)

# DAY 12

P a g e 1 | 11
Q1. Where is the confusion matrix used? Which module would you
use to show it?

Answer:

In machine learning, confusion matrix is one of the easiest ways to summarize the performance of
your algorithm.

At times, it is difficult to judge the accuracy of a model by just looking at the accuracy because of
problems like unequal distribution. So, a better way to check how good your model is, is to use a
confusion matrix.

First, let’s look at some key terms.

Classification accuracy – This is the ratio of the number of correct predictions to the number of
predictions made

True positives – Correct predictions of true events

False positives – Incorrect predictions of true events

True negatives – Correct predictions of false events

False negatives – Incorrect predictions of false events.

The confusion matrix is now simply a matrix containing true positives, false positives, true
negatives, false negatives.

P a g e 2 | 11
Q2: What is Accuracy?
Answer:
It is the most intuitive performance measure and it simply a ratio of correctly predicted to the total
observations. We can say as, if we have high accuracy, then our model is best. Yes, we could say that
accuracy is a great measure but only when you have symmetric datasets where false positives and
false negatives are almost same.

Accuracy = True Positive + True Negative / (True Positive +False Positive + False
Negative + True Negative)

Q3: What is Precision?


Answer:
It is also called as the positive predictive value. Number of correct positives in your model that
predicts compared to the total number of positives it predicts.

Precision = True Positives / (True Positives + False Positives)


Precision = True Positives / Total predicted positive

It is the number of positive elements predicted properly divided by the total number of positive
elements predicted.
We can say Precision is a measure of exactness, quality, or accuracy. High precision
Means that more or all of the positive results you predicted are correct.

P a g e 3 | 11
Q4: What is Recall?
Answer:
Recall we can also called as sensitivity or true positive rate.
It is several positives that our model predicts compared to the actual number of positives in our data.
Recall = True Positives / (True Positives + False Positives)
Recall = True Positives / Total Actual Positive

Recall is a measure of completeness. High recall which means that our model classified most or all
of the possible positive elements as positive.

Q5: What is F1 Score?


Answer:
We use Precision and recall together because they complement each other in how they describe the
effectiveness of a model. The F1 score that combines these two as the weighted harmonic mean of
precision and recall.

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

P a g e 4 | 11
Q6: What is Bias and Variance trade-off?
Answer:
Bias
Bias means it’s how far are the predict values from the actual values. If the average predicted values
are far off from the actual values, then we called as this one have high bias.
When our model has a high bias, then it means that our model is too simple and does not capture the
complexity of data, thus underfitting the data.

Variance
It occurs when our model performs good on the trained dataset but does not do well on a dataset that
it is not trained on, like a test dataset or validation dataset. It tells us that actual value is how much
scattered from the predicted value.

Because of High variance it cause overfitting that implies that the algorithm models random noise
present in the training data.
When model have high variance, then model becomes very flexible and tune itself to the data points
of the training set.

P a g e 5 | 11
Bias-variance: It decomposition essentially decomposes the learning error from any algorithm by
adding bias, the variance and a bit of irreducible error due to noise in the underlying dataset.
Essentially, if we make the model more complex and add more variables, We’ll lose bias but gain
some variance —to get the optimally reduced amount of error, you’ll have to tradeoff bias and
variance. We don’t want either high bias or high variance in your model.

Bias and variance using bulls-eye diagram

Q7. What is data wrangling? Mention three points to consider in the


process.

Answer:

Data wrangling is a process by which we convert and map data. This changes data from its raw
form to a format that is a lot more valuable.

Data wrangling is the first step for machine learning and deep learning. The end goal is to provide
data that is actionable and to provide it as fast as possible.

P a g e 6 | 11
There are three major things to focus on while talking about data wrangling –

1. Acquiring data
The first and probably the most important step in data science is the acquiring, sorting and cleaning
of data. This is an extremely tedious process and requires the most amount of time.

One needs to:

 Check if the data is valid and up-to-date.


 Check if the data acquired is relevant for the problem at hand.

Sources for data collection Data is publicly available on various websites like
kaggle.com, data.gov ,World Bank, Five Thirty Eight Datasets, AWS Datasets, Google
Datasets.

2. Data cleaning
Data cleaning is an essential component of data wrangling and requires a lot of patience. To make
the job easier it is first essential to format the data make the data readable for humans at first.

The essentials involved are:

 Format the data to make it more readable


 Find outliers (data points that do not match the rest of the dataset) in data
 Find missing values and remove them from the data set (without this, any model being
trained becomes incomplete and useless)

3. Data Computation
At times, your machine not have enough resources to run your algorithm e.g. you might not have a
GPU. In these cases, you can use publicly available APIs to run your algorithm. These are standard
end points found on the web which allow you to use computing power over the web and process
data without having to rely on your own system. An example would be the Google Colab Platform.

P a g e 7 | 11
Q8. Why is normalization required before applying any machine
learning model? What module can you use to perform normalization?

Answer:

Normalization is a process that is required when an algorithm uses something like distance
measures. Examples would be clustering data, finding cosine similarities, creating recommender
systems.

Normalization is not always required and is done to prevent variables that are on higher scale from
affecting outcomes that are on lower levels. For example, consider a dataset of employees’ income.
This data won’t be on the same scale if you try to cluster it. Hence, we would have to normalize the
data to prevent incorrect clustering.

A key point to note is that normalization does not distort the differences in the range of values.

A problem we might face if we don’t normalize data is that gradients would take a very long time
to descend and reach the global maxima/ minima.

For numerical data, normalization is generally done between the range of 0 to 1.

The general formula is:

Xnew = (x-xmin)/(xmax-xmin)

P a g e 8 | 11
Q9. What is the difference between feature selection and feature
extraction?

Feature selection and feature extraction are two major ways of fixing the curse of dimensionality

1. Feature selection:
Feature selection is used to filter a subset of input variables on which the attention should focus.
Every other variable is ignored. This is something which we, as humans, tend to do subconsciously.

Many domains have tens of thousands of variables out of which most are irrelevant and redundant.
Feature selection limits the training data and reduces the amount of computational resources used.
It can significantly improve a learning algorithms performance.

In summary, we can say that the goal of feature selection is to find out an optimal feature subset.
This might not be entirely accurate, however, methods of understanding the importance of features
also exist. Some modules in python such as Xgboost help achieve the same.

2. Feature extraction
Feature extraction involves transformation of features so that we can extract features to improve the
process of feature selection. For example, in an unsupervised learning problem, the extraction of
bigrams from a text, or the extraction of contours from an image are examples of feature extraction.

The general workflow involves applying feature extraction on given data to extract features and
then apply feature selection with respect to the target variable to select a subset of data. In effect,
this helps improve the accuracy of a model.

P a g e 9 | 11
Q10. Why is polarity and subjectivity an issue?
Polarity and subjectivity are terms which are generally used in sentiment analysis.
Polarity is the variation of emotions in a sentence. Since sentiment analysis is widely dependent on
emotions and their intensity, polarity turns out to be an extremely important factor.

In most cases, opinions and sentiment analysis are evaluations. They fall under the categories of
emotional and rational evaluations.

Rational evaluations, as the name suggests, are based on facts and rationality while emotional
evaluations are based on non-tangible responses, which are not always easy to detect.

Subjectivity in sentiment analysis, is a matter of personal feelings and beliefs which may or may
not be based on any fact. When there is a lot of subjectivity in a text, it must be explained and
analysed in context. On the contrary, if there was a lot of polarity in the text, it could be expressed
as a positive, negative or neutral emotion.

Q11. When would you use ARIMA?

Answer:

ARIMA is a widely used statistical method which stands for Auto Regressive Integrated Moving
Average. It is generally used for analyzing time series data and time series forecasting. Let’s take a
quick look at the terms involved.

Auto Regression is a model that uses the relationship between the observation and some numbers of
lagging observations.

P a g e 10 | 11
Integrated means use of differences in raw observations which help make the time series stationary.

Moving Averages is a model that uses the relationship and dependency between the observation
and residual error from the models being applied to the lagging observations.

Note that each of these components are used as parameters. After the construction of the model, a
linear regression model is constructed.

Data is prepared by:

 Finding out the differences

 Removing trends and structures that will negatively affect the model

 Finally, making the model stationary.

P a g e 11 | 11
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# Day13

P a g e 1 | 10
Q1. What is Autoregression?
Answer:
The autoregressive (AR) model is commonly used to model time-varying processes and solve
problems in the fields of natural science, economics and finance, and others. The models have always
been discussed in the context of random process and are often perceived as statistical tools for time
series data.
A regression model, like linear regression, models an output value which are based on a linear
combination of input values.
Example: y^ = b0 + b1*X1
Where y^ is the prediction, b0 and b1 are coefficients found by optimising the model on training
data, and X is an input value.
This model technique can be used on the time series where input variables are taken as observations
at previous time steps, called lag variables.
For example, we can predict the value for the next time step (t+1) given the observations at the last
two time steps (t-1 and t-2). As a regression model, this would look as follows:
X(t+1) = b0 + b1*X(t-1) + b2*X(t-2)-
Because the regression model uses the data from the same input variable at previous time steps, it is
referred to as an autoregression.
The notation AR(p) refers to the autoregressive model of order p. The AR(p) model is written

P a g e 2 | 10
Q2. What is Moving Average?
Answer:
Moving average: From a dataset, we will get an overall idea of trends by this technique; it is
an average of any subset of numbers. For forecasting long-term trends, the moving average is
extremely useful for it. We can calculate it for any period. For example: if we have sales data for
twenty years, we can calculate the five-year moving average, a four-year moving average, a three-
year moving average and so on. Stock market analysts will often use a 50 or 200-day moving average
to help them see trends in the stock market and (hopefully) forecast where the stocks are headed.

The notation MA(q) refers to the moving average model of order q:

Q3. What is Autoregressive Moving Average (ARMA)?


Answer:
ARMA: It is a model of forecasting in which the methods of autoregression (AR) analysis and
moving average (MA) are both applied to time-series data that is well behaved. In ARMA it is
assumed that the time series is stationary and when it fluctuates, it does so uniformly around a
particular time.
AR (Autoregression model)-
Autoregression (AR) model is commonly used in current spectrum estimation.

P a g e 3 | 10
The following is the procedure for using ARMA.

 Selecting the AR model and then equalizing the output to equal the signal being studied if
the input is an impulse function or the white noise. It should at least be good approximation
of signal.
 Finding a model’s parameters number using the known autocorrelation function or the data .
 Using the derived model parameters to estimate the power spectrum of the signal.
Moving Average (MA) model-
It is a commonly used model in the modern spectrum estimation and is also one of the methods of
the model parametric spectrum analysis. The procedure for estimating MA model’s signal spectrum
is as follows.

 Selecting the MA model and then equalising the output to equal the signal understudy in the
case where the input is an impulse function or white noise. It should be at least a good
approximation of the signal.
 Finding the model’s parameters using the known autocorrelation function.
 Estimating the signal’s power spectrum using the derived model parameters.
In the estimation of the ARMA parameter spectrum, the AR parameters are first estimated, and then
the MA parameters are estimated based on these AR parameters. The spectral estimates of the ARMA
model are then obtained. The parameter estimation of the MA model is, therefore often calculated as
a process of ARMA parameter spectrum association.
The notation ARMA(p, q) refers to the model with p autoregressive terms and q moving-average
terms. This model contains the AR(p) and MA(q) models,

P a g e 4 | 10
Q4. What is Autoregressive Integrated Moving Average (ARIMA)?
Answer:
ARIMA: It is a statistical analysis model that uses time-series data to either better understand the data
set or to predict future trends.
An ARIMA model can be understood by the outlining each of its components as follows-

 Autoregression (AR): It refers to a model that shows a changing variable that regresses on
its own lagged, or prior, values.
 Integrated (I): It represents the differencing of raw observations to allow for the time series
to become stationary, i.e., data values are replaced by the difference between the data values
and the previous values.
 Moving average (MA): It incorporates the dependency between an observation and the
residual error from the moving average model applied to the lagged observations.
Each component functions as the parameter with a standard notation. For ARIMA models, the
standard notation would be the ARIMA with p, d, and q, where integer values substitute for the
parameters to indicate the type of the ARIMA model used. The parameters can be defined as-

 p: It the number of lag observations in the model; also known as the lag order.
 d: It the number of times that the raw observations are differenced; also known as the degree
of differencing.
 q: It the size of the moving average window; also known as the order of the moving average.

P a g e 5 | 10
Q5.What is SARIMA (Seasonal Autoregressive Integrated Moving-
Average)?
Answer:
Seasonal ARIMA: It is an extension of ARIMA that explicitly supports the univariate time series
data with the seasonal component.
It adds three new hyper-parameters to specify the autoregression (AR), differencing (I) and the
moving average (MA) for the seasonal component of the series, as well as an additional parameter
for the period of the seasonality.

Configuring the SARIMA requires selecting hyperparameters for both the trend and seasonal
elements of the series.

Trend Elements
Three trend elements requires the configuration.
They are same as the ARIMA model, specifically-

p: It is Trend autoregression order.


d: It is Trend difference order.
q: It is Trend moving average order.

Seasonal Elements-

Four seasonal elements are not the part of the ARIMA that must be configured, they are-
P: It is Seasonal autoregressive order.
D: It is Seasonal difference order.
Q: It is Seasonal moving average order.
m: It is the number of time steps for the single seasonal period.

Together, the notation for the SARIMA model is specified as-

SARIMA(p,d,q)(P,D,Q)m-

The elements can be chosen through careful analysis of the ACF and PACF plots looking at the
correlations of recent time steps.

P a g e 6 | 10
Q6. What is Seasonal Autoregressive Integrated Moving-Average
with Exogenous Regressors (SARIMAX) ?
Answer:
SARIMAX: It is an extension of the SARIMA model that also includes the modelling of the
exogenous variables.
Exogenous variables are also called the covariates and can be thought of as parallel input sequences
that have observations at the same time steps as the original series. The primary series may be
referred as endogenous data to contrast it from exogenous sequence(s). The observations for
exogenous variables are included in the model directly at each time step and are not modeled in the
same way as the primary endogenous sequence (e.g. as an AR, MA, etc. process).

The SARIMAX method can also be used to model the subsumed models with exogenous variables,
such as ARX, MAX, ARMAX, and ARIMAX.
The method is suitable for univariate time series with trend and/or seasonal components and
exogenous variables.

P a g e 7 | 10
Q7. What is Vector autoregression (VAR)?
Answer:
VAR: It is a stochastic process model used to capture the linear interdependencies among
multiple time series. VAR models generalise the univariate autoregressive model (AR model) by
allowing for more than one evolving variable. All variables in the VAR enter the model in the same
way: each variable has an equation explaining its evolution based on its own lagged values, the lagged
values of the other model variables, and an error term. VAR modelling does not requires as much
knowledge about the forces influencing the variable as do structural models with simultaneous
equations: The only prior knowledge required is a list of variables which can be hypothesised to
affect each other intertemporally.
A VAR model describes the evolution of the set of k variables over the same sample period (t = 1,
..., T) as the linear function of only their past values. The variables are collected in the k-
vector ((k × 1)-matrix) yt, , which has as the (i th )element, yi,t, the observation at time t of
the (i th )variable. Example: if the (i th )variable is the GDP, then yi,t is the value of GDP at time “t”.

-
where the observation yt−i is called the (i-th) lag of y, c is the k-vector of constants (intercepts), Ai is
a time-invariant (k × k)-matrix, and et is a k-vector of error terms satisfying.

Q8. What is Vector Autoregression Moving-Average (VARMA)?


Answer:
VARMA: It is method models the next step in each time series using an ARMA model. It is the
generalisation of ARMA to multiple parallel time series, Example- multivariate time series.

P a g e 8 | 10
The notation for a model involves specifying the order for the AR(p) and the MA(q) models as
parameters to the VARMA function, e.g. VARMA (p, q). The VARMA model can also be used to
develop VAR or VMA models.
This method is suitable for multivariate time series without trend and seasonal components.

Q9. What is Vector Autoregression Moving-Average with Exogenous


Regressors (VARMAX)?
Answer:
VARMAX: It is an extension of the VARMA model that also includes the modelling of the
exogenous variables. It is the multivariate version of the ARMAX method.
Exogenous variables are also called the covariates and can be thought of as parallel input sequences
that have observations at the same time steps as the original series. The primary series(es) are referred
as the endogenous data to contrast it from the exogenous sequence(s). The observations for the
exogenous variables are included in the model directly at each time step and are not modeled in the
same way as the primary endogenous sequence (Example- as an AR, MA, etc.).
This method can also be used to model subsumed models with exogenous variables, such as VARX
and the VMAX.
This method is suitable for multivariate time series without trend and seasonal components and
exogenous variables.

P a g e 9 | 10
Q10. What is Simple Exponential Smoothing (SES)?
Answer:
SES: It method models the next time step as an exponentially weighted linear function of observations
at prior time steps.
This method is suitable for univariate time series without trend and seasonal components.
Exponential smoothing is the rule of thumb technique for smoothing time series data using the
exponential window function. Whereas in the simple moving average, the past observations are
weighted equally, exponential functions are used to assign exponentially decreasing weights over
time. It is easily learned and easily applied procedure for making some determination based on prior
assumptions by the user, such as seasonality. Exponential smoothing is often used for analysis of
time-series data.
Exponential smoothing is one of many window functions commonly applied to smooth data in signal
processing, acting as low-pass filters to remove high-frequency noise.
The raw data sequence is often represented by {xt} beginning at time t = 0, and the output of the
exponential smoothing algorithm is commonly written as {st} which may be regarded as a best
estimate of what the next value of x will be. When the sequence of observations begins at time t=
0, the simplest form of exponential smoothing is given by the formulas:

P a g e 10 | 10
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)

# DAY 14

P a g e 1 | 11
Q1. What is Alexnet?
Answer:
The Alex Krizhevsky, Geoffrey Hinton and Ilya Sutskever created the neural network architecture
called ‘AlexNet’ and won Image Classification Challenge (ILSVRC) in 2012. They trained their
network on 1.2 million high-resolution images into 1000 different classes with 60 million parameters
and 650,000 neurons. The training was done on two GPUs with split layer concept because GPUs
were a little bit slow at that time.
AlexNet is the name of convolutional neural network which has had a large impact on the field
of machine learning, specifically in the application of deep learning to machine vision. The network
had very similar architecture as the LeNet by Yann LeCun et al. but was deeper with more filters per
layer, and with the stacked convolutional layers. It consist of ( 11×11, 5×5,3×3, convolutions), max
pooling, dropout, data augmentation, ReLU activations and SGD with the momentum. It attached
with ReLU activations after every convolutional and fully connected layer. AlexNet was trained for
six days simultaneously on two Nvidia Geforce GTX 580 GPUs, which is the reason for why their
network is split into the two pipelines.
Architecture

AlexNet contains eight layers with weights, first five are convolutional, and the remaining three are
fully connected. The output of last fully-connected layer is fed to a 1000-way softmax which
produces a distribution over the 1000 class labels. The network maximises the multinomial logistic
regression objective, which is equivalent to maximising the average across training cases of the log-
probability of the correct label under the prediction distribution. The kernels of second, fourth, and
the fifth convolutional layers are connected only with those kernel maps in the previous layer which
reside on the same GPU. The kernels of third convolutional layer are connected to all the kernel maps
in second layer. The neurons in fully connected layers are connected to all the neurons in the previous
layers.

P a g e 2 | 11
In short, AlexNet contains five convolutional layers and three fully connected layers. Relu is applied
after the very convolutional and the fully connected layer. Dropout is applied before the first and
second fully connected year. The network has the 62.3 million parameters and needs 1.1 billion
computation units in a forward pass. We can also see convolution layers, which accounts for 6% of
all the parameters, consumes 95% of the computation.

Q2. What is VGGNet?


Answer:
VGGNet consists of 16 convolutional layers and is very appealing because of its very uniform
architecture. Similar to AlexNet, only 3x3 convolutions, but lots of filters. Trained on 4 GPUs for 2–
3 weeks. It is currently the most preferred choice in the community for extracting features from
images. The weight configuration of the VGGNet is publicly available and has been used in many
other applications and challenges as a baseline feature extractor. However, VGGNet consists of 138
million parameters, which can be a bit challenging to handle.
There are multiple variants of the VGGNet (VGG16, VGG19 etc.) which differ only in total number
of layers in the networks. The structural details of the VGG16 network has been shown:

The idea behind having the fixed size kernels is that all the variable size convolutional kernels used
in the Alexnet (11x11, 5x5, 3x3) can be replicated by making use of multiple 3x3 kernels as the
building blocks. The replication is in term of the receptive field covered by kernels .

P a g e 3 | 11
Let’s consider the example. Say we have an input layer of the size 5x5x1. Implementing the conv
layer with kernel size of 5x5 and stride one will the results and output feature map of (1x1). The same
output feature map can obtained by implementing the two (3x3) Conv layers with stride of 1 as
below:

Now, let’s look at the number of the variables needed to be trained. For a 5x5 Conv layer filter, the
number of variables is 25. On the other hand, two conv layers of kernel size 3x3 have a total of
3x3x2=18 variables (a reduction of 28%).

P a g e 4 | 11
Q3. What is VGG16?
Answer:
VGG16: It is a convolutional neural network model proposed by the K. Simonyan and A. Zisserman
from the University of Oxford in the paper “Very Deep Convolutional Networks for the Large-Scale
Image Recognition”. The model achieves 92.7% top 5 test accuracy in ImageNet, which is the dataset
of over 14 million images belonging to the 1000 classes. It was one of famous model submitted
to ILSVRC-2014. It improves AlexNet by replacing the large kernel-sized filters (11 and 5 in the first
and second convolutional layer, respectively) with multiple 3×3 kernel-sized filters one after another.
VGG16 was trained for weeks and was using NVIDIA Titan Black GPU’s.

The Architecture
The architecture depicted below is VGG16.

The input to the Cov1 layer is of fixed size of 224 x 224 RGB image. The image is passed through
the stack of convolutional (conv.) layers, where the filters were used with a very small receptive field:
3×3 (which is the smallest size to capture the notion of left/right, up/down, centre). In one of the
configurations, it also utilises the 1×1 convolution filters, which can be seen as the linear
transformation of the input channels . The convolution stride is fixed to the 1 pixel, the spatial padding
of the Conv. layer input is such that, the spatial resolution is preserved after the convolution, i.e. the

P a g e 5 | 11
padding is 1-pixel for 3×3 Conv. layers. Spatial pooling is carried out by the five max-pooling layers,
which follows some of the Conv. Layers. Max-pooling is performed over the 2×2 pixel window, with
stride 2.
Three Fully-Connected (FC) layers follow the stack of convolutional layers (which has the different
depth in different architectures): the first two have 4096 channels each, the third performs 1000-way
ILSVRC classification and thus contains 1000 channels . The final layer is softmax layer. The
configurations of the fully connected layers is same in all the networks.
All hidden layers are equipped with rectification (ReLU) non-linearity. It is also noted that none of
the networks (except for one) contain the Local Response Normalisation (LRN), such normalisation
does not improve the performance on the ILSVRC dataset, but leads to increased memory
consumption and computation time.

Q4. What is ResNet?


Answer:
At the ILSVRC 2015, so-called Residual Neural Network (ResNet) by the Kaiming He et al
introduced the anovel architecture with “skip connections” and features heavy batch normalisation.
Such skip connections are also known as the gated units or gated recurrent units and have the strong
similarity to recent successful elements applied in RNNs. Thanks to this technique as they were able
to train the NN with 152 layers while still having lower complexity than the VGGNet. It achieves the
top-5 error rate of 3.57%, which beats human-level performance on this dataset.

Q5. What is HAAR CASCADE?


Answer:
Haar Cascade: It is the machine learning object detections algorithm used to identify the objects in
an image or the video and based on the concept of features proposed by Paul Viola and Michael
Jones in their paper "Rapid Object Detection using a Boosted Cascade of Simple Features" in 2001.
It is a machine learning-based approach where the cascade function is trained from the lot of positive
and negative images. It is then used to detect the objects in other images.
The algorithm has four stages:
P a g e 6 | 11
 Haar Feature Selection
 Creating Integral Images
 Adaboost Training
 Cascading Classifiers
It is well known for being able to detect faces and body parts in an image but can be trained to
identify almost any object.

Q6. What is Transfer Learning?


Answer:
Transfer learning: It is the machine learning method where the model developed for a task is
reused as the starting point for the model on the second task .
Transfer Learning differs from the traditional Machine Learning in that it is the use of pre-trained
models that have been used for another task to jump-start the development process on a new task
or problem.

P a g e 7 | 11
The benefits of the Transfer Learning are that it can speed up the time as it takes to develop and
train the model by reusing these pieces or modules of already developed models. This helps to
speed up the model training process and accelerate results.

Q7. What is Faster, R-CNN?


Answer:
Faster R-CNN: It has two networks: region proposal network (RPN) for generating region
proposals and a network using these proposals to detect objects. The main difference here with
the Fast R-CNN is that the later uses selective search to generate the region proposals. The time
cost of generating the region proposals is much smaller in the RPN than selective search, when
RPN shares the most computation with object detection network. In brief, RPN ranks region
boxes (called anchors) and proposes the ones most likely containing objects.
Anchors
Anchors play an very important role in Faster R-CNN. An anchor is the box. In default
configuration of Faster R-CNN, there are nine anchors at the position of an image. The graphs
shown 9 anchors at the position (320, 320) of an image with size (600, 800).

P a g e 8 | 11
Region Proposal Network:
The output of the region proposal network is the bunch of boxes/proposals that will be examined
by a classifier and regressor to check the occurrence of objects eventually. To be more
precise, RPN predicts the possibility of an anchor being background or foreground, and refine
the anchor.

Q8. What is RCNN?


Answer:
To bypass the problem of selecting the huge number of regions, Ross Girshick et al. proposed a
method where we use the selective search to extract just 2000 regions from the image, and he
called them as region proposals. Therefore, instead of trying to classify the huge number of
regions, you can work with 2000 regions.

P a g e 9 | 11
Problems with R-CNN:

 It still takes the huge amount of time to train the network as we would have to classify
2000 region proposals per image.
 It cannot be implemented real-time as it takes around 47 seconds for each test image.
 The selective search algorithm is the fixed algorithm. Therefore, no learning is happening
at that stage. This leads to the generation of the bad candidate region proposals.

Q9.What is GoogLeNet/Inception?
Answer:
The winner of the ILSVRC 2014 competition was GoogLeNet from Google. It achieved a top-5 error
rate of 6.67%! This was very close to human-level performance which the organisers of the challenge
were now forced to evaluate. As it turns out, this was rather hard to do and required some human
training to beat GoogLeNets accuracy. After the few days of training, the human expert (Andrej
Karpathy) was able to achieve the top-5 error rate of 5.1%(single model) and 3.6%(ensemble). The
network used the CNN inspired by LeNet but implemented a novel element which is dubbed an
inception module. It used batch normalisation, image distortions and RMSprop. This module is based
on the several very small convolutions to reduce the number of parameters drastically. Their
architecture consisted of the 22 layer deep CNN but reduced the number of parameters from 60
million (AlexNet) to 4 million.
It contains 1×1 Convolution at the middle of network, and global average pooling is used at the end
of the network instead of using the fully connected layers. These two techniques are from another
paper “Network In-Network” (NIN). Another technique, called inception module, is to have different
sizes/types of convolutions for the same input and to stack all the outputs.

P a g e 10 | 11
Q10. What is LeNet-5?
Answer:
LeNet-5, a pioneering 7-level convolutional network by the LeCun et al in 1998, that classifies digits,
was applied by several banks to recognise hand-written numbers on checks (cheques) digitised in
32x32 pixel greyscale input images. The ability to process higher-resolution images requires larger
and more convolutional layers, so the availability of computing resources constrains this technique.

LeNet-5 is very simple network. It only has seven layers, among which there are three convolutional
layers (C1, C3 and C5), two sub-sampling (pooling) layers (S2 and S4), and one fully connected layer
(F6), that are followed by output layers. Convolutional layers use 5 by 5 convolutions with stride 1.
Sub-sampling layers are 2 by 2 average pooling layers. Tanh sigmoid activations are used to
throughout the network. Several interesting architectural choices were made in LeNet-5 that are not
very common in the modern era of deep learning.

------------------------------------------------------------------------------------------------------------------------

P a g e 11 | 11

You might also like