Data Science Interview Questions #Week2
Data Science Interview Questions #Week2
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 08
Page 1|7
Q1. What is Tensorflow?
Ans:
TensorFlow: TensorFlow is an open-source software library released in 2015 by Google to make it
easier for the developers to design, build, and train deep learning models. TensorFlow is originated
as an internal library that the Google developers used to build the models in house, and we expect
additional functionality to be added in the open-source version as they are tested and vetted in internal
flavour. Although TensorFlow is the only one of several options available to the developers and we
choose to use it here because of thoughtful design and ease of use.
At a high level, TensorFlow is a Python library that allows users to express arbitrary computation as
a graph of data flows. Nodes in this graph represent mathematical operations, whereas edges
represent data that is communicated from one node to another. Data in TensorFlow are represented
as tensors, which are multidimensional arrays. Although this framework for thinking about
computation is valuable in many different fields, TensorFlow is primarily used for deep learning in
practice and research.
• Has the GPU memory conflicts with Theano if imported in the same scope.
• It has dependencies with other libraries.
• Requires prior knowledge of the advanced calculus and linear algebra along with the pretty
good understanding of machine learning.
Page 4|7
Q7. What are the use cases of Tensor flow?
Ans:
Tensorflow is an important tool of deep learning, it has mainly five use cases, and they are:
• Time Series
• Image recognition
• Sound Recognition
• Video detection
• Text-based Applications
Page 5|7
Q9. What is Keras?
Ans:
Keras: It is an Open Source Neural Network library written in Python that runs on the top of Theano
or Tensorflow. It is designed to be the modular, fast and easy to use. It was developed by François
Chollet, a Google engineer.
Page 6|7
RNN (Recurrent Neural Network)
• Scalability
• Visualisation of Data
• Debugging facility
• Pipelining
-------------------------------------------------------------------------------------------------------------
Page 7|7
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 09
Page 1|9
Q1: How would you define Machine Learning?
Ans:
Machine learning: It is an application of artificial intelligence (AI) that provides systems the ability
to learn automatically and to improve from experiences without being programmed. It focuses on the
development of computer applications that can access the data and used it to learn for themselves.
The process of learning starts with the observations or data, such as examples, direct experience, or
instruction, to look for the patterns in data and to make better decisions in the future based on
examples that we provide. The primary aim is to allow the computers to learn automatically without
human intervention or assistance and adjust actions accordingly.
Page 2|9
Q3. What are the two common supervised tasks?
Ans:
The two common supervised tasks are regression and classification.
Regression-
The regression problem is when the output variable is the real or continuous value, such as “salary”
or “weight.” Many different models can be used, and the simplest is linear regression. It tries to fit
the data with the best hyper-plane, which goes through the points.
Classification
It is the type of supervised learning. It specifies the class to which the data elements belong to and
is best used when the output has finite and discrete values. It predicts a class for an input variable,
as well.
Page 3|9
It is a Machine Learning technique that involves the grouping of the data points. Given a set of data
points, and we can use a clustering algorithm to classify each data point into the specific group. In
theory, data points that lie in the same group should have similar properties and/ or features, and
data points in the different groups should have high dissimilar properties and/or features. Clustering
is the method of unsupervised learning and is a common technique for statistical data analysis used
in many fields.
Visualization
Data visualization is the technique that uses an array of static and interactive visuals within the
specific context to help people to understand and make sense of the large amounts of data. The data
is often displayed in the story format that visualizes patterns, trends, and correlations that may go
otherwise unnoticed. It is regularly used as an avenue to monetize data as the product. An example
of using monetization and data visualization is Uber. The app combines visualization with real-time
data so that customers can request a ride.
Page 6|9
Q9. What is the Model Parameter?
Ans:
Model parameter: It is a configuration variable that is internal to a model and whose value can be
predicted from the data.
Page 7|9
Parameters are key to machine learning algorithms. They are part of the model that is learned from
historical training data.
Page 8|9
Using the rest dataset and train models.
Test the model using a reserve portion of the data-set.
--------------------------------------------------------------------------------------------------------------
Page 9|9
DATA SCIENCE
INTERVIEW PREPARATION
(30 Days of Interview
Preparation)
# DAY 10
P a g e 1 | 11
Q1. What is a Recommender System?
Answer:
A recommender system is today widely deployed in multiple fields like movie recommendations, music
preferences, social tags, research articles, search queries and so on. The recommender systems work as
per collaborative and content-based filtering or by deploying a personality-based approach. This type of
system works based on a person’s past behavior in order to build a model for the future. This will predict
the future product buying, movie viewing or book reading by people. It also creates a filtering approach
using the discrete characteristics of items while recommending additional items.
Answer:
SAS: it is one of the most widely used analytics tools used by some of the biggest companies on earth.
It has some of the best statistical functions, graphical user interface, but can come with a price tag and
hence it cannot be readily adopted by smaller enterprises
P a g e 2 | 11
R: The best part about R is that it is an Open Source tool and hence used generously by academia and
the research community. It is a robust tool for statistical computation, graphical representation and
reporting. Due to its open source nature it is always being updated with the latest features and then readily
available to everybody.
Python: Python is a powerful open source programming language that is easy to learn, works well with
most other tools and technologies. The best part about Python is that it has innumerable libraries and
community created modules making it very robust. It has functions for statistical operation, model
building and more.
Answer:
With data coming in from multiple sources it is important to ensure that data is good enough for analysis.
This is where data cleansing becomes extremely vital. Data cleansing extensively deals with the process
of detecting and correcting of data records, ensuring that data is complete and accurate and the
components of data that are irrelevant are deleted or modified as per the needs. This process can be
deployed in concurrence with data wrangling or batch processing.
P a g e 3 | 11
Once the data is cleaned it confirms with the rules of the data sets in the system. Data cleansing is an
essential part of the data science because the data can be prone to error due to human negligence,
corruption during transmission or storage among other things. Data cleansing takes a huge chunk of time
and effort of a Data Scientist because of the multiple sources from which data emanates and the speed at
which it comes.
Answer:
Here we will discuss the components involved in solving a problem using machine learning.
Domain knowledge
This is the first step wherein we need to understand how to extract the various features from the data and
learn more about the data that we are dealing with. It has got more to do with the type of domain that we
are dealing with and familiarizing the system to learn more about it.
P a g e 4 | 11
Feature Selection
This step has got more to do with the feature that we are selecting from the set of features that we have.
Sometimes it happens that there are a lot of features and we have to make an intelligent decision regarding
the type of feature that we want to select to go ahead with our machine learning endeavor.
Algorithm
This is a vital step since the algorithms that we choose will have a very major impact on the entire process
of machine learning. You can choose between the linear and nonlinear algorithm. Some of the algorithms
used are Support Vector Machines, Decision Trees, Naïve Bayes, K-Means Clustering, etc.
Training
This is the most important part of the machine learning technique and this is where it differs from the
traditional programming. The training is done based on the data that we have and providing more real
world experiences. With each consequent training step the machine gets better and smarter and able to
take improved decisions.
Evaluation
In this step we actually evaluate the decisions taken by the machine in order to decide whether it is up to
the mark or not. There are various metrics that are involved in this process and we have to closed deploy
each of these to decide on the efficacy of the whole machine learning endeavor.
Optimization
This process involves improving the performance of the machine learning process using various
optimization techniques. Optimization of machine learning is one of the most vital components wherein
the performance of the algorithm is vastly improved. The best part of optimization techniques is that
machine learning is not just a consumer of optimization techniques but it also provides new ideas for
optimization too.
Testing
Here various tests are carried out and some these are unseen set of test cases. The data is partitioned into
test and training set. There are various testing techniques like cross-validation in order to deal with
multiple situations.
P a g e 5 | 11
Q4. What is Interpolation and Extrapolation?
Answer:
The terms of interpolation and extrapolation are extremely important in any statistical analysis.
Extrapolation is the determination or estimation using a known set of values or facts by extending it and
taking it to an area or region that is unknown. It is the technique of inferring something using data that
is available.
Interpolation on the other hand is the method of determining a certain value which falls between a certain
set of values or the sequence of values. This is especially useful when you have data at the two extremities
of a certain region but you don’t have enough data points at the specific point. This is when you deploy
interpolation to determine the value that you need.
P a g e 6 | 11
Q5. What does P-value signify about the statistical data?
Answer:
P-value is used to determine the significance of results after a hypothesis test in statistics. P-value helps
the readers to draw conclusions and is always between 0 and 1.
• P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null hypothesis
cannot be rejected.
• P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis
can be rejected.
• P-value=0.05is the marginal value indicating it is possible to go either way.
P a g e 7 | 11
There are various factors to be considered when answering this question-
Understand the problem statement, understand the data and then give the answer.Assigning a default
value which can be mean, minimum or maximum value. Getting into the data is important.
If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.
If you have a distribution of data coming, for normal distribution give the mean value.
Should we even treat missing values is another important point to consider? If 80% of the values for a
variable are missing then you can answer that you would be dropping the variable instead of treating the
missing values.
Q7. Explain the difference between a Test Set and a Validation Set?
Answer:
Validation set can be considered as a part of the training set as it is used for parameter selection and to
avoid Overfitting of the model being built. On the other hand, test set is used for testing or evaluating
the performance of a trained machine leaning model.
In simple terms ,the differences can be summarized as-
Training Set is to fit the parameters i.e. weights.
Test Set is to assess the performance of the model i.e. evaluating the predictive power and
generalization.
Validation set is to tune the parameters.
P a g e 8 | 11
Q8. What is the curse of dimensionality? Can you list some ways to
deal with it?
Answer:
The curse of dimensionality is when the training data has a high feature count, but the dataset does not
have enough samples for a model to learn correctly from so many features. For example, a training dataset
of 100 samples with 100 features will be very hard to learn from because the model will find random
relations between the features and the target. However, if we had a dataset of 100k samples with 100
features, the model could probably learn the correct relationships between the features and the target.
Feature selection. Instead of using all the features, we can train on a smaller subset of features.
Dimensionality reduction. There are many techniques that allow to reduce the dimensionality
of the features. Principal component analysis (PCA) and using autoencoders are examples of
dimensionality reduction techniques.
L1 regularization. Because it produces sparse parameters, L1 helps to deal with high-
dimensionality input.
Feature engineering. It’s possible to create new features that sum up multiple existing
features. For example, we can get statistics such as the mean or median.
P a g e 9 | 11
Q9. What is data augmentation? Can you give some examples?
Answer:
Data augmentation is a technique for synthesizing new data by modifying existing data in such a way
that the target is not changed, or it is changed in a known way.
Computer vision is one of fields where data augmentation is very useful. There are many modifications
that we can do to images:
Resize
Horizontal or vertical flip
Rotate
Add noise
Deform
Modify colors
Each problem needs a customized data augmentation pipeline. For example, on OCR, doing flips will
change the text and won’t be beneficial; however, resizes and small rotations may help.
Cross-validation is a technique for dividing data between training and validation sets. On typical cross-
validation this split is done randomly. But in stratified cross-validation, the split preserves the ratio of
the categories on both the training and validation datasets.
P a g e 10 | 11
For example, if we have a dataset with 10% of category A and 90% of category B, and we use stratified
cross-validation, we will have the same proportions in training and validation. In contrast, if we use
simple cross-validation, in the worst case we may find that there are no samples of category A in the
validation set.
On a dataset with multiple categories. The smaller the dataset and the more imbalanced the
categories, the more important it will be to use stratified cross-validation.
On a dataset with data of different distributions. For example, in a dataset for autonomous
driving, we may have images taken during the day and at night. If we do not ensure that both
types are present in training and validation, we will have generalization problems.
P a g e 11 | 11
DATA SCIENCE
INTERVIEW PREPARATION
(30 Days of Interview
Preparation)
# DAY 11
P a g e 1 | 12
Q1. What are tensors?
Answer:
The tensors are no more than a method of presenting the data in deep learning. If put in the simple term,
tensors are just multidimensional arrays that allow developers to represent the data in a layer, which
means deep learning you are using contains high-level data sets where each dimension represents a
different feature.
The foremost benefit of using tensors is it provides the much-needed platform-flexibility and is easy to
trainable on CPU. Apart from this, tensors have the auto differentiation capabilities, advanced support
system for queues, threads, and asynchronous computation. All these features also make it customizable.
Answer:
RNN is the artificial neutral which were created to analyze and recognize the patterns in the sequences
of the data. Due to their internal memory, RNN can certainly remember the things about the inputs they
receive.
P a g e 2 | 12
Most common issues faced with RNN
Although RNN is around for a while and uses backpropagation, there are some common issues faced
by developers who work it. Out of all, some of the most common issues are:
Exploding gradients
Vanishing gradients
P a g e 3 | 12
Q3. What is a ResNet, and where would you use it? Is it efficient?
Answer:
Among the various neural networks that are used for computer vision, ResNet (Residual Neural
Networks), is one of the most popular ones. It allows us to train extremely deep neural networks, which
is the prime reason for its huge usage and popularity. Before the invention of this network, training
extremely deep neural networks was almost impossible.
To understand why we must look at the vanishing gradient problem which is an issue that arises when
the gradient is backpropagated to all the layers. As a large number of multiplications are performed, the
size of the network keeps decreasing till it becomes extremely small, and thus, the network starts
performing badly. ResNet helps to counter the vanishing gradient problem.
The efficiency of this network is highly dependent on the concept of skip connections. Skip connections
are a method of allowing a shortcut path through which the gradient can flow, which in effect helps
counter the vanishing gradient problem.
In general, a skip connection allows us to skip the training of a few layers. Skip connections are also
called identity shortcut connections as they allow us to directly compute an identity function by just
relying on these connections and not having to look at the whole network.
P a g e 4 | 12
Q4. Transfer learning is one of the most useful concepts today. Where
can it be used?
Answer:
Pre-trained models are probably one of the most common use cases for transfer learning.
For anyone who does not have access to huge computational power, training complex models is always
a challenge. Transfer learning aims to help by both improving the performance and speeding up your
network.
In layman terms, transfer learning is a technique in which a model that has already been trained to do
one task is used for another without much change. This type of learning is also called multi-task learning.
Many models that are pre-trained are available online. Any of these models can be used as a starting
point in the creation of the new model required. After just using the weights, the model must be refined
and adapted on the required data by tuning the parameters of the model.
P a g e 5 | 12
The general idea behind transfer learning is to transfer knowledge not data. For humans, this task is easy
– we can generalize models that we have mentally created a long time ago for a different purpose. One
or two samples is almost always enough. However, in the case of neural networks, a huge amount of data
and computational power are required.
Transfer learning should generally be used when we don’t have a lot of labeled training data, or if there
already exists a network for the task you are trying to achieve, probably trained on a much more massive
dataset. Note, however, that the input of the model must have the same size during training. Also, this
works only if the tasks are fairly similar to each other, and the features learned can be generalized. For
example, something like learning how to recognize vehicles can probably be extended to learn how to
recognize airplanes and helicopters.
Answer:
A hyperparameter is just a variable that defines the structure of the network. Let’s go through some
hyperparameters and see the effect of tuning them.
1. A number of hidden layers – Most times, the presence or absence of a large number of hidden
layers may determine the output, accuracy and training time of the neural network. Having a
large number of these layers may sometimes cause an increase in accuracy.
2. Learning rate – This is simply a measure of how fast the neural network will change its
parameters. A large learning rate may lead to the network not being able to converge, but might
also speed up learning. On the other hand, a smaller value for the learning rate will probably slow
down the network but might lead to the network being able to converge.
3. Number of epochs – This is the number of times the entire training data is run through the
network. Increasing the number of epochs leads to better accuracy.
4. Momentum – Momentum is a measure of how and where the network will go while taking into
account all of its past actions. A proper measure of momentum can lead to a better network.
5. Batch Size – Batch size determines the number of subsamples that are inputs to the network
before every parameter update.
P a g e 6 | 12
Q6. Why are deep learning models referred as black boxes?
Answer:
Lately, the concept of deep learning being a black box has been floating around. A black box is a system
whose functioning cannot be properly grasped, but the output produced can be understood and utilized.
Now, since most models are mathematically sound and are created based on legit equations, how is it
possible that we do not know how the system works?
First, it is almost impossible to visualize the functions that are generated by a system. Most machine
learning models end up with such complex output that a human can't make sense of it.
Second, there are networks with millions of hyperparameters. As a human, we can grasp around 10 to 15
parameters. But analysing a million of them seems out of the question.
Third and most important, it becomes very hard, if not impossible, to trace back why the system made
the decisions it did. This may not sound like a huge problem to worry about but consider the case of a
self driving car. If the car hits someone on the road, we need to understand why that happened and prevent
it. But this isn’t possible if we do not understand how the system works.
P a g e 7 | 12
To make a deep learning model not be a black box, a new field called Explainable Artificial Intelligence
or simply, Explainable AI is emerging. This field aims to be able to create intermediate results and trace
back the decision-making process of a system.
Answer:
Recurrent neural networks allow information to be stored as a memory using loops. Thus, the output of
a recurrent neural network is not only based on the current input but also the past inputs which are stored
in the memory of the network. Backpropagation is done through time, but in general, the truncated
version of this is used for longer sequences.
Gates are generally used in networks that are dependent on time. In effect, any network which would
require memory, so to speak, would benefit from the use of gates. These gates are generally used to keep
track of any information that is required by the network without leading to a state of either vanishing or
exploding gradients. Such a network can also preserve the error through time. Since a sense of constant
error is maintained, the network can learn better.
These gated units can be considered as units with recurrent connections. They also contain additional
neurons, which are gates. If you relate this process to a signal processing system, the gate is used to
P a g e 8 | 12
regulate which part of the signal passes through. A sigmoid activation function is used which means that
the values taken are from 0 to 1.
An advantage of using gates is that it enables the network to either forget information that it has already
learned or to selectively ignore information either based on the state of the network or the input the gate
receives.
Gates are extensively used in recurrent neural networks, especially in Long Short-Term Memory (LSTM)
networks. A general LSTM network will have 3 to 5 gates, typically an input gate, output gate, hidden
gate, and activation gate.
The Sobel filter performs a two-dimensional spatial gradient measurement on a given image, which then
emphasizes regions that have a high spatial frequency. In effect, this means finding edges.
In most cases, Sobel filters are used to find the approximate absolute gradient magnitude for every point
in a grayscale image. The operator consists of a pair of 3×3 convolution kernels. One of these kernels is
rotated by 90 degrees.
P a g e 9 | 12
These kernels respond to edges that run horizontal or vertical with respect to the pixel grid, one kernel
for each orientation. A point to note is that these kernels can be applied either separately or can be
combined to find the absolute magnitude of the gradient at every point.
The Sobel operator has a large convolution kernel, which ends up smoothing the image to a greater
extent, and thus, the operator becomes less sensitive to noise. It also produces higher output values for
similar edges compared to other methods.
To overcome the problem of output values from the operator overflowing the maximum allowed pixel
value per image type, avoid using image types that support pixel values.
Answer:
Boltzmann machines are algorithms that are based on physics, specifically thermal equilibrium. A special
and more well-known case of Boltzmann machines is the Restricted Boltzmann machine, which is a type
of Boltzmann machine where there are no connections between hidden layers of the network.
P a g e 10 | 12
The concept was coined by Geoff Hinton, who most recently won the Turing award. In general, the
algorithm uses the laws of thermodynamics and tries to optimize a global distribution of energy in the
system.
In discrete mathematical terms, a restricted Boltzmann machine can be called a symmetric bipartite
graph, i.e. two symmetric layers. These machines are a form of unsupervised learning, which means that
there are no labels provided with data. It uses stochastic binary units to reach this state.
Boltzmann machines are derived from Markov state machines. A Markov State Machine is a model that
can be used to represent almost any computable function. The restricted Boltzmann machine can be
regarded as an undirected graphical model. It is used in dimensionality reduction, collaborative filtering,
learning features as well as modeling. It can also be used for classification and regression. In general,
restricted Boltzmann machines are composed of a two-layer network, which can then be extended further.
Note that these models are probabilistic since each of the nodes present in the system learns low-level
features from items in the dataset. For example, if we take a grayscale image, each node that is
responsible for the visible layer will take just one-pixel value from the image.
A part of the process of creating such a machine is a feature hierarchy where sequences of activations
are grouped in terms of features. In thermodynamics principles, simulated annealing is a process that the
machine follows to separate signal and noise.
P a g e 11 | 12
Q10. What are the types of weight initialization?
Answer:
There are two major types of weight initialization:- zero initialization and random initialization.
Zero initialization: In this process, biases and weights are initialised to 0. If the weights are set to
0, all derivatives with respect to the loss functions in the weight matrix become equal. Hence, none of
the weights change during subsequent iterations. Setting the bias to 0 cancels out any effect it may
have.
All hidden units become symmetric due to zero initialization. In general, zero initialization is not very
useful or accurate for classification and thus must be avoided when any classification task is required.
Random initialization: As compared to 0 initialization, this involves setting random values for the
weights. The only disadvantage is that set very high values will increase the learning time as the
sigmoid activation function maps close to 1. Likewise, if low values are set, the learning time increases
as the activation function is mapped close to 0.
Setting too high or too low values thus generally leads to the exploding or vanishing gradient problem.
New types of weight initialization like “He initialization” and “Xavier initialization” have also
emerged. These are based on specific equations and are not mentioned here due to their sheer
complexity.
P a g e 12 | 12
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 12
P a g e 1 | 11
Q1. Where is the confusion matrix used? Which module would you
use to show it?
Answer:
In machine learning, confusion matrix is one of the easiest ways to summarize the performance of
your algorithm.
At times, it is difficult to judge the accuracy of a model by just looking at the accuracy because of
problems like unequal distribution. So, a better way to check how good your model is, is to use a
confusion matrix.
Classification accuracy – This is the ratio of the number of correct predictions to the number of
predictions made
The confusion matrix is now simply a matrix containing true positives, false positives, true
negatives, false negatives.
P a g e 2 | 11
Q2: What is Accuracy?
Answer:
It is the most intuitive performance measure and it simply a ratio of correctly predicted to the total
observations. We can say as, if we have high accuracy, then our model is best. Yes, we could say that
accuracy is a great measure but only when you have symmetric datasets where false positives and
false negatives are almost same.
Accuracy = True Positive + True Negative / (True Positive +False Positive + False
Negative + True Negative)
It is the number of positive elements predicted properly divided by the total number of positive
elements predicted.
We can say Precision is a measure of exactness, quality, or accuracy. High precision
Means that more or all of the positive results you predicted are correct.
P a g e 3 | 11
Q4: What is Recall?
Answer:
Recall we can also called as sensitivity or true positive rate.
It is several positives that our model predicts compared to the actual number of positives in our data.
Recall = True Positives / (True Positives + False Positives)
Recall = True Positives / Total Actual Positive
Recall is a measure of completeness. High recall which means that our model classified most or all
of the possible positive elements as positive.
P a g e 4 | 11
Q6: What is Bias and Variance trade-off?
Answer:
Bias
Bias means it’s how far are the predict values from the actual values. If the average predicted values
are far off from the actual values, then we called as this one have high bias.
When our model has a high bias, then it means that our model is too simple and does not capture the
complexity of data, thus underfitting the data.
Variance
It occurs when our model performs good on the trained dataset but does not do well on a dataset that
it is not trained on, like a test dataset or validation dataset. It tells us that actual value is how much
scattered from the predicted value.
Because of High variance it cause overfitting that implies that the algorithm models random noise
present in the training data.
When model have high variance, then model becomes very flexible and tune itself to the data points
of the training set.
P a g e 5 | 11
Bias-variance: It decomposition essentially decomposes the learning error from any algorithm by
adding bias, the variance and a bit of irreducible error due to noise in the underlying dataset.
Essentially, if we make the model more complex and add more variables, We’ll lose bias but gain
some variance —to get the optimally reduced amount of error, you’ll have to tradeoff bias and
variance. We don’t want either high bias or high variance in your model.
Answer:
Data wrangling is a process by which we convert and map data. This changes data from its raw
form to a format that is a lot more valuable.
Data wrangling is the first step for machine learning and deep learning. The end goal is to provide
data that is actionable and to provide it as fast as possible.
P a g e 6 | 11
There are three major things to focus on while talking about data wrangling –
1. Acquiring data
The first and probably the most important step in data science is the acquiring, sorting and cleaning
of data. This is an extremely tedious process and requires the most amount of time.
Sources for data collection Data is publicly available on various websites like
kaggle.com, data.gov ,World Bank, Five Thirty Eight Datasets, AWS Datasets, Google
Datasets.
2. Data cleaning
Data cleaning is an essential component of data wrangling and requires a lot of patience. To make
the job easier it is first essential to format the data make the data readable for humans at first.
3. Data Computation
At times, your machine not have enough resources to run your algorithm e.g. you might not have a
GPU. In these cases, you can use publicly available APIs to run your algorithm. These are standard
end points found on the web which allow you to use computing power over the web and process
data without having to rely on your own system. An example would be the Google Colab Platform.
P a g e 7 | 11
Q8. Why is normalization required before applying any machine
learning model? What module can you use to perform normalization?
Answer:
Normalization is a process that is required when an algorithm uses something like distance
measures. Examples would be clustering data, finding cosine similarities, creating recommender
systems.
Normalization is not always required and is done to prevent variables that are on higher scale from
affecting outcomes that are on lower levels. For example, consider a dataset of employees’ income.
This data won’t be on the same scale if you try to cluster it. Hence, we would have to normalize the
data to prevent incorrect clustering.
A key point to note is that normalization does not distort the differences in the range of values.
A problem we might face if we don’t normalize data is that gradients would take a very long time
to descend and reach the global maxima/ minima.
Xnew = (x-xmin)/(xmax-xmin)
P a g e 8 | 11
Q9. What is the difference between feature selection and feature
extraction?
Feature selection and feature extraction are two major ways of fixing the curse of dimensionality
1. Feature selection:
Feature selection is used to filter a subset of input variables on which the attention should focus.
Every other variable is ignored. This is something which we, as humans, tend to do subconsciously.
Many domains have tens of thousands of variables out of which most are irrelevant and redundant.
Feature selection limits the training data and reduces the amount of computational resources used.
It can significantly improve a learning algorithms performance.
In summary, we can say that the goal of feature selection is to find out an optimal feature subset.
This might not be entirely accurate, however, methods of understanding the importance of features
also exist. Some modules in python such as Xgboost help achieve the same.
2. Feature extraction
Feature extraction involves transformation of features so that we can extract features to improve the
process of feature selection. For example, in an unsupervised learning problem, the extraction of
bigrams from a text, or the extraction of contours from an image are examples of feature extraction.
The general workflow involves applying feature extraction on given data to extract features and
then apply feature selection with respect to the target variable to select a subset of data. In effect,
this helps improve the accuracy of a model.
P a g e 9 | 11
Q10. Why is polarity and subjectivity an issue?
Polarity and subjectivity are terms which are generally used in sentiment analysis.
Polarity is the variation of emotions in a sentence. Since sentiment analysis is widely dependent on
emotions and their intensity, polarity turns out to be an extremely important factor.
In most cases, opinions and sentiment analysis are evaluations. They fall under the categories of
emotional and rational evaluations.
Rational evaluations, as the name suggests, are based on facts and rationality while emotional
evaluations are based on non-tangible responses, which are not always easy to detect.
Subjectivity in sentiment analysis, is a matter of personal feelings and beliefs which may or may
not be based on any fact. When there is a lot of subjectivity in a text, it must be explained and
analysed in context. On the contrary, if there was a lot of polarity in the text, it could be expressed
as a positive, negative or neutral emotion.
Answer:
ARIMA is a widely used statistical method which stands for Auto Regressive Integrated Moving
Average. It is generally used for analyzing time series data and time series forecasting. Let’s take a
quick look at the terms involved.
Auto Regression is a model that uses the relationship between the observation and some numbers of
lagging observations.
P a g e 10 | 11
Integrated means use of differences in raw observations which help make the time series stationary.
Moving Averages is a model that uses the relationship and dependency between the observation
and residual error from the models being applied to the lagging observations.
Note that each of these components are used as parameters. After the construction of the model, a
linear regression model is constructed.
Removing trends and structures that will negatively affect the model
P a g e 11 | 11
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# Day13
P a g e 1 | 10
Q1. What is Autoregression?
Answer:
The autoregressive (AR) model is commonly used to model time-varying processes and solve
problems in the fields of natural science, economics and finance, and others. The models have always
been discussed in the context of random process and are often perceived as statistical tools for time
series data.
A regression model, like linear regression, models an output value which are based on a linear
combination of input values.
Example: y^ = b0 + b1*X1
Where y^ is the prediction, b0 and b1 are coefficients found by optimising the model on training
data, and X is an input value.
This model technique can be used on the time series where input variables are taken as observations
at previous time steps, called lag variables.
For example, we can predict the value for the next time step (t+1) given the observations at the last
two time steps (t-1 and t-2). As a regression model, this would look as follows:
X(t+1) = b0 + b1*X(t-1) + b2*X(t-2)-
Because the regression model uses the data from the same input variable at previous time steps, it is
referred to as an autoregression.
The notation AR(p) refers to the autoregressive model of order p. The AR(p) model is written
P a g e 2 | 10
Q2. What is Moving Average?
Answer:
Moving average: From a dataset, we will get an overall idea of trends by this technique; it is
an average of any subset of numbers. For forecasting long-term trends, the moving average is
extremely useful for it. We can calculate it for any period. For example: if we have sales data for
twenty years, we can calculate the five-year moving average, a four-year moving average, a three-
year moving average and so on. Stock market analysts will often use a 50 or 200-day moving average
to help them see trends in the stock market and (hopefully) forecast where the stocks are headed.
P a g e 3 | 10
The following is the procedure for using ARMA.
Selecting the AR model and then equalizing the output to equal the signal being studied if
the input is an impulse function or the white noise. It should at least be good approximation
of signal.
Finding a model’s parameters number using the known autocorrelation function or the data .
Using the derived model parameters to estimate the power spectrum of the signal.
Moving Average (MA) model-
It is a commonly used model in the modern spectrum estimation and is also one of the methods of
the model parametric spectrum analysis. The procedure for estimating MA model’s signal spectrum
is as follows.
Selecting the MA model and then equalising the output to equal the signal understudy in the
case where the input is an impulse function or white noise. It should be at least a good
approximation of the signal.
Finding the model’s parameters using the known autocorrelation function.
Estimating the signal’s power spectrum using the derived model parameters.
In the estimation of the ARMA parameter spectrum, the AR parameters are first estimated, and then
the MA parameters are estimated based on these AR parameters. The spectral estimates of the ARMA
model are then obtained. The parameter estimation of the MA model is, therefore often calculated as
a process of ARMA parameter spectrum association.
The notation ARMA(p, q) refers to the model with p autoregressive terms and q moving-average
terms. This model contains the AR(p) and MA(q) models,
P a g e 4 | 10
Q4. What is Autoregressive Integrated Moving Average (ARIMA)?
Answer:
ARIMA: It is a statistical analysis model that uses time-series data to either better understand the data
set or to predict future trends.
An ARIMA model can be understood by the outlining each of its components as follows-
Autoregression (AR): It refers to a model that shows a changing variable that regresses on
its own lagged, or prior, values.
Integrated (I): It represents the differencing of raw observations to allow for the time series
to become stationary, i.e., data values are replaced by the difference between the data values
and the previous values.
Moving average (MA): It incorporates the dependency between an observation and the
residual error from the moving average model applied to the lagged observations.
Each component functions as the parameter with a standard notation. For ARIMA models, the
standard notation would be the ARIMA with p, d, and q, where integer values substitute for the
parameters to indicate the type of the ARIMA model used. The parameters can be defined as-
p: It the number of lag observations in the model; also known as the lag order.
d: It the number of times that the raw observations are differenced; also known as the degree
of differencing.
q: It the size of the moving average window; also known as the order of the moving average.
P a g e 5 | 10
Q5.What is SARIMA (Seasonal Autoregressive Integrated Moving-
Average)?
Answer:
Seasonal ARIMA: It is an extension of ARIMA that explicitly supports the univariate time series
data with the seasonal component.
It adds three new hyper-parameters to specify the autoregression (AR), differencing (I) and the
moving average (MA) for the seasonal component of the series, as well as an additional parameter
for the period of the seasonality.
Configuring the SARIMA requires selecting hyperparameters for both the trend and seasonal
elements of the series.
Trend Elements
Three trend elements requires the configuration.
They are same as the ARIMA model, specifically-
Seasonal Elements-
Four seasonal elements are not the part of the ARIMA that must be configured, they are-
P: It is Seasonal autoregressive order.
D: It is Seasonal difference order.
Q: It is Seasonal moving average order.
m: It is the number of time steps for the single seasonal period.
SARIMA(p,d,q)(P,D,Q)m-
The elements can be chosen through careful analysis of the ACF and PACF plots looking at the
correlations of recent time steps.
P a g e 6 | 10
Q6. What is Seasonal Autoregressive Integrated Moving-Average
with Exogenous Regressors (SARIMAX) ?
Answer:
SARIMAX: It is an extension of the SARIMA model that also includes the modelling of the
exogenous variables.
Exogenous variables are also called the covariates and can be thought of as parallel input sequences
that have observations at the same time steps as the original series. The primary series may be
referred as endogenous data to contrast it from exogenous sequence(s). The observations for
exogenous variables are included in the model directly at each time step and are not modeled in the
same way as the primary endogenous sequence (e.g. as an AR, MA, etc. process).
The SARIMAX method can also be used to model the subsumed models with exogenous variables,
such as ARX, MAX, ARMAX, and ARIMAX.
The method is suitable for univariate time series with trend and/or seasonal components and
exogenous variables.
P a g e 7 | 10
Q7. What is Vector autoregression (VAR)?
Answer:
VAR: It is a stochastic process model used to capture the linear interdependencies among
multiple time series. VAR models generalise the univariate autoregressive model (AR model) by
allowing for more than one evolving variable. All variables in the VAR enter the model in the same
way: each variable has an equation explaining its evolution based on its own lagged values, the lagged
values of the other model variables, and an error term. VAR modelling does not requires as much
knowledge about the forces influencing the variable as do structural models with simultaneous
equations: The only prior knowledge required is a list of variables which can be hypothesised to
affect each other intertemporally.
A VAR model describes the evolution of the set of k variables over the same sample period (t = 1,
..., T) as the linear function of only their past values. The variables are collected in the k-
vector ((k × 1)-matrix) yt, , which has as the (i th )element, yi,t, the observation at time t of
the (i th )variable. Example: if the (i th )variable is the GDP, then yi,t is the value of GDP at time “t”.
-
where the observation yt−i is called the (i-th) lag of y, c is the k-vector of constants (intercepts), Ai is
a time-invariant (k × k)-matrix, and et is a k-vector of error terms satisfying.
P a g e 8 | 10
The notation for a model involves specifying the order for the AR(p) and the MA(q) models as
parameters to the VARMA function, e.g. VARMA (p, q). The VARMA model can also be used to
develop VAR or VMA models.
This method is suitable for multivariate time series without trend and seasonal components.
P a g e 9 | 10
Q10. What is Simple Exponential Smoothing (SES)?
Answer:
SES: It method models the next time step as an exponentially weighted linear function of observations
at prior time steps.
This method is suitable for univariate time series without trend and seasonal components.
Exponential smoothing is the rule of thumb technique for smoothing time series data using the
exponential window function. Whereas in the simple moving average, the past observations are
weighted equally, exponential functions are used to assign exponentially decreasing weights over
time. It is easily learned and easily applied procedure for making some determination based on prior
assumptions by the user, such as seasonality. Exponential smoothing is often used for analysis of
time-series data.
Exponential smoothing is one of many window functions commonly applied to smooth data in signal
processing, acting as low-pass filters to remove high-frequency noise.
The raw data sequence is often represented by {xt} beginning at time t = 0, and the output of the
exponential smoothing algorithm is commonly written as {st} which may be regarded as a best
estimate of what the next value of x will be. When the sequence of observations begins at time t=
0, the simplest form of exponential smoothing is given by the formulas:
P a g e 10 | 10
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 14
P a g e 1 | 11
Q1. What is Alexnet?
Answer:
The Alex Krizhevsky, Geoffrey Hinton and Ilya Sutskever created the neural network architecture
called ‘AlexNet’ and won Image Classification Challenge (ILSVRC) in 2012. They trained their
network on 1.2 million high-resolution images into 1000 different classes with 60 million parameters
and 650,000 neurons. The training was done on two GPUs with split layer concept because GPUs
were a little bit slow at that time.
AlexNet is the name of convolutional neural network which has had a large impact on the field
of machine learning, specifically in the application of deep learning to machine vision. The network
had very similar architecture as the LeNet by Yann LeCun et al. but was deeper with more filters per
layer, and with the stacked convolutional layers. It consist of ( 11×11, 5×5,3×3, convolutions), max
pooling, dropout, data augmentation, ReLU activations and SGD with the momentum. It attached
with ReLU activations after every convolutional and fully connected layer. AlexNet was trained for
six days simultaneously on two Nvidia Geforce GTX 580 GPUs, which is the reason for why their
network is split into the two pipelines.
Architecture
AlexNet contains eight layers with weights, first five are convolutional, and the remaining three are
fully connected. The output of last fully-connected layer is fed to a 1000-way softmax which
produces a distribution over the 1000 class labels. The network maximises the multinomial logistic
regression objective, which is equivalent to maximising the average across training cases of the log-
probability of the correct label under the prediction distribution. The kernels of second, fourth, and
the fifth convolutional layers are connected only with those kernel maps in the previous layer which
reside on the same GPU. The kernels of third convolutional layer are connected to all the kernel maps
in second layer. The neurons in fully connected layers are connected to all the neurons in the previous
layers.
P a g e 2 | 11
In short, AlexNet contains five convolutional layers and three fully connected layers. Relu is applied
after the very convolutional and the fully connected layer. Dropout is applied before the first and
second fully connected year. The network has the 62.3 million parameters and needs 1.1 billion
computation units in a forward pass. We can also see convolution layers, which accounts for 6% of
all the parameters, consumes 95% of the computation.
The idea behind having the fixed size kernels is that all the variable size convolutional kernels used
in the Alexnet (11x11, 5x5, 3x3) can be replicated by making use of multiple 3x3 kernels as the
building blocks. The replication is in term of the receptive field covered by kernels .
P a g e 3 | 11
Let’s consider the example. Say we have an input layer of the size 5x5x1. Implementing the conv
layer with kernel size of 5x5 and stride one will the results and output feature map of (1x1). The same
output feature map can obtained by implementing the two (3x3) Conv layers with stride of 1 as
below:
Now, let’s look at the number of the variables needed to be trained. For a 5x5 Conv layer filter, the
number of variables is 25. On the other hand, two conv layers of kernel size 3x3 have a total of
3x3x2=18 variables (a reduction of 28%).
P a g e 4 | 11
Q3. What is VGG16?
Answer:
VGG16: It is a convolutional neural network model proposed by the K. Simonyan and A. Zisserman
from the University of Oxford in the paper “Very Deep Convolutional Networks for the Large-Scale
Image Recognition”. The model achieves 92.7% top 5 test accuracy in ImageNet, which is the dataset
of over 14 million images belonging to the 1000 classes. It was one of famous model submitted
to ILSVRC-2014. It improves AlexNet by replacing the large kernel-sized filters (11 and 5 in the first
and second convolutional layer, respectively) with multiple 3×3 kernel-sized filters one after another.
VGG16 was trained for weeks and was using NVIDIA Titan Black GPU’s.
The Architecture
The architecture depicted below is VGG16.
The input to the Cov1 layer is of fixed size of 224 x 224 RGB image. The image is passed through
the stack of convolutional (conv.) layers, where the filters were used with a very small receptive field:
3×3 (which is the smallest size to capture the notion of left/right, up/down, centre). In one of the
configurations, it also utilises the 1×1 convolution filters, which can be seen as the linear
transformation of the input channels . The convolution stride is fixed to the 1 pixel, the spatial padding
of the Conv. layer input is such that, the spatial resolution is preserved after the convolution, i.e. the
P a g e 5 | 11
padding is 1-pixel for 3×3 Conv. layers. Spatial pooling is carried out by the five max-pooling layers,
which follows some of the Conv. Layers. Max-pooling is performed over the 2×2 pixel window, with
stride 2.
Three Fully-Connected (FC) layers follow the stack of convolutional layers (which has the different
depth in different architectures): the first two have 4096 channels each, the third performs 1000-way
ILSVRC classification and thus contains 1000 channels . The final layer is softmax layer. The
configurations of the fully connected layers is same in all the networks.
All hidden layers are equipped with rectification (ReLU) non-linearity. It is also noted that none of
the networks (except for one) contain the Local Response Normalisation (LRN), such normalisation
does not improve the performance on the ILSVRC dataset, but leads to increased memory
consumption and computation time.
P a g e 7 | 11
The benefits of the Transfer Learning are that it can speed up the time as it takes to develop and
train the model by reusing these pieces or modules of already developed models. This helps to
speed up the model training process and accelerate results.
P a g e 8 | 11
Region Proposal Network:
The output of the region proposal network is the bunch of boxes/proposals that will be examined
by a classifier and regressor to check the occurrence of objects eventually. To be more
precise, RPN predicts the possibility of an anchor being background or foreground, and refine
the anchor.
P a g e 9 | 11
Problems with R-CNN:
It still takes the huge amount of time to train the network as we would have to classify
2000 region proposals per image.
It cannot be implemented real-time as it takes around 47 seconds for each test image.
The selective search algorithm is the fixed algorithm. Therefore, no learning is happening
at that stage. This leads to the generation of the bad candidate region proposals.
Q9.What is GoogLeNet/Inception?
Answer:
The winner of the ILSVRC 2014 competition was GoogLeNet from Google. It achieved a top-5 error
rate of 6.67%! This was very close to human-level performance which the organisers of the challenge
were now forced to evaluate. As it turns out, this was rather hard to do and required some human
training to beat GoogLeNets accuracy. After the few days of training, the human expert (Andrej
Karpathy) was able to achieve the top-5 error rate of 5.1%(single model) and 3.6%(ensemble). The
network used the CNN inspired by LeNet but implemented a novel element which is dubbed an
inception module. It used batch normalisation, image distortions and RMSprop. This module is based
on the several very small convolutions to reduce the number of parameters drastically. Their
architecture consisted of the 22 layer deep CNN but reduced the number of parameters from 60
million (AlexNet) to 4 million.
It contains 1×1 Convolution at the middle of network, and global average pooling is used at the end
of the network instead of using the fully connected layers. These two techniques are from another
paper “Network In-Network” (NIN). Another technique, called inception module, is to have different
sizes/types of convolutions for the same input and to stack all the outputs.
P a g e 10 | 11
Q10. What is LeNet-5?
Answer:
LeNet-5, a pioneering 7-level convolutional network by the LeCun et al in 1998, that classifies digits,
was applied by several banks to recognise hand-written numbers on checks (cheques) digitised in
32x32 pixel greyscale input images. The ability to process higher-resolution images requires larger
and more convolutional layers, so the availability of computing resources constrains this technique.
LeNet-5 is very simple network. It only has seven layers, among which there are three convolutional
layers (C1, C3 and C5), two sub-sampling (pooling) layers (S2 and S4), and one fully connected layer
(F6), that are followed by output layers. Convolutional layers use 5 by 5 convolutions with stride 1.
Sub-sampling layers are 2 by 2 average pooling layers. Tanh sigmoid activations are used to
throughout the network. Several interesting architectural choices were made in LeNet-5 that are not
very common in the modern era of deep learning.
------------------------------------------------------------------------------------------------------------------------
P a g e 11 | 11