0% found this document useful (0 votes)

16 views141 pages

Data Science Interview Ques.

The document provides a comprehensive set of interview questions and answers for data science roles, focusing on key Python concepts and libraries. It covers a variety of topics including data manipulation, analysis, and programming techniques relevant to data science. The questions range from basic to intermediate levels, helping candidates prepare effectively for their interviews.

Uploaded by

deepaksain013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views141 pages

Data Science Interview Ques.

Uploaded by

deepaksain013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 141

Ace the upcoming Data Science Interview

You can't anticipate every question an interviewer will ask. However, there are many critical
questions that you can prepare before the interview.

Our hiring partners have helped us curate a set of interview questions on key skills, which will help
you prepare better for the data science job roles.

Filters

1. Name a function which is most useful to convert a multidimensional array into a one-
dimensional array. For this function will changing the output array affect the original array?

Basic Python

The flatten( ) can be used to convert a multidimensional array into a 1D array. If we modify
the output array returned by flatten( ), it will not affect the original array because this
function returns a copy of the original array.

2. If there are two variables defined as 'a = 3' and 'b = 4', will ID() function return the same
values for a and b?
Basic Python

The id() function in python returns the identity of an object, which is actually the memory
address. Since, this identity is unique and constant for every object, it will not return same
values for a and b.

3. For what Beautiful soup library is used for?

Basic Python

4. In python, if we create two variables 'mean = 7' and 'Mean = 7' , will both of them be
considered as equivalent?

Basic Python

Python is a case-sensitive language. It has the ability to distinguish uppercase or lowercase

letters and hence these variables 'mean = 7' and 'Mean = 7' will not be considered as
equivalent.

5. What is the use of 'inplace' in pandas functions?

Basic Python

Inplace is a parameter available for a number of pandas functions. It impacts how the
function executes. Using 'inplace = True', the original dataframe can be modified and it will
return nothing. The default behaviour is 'inplace = False' which returns a copy of the
dataframe, without affecting the original dataframe.

6. How can you change the index of a dataframe in python?

Basic Python

DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

keys: label or array-like or list of labels/arrays This parameter can be either a single column
key, a single array of the same length as the calling DataFrame, or a list containing an
arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index,
np.ndarray, and instances of Iterator.
7. How would check a number is prime or not using Python?

Basic Python

# taking input from user number = int(input("Enter any number: ")) # prime number is always
greater than 1 if number > 1: for i in range(2, number): if (number % i) == 0: print(number, "is
not a prime number") break else: print(number, "is a prime number") # if the entered number
is less than or equal to 1 # then it is not a prime number else: print(number, "is not a prime
number")

8. What is the difference between univariate and bivariate analysis? What all different
functions can be used in python?

Basic Python

Univariate analysis summarizes only one variable at a time while Bivariate analysis compares
two variables. Below are a few functions which can be used in the univariate and bivariate
analysis: 1. To find the population proportions with different types of blood disorders.
df.Thal.value_counts() 2. To make a plot of the distribution : sns.distplot(df.Variable.dropna())
3. Find the minimum, maximum, average, and standard deviation of data. There is a function
called describe() which returns the minimum, maximum, mean etc. of the numerical variables
of the data frame. 4. Find the mean of the Variable df.Variable.dropna().mean() 5. Boxplot to
observe outliers sns.boxplot(x = ' ', y = ' ', hue = ' ', data=df) 6. Correlation plot: data.corr()

9. What is the difference between 'for' loop and 'while' loop?

Basic Python

- 'for' loop is used to obtain a certain result. In a for loop, the number of iterations to be
performed is already known. - In 'while' loop, the number of iterations is not known. Here,
the statement runs until a specific condition is met and the assertion is proven untrue.

10. Differentiate between Call by value and Call by reference.

Basic Python

11. How will you import multiple excel sheets in a data frame?
Basic Python

The excel sheets can be read using 'pd.read_excel()' function into a dataframe and then
using 'pd.concat()', concatenate all the excel sheets- Syntax: df =
pd.concat(pd.read_excel('sheet_name', sheet_name=None), ignore_index=True)

12. What is the difference between 'Append' and 'Extend' function?

Basic Python

The append() method adds an item to the end of the list. The syntax of the append() method
is: list.append(item) On the other hand, the extend method extends the list by adding each
element from iterable. The syntax of the extend() method is: list.extend(item)

13. What are the data types available in Python?

Basic Python

Python has the following standard data types: - Boolean - Set - Mapping Type: dictionary -
Sequence Type: list, tuple, string - Numeric Type: complex, float, int.

14. Can you write a function using python to impute outliers?

Basic Python

import numpy as np def remove Outliers(x, outlierConstant): a = np.array(x) upper_quartile =

21. How do you get the frequency of a categorical column of a dataframe using python?

Basic Python

Using df.value_counts(), where df is the dataframe. The value_counts( ) function returns the
counts of the distinct elements in a dataframe column, sorted in descending order by
default.

22. Will range(5) include '5' in its output?

Basic Python

The range() function in python always excludes the last integer from the result. Here it will
generate a numeric series from '0' to (5-1)=4, and it will not include '5'.

23. How can you drop a column in python?

Basic Python

Pandas 'drop()' method is used to remove specific rows and columns. To drop a column, the
parameter 'axis' should be set as 'axis = 1'. This parameter determines whether to drop labels
from the columns or rows (index). Default behaviour is, axis = 0. Syntax:
df.drop('column_name', axis=1)

24. How NaN values behave while comparing with itself?

Basic Python

NaN values can not be compared with itself. That's why, checking if a variable is equal to
itself is the most popular way to look for NaN values. If it isn't, it's most likely a NaN value.
25. How can we convert a python series object into a dataframe?

Basic Python

The to_frame() is a function that helps us to convert a series object into a dataframe.
Syntax: Series.to_frame(name=None) name: this name will substitute the existing series
name while creating the dataframe.

26. How do you read a file without using Pandas?

Basic Python

27. Can you plot 3D plots using matplotlib? Describe the function.

Intermediate Python

Yes Function: import numpy as np import matplotlib.pyplot as plt fig = plt.figure() ax =

plt.axes(projection ='3d')

28. How get_dummies() is different from one hot encoder?

Intermediate Python

OneHotEncoder cannot process string values directly. If your nominal features are strings,
then you need to first map them into integers. pandas.get_dummies is kind of the opposite.
By default, it only converts string columns into one-hot representation, unless columns are
specified.

29. Name a tool that can be used to convert categorical columns into a numeric column.

Intermediate Python

One of the most used and popular ones are LabelEncoder and OneHotEncoder. Both are
provided as parts of sklearn library. LabelEncoder can be used to transform categorical data
into integers: from sklearn.preprocessing import LabelEncoder label_encoder =
LabelEncoder() x = ['Apple', 'Orange', 'Apple', 'Pear'] y = label_encoder.fit_transform(x) print(y)
array([0, 1, 0, 2]) OneHotEncoder can be used to transform categorical data into one hot
encoded array: from sklearn.preprocessing import OneHotEncoder onehot_encoder =
OneHotEncoder(sparse=False) y = y.reshape(len(y), 1) onehot_encoded =
onehot_encoder.fit_transform(y) print(onehot_encoded)

30. How will you remove duplicate data from a dataframe?

Intermediate Python

Python programming language supports negative indexing of arrays, something which is not
available in arrays in most other programming languages. This means that the index value of
-1 gives the last element, and -2 gives the second last element of an array. The negative
indexing starts from where the array ends. This means that the last element of the array is
the first element in the negative indexing which is -1.

41. Python or R, which one would you prefer for text analytics?

Intermediate Python

42. What all different methods can be used to standardize the data using python?
Intermediate Python

Min Max Scaler. Standard Scaler. Max Abs Scaler. Robust Scaler. Quantile Transformer
Scaler. Power Transformer Scaler. Unit Vector Scaler.

43. How would you define a block in Python?

Intermediate Python

A block is a group of statements in a program or script. Usually, it consists of at least one

statement and declarations for the block, depending on the programming or scripting
language. A language which allows grouping with blocks is called a block-structured
language

44. How do you do Up-sampling of data? Name a python function or explain the code.

Intermediate Python

Up-sampling is the process of randomly duplicating observations from the minority class in
order to reinforce its signal. There are several heuristics for doing so, but the most common
way is to simply resample with replacement. Module for resampling in Python: from
sklearn.utils import resample

45. What is machine learning?

Basic Machine Learning

Machine learning is a branch of artificial intelligence (AI) that focuses on the use of data and
algorithms to mimic the way that humans learn. It aims to gradually improve by learning from
the events that happened in the past (data captured in past), assuming that the past data is a
good representation of the future. There are various machine learning algorithms available to
build a model that can learn the hidden patterns from the past data, known as training data,
in order to make predictions for the future data or the unseen data, based on which
decisions can be taken. For example: Predicting the prices of a house based on attributes of
the property.

46. Machine learning helps in summarising the patterns in the data in a mathematically
precise way. What exactly is the mathematical outcome of any (machine learning) model
building exercise?

Intermediate Machine Learning

Machine learning models take data as input to find the hidden patterns in it and try to
summarize the patterns that exist in the data by establishing a relationship between the
predictors and the predicted values in a mathematically precise way. The mathematical
outcome of a model can be as simple as an equation that relates the predictors to the target
variable. For example, the relationship between salary and years of experience of an
individual.

47. Machine learning automates the process of building mathematical models out of data.
Explain/elaborate on this statement in the light of the linear regression algorithm.

Advanced Machine Learning

Linear regression is a linear model which tries to fit the best fit line through the data and
establish the relationship between the independent variables and the dependent variable in
a form of a linear equation. The equation of the best fit line can be given as: Y = ax1 + bx2 +
c Where a and b are the coefficients of x1 and x2 variables respectively and c is the
constant. The linear regression tries to fit the line in such a way that the errors are
minimized, that is, the predicted values are closer to the observed values. The machine-
learning algorithm of linear regression automates the process of model building i.e it
automatically finds the best fit line which has the minimum error or predicts the values that
are closest to the observed values. This means that process of finding the relationship
between independent variables and the dependent variable is automated.

48. If you model performs very well on the data that it was trained on but not on the data
that it has not seen so far, how will you address that performance gap? Why is it important
to address that gap?

Intermediate Machine Learning

Data generally contains information as well as noise. When we fit a model on the training
data, it learns both the information and noise. If the model learns too much noise and fails to
capture the required information then we see that there is a performance gap between the
training performance and the performance on the unseen data (test set). This performance
gap indicates that the model is overfitting, i.e. failing to replicate the performance of the
training set on the test set. To address this performance gap between the training and the
test set various regularization techniques can be applied. In linear models like linear
regression, regularization techniques like ridge regression and lasso regression can be used.
In non-linear models like decision trees, the pruning techniques like pre-pruning and post-
pruning techniques can be used to deal with the performance gap. Also, the technique of
cross-validation can be implemented to determine the performance of the model on the
unseen dataset.

49. When a model gets to production, it will have to make prediction on data that it has not
see so far, how can we ensure that the model performs well on this data?

Advanced Machine Learning

Before sending the model to production we can check the performance and validity of the
model by using methods. Train - Validation split: In this method, we divide the training set
into two parts one part is kept for training, and the other is kept for validating the model
performance. We train the model on the training set and test it against the validation set.
Based on the performance of the model on the validation set we tune the hyperparameters
of the model to get a generalized and good model performance. K-fold cross-validation: In
this method, we divide the training set into k-folds. Where k can be any number ranging
from 2 to the maximum number of records in the dataset - 1 (generally 10 folds are
preferred). Let’s assume that we set the value of k to be 5, then, in this case, 4 folds will be
used for training the model and the left-out fold will be used as a test set. The same
procedure is repeated for all the folds i.e each fold will be used as a training and test set. To
determine the model performance average of metrics across all the folds is taken. With this
method, we can be sure of the model’s performance because the model has been tested
across various datasets.

50. What is supervised learning?

Basic Machine Learning

Supervised learning is a type of machine learning method in which algorithms are trained
using well "labeled" training data, that is independent variables are already tagged against a
defined target variable. With this technique, we can make predictions and compare them
against the ground truth. For example, Determining if a client might default on a loan or not.

51. What is unsupervised learning?

Basic Machine Learning

Unsupervised learning is a type of machine learning method in which models are trained
using an unlabeled dataset i.e there is no defined target variable against the independent
variables. Since there is no defined target, there is no specific way to compare model
performance in most unsupervised learning methods. Hence, unsupervised learning
algorithms generally perform the task by clustering the dataset into groups according to
certain measures of similarities. For example, Advertising companies segment the population
into smaller groups with similar demographics and purchasing habits to reach their target
market with relevant ads.

52. How do we measure performance of a supervised learning model?

Intermediate Machine Learning

There are several performance metrics available to measure the performance of a

supervised learning model. In case of a regression problem, some of the metrics available to
measure the performance are R2, Adj. R2, RMSE, MAE, etc. In case of a classification
problem, some of the metrics available to measure the performance are Accuracy, Precision,
Recall, F1-Score, etc.

53. How do we measure performance of an unsupervised learning model?

Intermediate Machine Learning

There are several performance metrics available to measure the performance of an

unsupervised learning model such as silhouette score, cophenetic correlation, etc.

54. What is the difference between correlation and multicollinearity?

Advanced Machine Learning

Correlation is a statistical measure that expresses the strength of a linear relationship

between two quantitative variables. A correlation can be positive or negative. In a positive
correlation, the two variables move in the same direction i.e. when one variable increases,
the other variable also increases, and vice versa. Whereas in a negative correlation, the two
variables move in the opposite direction i.e. when one variable increases, the other variable
will decrease, and vice versa. Correlation gives a sense of the relationship between two
variables, known as pair-wise correlation. When two or more variables have a strong linear
relationship they are said to be multicollinear. Multicollinearity is a challenge in linear models
because when two or more independent variables display high correlation the model is not
able to distinguish between the individual effects of the independent variables on the
dependent variable. Multicollinearity can be detected using the Variance Inflation Factor
(VIF).

55. How does multicollinearity affect the performance of a linear regression model?

Intermediate Machine Learning

Multicollinearity doesn’t impact the performance of a linear regression model, it only

impacts the interpretation from the model. Multicollinearity is a challenge in linear
regression because when two or more independent variables display high correlation the
model is not able to distinguish between the individual effects of the independent variables
on the dependent variable.

56. Which evaluation metric should you use to evaluate a linear regression model built on
a dataset that has a lot of outliers in it?

Intermediate Machine Learning

MAE would be a good metric in that case because it is most robust to outliers. MSE or
RMSE is extremely sensitive to outliers and penalizes the outliers more.

57. What is the difference between r-squared and adjusted r-squared?

Intermediate Machine Learning

R-squared (R2) is a statistical measure that represents the proportion of the variance that is
explained in the dependent variable by the independent variables. For example, if the R2 of a
model is 0.70, then 70% of the variation can be explained by the model's inputs. Adjusted R-
squared is a modified version of R-squared that has been adjusted for the number of
independent variables in the model and penalizes the model performance for adding
variables that do not improve the existing model. If we add a new independent variable in
the model, the R2 of the model will always increase. However, the adjusted R-squared
increases only when the new independent variable improves the model more than expected
by chance. It decreases when the independent variable improves the model by less than
expected.

58. How will you explain Decision Tree to a non-tech person?

Advanced Machine Learning

A decision tree can be considered as an inverted tree representation that grows from top to
bottom instead of bottom to top. It tries to mimic the human decision-making process and
tries to represent all the possible solutions to a decision based on certain conditions. For
example, If you have to decide whether to go out for a coffee or not at a nearby place, a
simple decision tree can look like Start with the main question that is “To go out for coffee?”
The decision to go out depends on the location of the place, so the second question
becomes “Is the place nearby?” If ‘yes’ then go for coffee else ‘No’.

59. Why are decision trees prone to overfitting?

Intermediate Machine Learning

The main aim of the decision tree is to achieve homogeneity among the leaf nodes i.e any
split made by the decision tree should result in pure leaves which contain one type of
decision only. For example, If we are trying to predict whether a person will default on a loan
or not and we use the decision tree to make this prediction then the result from the decision
tree split must result in all the defaulters in one leaf and all the non-defaulters on another
leaf node. If the composition of the leaf node is 50% defaulters and 50% non-defaulters then
the leaf is considered completely impure. If a decision tree is built without any restrictions
the tree will grow to its full length and will try to achieve homogeneity by capturing complex
patterns as well as noise present in the data during this process. Due to this, it ends up
learning all the patterns that are present in the training data but fails to replicate the
performance on unseen data i.e it leads to overfitting

Although CNNs are optimized to work on image data and perform better and more
efficiently on images than fully connected neural networks, they still suffer from some
drawbacks which should be kept in mind. CNNs require quite a lot of labelled image data in
order to reach near-human levels of performance in image related tasks, and such data may
not readily be available. In that case, it may be better to use Transfer Learning to import the
weights and architecture of a pre-trained model and only fine tune its last few layers to apply
it to the problem at hand. CNNs may also be susceptible to spurious patterns in the data
(such as the sky always being present in car images - so it wrongly learns that having a sky is
important to classify something as a car), and this susceptibility can be resolved by
diversifying the training dataset to ensure nothing else about the images is consistent other
than the exact pattern we want the CNN to learn. CNNs can also be susceptible to small
perturbations in the dataset, for example: not being rotationally invariant, and this problem
should be addressed through the technique of data augmentation through various image
modification techniques such as flipping, rotation, cropping, mirroring, color modification etc.

81. Why is text pre-processing an essential part of NLP? What happens if we fail to pre-
process text data?

Basic Deep Learning

Hint?

Text preprocessing helps us to get rid of unhelpful parts of the data, or noise by converting
all characters to lowercase, removing punctuations marks, and removing stop words and
typos. Removing noise comes in handy when you want to do text analysis on pieces of data
like comments or tweets. It will be helpful to get rid of the text that interferes with text
analysis If not pre-processed, you will receive an error or your model will not perform as
expected.
82. In case you're working on an NLP application such as sentiment analysis of Twitter
posts, describe the text pre-processing steps that would most likely be required?

Intermediate Deep Learning

Hint?

1] Lowercasing for consistency. 2] Stemming to reduce the words to root form. 3]

Lemmatization to map words to their root form. 4] Stop words removal because they carry
low information and don't contribute to the sentiment analysis. 5] Noise removal including
digits, hashtags (they are Twitter comments), and characters. 6] Remove emoticons (they are
Twitter comments) because they are noise.

83. Which evaluation metric is suitable to measure the performance of sentiment analysis
and why?

Intermediate Deep Learning

Hint?

Sentiment analysis is a classification problem, thus, it uses the metrics of Precision, Recall, F-
score, and Accuracy. Also, average measures like macro, micro, and weighted F1 scores are
useful for multi-class problems. Accuracy is used when the True Positives and True negatives
are more important while F1-score is used when the False Negatives and False Positives are
crucial. F1 scores also are helpful when there is a lot of class imbalance. As sentiment
analysis is a real-time problem, we can expect a lot of class imbalance. Thus, F1 scores are
mostly used.

84. What is the difference between stemming and lemmatization? Could you provide an
example?

Basic Deep Learning

Hint?

Stemming and Lemmatization both generate the foundation of the inflected words. The
difference is that the stem may not be an actual word, whereas the lemma is an actual
language word. For eg: beautiful and beautifully will be stemmed to beauti which has no
meaning in the English dictionary. The same are however lemmatised to beautiful and
beautifully respectively without changing the meaning of the words.
85. Would you consider Logistic Regression to be a special case of using Neural Networks?
If so, how?

Basic Deep Learning

Hint?

Yes, logistic regression is a specialized case of a one-node neural network, where we use
the Sigmoid activation function and the cost function being minimized is the Binary Cross-
Entropy function.

86. How do you compare categorical values, how would you know that a categorical value
is related to target variable?

Basic Advanced Stats

Comparing categorical Values: When there are three or more levels/categories for the predictor & Target
variable is nominal, the degree of association between the predictor and target variable can be measured
with statistics such as chi-squared tests

Categorical value is related to the target variable:

- When there is only one continuous target variable, there are one plus categorical independent
variables, and there is no control variable at all, then you can go for ANOVA.

- Similarly, when there is only one continuous target variable, there is only one categorical independent
variable (i.e. dichotomous, e.g. pass/fail), and no control variable, then go for t-Test

87. What is Linear regression? Explain the assumptions.

Basic Advanced Stats

Linear regression is an analysis that assesses whether one or more predictor variables explain the
dependent (criterion) variable. The regression has five key assumptions:

1) Linear relationship: Linear regression needs the relationship between the independent and dependent
variables to be linear. The linearity assumption can best be tested with scatter plots.

2) Normality: The error terms must be normally distributed (To check normality, one can look at QQ plot,
can also perform statistical tests of normality such as Kolmogorov-Smirnov test, Shapiro-Wilk test.

3) Multicollinearity: Linear regression assumes that there is little or no multicollinearity in the data.
Multicollinearity occurs when the independent variables are too highly correlated with each other.
Multicollinearity may be tested with three central criteria: Correlation matrix, Tolerance, VIF

4) No auto-correlation: Linear regression analysis requires that there is little or no autocorrelation in the
data. Autocorrelation occurs when the residuals are not independent of each other. For instance, this
typically occurs in stock prices, where the price is not independent of the previous price.

5) Homoscedasticity: The error terms must have constant variance. This phenomenon is known as
homoskedasticity. The presence of non-constant variance is referred to as heteroskedasticity.

88. Explain mathematically how Linear Regression works?

Basic Advanced Stats

The idea behind simple linear regression is to "fit" the observations of two variables into a linear
relationship between them. Graphically, the task is to draw the line that is "best-fitting" or "closest" to the
points (x_i,y_i), where x_i and y_i are observations of the two variables which are expected to depend
linearly on each other.

Although many measures of best fit are possible, for most applications the best-fitting line is found using
the method of least squares. The method finds the linear function L which minimizes the sum of the
squares of the errors in the approximations of the y_i by L(x_i)

For eg: To find the line y=mx+b of best fit through N points, the goal is to minimize the sum of the
squares of the differences between the y-coordinates and the predicted yy-coordinates based on the line
and the x-coordinates.

89. In your project, why classification was chosen over regression ?

Basic Advanced Stats

Classification is used when the output variable is a category such as “red” or “blue”, “spam” or “not spam”.
It is used to draw a conclusion from observed values. Differently from regression which is used when the
output variable is a real or continuous value like “age”, “salary”, etc.

When we must identify the class the data belongs to, we use classification over regression. Like when
you must identify whether a name is male or female instead of finding out how they are correlated with
the person.

Adjusted R-squared adjusts the statistic based on the number of independent variables in the model.

It is possible that R Square has improved significantly yet Adjusted R Square is decreased with the
addition of a new predictor when the newly added variable brings in more complexity than the power to
predict the target variables.

Adj. R² = 1 - ((1 - R.squared) * (n - 1)/(n-p-1)) where p: no. of predictors, n: no. of observations

94. Difference between logistic regression and CART?

Basic Advanced Stats

1. Cart works best locally, Logistic regression works best Globally

2. Cart is Useful for identifying interactions between variables

3. Cart can predict both categorical and quantitative data while logistic can only predict
categorical/ordinal

4. Cart is Easy to run & interpret

5. Cart can lead to overfitting as it has a disadvantage over stop splitting

6. CART works best with a larger dataset, while Logistic regression on a smaller dataset

7. Cart is non-parametric while logistic is parametric

95. What are the limitations of Logistic Regression

Basic Advanced Stats

1. The major limitation of Logistic Regression is the assumption of linearity between the dependent
variable and the independent variables.

2. It can only be used to predict discrete functions. Hence, the dependent variable of Logistic Regression
is bound to the discrete number set.

3. Non-linear problems can’t be solved with logistic regression because it has a linear decision surface.
Linearly separable data is rarely found in real-world scenarios

4. Logistic Regression requires average or no multicollinearity between independent variables

5. If the number of observations is lesser than the number of features, Logistic Regression should not be
used, otherwise, it may lead to overfitting.

96. Name the library used to implement logistic Regression

Basic Advanced Stats

Python:

from sklearn.linear_model import LogisticRegression

glm(Target ~.,family=binomial(link='logit'),data=train)

97. What is confusion matrix?

Basic Advanced Stats

A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model,
where N is the number of target classes. The matrix compares the actual target values with those
predicted by the machine learning model. This gives us a holistic view of how well our classification model
is performing and what kinds of errors it is making.

True Positive (TP): The actual value was positive and the model predicted a positive value

True Negative (TN): The actual value was negative and the model predicted a negative value

False Positive (FP) – Type 1 error: The actual value was negative but the model predicted a positive value

False Negative (FN) – Type 2 error: The actual value was positive but the model predicted a negative
value

98. What is vif? What is the precision of Vif ?

Basic Advanced Stats

VIF, the Variance Inflation Factor, is used during regression analysis to assess whether certain
independent variables are correlated to each other and the severity of this correlation. If your VIF number
is greater than 10, the included variables are highly correlated to each other. Since the ability to make
precise estimates is important to many companies, generally people aim for a VIF within the range of 1-5.
A cutoff number of 5 is commonly used.

99. How do you deal with multi-colinearity and conditional probability?

Intermediate Advanced Stats

Potential solutions to deal with multicollinearity:

- Remove some of the highly correlated independent variables.

- Linearly combine the independent variables, such as adding them together.

- Perform an analysis designed for highly correlated variables, such as principal components analysis or
partial least squares regression.

100. Is logistic regression a part of Linear regression?

Basic Advanced Stats

Logistic regression is considered a generalized linear model because the outcome always depends on the
sum of the inputs and parameters.

The actual value of the dependent variable is yi.

The predicted value of yi is defined to be y^i = a xi + b, where y = a x + b is the regression equation.

The residual is the error that is not explained by the regression equation:

ei = yi - y^i.

A residual plot plots the residuals on the y-axis vs. the predicted values of the dependent variable on the
x-axis. We would like the residuals to be unbiased: have an average value of zero in any thin vertical strip,
and homoscedastic, which means "same stretch": the spread of the residuals should be the same in any
thin vertical strip.

The residuals are heteroscedastic if they are not homoscedastic.

101. Write the equation of the linear Regression? Explain residuals?

Basic Advanced Stats

The actual value of the dependent variable is yi.

The predicted value of yi is defined to be y^i = a xi + b, where y = a x + b is the regression equation.

The residual is the error that is not explained by the regression equation:

ei = yi - y^i.

The residuals are heteroscedastic if they are not homoscedastic.

102. Explain homoscedasticity ?

Intermediate Advanced Stats

The assumption of homoscedasticity (meaning “same variance”) is central to linear regression models.
Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance
in the relationship between the independent variables and the dependent variable) is the same across all
values of the independent variables. Heteroscedasticity (the violation of homoscedasticity) is present
when the size of the error term differs across values of an independent variable. The impact of violating
the assumption of homoscedasticity is a matter of degree increasing as heteroscedasticity increases.

103. Performance measures of linear Regression?

Basic Advanced Stats

Most commonly known evaluation metrics include:

R-squared (R2), which is the proportion of variation in the outcome that is explained by the predictor
variables. In multiple regression models, R2 corresponds to the squared correlation between the observed
outcome values and the predicted values by the model. The Higher the R-squared, the better the model.

Residual Standard Error (RSE), also known as the model sigma, is a variant of the RMSE adjusted for the
number of predictors in the model. The lower the RSE, the better the model. In practice, the difference
between RMSE and RSE is very small, particularly for large multivariate data.

Mean Absolute Error (MAE), like the RMSE, the MAE measures the prediction error. Mathematically, it is
the average absolute difference between observed and predicted outcomes, MAE = mean(abs(observeds
- predicteds)). MAE is less sensitive to outliers compared to RMSE.

Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp

The lower these metrics, the better the model.

AIC stands for (Akaike’s Information Criteria): Basic idea of AIC is to penalize the inclusion of additional
variables to a model. It adds a penalty that increases

the error when including additional terms. The lower the AIC, the better the model.

AICc is a version of AIC corrected for small sample sizes.

BIC (or Bayesian information criteria) is a variant of AIC with a stronger penalty for including additional
variables to the model.

Mallows Cp: A variant of AIC developed by Colin Mallows.

104. Explain prior probability, likelihood and marginal likelihood in context of naiveBayes
algorithm?

Basic Advanced Stats

105. Derive logistic regression equation.

Intermediate Advanced Stats

In Logistic Regression, the Probability should be between 0 to 1 and as per cut off rate, the output comes
out in the form of 0 or 1 where the linear equation does not work because value comes out inform of +
or - infinity and that the reason we have to convert a linear equation into Sigmoid Equation.

Transformation of Linear Regression Equation into Logistic Regression Equation.

1. Linear Regression Equation is Y = b0 + b1*X

Converting into Sigmoid Equation:

2. Probability should not be less than 0 i.e. eliminating -infinity

converting into the exponential form: E^Y

3. Probability should not be greater than 1 i.e. eliminating +infinity

Dividing value with 1:

P = E^Y/E^Y+1

Odds Ratio:

4. Taking Odds Ratio which is used for calculating Probability

P = Probability of Success and 1-P= Probability of Failure

P/1-P

Sigmoid Equation put into Odd Ratio:

5. Substituting the value of P with equation E^Y/E^Y+1

P/1-P = (E^Y/E^Y+1 ) / (1-E^Y/E^Y+1)

=(E^Y/E^Y+1) / (1/E^Y+1)

=(E^Y/E^Y+1) x (E^Y+1/1)

=E^Y

Odds Ratio in the form of Sigmoid:

6. We can say P/1-P = E^Y

Log Transformation:

7. Converting into Log

P/1-P = E^Y

Log(P/1-P) = Y (When it converts into a log, Exponential naturally removed)

Log(P/1-P) = b0+b1*X

106. Explain how SVM works.

Intermediate Advanced Stats

A simple linear SVM classifier works by making a straight line between two classes.

That means all of the data points on one side of the line will represent a category and the data points on
the other side of the line will be put into a different category. This means there can be an infinite number
of lines to choose from.

What makes the linear SVM algorithm better than some of the other algorithms, like k-nearest neighbors,
is that it chooses the best line to classify your data points. It chooses the line that separates the data and
is the furthest away from the closet data points as possible.

A 2-D example helps to make sense of all the machine learning jargon. Basically, you have some data
points on a grid. You're trying to separate these data points by the category they should fit in, but you
don't want to have any data in the wrong category. That means you're trying to find the line between the
two closest points that keeps the other data points separated.

So the two closest data points give you the support vectors you'll use to find that line. That line is called
the decision boundary.

The decision boundary doesn't have to be a line. It's also referred to as a hyperplane because you can find
the decision boundary with any number of features, not just two.

Types of SVMs:

Simple SVM: Typically used for linear regression and classification problems.

Kernel SVM: Has more flexibility for non-linear data because you can add more features to fit a
hyperplane instead of a two-dimensional space.

107. How will you handle class imbalance problem? What are the various approaches?
Intermediate Advanced Stats

Imbalanced data typically refers to a problem with classification problems where the classes are not
represented equally.

Few tactics To Combat Imbalanced Training Data:

- Collect More Data

- Try Changing Your Performance Metric

- Try Resampling Your Dataset

- Try Generate Synthetic Samples (The most popular of such algorithms is called SMOTE or the
Synthetic Minority Over-sampling Technique)

- Try Different Algorithms

- Try Penalized Models

108. Why do we use sigmoid and not any increasing function from 0 to 1?

Intermediate Advanced Stats

The main reason why we use the sigmoid function is that it exists between (0 to 1). Therefore, it is
especially used for models where we have to predict the probability as an output. Since the probability of
anything exists only between the range of 0 and 1, sigmoid is the right choice.

109. What are various evaluation parameters of regression and classification to evaluate
the model?

Intermediate Advanced Stats

Regression evaluation metrics:

Root Mean Squared Error (RMSE), which measures the average error performed by the model in
predicting the outcome for an observation. Mathematically, the RMSE is the square root of the mean
squared error (MSE), which is the average squared difference between the observed actual outcome
values and the values predicted by the model. So, MSE = mean((observeds - predicteds)^2) and RMSE =
sqrt(MSE). The lower the RMSE, the better the model.
Residual Standard Error (RSE), also known as the model sigma, is a variant of the RMSE adjusted for the
number of predictors in the model. The lower the RSE, the better the model. In practice, the difference
between RMSE and RSE is very small, particularly for large multivariate data.

Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp

The lower these metrics, the better the model.

AIC stands for (Akaike’s Information Criteria): Basic idea of AIC is to penalize the inclusion of additional
variables to a model. It adds a penalty that increases the error when including additional terms. The lower
the AIC, the better the model.

AICc is a version of AIC corrected for small sample sizes.

BIC (or Bayesian information criteria) is a variant of AIC with a stronger penalty for including additional
variables to the model.

Mallows Cp: A variant of AIC developed by Colin Mallows.

Classification evaluation metrics:

- Average classification accuracy, representing the proportion of correctly classified observations.

- Confusion matrix, which is 2x2 table showing four parameters, including the number of true positives,
true negatives, false negatives and false positives.

- Precision, Recall and Specificity, which are three major performance metrics describing a predictive
classification model

- ROC curve, which is a graphical summary of the overall performance of the model, showing the
proportion of true positives and false positives at all possible values of probability cutoff. The Area Under
the Curve (AUC) summarizes the overall performance of the classifier.

110. In your project, If we use regression model, what would be the outcome?

Intermediate Advanced Stats

Regression analysis generates an equation to describe the statistical relationship between one or more
predictor variables and the response variable (continuous in nature). Where the response variable is the
target variable.
111. List out some common problems faced while analyzing the data.

Basic Advanced Stats

112. OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the
statement.

Intermediate Advanced Stats

113. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the
components?

Advanced Advanced Stats

114. What are the metrics chosen to evaluate model performance

Intermediate Data Mining

Hint?

115. How will you treat missing values?

Basic Data Mining

116. Explain Random Forest algorithm

Basic Data Mining

Hint?

117. Explain Decision Tree algorithm

Basic Data Mining

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised
learning algorithms, the decision tree algorithm can be used for solving regression and classification
problems too.

The goal of using a Decision Tree is to create a training model that can use to predict the class or value of
the target variable by learning simple decision rules inferred from prior data(training data).

In Decision Trees, for predicting a class label for a record we start from the root of the tree. We compare
the values of the root attribute with the record’s attribute. On the basis of comparison, we follow the
branch corresponding to that value and jump to the next node.

Types of decision trees: ID3, CART, C4.5, CHAID, MARS

118. Differentiate between random forest and decision trees?

Basic Data Mining

Hint?

119. Why did you choose Random forest or Decision trees model ?

Basic Data Mining

Random forests consist of multiple single trees each based on a random sample of the training data. They
are typically more accurate than single decision trees. The following figure shows the decision boundary
becomes more accurate and stable as more trees are added.

Two reasons why random forests outperform single decision trees:

- Trees are unpruned. While a single decision tree like CART is often pruned, a random forest tree is fully
grown and unpruned, and so, naturally, the feature space is split into more and smaller regions.

- Trees are diverse. Each random forest tree is learned on a random sample, and at each node, a random
set of features are considered for splitting. Both mechanisms create diversity among the trees.

120. Discuss Customer segmentation by Clustering

Intermediate Data Mining

The objective of any clustering algorithm is to ensure that the distance between data points in a cluster is
very low compared to the distance between 2 clusters i.e. Members of a group are very similar, and
members of the different group are extremely dissimilar.

For e.g., k-means clustering can be used for creating customer segments based on their income and
spend data

121. List the drawbacks and advantages of decision trees

Basic Data Mining

Advantages:

- Compared to other algorithms decision trees requires less effort for data preparation during pre-
processing.

- A decision tree does not require normalization of data.

- A decision tree does not require scaling of data as well.

- Missing values in the data also do NOT affect the process of building a decision tree to any
considerable extent.

- A Decision tree model is very intuitive and easy to explain to technical teams as well as stakeholders.

Disadvantages:

- A small change in the data can cause a large change in the structure of the decision tree causing
instability.

- For a Decision tree sometimes calculation can go far more complex compared to other algorithms.

- Decision tree often involves higher time to train the model.

- Decision tree training is relatively expensive as the complexity and time have taken are more.

- The Decision Tree algorithm is inadequate for applying regression and predicting continuous values.

122. How to reduce number of variables in Logistic regression and random forest?

Basic Data Mining

Seven techniques for dimensionality reduction:

- Missing Values Ratio: Data columns with a ratio of missing values greater than a given threshold can be
removed. The higher the threshold, the more aggressive the reduction.
- Low Variance Filter: Data columns with a variance lower than a given threshold can be removed. Notice
that the variance depends on the column range, and therefore normalization is required before applying
this technique.

- High Correlation Filter: Calculate the Pearson product-moment correlation coefficient between numeric
columns and Pearson’s chi-square value between nominal columns. For the final classification, we only
retain one column of each pair of columns whose pairwise correlation exceeds a given threshold. Notice
that correlation depends on the column range, and therefore, normalization is required before applying
this technique.

- Principal Component Analysis (PCA): First principal component has the largest possible variance; each
succeeding principal component has the highest possible variance under the constraint that it is
orthogonal to (i.e., uncorrelated with) the preceding principal components. Keeping only the first m < n
principal components reduces the data dimensionality while retaining most of the data information, i.e.,
variation in the data.

- Backward Feature Elimination: We remove one input column (from training model on n columns) at a
time and train the same model on n-1 columns. The input column whose removal has produced the
smallest increase in the error rate is removed, leaving us with n-1 input columns. The classification is then
repeated using n-2 columns, and so on. Each iteration k produces a model trained on n-k columns and an
error rate e(k). By selecting the maximum tolerable error rate, we define the smallest number of columns
necessary to reach that classification performance with the selected machine learning algorithm.

- Forward Feature Construction. This is the inverse process to backward feature elimination. We start
with one column only, progressively adding one column at a time, i.e., the column that produces the
highest increase in performance.

- Multicollinearity check using VARIANCE INFLATION FACTOR (VIF): **Typically used for logistic**

The VIF provides information on how large the standard error is compared with what it would be if the
variables were uncorrelated with the other predictor variables in the model. It is calculated for each
explanatory variable and those with high values are removed. Common thumb-rule classifies a VIF value
of >=5 significantly high implying high multicollinearity. A cut-off VIF value of <=2 is used by most
businesses since it offers a more stringent and clear rule.

123. How will you decide the number of clusters in K-Means?

Basic Data Mining

The optimal number of clusters can be defined as follow:

Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying
k from 1 to 10 clusters.
For each k, calculate the total within-cluster sum of square (wss).

Plot the curve of wss according to the number of clusters k.

The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate
number of clusters.

124. List out some of the best practices for data cleaning

Basic Data Mining

125. What are assumptions of clustering algorithm

Intermediate Data Mining

K-Means clustering method considers two assumptions regarding the clusters –

first that the clusters are spherical and second that the clusters are of similar size.

4. Filters - The Filters area is to place filters in PivotTable

129. What are the most common questions you should ask a client before creating a
dashboard?

Advanced Excel

1. What is the inference to be made from the dashboard

2. The audience - Who views the dashboard and if it is to be detailed much

3. If the data carries present or past information

130. How can we select all blank cells in Excel?

Basic Excel
Select the whole data set

On pressing F5, it opens the Go To dialogue box

Click the Special.. button. It opens the Go To special dialogue box

Select Blanks and click Ok

This selects all the blank cells in your dataset)

131. Can we sort multiple columns at one time?

Basic Excel

Yes, using sort dialog box.

132. What is the difference between absolute and relative cell references?

Intermediate Excel

Absolute: An absolute reference in Excel refers to a reference that is "locked" so that rows and columns
won't change when copied.

Relative: A relative addresses will change when copied to other location in a worksheet because it
describes the "offset" to another cell, rather than a fixed address.

133. What formula would you use to find the length of a text string in a cell?

Basic Excel

"=LEN(cell)"

The above formula can be used to find the length of the text string in the specified cell.
134. What are slicers in Excel

Intermediate Excel

Slicers are visible filters. The objective of slicers is same as that of filters. But in slicers the filter values
would be visible. Mainly used in Pivot tables

135. How can you Combine Data from Multiple tables into 1 pivot table

Intermediate Excel

Hint?

136. Explain Goal Seek and Solver.

Intermediate Excel

Goal seek would help us adjust the value in a specific range to reach the goal(target). It acts as a
business consultant in figuring out to meet the target.

Solver uses trial and error principle, it uses iteration to check a series of solutions for a specific problem
statement. It shows the changes in the output for different inputs

137. What are named ranges in excel

Intermediate Excel

Named ranges - It is used to name a group of cells (or one) with a common name. The common name
would be easy for using the name inside the formula rather giving the range.

138. Explain wildcard characters in Excel

Basic Excel

Used to find a string in the cell that are not exact but similar to the text. There are three wildcard
characters
1. * (asterisk) - If more than one character is to be matched with the given string, we use the asterisk. For
example sh* would filter shirt, short, shell, shall, shore, etc.

2. ?(question mark) - If exactly one character is to be matched with the given strin, ? is used. For example:
ra? would filter rap, ran, rat, raw, etc.

3. ~(tilde) - If the search string contains a wildcard character, then tilde can be used to find the string. For
example, if you need to search for ki* in your data. But since * is a wildcard character, the formula may
not fetch the desired output. In such case, ki~* would return ki*.

139. Explain the functions (VLOOKUP, COUNTIF, SUMIF, IFERROR, INDEX / MATCH)

Intermediate Excel

VLOOKUP - Stands for vertical look up. It is used to look up the data that is organised vertically

COUNTIF - Conditional counting. It is used to count all the values that would meet certain criteria

SUMIF - Conditional summing. As like countif, sumif would sum all the values in a range that would meet
the condition

IFERROR - It would catch all the ERRORs within the given range. It carries two arguments - the error to
be caught and the message to be displayed while the error is caught

MATCH - It is used to fetch the location of the value given as arguments in a given range.

INDEX - It returns the specific value in the given range. INDEX function carries three arguments. First
argument - this takes the range, Second argument - The order of the value to be returned

140. What are the 5 V’s in Big Data ?

Basic Hadoop

Volume, Velocity, Variety, Veracity and Value

141. List the different daemons in Hadoop cluster

Basic Hadoop

Core Hadoop = HDFS + YARN

- HDFS Daemons --> NameNode, DataNode, StandbyNameNode

- YARN Daemons --> ResourceManager, NodeManager

142. What is HDFS ?

Basic Hadoop

Hadoop's Distributed File System which is fault-tolerant, reliable and scalable. Designed to store big
files efficiently in a distributed manner

143. What are the functions of the daemons in the Hadoop cluster ?

Basic Hadoop

Core Hadoop = HDFS + YARN

The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the
EditLog to persistently record every change that occurs to file system metadata. The entire file system
namespace is stored in another file called the FsImage. Both EditLogs and FSImage files are stored as a file
in the NameNode’s local file system. The NameNode keeps an image of the entire file system namespace
and file Blockmap in memory. When the NameNode starts up, or a checkpoint is triggered by a
configurable threshold, it reads the FsImage and EditLog from disk, applies all the transactions from the
EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new
FsImage on disk. In a cluster with no high-availability, the checkpointing is taken care of by the
SecondaryNameNode
149. How does NameNode handle DataNode failure ?

Basic Hadoop

As soon as the data node is declared dead/non-functional all the data blocks it hosts are transferred to
the other data nodes with which the blocks are replicated initially. This is how Namenode handles
datanode failures

150. What are the steps of action when NameNode is down ?

Intermediate Hadoop

1. Use the file system metadata replica (FsImage) to start a new NameNode.
2. Then, configure the DataNodes and clients so that they can acknowledge this new NameNode,
that is started.
3. Now the new NameNode will start serving the client after it has completed loading the last
checkpoint FsImage (for metadata information) and received enough block reports from the
DataNodes.

151. How is HDFS fault tolerant ?

Intermediate Hadoop

HDFS also maintains the replication factor by creating a replica of data on other available machines in
the cluster if suddenly one machine fails.

152. What is the reason we use HDFS for large datasets instead of a lot of small files ?

Basic Hadoop

As the NameNode performs storage of metadata for the file system in RAM, the amount of memory
limits the number of files in HDFS file system. In simple words, more files will generate more metadata,
that will, in turn, require more memory (RAM).

153. What is a ‘block’ in HDFS ?

Basic Hadoop

In Hadoop, HDFS splits huge file into small chunks that is called Blocks. These are the smallest unit of
data in file system

154. What are the default sizes of a Hadoop block in Hadoop 3 and Hadoop 1 ?

Basic Hadoop

128 MB Hadoop 3 and 64 MB Hadoop 1

155. How do we change the block size in Hadoop ?

Basic Hadoop

block. size can be changed to required value(default 64mb/128mb) in hdfs-site. xml file or use
dfs.blocksize= command

156. What does the ‘jps’ command do ?

Basic Hadoop

The jps command uses the java launcher to find the class name and arguments passed to the main
method.

157. What is Rack Awareness in Hadoop ?

Intermediate Hadoop

A Rack is a collection nodes usually in 10 of nodes which are closely stored together and all nodes are
connected to a same Switch. When an user requests for a read/write in a large cluster of Hadoop in order
to improve traffic the namenode chooses a datanode that is closer this is called Rack Awareness

158. What is speculative execution in Hadoop ?

Intermediate Hadoop

Hadoop doesn't try to diagnose and fix slow running tasks, instead, it tries to detect them and runs
backup tasks for them. This is called speculative execution in Hadoop. These backup tasks are called
Speculative tasks in Hadoop

159. How do you restart all the daemons ?

Intermediate Hadoop

You can stop the NameNode individually using /sbin/hadoop-daemon.sh stop namenode command.
Then start the NameNode using /sbin/hadoop-daemon.sh start namenode.

Use /sbin/stop-all.sh and the use /sbin/start-all.sh, command which will stop all the demons first.

160. What are the different modes Hadoop can run in ?

Intermediate Hadoop

Standalone Mode
Pseudo-distributed Mode
Fully-Distributed Mode.

161. What is MapReduce ?

Basic Hadoop

MapReduce is a programming model or pattern within the Hadoop framework that is used to access big
data stored in the Hadoop File System (HDFS). It is a core component, integral to the functioning of the
Hadoop framework.

162. What is the syntax to run a MapReduce program ?

Basic Hadoop

hadoop jar jar_name package_name.class_name input_path_in_hdfs output_path_in_hdfs

163. What are the main configuration parameters in a MapReduce program ?

Intermediate Hadoop

Input location of Jobs in the distributed file system.

Output location of Jobs in the distributed file system.

The input format of data.

The output format of data.

The class which contains the map function.

The class which contains the reduce function. JAR file containing the mapper, reducer and driver
classes

164. What does “RecordReader” do in Hadoop ?

Intermediate Hadoop

RecordReader, typically, converts the byte-oriented view of the input, provided by the InputSplit, and
presents a record-oriented view for the Mapper and Reducer tasks for processing. It thus assumes the
responsibility of processing record boundaries and presenting the tasks with keys and values.

165. How do reducers communicate with each other ?

Intermediate Hadoop

Reducers always run in isolation and they can never communicate with each other as per the Hadoop
MapReduce programming paradigm

166. What does a MapReducer Partitioner do ?

Intermediate Hadoop

A partitioner partitions the key-value pairs of intermediate Map-outputs. It partitions the data using a
user-defined condition, which works like a hash function. The total number of partitions is the same as
the number of Reducer tasks for the job.
167. What does a combiner do ?

Basic Hadoop

A Combiner, also known as a semi-reducer, is an optional class that operates by accepting the inputs
from the Map class and thereafter passing the output key-value pairs to the Reducer class. The main
function of a Combiner is to summarize the map output records with the same key.

168. Explain Distributed cache in a MapReduce framework

Advanced Hadoop

A distributed cache is a mechanism supported by hadoop mapreduce framework where we can

broadcast small or moderate-sized files (read-only) to all the worker nodes where the map/reduce tasks
are running for a given job.

169. What is the reason we can’t perform aggregation in mapper ? Why do we need the
reducer for this ?

Advanced Hadoop

The aggregation can not be done at Mapper phase because aggregation requires sorting of data, and
mapper executes per input split ( a Data Blocks ), so it is not possible in a mapper because it loses
previous input split every time a new instance is taken as input. The data processed by mapper is then
stored in local disk through shuffling and sorting process before the reducer phase. The latency of writing
this data directly to disk and then transferring data across the network is an expensive operation in the
processing of a MapReduce job. Hence there is a necessity to reduce the amount of data that needs to be
sent across the network to reducer whenever possible.

170. What is XGBoost?

Basic ML

Hint?

171. How do you deploy a model to cloud

Intermediate ML

The workflow can be broken down into following basic steps:

- Training a machine learning model on a local system

- Wrapping the inference logic into a flask application

- Using Docker to containerize the flask application

- Hosting the docker container on an AWS ec2 instance and consuming the web-service

172. How will you make models out of the tweets for the pharma company

Advanced ML

Hint?

173. Make 4 segments (product category, competitors etc) and identify which medicine a
doctor is likely to recommend

Intermediate ML

Hint?

174. Working of ensemble methods such as bagging, boosting, random forest.

Intermediate ML

Hint?

175. What is clustering and KNN?

Basic ML

k-Means Clustering is an unsupervised learning algorithm that is used for clustering whereas KNN is a
supervised learning algorithm used for classification.
The “k” in k-means denotes the number of clusters you want to have in the end. If k = 5, you will have 5
clusters on the data set. “k” in K-Nearest Neighbors is the number of neighbours it checks. It is
supervised because you are trying to classify a point based on the known classification of other points.

176. What is bagging and boosting?

Basic ML

Bagging and Boosting decrease the variance of your single estimate as they combine several estimates
from different models. So the result may be a model with higher stability.

Bagging is used when the goal is to reduce the variance of a decision tree classifier. Here the objective is
to create several subsets of data from the training sample chosen randomly with replacement. Each
collection of subset data is used to train their decision trees. As a result, we get an ensemble of different
models. Average of all the predictions from different trees are used which is more robust than a single
decision tree classifier.

Boosting is used to create a collection of predictors. In this technique, learners are learned sequentially
with early learners fitting simple models to the data and then analysing data for errors. Consecutive trees
(random sample) are fit and at every step, the goal is to improve the accuracy from the prior tree. When
an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to
classify it correctly. This process converts weak learners into a better performing model

177. What is ADA boosting?

Basic ML

Ada-boost is an ensemble classifier. It combines a weak classifier algorithm to form strong classifier. A
single algorithm may classify the objects poorly. But if we combine multiple classifiers with a selection of
training set at every iteration and assigning the right amount of weight in the final voting, we can have
good accuracy score for the overall classifier.

178. Explain Gradient boosting and Extreme Gradient Boosting?

Basic ML

XGBoost stands for Extreme Gradient Boosting; it is a specific implementation of the Gradient Boosting
method which uses more accurate approximations to find the best tree model. It employs a number of
nifty tricks that make it exceptionally successful, particularly with structured data.
The most important are:

1) computing second-order gradients, i.e. second partial derivatives of the loss function (similar to
Newton’s method), which provides more information about the direction of gradients and how to get to
the minimum of our loss function. While regular gradient boosting uses the loss function of our base
model (e.g. decision tree) as a proxy for minimizing the error of the overall model, XGBoost uses the 2nd
order derivative as an approximation.

2) And advanced regularization (L1 & L2), which improves model generalization.

XGBoost has additional advantages: training is very fast and can be parallelized/distributed across
clusters.

179. What is Bootstrap sampling?

Basic ML

Bootstrap Sampling is a method that involves drawing of sample data repeatedly with replacement from
a data source to estimate a population parameter.

180. What to be done on the dataset if the assumptions are not met?

Intermediate ML

1. If you create a scatter plot of values for x and y and see that there is not a linear relationship between
the two variables, then one can do the following:

- Apply a nonlinear transformation to the independent and/or dependent variable. e.g. log, square root,
or reciprocal of the independent and/or dependent variable

- Add another independent variable to the model.

2. If residuals are not independent then one can do the following:

- For positive serial correlation, consider adding lags of the dependent and/or independent variable to
the model.

- For negative serial correlation, check to make sure that none of your variables is overdifferenced.

- For seasonal correlation, consider adding seasonal dummy variables to the model

3. If Residuals do not have constant variance, then one can do the following:

- Transform the dependent variable

- Use weighted regression

4. If Residuals are not normally distributed, then one can do the following:

- First, verify that any outliers aren’t having a huge impact on the distribution. If there are outliers
present, make sure that they are real values and that they aren’t data entry errors

- Next, you can apply a nonlinear transformation to the independent and/or dependent variable. e.g. log,
square root, or the reciprocal of the independent and/or dependent variable

181. How to apply ML Algorithms in Mfg/Production Environment ?

Intermediate ML

1. Specify Performance Requirements (This may be as accurate or false positives or whatever metrics are
important to the business)

2. Separate Prediction Algorithm From Model Coefficients

2a. Select or Implement The Prediction Algorithm

2b. Serialize Your Model Coefficients

3. Develop Automated Tests For Your Model

4. Develop Back-Testing and Now-Testing Infrastructure

5. Challenge Then Trial Model Updates (For example, perhaps you set up a grid or random search of
model hyperparameters that runs every night and spits out new candidate models)

182. Difference between Classification and Linear Regression?

Basic ML

1. Fundamentally, classification is about predicting a label and regression is about predicting a quantity.

i.e. Classification is the task of predicting a discrete class label while Regression is the task of predicting
a continuous quantity

2. Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.

Regression predictions can be evaluated using root mean squared error, whereas classification
predictions cannot.

3. A regression algorithm can predict a discrete value which is in the form of an integer quantity

A classification algorithm can predict a continuous value if it is in the form of a class label probability
183. Which model to use to check whether a patient is diabetic or not?

Basic ML

Classification algorithm such as Logistic regression, Random forest etc

184. Explain missing values and outlier treatment

Basic ML

185. What is logistic regression? The output for logistic regression?

Basic ML

a. Logistic regression models the probabilities for classification problems with two possible outcomes. It's
an extension of the linear regression model for classification problems.

b. Log likelihood – This is the log likelihood of the final model

c. Number of obs – This is the number of observations that were used in the analysis

d. LR chi2(3) – This is the likelihood ratio (LR) chi-square test. The number in the parenthesis indicates
the number of degrees of freedom

e. Prob > chi2 – This is the probability of obtaining the chi-square statistic given that the null hypothesis
is true. In this case, the model is statistically significant because the p-value is less than .000.

f. Pseudo R2 – This is the pseudo R-squared.

186. What is Ensemble techniques and it's working? some models?

Basic ML

A group of weak learners coming together to form a strong learner, thus increasing the accuracy of any
Machine Learning model is called an ensemble model

Simple Ensemble Techniques: Hard Voting Classifier, Averaging, Weighted Averaging

Advanced Ensemble Techniques: Stacking, Bagging (Randomforest) and Pasting Boosting(Adaboost, XGB
etc)
187. What is Decision tree and Random forest?

Basic ML

- A decision tree is a supervised machine learning algorithm that can be used for both classification and
regression problems. A decision tree is simply a series of sequential decisions made to reach a specific
result

- Random Forest is a tree-based machine learning algorithm that leverages the power of multiple
(randomly created) decision trees for making decisions. i.e. The Random Forest Algorithm combines the
output of multiple (randomly created) Decision Trees to generate the final output.

- Random Forest is suitable for situations when we have a large dataset, and interpretability is not a major
concern. Decision trees are much easier to interpret and understand. Since a random forest combines
multiple decision trees, it becomes more difficult to interpret.

- The decision tree model gives high importance to a particular set of features. But the random forest
chooses features randomly during the training process.

188. How to deal with underfitting and overfitting

Basic ML

Handling Overfitting:

Cross-validation

This is done by splitting your dataset into ‘test’ data and ‘train’ data. Build the model using the ‘train’ set.
The ‘test’ set is used for in-time validation.

Regularization

This is a form of regression, that regularizes or shrinks the coefficient estimates towards zero. This
technique discourages learning a more complex model

Early stopping

When training a learner with an iterative method, you stop the training process before the final
iteration. This prevents the model from memorizing the dataset.

Pruning

This technique applies to decision trees.

Pre-pruning: Stop ‘growing’ the tree earlier before it perfectly classifies the training set.

Post-pruning: Allows the tree to ‘grow’, perfectly classify the training set and then post prune the tree.
Dropout

This is a technique where randomly selected neurons are ignored during training.

Regularize the weights

Handling Underfitting:

Get more training data

Increase the size or number of parameters in the model

Increase the complexity of the model

Increasing the training time, until cost function is minimised

189. What is bias variance tradeoff

Basic ML

The goal of any supervised machine learning algorithm is to achieve low bias(the difference between the
average prediction of our model and the correct value which we are trying to predict) and low
variance(variability of model prediction for a given data point or a value which tells us spread of our data).

If our model is too simple and has very few parameters then it may have high bias and low variance. On
the other hand, if our model has a large number of parameters then it’s going to have high variance and
low bias.

Increasing the bias will decrease the variance. Increasing the variance will decrease bias.

So we need to find the right/good balance without overfitting and underfitting the data.

This tradeoff in complexity is why there is a tradeoff between bias and variance.

190. How will you explain machine learning to a 5 year old.

Intermediate ML

Just like a human, a computer can learn from three sources.

One is Observing what others did in similar situations. The other is observing a situation and trying to
come up with the best possible logic on the spot to decide/conclude. The third is learning from previous
mistakes/success. These three methods correspond to three branches of Machine learning, Supervised,
Unsupervised and Reinforcement learning respectively.
- In Supervised Learning, a computer can tell what word in a sentence is the name of a city, given it is
shown example sentences which may or may not contain names of cities and every occurrence of a city
name is tagged in these examples.

- Unsupervised is where we ask the computer to make decisions based on raw data attributes and a set of
measurable quantities. Some examples would include asking a computer to come up with localities in a
dataset where Lat-Long of the house is given. It would use Lat Long to find distances and form localities
of house.

- The third type of learning is Reinforcement Learning. This is a method in which computer starts with
making random decisions, and then learns based on errors it makes and successes it encounters as it
goes. A recent discovery was an algorithm which could play many different arcade games after learning
the correct/wrong moves. These algorithms would start by making a lot of failures in the beginning and
then get better as they go.

191. What do you do in data exploration?

Basic ML

192. You are given a data set on cancer detection. You’ve build a classification model and
achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance?
What can you do about it?

Intermediate ML

193. You are working on a time series data set. You manager has asked you to build a high
accuracy model. You start with the decision tree algorithm, since you know it works fairly
well on all kinds of data. Later, you tried a time series regression model and got higher
accuracy than decision tree model. Can this happen? Why?

Advanced ML
194. You came to know that your model is suffering from low bias and high variance.
Which algorithm should you use to tackle it? Why?

Intermediate ML

195. How is kNN different from kmeans clustering?

Basic ML

196. After analyzing the model, your manager has informed that your regression model is
suffering from multicollinearity. How would you check if he’s true? Without losing any
information, can you still build a better model?

Intermediate ML

197. When is Ridge regression favorable over Lasso regression?

Basic ML

198. While working on a data set, how do you select important variables? Explain your
methods.

Basic ML

199. What is the difference between covariance and correlation?

Intermediate ML

200. Both being tree based algorithm, how is random forest different from Gradient
boosting algorithm (GBM)?
Basic ML

201. You’ve got a data set to work having p (no. of variable) > n (no. of observation). Why is
(Ordinary Least Squares) OLS is bad option to work with? Which techniques would be best
to use? Why?

Advanced ML

202. We know that one hot encoding increasing the dimensionality of a data set. But,
label encoding doesn’t. How ?

Intermediate ML

203. You are given a data set consisting of variables having more than 30% missing values?
Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will
you deal with them?

Basic ML

204. People who bought this, also bought…’ recommendations seen on amazon is a result
of which algorithm?

Intermediate ML

205. What do you understand by Type I vs Type II error ?

Basic ML

206. You have been asked to evaluate a regression model based on R², adjusted R² and
tolerance. What will be your criteria?
Basic ML

207. Considering the long list of machine learning algorithm, given a data set, how do you
decide which one to use?

Basic ML

208. When does regularization becomes necessary in Machine Learning?

Basic ML

209. What do you understand by Bias Variance trade off?

Basic ML

210. How can you prove that one improvement you've brought to an algorithm is really an
improvement over not doing anything?

Basic ML

211. Explain what resampling methods are and why they are useful. Also explain their
limitations.

Basic ML

- Repeatedly drawing samples from a training set and refitting a model of interest on each sample in
order to obtain additional information about the fitted model

- Example: repeatedly draw different samples from training data, fit a linear regression to each new
sample, and then examine the extent to which the resulting fit differ

- most common are: cross-validation and the bootstrap

cross-validation: random sampling with no replacement, bootstrap: random sampling with replacement
- cross-validation: evaluating model performance, model selection (select the appropriate level of
flexibility)

- bootstrap: mostly used to quantify the uncertainty associated with a given estimator or statistical
learning method

212. Is it better to have too many false positives, or too many false negatives? Explain.

Basic ML

False-positive and false-negative are two problems we have to deal with while evaluating a mode.

In medical, a false positive can lead to unnecessary treatment and a false negative can lead to a false
diagnostic, which is very serious since the disease has been ignored.

However, we can minimize the errors by collecting more information, considering other variables,
adjusting the sensitivity (true positive rate) and specificity (true negative rate) of the test, or conducting
the test multiple times.

Even so, it is still hard since reducing one type of error means increasing the other type of error.
Sometimes, one type of error is more preferable than the other one, so data scientists will have to
evaluate the consequences of the errors and make a decision

213. What is selection bias, why is it important and how can you avoid it

Basic ML

Selection bias occurs if a data set's examples are chosen in a way that is not reflective of their real-world
distribution.

How to avoid selection biases

Mechanisms for avoiding selection biases include:

- Using random methods when selecting subgroups from populations.

- Ensuring that the subgroups selected are equivalent to the population at large in terms of their key
characteristics (this method is less of a protection than the first since typically the key characteristics are
not known).
214. Differentiate between univariate, bivariate and multivariate analysis.

Basic ML

Univariate statistics summarize only one variable at a time.

Bivariate statistics compare two variables.

Multivariate statistics compare more than two variables.

215. What is the difference between Cluster and Systematic Sampling?

Basic ML

Systematic sampling and cluster sampling are both statistical measures used by researchers, analysts,
and marketers to study samples of a population.

Systematic sampling involves selecting fixed intervals from the larger population to create the sample.

Cluster sampling divides the population into groups, then takes a random sample from each cluster.

216. Can you cite some examples where both false positive and false negatives are equally
important?

Intermediate ML

Let us take an example of a medical field where:

A false positive = person is considered as sick but actually is healthy

A false negative = person is considered as healthy but is actually sick

What does it mean?

False-positive cases lead to overspending due to unnecessary care and damaging the health of an
otherwise healthy person due to unnecessary side effects of the therapy.

A false negative case means that your patients get sicker or die.

In this case, both false positive and false negatives are equally important since it concerns a person’s life

217. Explain Lasso regression

Basic ML

Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are
shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e.
models with fewer parameters)

Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of the
magnitude of coefficients. This type of regularization can result in sparse models with few coefficients;
Some coefficients can become zero and eliminate from the model. Larger penalties result in coefficient
values closer to zero, which is ideal for producing simpler models.

218. Explain Gradient Descent Algorithm

Intermediate ML

Gradient descent is an optimization algorithm that's used when training a machine learning model.

It's based on a convex function and tweaks its parameters iteratively to minimize a given cost function to
its local minimum.

You start by defining the initial parameter's values and from there gradient descent uses calculus to
iteratively adjust the values so they minimize the given cost-function (where a gradient measures how
much the output of a function changes if you change the inputs a little bit.)

219. How machine learning is deployed in real world scenarios?

Advanced ML

AWS or Azure instances with python jobs that run with either manual schedules, or automated to trigger
on receiving say new data. These are usually a suite of services that constitute a deployment
environment of such models.

Storage - model needs to be stored somewhere (pickle or joblib or specific model object). Either s3 on
aws or blob in azure.

Computing instance - Computing environment that contains python and is enabled to communicate to
every platform that is relevant to the deployment context.

Job scheduler - Devops is the norm now. Automated pipelines that procure data, process,
load/retrain/predict with the packaged model.

Final layer - either BI tools like tableau, qilkview etc or sql/nosql databases or excel reports
220. What is cosine similarity?

Intermediate ML

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size.
Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional
space. The cosine similarity is advantageous because even if the two similar documents are far apart by
the Euclidean distance (due to the size of the document), chances are they may still be oriented closer
together. The smaller the angle, the higher the cosine similarity.

221. How to implement Tensorflow?

Intermediate ML

The usual workflow of running a program in TensorFlow is as follows:

Build a computational graph, this can be any mathematical operation TensorFlow supports.

Initialize variables, to compile the variables defined previously

Create a session, this is where the magic starts!

Run graph in session, the compiled graph is passed to the session, which starts its execution.

Close session, shut down the session.

222. What is part of speech (POS) tagging? What is the simplest approach to building a
POS tagger that you can imagine?

Basic NLP

POS tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag,
based on its context and definition. The most common approach is to use the lexicon-based approach,
using a lexicon to assign a tag for each word. The lexicon is constructed from a gold standard annotated
corpus, where each word type is coupled with its most frequent associated tag in the gold standard
corpus.

223. How would you build a part of speech (POS) tagger from scratch given a corpus of
annotated sentences? How would you deal with unknown words?
Basic NLP

First, we will create features from words (like last 2,3 letters, the previous word, next word, etc.). Then we
will train a classifier to find the POS tag. HMM, CRF and RNNs can be used to train the model. Unknown
words can also be predicted by generating the features (position of the word, suffix, etc) from them.

224. How would you train a model that identifies whether the word “Apple” in a sentence
belongs to the fruit or the company?

Basic NLP

This particular task is known as NER (Named Entity Recognition) tagging. HMM, CRF and RNNs can be
used to train a model for NER

225. How would you find all the occurrences of quoted text in a news article?

Basic NLP

Train a classifier model to look at the constituent parts of a news article and assign a probability that,
taken together, composes a valid quoted text.

226. How would you build a system that auto-corrects text that has been generated by a
speech recognition system?

Basic NLP

It can be done in multiple ways, but the simplest way would be to take the unknown words and compare
them with similar words from our dictionary. Distances can be calculated using algorithms like
Levenshtein and if the result is satisfactory, the words can be exchanged

227. Which are some popular models other than word2vec?

Basic NLP

Some popular models other than word2vec are GloVe, Adagram, FastText, etc
228. What is latent semantic indexing and where can it be applied?

Basic NLP

Latent semantic indexing (LSI) is a concept used by search engines to discover how a term and content
work together to mean the same thing, even if they do not share keywords or synonyms. Search engines
use LSI to judge the quality of the content on a page by checking for words that should appear alongside
a given search term or keyword

229. Explain some metrics to test out a Named Entity recognition model.

Basic NLP

When you train a NER system the most typical evaluation method is to measure precision, recall, f1-
score, and confusion matrix at a token level.

230. List out some popular Python libraries that are used for NLP.

Basic NLP

Some popular libraries for NLP are, NLTK, Gensim, spaCy, TextBlob, etc.

231. What are some popular applications of NLP?

Basic NLP

Some popular applications are Text summarization, Machine translation, Sentiment Analysis, chatbots,
etc.

232. What is the difference between search function and match function?

Basic NLP

re.search() method finds something anywhere present in the string and return a match object, whereas
re.match() method finds something only at the beginning of the string and returns a match object
233. What is tokenization, chinking, chunking?

Basic NLP

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be
either word, characters, or subwords. Chunking means a grouping of words/tokens into chunks. Chunking
can break sentences into phrases that are more useful than individual words and yield meaningful results.
Chinking is a lot like chunking, it is basically a way for you to remove a chunk from a chunk.

234. What is the skip-gram model?

Basic NLP

Skip-gram is an unsupervised algorithm to find word embeddings. It tries to predict the source context
words (surrounding words) given a target word (the center word)

235. What is a CBOW model?

Basic NLP

CBOW is an unsupervised algorithm to find word embeddings. It tries to predict the target word (the
center word) given the source context words (surrounding words).

236. How can you create your own word embeddings?

Basic NLP

You can use gensim library to implement word2vec model, you can train the word2vec model on your
text corpus and then generate word embeddings.

237. What is the difference between stemming and lemmatization?

Basic NLP

Stemming and lemmatization, both are used to derive root (base) word from their inflected form. A stem
might not be an actual word whereas a lemma will be an actual word.
238. How would you build a system to translate English text to Greek and vice-versa?

Basic NLP

One can use Neural Machine Translation to translate English text to Greek and vice-versa. A sequence
to sequence model can be created using RNNs.

239. How would you build a system that automatically groups news articles by subject?

Basic NLP

There can be different ways to do this task, if you have annotated data, you can train a classifier model
to classify different articles

240. What are stop words? Describe an application in which stop words should be
removed.

Basic NLP

Stop words are frequently used word, which does not add much meaning to a sentence or does not help
in prediction. We will need to remove the stop words while performing sentiment analysis.

241. How would you design a model to predict whether a movie review was positive or
negative?

Basic NLP

We will need to perform sentiment analysis on the reviews, It can be done in multiple ways, one simple
way to do this is by training a classifier using ML algorithms or RNNs (LSTM or GRU).

242. What is entropy? How would you estimate the entropy of the English language?

Basic NLP

Entropy is a measure of randomness in the information. One possible way of calculating the entropy of
English uses N-grams. One can statistically calculate the entropy of the next letter when the previous N -
1 letters are known.
243. What is the TF-IDF score of a word and in what context is this useful?

Basic NLP

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of
documents. This is done by multiplying two metrics: how many times a word appears in a document, and
the inverse document frequency of the word across a set of documents. TF-IDF is used to convert text
corpus into a matrix on which Machine learning algorithms can be implemented

244. What is dependency parsing?

Basic NLP

Dependency parsing is the process of analyzing the grammatical structure of a sentence based on the
dependencies between the words in that sentence.

245. What are the difficulties in building and using an annotated corpus of text such as
the Brown Corpus and what can be done to mitigate them?

Basic NLP

246. What tools for training NLP models (NLTK, Apache OpenNLP, GATE, MALLET etc…)
have you used?

Basic NLP

To train NLP models, I have used NLTK, Gensim, Spacy and a few others

247. Are you familiar with WordNet or other related linguistic resources?

Basic NLP

WordNet is the lexical database i.e. dictionary for the English language, specifically designed for NLP.
Synset is a special kind of a simple interface that is present in NLTK to look up words in WordNet.
248. Problems faced in NLP and how you tackled them?

Basic NLP

Most of the challenges I faced in NLP are due to data complexity, characteristics such as sparsity,
diversity, dimensionality, etc. and the dynamic nature of the datasets. With the special focus on
addressing NLP challenges, one can build accelerators, robust, scalable domain-specific knowledge bases
and dictionaries that bridges the gap between user vocabulary and domain nomenclature.

249. What are some of the common problems using fixed window neural models?

Advanced NLP

The main problem faced while using a fixed window neural model, is the window size can be small for
large sentences, making it unable to process the complete information

250. What are some common examples of sequential data?

Advanced NLP

Some common examples of sequential data are text corpus, DNA sequence, and time-series data

251. What are some problems with N-gram language models?

Advanced NLP

An issue when using n-gram language models is out-of-vocabulary (OOV) words. They are encountered in
computational linguistics and natural language processing when the input includes words which were not
present in a system's dictionary or database during its preparation.

252. What are some limitations of RNNs?

Advanced NLP

RNNs are prone to exploding and vanishing gradient problem. RNN also fails to keep track of long term
dependencies.
253. What are Vanishing gradient problems?

Advanced NLP

As more layers using certain activation functions are added to neural networks, the gradients of the loss
function approach zero, making the network hard to train.

254. What is exploding gradients in RNN?

Advanced NLP

Exploding gradients are a problem where large error gradients accumulate and result in very large
updates to neural network model weights during training.

255. Can you give me an example of many-to-one architecture in sequence models?

Advanced NLP

An example of Many to one architecture in sequence model, would be sentiment analysis, where the
inputs are words and the output is sentiment

256. What activation layer is used in the hidden units of an RNN?

Advanced NLP

tanh activation function is used in hidden units of RNN

257. What is the use of the Forget Gate in LSTMs?

Advanced NLP

In LSTM, the forget gate controls the extent to which a value remains in the cell

258. Why is there a specific need for an architecture like GRU or LSTM?
Advanced NLP

RNN’s suffer from short-term memory. If a sequence is long enough, they will have a hard time carrying
the information from the earlier timesteps to later ones. This is called the Vanishing Gradient Problem. To
solve this issue, GRU and LSTMs are used

259. What problems of RNNs do LSTMs address?

Advanced NLP

260. What is the primary difference between an LSTM and GRU?

Advanced NLP

The main difference between GRU and LSTM is, GRU have 2 gates whereas LSTM has 3 gates, thus GRU
is faster than LSTM. But LSTMs generally perform better at remembering longer sequences than GRU

261. What kind of datasets are RNNs known best to work on?

Advanced NLP

RNNs are good at making predictions when the data is sequential.

262. What are the different possible architectures in RNNs and give examples of the
same?

Advanced NLP

Different possible architectures for RNN are the following:

1. One-to-Many: ex. Auto-Image captioning

2. Many-to-Many: ex. Neural Machine Translation
3. Many-to-one: ex. Sentiment Analysis

263. What are some of the ways to address the exploding gradients problem in RNNs?

Advanced NLP

Advanced NLP

270. Why do we need attention mechanisms?

Advanced NLP

Advanced NLP

Transformers are better than all the other architectures because they totally avoid recursion, by
processing sentences as a whole and by learning relationships between words, using multi-head
attention mechanisms and positional embeddings
275. What are the differences between BERT and ALBERT v2?

Advanced NLP

1. Extraction of meaning from a variety of complex, multi-format documents.

2. Support for multiple languages
3. Integration of pre-existing, text-based knowledge
279. What are built-in function in python?

Basic Python

Hint?

280. Differentiate between Call by value and Call by reference

Basic Python

Hint?

281. How to read any file (without using Pandas)

Intermediate Python

Hint?

282. What is NaN in python?

Basic Python

Hint?

283. What is the use of ID() function in python?

Basic Python

Hint?

284. How will you import multiple excel sheets in a data frame?

Basic Python

Hint?
285. What are the different types of data types?

Basic Python

Hint?

286. Difference between lists/ tuples/ dictionaries?

Basic Python

Hint?

287. How would check a number is prime or not using Python?

Basic Python

# taking input from user

number = int(input("Enter any number: "))

# prime number is always greater than 1

if number > 1:

for i in range(2, number):

if (number % i) == 0:

print(number, "is not a prime number")

break

else:

print(number, "is a prime number")

# if the entered number is less than or equal to 1

# then it is not a prime number

else:

print(number, "is not a prime number")

288. How would check a number is armstrong number using Python?

Basic Python

# Python program to check if the number is an Armstrong number or not

# take input from the user

num = int(input("Enter a number: "))

# initialize sum

sum = 0

# find the sum of the cube of each digit

temp = num

while temp > 0:

digit = temp % 10

sum += digit ** 3

temp //= 10

# display the result

if num == sum:

print(num,"is an Armstrong number")

else:

print(num,"is not an Armstrong number")

289. What is an Append Function?

Basic Python

The append() method adds an item to the end of the list.

The syntax of the append() method is:

list.append(item)

290. For what Beautiful soup library is used for?

Basic Python

Hint?
291. Which function is most useful to convert a multidimensional array into a one-
dimensional

Basic Python

Hint?

292. Python or R – Which one would you prefer for text analytics?

Intermediate Python

293. What is the lambda function in Python?

Intermediate Python

In Python, anonymous functions are defined using the lambda keyword

Syntax of Lambda Function in python

lambda arguments: expression

294. How negative indices are used in Python?

Intermediate Python

Python programming language supports negative indexing of arrays, something which is not available in
arrays in most other programming languages. This means that the index value of -1 gives the last element,
and -2 gives the second last element of an array. The negative indexing starts from where the array ends.
This means that the last element of the array is the first element in the negative indexing which is -1.

295. How is the Python series different from a single column dataframe?

Intermediate Python

Python series is the data structure for a single column of a DataFrame, not only conceptually, but
literally, i.e. the data in a DataFrame is actually stored in memory as a collection of Series
Series is a one-dimensional object that can hold any data type such as integers, floats and strings and it
does not have any name/header whereas the dataframe has column names.

296. Which libraries in SciPy have you worked with in your project?

Intermediate Python

SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT,
signal and image processing, ODE solvers etc

Subpackages include:

scipy.cluster

scipy.constants

scipy.fftpack

scipy.integrate

scipy.interpolation

scipy.linalg

scipy.io

scipy.ndimage

scipy.odr

scipy.optimize

scipy.signal

scipy.sparse

scipy.spatial

scipy.special

scipy.stats

scipy.weaves

297. How groupby function works in Python?

Intermediate Python
Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. pandas
objects can be split on any of their axes.

Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True,

squeeze=False, **kwargs)

Parameters :

by: mapping, function, str, or iterable

axis: int, default 0

level: If the axis is a MultiIndex (hierarchical), group by a particular level or levels

as_index: For aggregated output, return object with group labels as the index. Only relevant for
DataFrame input. as_index=False is effectively “SQL-style” grouped output

sort: Sort group keys. Get better performance by turning this off. Note this does not influence the order
of observations within each group. groupby preserves the order of rows within each group.

group_keys: When calling apply, add group keys to index to identify pieces

squeeze: Reduce the dimensionality of the return type if possible, otherwise return a consistent type

Returns: GroupBy object

298. What does [::-1] do in python?

Intermediate Python

[::] just produces a copy of all the elements in order

[::-1] produces a copy of all the elements in reverse order

299. What are python packages?

Basic Python

Packages are namespaces which contain multiple packages and modules themselves. They are simply
directories.

Each package in Python is a directory which MUST contain a special file called __init__.py. This file can be
empty, and it indicates that the directory it contains is a Python package, so it can be imported the same
way a module can be imported.
If we create a directory called foo, which marks the package name, we can then create a module inside
that package called bar. We also must not forget to add the __init__.py file inside the foo directory.

300. How do you check missing values in a dataframe using python?

Intermediate Python

Pandas isnull() function detect missing values in the given object. It returns a boolean same-sized object
indicating if the values are NA. Missing values get mapped to True and non-missing value gets mapped to
False.

301. How do you get the frequency of a categorical column of a dataframe using python?

Basic Python

Using Series.value_counts()

302. Can you write a function using python to impute outliers?

Basic Python

import numpy as np

def removeOutliers(x, outlierConstant):

a = np.array(x)

upper_quartile = np.percentile(a, 75)

lower_quartile = np.percentile(a, 25)

IQR = (upper_quartile - lower_quartile) * outlierConstant

Yes

306. What all ways have you used to convert categorical columns into numerical data
using python?

Intermediate Python

One of the most used and popular ones are LabelEncoder and OneHotEncoder.

Both are provided as parts of sklearn library.

LabelEncoder can be used to transform categorical data into integers:

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

x = ['Apple', 'Orange', 'Apple', 'Pear']

y = label_encoder.fit_transform(x)
print(y)

array([0, 1, 0, 2])

OneHotEncoder can be used to transform categorical data into one hot encoded array:

from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder(sparse=False)

y = y.reshape(len(y), 1)

onehot_encoded = onehot_encoder.fit_transform(y)

print(onehot_encoded)

307. How get_dummies() is different from one hot encoder?

Intermediate Python

OneHotEncoder cannot process string values directly. If your nominal features are strings, then you
need to first map them into integers.

pandas.get_dummies is kind of the opposite. By default, it only converts string columns into one-hot
representation, unless columns are specified.

308. How do you check the distribution of data in python?

Intermediate Python

A simple and commonly used plot to quickly check the distribution of a sample of data is the histogram.

from matplotlib import pyplot

pyplot.hist(data)

309. What is the difference between iloc and loc activity?

Basic Python

loc gets rows (or columns) with particular labels from the index.

iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
310. Difference between univariate and bivariate analysis? What all different functions
can be used in python?

Basic Python

Univariate statistics summarize only one variable at a time.

Bivariate statistics compare two variables.

Below are a few functions which can be used in the univariate and bivariate analysis:

1. To find the population proportions with different types of blood disorders.

df.Thal.value_counts()

2. To make a plot of the distribution :

sns.distplot(df.Variable.dropna())

3. Find the minimum, maximum, average, and standard deviation of data.

There is a function called ‘describe’

4. Find the mean of the Variable

df.Variable.dropna().mean()

5. Boxplot to observe outliers

sns.boxplot(x = "", y = "", hue = "", data=df)

6. Correlation plot:

data.corr()

311. What all different methods can be used to standardize the data using python?

Intermediate Python

Min Max Scaler.

Standard Scaler.

Max Abs Scaler.

Robust Scaler.

Quantile Transformer Scaler.

Power Transformer Scaler.

Unit Vector Scaler.

312. What is the apply function in Python? How does it work?

Basic Python

Pandas.apply allow the users to pass a function and apply it on every single value of the Pandas series.

Syntax:

s.apply(func, convert_dtype=True, args=())

313. How do you do upsampling of data? Name a python function or explain the code.

Intermediate Python

Up-sampling is the process of randomly duplicating observations from the minority class in order to
reinforce its signal.

There are several heuristics for doing so, but the most common way is to simply resample with
replacement.

Module for resampling in Python:

from sklearn.utils import resample

314. Can you plot 3D plots using matplotlib? Name the function.

Intermediate Python

Yes

Function:

import numpy as np

import matplotlib.pyplot as plt

fig = plt.figure()

ax = plt.axes(projection ='3d')
315. How can you drop a column in python?

Basic Python

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False,

errors='raise')

316. What is the use of ‘inplace’ in python functions?

Basic Python

In-place operation is an operation that changes directly the content of a given linear algebra, vector,
matrices(Tensor) with/without making a copy

When inplace = True is used, it performs an operation on data and nothing is returned.

When inplace=False is used, it performs an operation on data and returns a new copy of data.

317. How do you select a sample of dataframe?

Intermediate Python

1. Randomly select a single row: df = df.sample()

2. Randomly select a specified n number of rows: df = df.sample(n=3)

3. Allow a random selection of the same row more than once: df = df.sample(n=3,replace=True)

4. Randomly select a specified fraction of the total number of rows: df = df.sample(frac=0.50)

318. How would you define a block in Python?

Intermediate Python

A block is a group of statements in a program or script. Usually, it consists of at least one statement and
declarations for the block, depending on the programming or scripting language. A language which allows
grouping with blocks is called a block-structured language

319. How will you remove duplicate data from a dataframe?

Intermediate Python

DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it
will consider them only for duplicates.

keep: keep is to control how to consider duplicate value. It has only three distinct value and default is
‘first’.

320. Can you convert a string into an int? When and how?

Basic Python

Python offers the int() method that takes a String object as an argument and returns an integer. This can
be done when the value is either of numeric object or floating-point.

But keep these special cases in mind:

A floating-point (an integer with a fractional part) as an argument will return the float rounded down to
the nearest whole integer.

321. What does the function zip() do?

Intermediate Python

The zip() function takes iterables (can be zero or more), aggregates them in a tuple, and return it.

The syntax of the zip() function is:

zip(*iterables)

322. How many arguments can the range() function take?

Basic Python

It can take mainly three arguments.

start: integer starting from which the sequence of integers is to be returned

stop: integer before which the sequence of integers is to be returned.

The range of integers ends at stop – 1.

step: integer value which determines the increment between each integer in the sequence

323. What is the difference between list, array and tuple in Python?

Basic Python

List:

The list is an ordered collection of data types.

The list is mutable.

List are dynamic and can contain objects of different data types.

List elements can be accessed by index number

Array:

An array is an ordered collection of similar data types.

An array is mutable.

An array can be accessed by using its index number.

Tuple:

Tuples are immutable and can store any type of data type.

it is defined using ().

it cannot be changed or replaced as it is an immutable data type

324. Write a Sorting Algorithm in R?

Intermediate R

There are multiple algorithms for performing sorting on the data in the R programming language. Below
different types of sorting function have been discussed.

Bubble Sort

Insertion Sort
Selection Sort

Merge Sort

Quick Sort

325. What are the packages used in R for data science?

Basic R

1. Dplyr

2. Ggplot2

3. Shiny

4. Lubridate

5. Knitr

6. Mlr

7. Caret

8. Text2Vec

9. Prophet

10. SnowballC

326. Explain the functions of dplyr package

Basic R

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most
common data manipulation challenges:

Below are the functions used in this package:

mutate() adds new variables that are functions of existing variables

select() picks variables based on their names.

filter() picks cases based on their values.

summarise() reduces multiple values down to a single summary.

arrange() changes the ordering of the rows.

327. Explain the syntax of rbind and cbind in R

Intermediate R

cbind() and rbind() both create matrices by combining several vectors of the same length. cbind()
combines vectors as columns, while rbind() combines them as rows

328. What is interaction function?

Basic R

interaction computes a factor which represents the interaction of the given factors. The result of the
interaction is always unordered.

Syntax: interaction(…, drop = FALSE, sep = ".", lex.order = FALSE)

329. What are the different data types/objects in R?

Basic R

Hint?

330. What is a factor variable, and why would you use one?

337. How missing values and impossible values are represented in R language?

Intermediate R

In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by
zero) are represented by the symbol NaN (not a number)

338. What is the process to create a table in R language without using external files?

With R/Python, you can visualise data in a similar way to Tableau, and build interactive visualisations
with many libraries but you have a lot more flexibility.

343. Explain the various benefits of R language?

Basic R

344. What are the differences between the sum function and using “+” operator

Basic SAS

SUM function returns the sum of non-missing arguments whereas “+” operator returns a missing value if
any of the arguments are missing
345. How does PROC SQL work

Intermediate SAS

The SQL query structure does not change even if we use PROC SQL command. For example -

PROC SQL;

SELECT column(s)

FROM table(s) | view(s)

WHERE expression

GROUP BY column(s)

HAVING expression

ORDER BY column(s);

QUIT;

In the above query, the select statement is nothing like a select sql query but you always end the query
with QUIT;
346. If you are given an unsorted data set, how will you read the last observation to a new
dataset

Intermediate SAS

We can read the last observation to a new data set using end= data set option.

For example:

data work.calculus;

set work.comp end=last;

If last;

run;

Here in the above query, a new dataset calculus is getting created from comp (within work directory).
last is the temporary variable (initialized to 0) which is set to 1 when the set statement reads the last
observation

347. Can you tell the difference between VAR X1 – X3 and VAR X1 — X3

Intermediate SAS
348. What is the purpose of trailing @ and @@? How do you use them

Intermediate SAS

The trailing @ is also known as a column pointer. By using the trailing @, in the Input statement gives you
the ability to read a part of your raw data line, test it and then decide how to read additional data from
the same record.

The single trailing @ tells the SAS system to “hold the line”.

The double trailing @@ tells the SAS system to “hold the line more strongly”.

An Input statement ending with @@ instructs the program to release the current raw data line only when
there are no data values left to be read from that line. The @@, therefore, holds the input record even
across multiple iterations of the data step.

349. What is the difference between the Do Index, Do While and the Do Until loop

Intermediate SAS

350. What is the ANYDIGIT function in SAS

Basic SAS

Searches a character string for a digit and returns the first position at which it is found

351. What is interleaving in SAS

Basic SAS

352. What is the difference between RDDs and Dataframe in Spark?

Intermediate Spark

A Data Frame is the tabular representation of data and is equivalent to a table in a relational database
but with better optimization.
RDD is the representation of a set of records, logically partitioned across multiple nodes for parallel
processing.

353. How to do Spark Tuning ( Optimization)?

Intermediate Spark

Spark performance tuning is the process of efficiently utilizing the spark resources such as memory,
cores, instances as per the input data records.

354. What is a stage in Spark and What are the types of stages?

Basic Spark

Spark stage is nothing but each individual job work/tasks from the entire execution plan.

There are two types of stages:

1. ShuffleMapStage: It is an intermediate stage and produces data for the next stage.
2. ResultStage: Final stage of spark and helps in the computation of result from the action plan.

355. What are shared variables in Spark and what is the use of it?

Basic Spark

Shared variables are nothing but the globally referred variables use in multiple functions and methods
in parallel.

Spark provides two special types of shared variables

1. Broadcast Variables(Used to cache a value in memory on all nodes)

2. Accumulators (used to implement counters and sums).

356. What is the difference between Batch processing and real time streaming?

Intermediate Spark
Batch processing is the processing of blocks of data that have already been stored over a period of time.
It is used in the scenarios where it is required to process large volumes of data to get more detailed
insights than it is to get fast analytics results. On the other hand, real-time processing as the name
suggests is used for real-time analytics. It is used to process the data as it arrives and gets instant
analytics result.

357. Can you connect SparkSQL to RDBMS? If yes, How?

Intermediate Spark

SparkSQL itself is built of two main components: Dataframe and SQLContext. SQLContext encapsulates
all the relative functionality of spark and provides extended functionality to be able to 'talk' to different
databases which could be SQL or NoSQL DBs. Every DB has its own respective connectors to be
integrated with spark and with the help of such dedicated connectors SQLContext talks to DBs.

358. What are Accumulators in Spark?

Basic Spark

Accumulators are one of the types of shared variables used in spark. It is meant for numeric data
aggregation where the data is stored in the cache and can be accessed throughout the model
functionalities.

359. What is the difference between SQLContext and HiveContext in Spark?

Intermediate Spark

SQLContext is nothing but the gateway to SparkSQL from where the spark can interact with the
databases. HiveContext is the superset of SQLContext which inherits all the property of SQLContext for
DB interactions with addition of HiveContext properties to connect with Hive and HBase.

360. Explain the project you did using Spark?

Intermediate Spark
Spark is basically used where basic python is not capable of solving the issue. Used spark functionalities
on Python for telecom domain use cases where data size was huge > 20 GB. Used RDD concepts for
parallel and fast data preprocessing. used shared variables concept for data storage and loading from
cache.

361. How does Kafka work?

Intermediate Spark

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance
TCP. Applications (producers) send messages (records) to a Kafka node (broker) and said messages are
processed by other applications called consumers. Said messages get stored in a topic and consumers
subscribe to the topic to receive new messages.

362. Explain Spark Architecture

Intermediate Spark

Apache spark follows a master/slave architecture where the master drives the process and slave daemon
are the worker nodes which does the actual processing.

Spark Driver contains various components – DAGScheduler, TaskScheduler, BackendScheduler and

BlockManager responsible for the translation of spark user code into actual spark jobs executed on the
cluster

363. What is Lazy evaluation in Spark?

Basic Spark

Lazy evaluation as the name suggests means the execution will not start until an action is triggered.
Whenever there is some operation on RDD, it does not get executed immediately. Spark adds them to a
DAG of computation and only when the driver requests some data, this DAG actually gets executed

Advantages of lazy evaluation.

1. It is an optimization technique i.e. it provides optimization by reducing the number of queries.

2. It saves the round trips between driver and cluster, thus speeds up the process.
364. How can you use Apache Spark with Hadoop?

Intermediate Spark

We need to understand that the spark is not intended to replace the Hadoop stack rather enhance its
functionalities. Spark can enrich the processing capabilities in terms of read and write data from HDFS by
combining spark with Hadoop MapReduce, HBase.

There are two different ways in which the deployment happens.

1. Standalone Deployment: Hadoop cluster run side by side with Hadoop MR. user can run Spark
jobs directly on HDFS.
2. Hadoop Yarn Deployment: Users can deploy Hadoop yarn and can run spark on yarn without any
pre-installation or administrative access required.

365. What are different types of cluster managers in Spark?

Basic Spark

Apache has 3 types of cluster managers.

Standalone: Simplest way to run spark in a clustered environment. It is a cluster which spark itself
manages. It has masters and number of workers with the configured amount of memory and CPU
cores.
Mesos: Mesos handles the workload in a distributed environment by dynamic resource sharing and
isolation. It is used for large scale cluster deployments and it decreases an overhead of allocating a
specific machine for different workloads.
Hadoop Yarn: YARN data computation framework is a combination of the ResourceManager, the
NodeManager. In resource manager, The Scheduler allocates a resource to the various running
application and Application Manager manages applications across all the nodes.

366. What is the use of broadcast variables in Apache Spark?

Intermediate Spark

Broadcast variables are useful when large datasets need to be cached in executors. Without this, these
need to be shipped to each executor before the actual process call. It is meant to be a read-only and is a
mechanism for sharing variables across executors
367. What is the role of Dstream in Spark?

Basic Spark

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a
continuous stream of data, either the input data stream received from the source or the processed data
stream generated by transforming the input stream. Internally, a DStream is represented by a continuous
series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream
contains data from a certain interval.

368. What are the different levels of persistence available in Spark?

GraphX is Apache Spark’s API for graphs and graph-parallel computation. This includes the collection of
graph algorithms and processes to do graph analytics. GraphX extends the Spark RDD with a Resilient
Distributed Property Graph.

The property graph is a directed multigraph which can have multiple edges in parallel. Every edge and
vertex have user-defined properties associated with it. The parallel edges allow multiple relationships
between the same vertices. It is flexible, fast and open source.

373. What is Sliding Window in Spark streaming?

Intermediate Spark

Spark streaming has an advantageous feature of windowed operation. It can do the transformation
operation over a sliding window of data. Generally, the sliding window operation requires two specific
parameters.

Window length which defines the duration of the window & Sliding Interval which defines the interval at
which the operation is performed.
374. What is the difference between Spark Session and Spark Context?

Intermediate Spark

Spark SparkContext is an entry point to Spark and used to programmatically create RDDs and other
variables. It's object "sc" is a default variable and can be created by using SparkContext class.

However, SparkSession is a superset of SparkContext which includes all the functional class of different
APIs, Spark Context, SQLContext, HiveContext etc. It's an entry point to underlying spark functionality
itself.

375. Why is Spark faster than Hadoop?

Basic Spark

Theoretically, Spark performs 100 times faster than Hadoop and this is possible only because it processes
data in random access memory (RAM), while Hadoop MapReduce persists data back to the disk after a
map or reduce action. Nonetheless, Spark needs lots of memory and keeps the data there until a further
call for caching.

376. What is the use of SQLContext in Spark?

Basic Spark

SQLContext is nothing but the gateway to SparkSQL from where the spark can interact with the
databases. Here the DB can be both SQL and NoSQL. Respective drivers are available for different DBs
which can be initiated along with the SparkSession builder process itself.

377. What are the limitations of Spark?

Basic Spark

1. Need manual optimization whenever required. No automated process is available.

2. No own file management system. Dependency on HDFS or something else.
3. Spark ML is very limited. MLlib does not support all extensive algorithms as of now. Not good for
advance analytics.
4. Not good for a multi-user environment. not capable of handling users concurrency.
378. Explain Spark Streaming with Kafka?

Intermediate Spark

Spark Streaming is nothing but a continuous stream that is processed using algorithms as it is. The output
is also retrieved in the form of a continuous data stream. Kafka streaming works on state transitions
unlike batches as that in Spark Streaming.

It stores the states within its topics, which is used by the stream processing applications for storing and
querying of the data. Thereby, all its operations are state-controlled. These states are further used to
connect topics to form an event task

379. How can you connect with Hive in Spark?

Intermediate Spark

Hive is connected through HiveContext in spark. HiveContext is the superset of SQLContext which
inherits all the property of SQLContext for DB interactions with addition of HiveContext properties to
connect with Hive and HBase.

380. What is the difference between Cache vs Broadcast in Spark?

Intermediate Spark

Cache stores each node or any partitions of it that it computes, in memory and reuses them in other
actions on the dataset. It helps in faster execution in future processes. Whereas, Broadcast variables
allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy
of it with tasks.

381. Explain DAG in Spark?

Basic Spark

DAG is the abbreviation of the Directed Acyclic Graph. In Spark, this is used for the visual representation
of RDDs and the operations being performed on them. The RDDs are represented by vertices, while the
operations are represented by edges. Every edge is directed from an 'earlier state' to a 'later state.
382. Given a table(cars) with 4 columns(model_id, model_name,color, price) , perform
groupby using model_name and color, order by highest price, get 3rd highest.

Basic SQL

Hint?

383. What is the difference between the WHERE and HAVING clauses?

Basic SQL

Hint?

384. Given a table(employee). Find the Second highest salary. Find the 10th highest
salary. Find the 25-30th highest salary.

Intermediate SQL

Hint?

385. Fetch department-wise salary from an employee table

Basic SQL

Hint?

386. Given a table with order-id , order item-id and quantity Find the quantity for distinct
order-id

Basic SQL

Hint?

387. What are the different type of Joins in Sql and explain them? (Mainly focused on full
outer join )
Basic SQL

Hint?

388. Given 2 tables and the following query. What will be the output (select * from table 1
full outer join table 2) where values not in (select * from table 1 inner join table 2)

Intermediate SQL

Hint?

389. Given an assumption: There are 2 tables, first table has 10 records and second table
has 15 records. There are 5 records common in both the tables. Number of records that
would be fetched when you perform left join/right join/inner join/cross-join.

Basic SQL

Hint?

390. Given a word "JOE", find the word in a given string irrespective of word being upper
case or lower case or capitalize?

Intermediate SQL

Hint?

391. Find out if the database has any duplicate record names.

Basic SQL

Hint?
392. Differentiate between Implicit vs Explicit Join

Intermediate SQL

Hint?

393. With respect to SQL, which one is more preferable - Subqueries or Joins? Why?

Intermediate SQL

Hint?

394. Does SQL have User Defined Functions?

Basic SQL

Hint?

395. Query to find the employees in the office given check in and check out as fields.

Intermediate SQL

Hint?

396. Given a table of an event having columns date-ts/ event id. Find the event that
happened 3rd on every month

Basic SQL

Hint?

397. Split a full name into 2. First and last.

Basic SQL

Hint?
398. Find the Salary greater than Average salary without using Joins or Sub-Queries

FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table

404. What is Normalization

Basic SQL

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging
between 0 and 1. It is also known as Min-Max scaling.

405. Use count function in a query

Basic SQL

SELECT COUNT(*) FROM dataset;

406. Difference between count(column_name) and count(*)

Intermediate SQL

COUNT(*) will count the number of records.

COUNT(column_name) will count the number of records where column_name is not null.

407. Write SQL query to find the cumulative price of each customer in a table?

Intermediate SQL

Step1: Partition data using OVER Clause

SELECT CustomerID

,TransactionDate
,TransactionAmount

,SUM(TransactionAmount) OVER(PARTITION BY TransactionDate) RunningTotal

FROM Sales.CustomerTransactions T1

WHERE TransactionTypeID = 1

GROUP BY CustomerID

ORDER BY InvoiceID

,TransactionAmount

Step2: Order Partitions with Order BY

SELECT CustomerID

,TransactionDate

,TransactionAmount

,SUM(TransactionAmount) OVER(PARTITION BY TransactionDate ORDER BY InvoiceID) RunningTotal

FROM Sales.CustomerTransactions T1

WHERE TransactionTypeID = 1

GROUP BY CustomerID

ORDER BY InvoiceID

,TransactionAmount

408. Write a query to delete duplicate records in a table

Intermediate SQL

SELECT [FirstName],

[LastName],

[Country],

COUNT(*) AS CNT

FROM [SampleDB].[dbo].[Employee]

GROUP BY [FirstName],

[LastName],
[Country]

HAVING COUNT(*) > 1;

409. What is a constraint in SQL?

Intermediate SQL

Hint?

410. What are the constraints type available in SQL

416. What is SQL NOT NULL constraint?

Intermediate SQL

A NOT NULL constraint in SQL is used to prevent inserting NULL values into the specified column,
considering it as a not accepted value for that column. This means that you should provide a valid SQL
NOT NULL value to that column in the INSERT or UPDATE statements, as the column will always contain
data

417. What is a CHECK constraint?

Intermediate SQL
The CHECK constraint is used to limit the value range that can be placed in a column. If you define a
CHECK constraint on a single column it allows only certain values for this column. If you define a CHECK
constraint on a table it can limit the values in certain columns based on values in other columns in the
row.

418. What is a DEFAULT constraint?

Intermediate SQL

The DEFAULT constraint is used to provide a default value for a column. The default value will be added
to all new records IF no other value is specified.

Basic SQL

DML is Data Manipulation Language and is used to manipulate data. Examples of DML are insert,
update and delete statements.

429. Explain DCL and TCL with examples

Intermediate SQL

DCL is Data Control Language

TCL is Transaction Control Language

Examples under DCL: GRANT, REVOKE

UPDATE emp

SET city= 'chennai'

WHERE city= 'Madras';

436. Write a query to remove record whose salary is great than or equal to 50000 and
city is chennai

Basic SQL

DELETE FROM emp WHERE salary>=50000 and city='chennai'

UNION removes duplicate records (where all columns in the results are the same), UNION ALL does not.

There is a performance hit when using UNION instead of UNION ALL, since the database server must do
additional work to remove the duplicate rows, but usually, you do not want the duplicates (especially
when developing reports).

444. What is an execution plan? When would you use it? How would you view the
execution plan

Intermediate SQL

An execution plan is a window in SQL Server Management Studio to shows you how SQL Server breaks
down a query and also identifies where issues might exist within the execution plan. By identifying the
statements that take a long time to complete, you can then look at the execution plan to determine
tuning needs.

When do you use it?

You can use it anytime you write a query. Most developers use execution plan when they have database
queries consumes a lot of resources and takes time.

How do you view it in SQL Server?

SQL Server can create execution plans in two ways:

Actual Execution Plan - (CTRL + M) - is created after execution of the query and contains the steps that
were performed

Estimated Execution Plan - (CTRL + L) - is created without executing the query and contains an
approximate execution plan

Execution plans can be presented in these three ways.

Text Plans

Graphical Plans

XML Plans

Basic Statistics
Hint?

449. Difference between pause and continue

Basic Statistics

Hint?

450. Why you used T-test in the project that you have mentioned in your resume.

Basic Statistics

Hint?

451. Given two populations, to perform a test of effectiveness of a drug, which statistical
test will you perform?

Intermediate Statistics

Hint?

452. If a height is co - related to weight & weight is co -related height are the both the
statements same?

Basic Statistics

Yes, both the statements are true, given that they are continuous variables.

453. Given a data / statement, calculate the Z score

Basic Statistics

A z-score measures exactly how many standard deviations above or below the mean a data point is.

Formula for calculating a z-score:

Here's the formula for calculating a z-score:

z={data point}-{mean}} / {standard deviation}

A positive z-score says the data point is above average.

A negative z-score says the data point is below average.

A z-score close to 000 says the data point is close to average.

A data point can be considered unusual if its z-score is above 333 or below -3−3minus, 3

454. What is p-value?

Basic Statistics

Hint?

455. Explain Chi-squared test, Z-test, Anova.

Intermediate Statistics

Hint?

456. Difference between precision/ recall/ f1 score.

Intermediate Statistics

Hint?

457. What are independent variables and categorical variables. Highlight the key
differences.

Basic Statistics

An independent variable sometimes called an experimental or predictor variable, is a variable that is

being manipulated in an experiment in order to observe the effect on a dependent variable, sometimes
called an outcome variable.
Categorical variables contain a finite number of categories or distinct groups. Categorical data might not
have a logical order. For example, categorical predictors include gender, material type, and payment
method.

An independent variable can be categorical or numerical. A categorical variable can be an independent

variable or a dependent variable

458. What is Chi Square ?

Basic Statistics

The Chi-Square statistic is commonly used for testing relationships between categorical variables. The
null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the
population; they are independent.

459. How to prove a sample is the true representation of population?

Intermediate Statistics

Properties of Representative Samples:

- Estimates calculated from sample data are often used to make

inferences about populations.

- If a sample is representative of a population, then Sample reflects the characteristics of the population,
so those sample findings can be generalized to the population

- A most effective way to achieve representativeness is through randomization; random selection or

random assignment

460. What is Hypothesis Testing?

Basic Statistics

A statistical hypothesis is an assumption about a population parameter. This assumption may or may not
be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject
statistical hypotheses
461. A scenario was given and was asked to write Null and Alternate Hypothesis

Intermediate Statistics

- Sigmoid function outputs the conditional probabilities of the prediction, the class probabilities.

467. You are given a data set. The data set has missing values which spread along 1
standard deviation from the median. What percentage of data would remain unaffected?
Why?

Intermediate Statistics

468. What are different types of Hypothesis Testing

Intermediate Statistics

There are basically two types, namely, null hypothesis and alternative hypothesis

The null hypothesis is generally denoted as H0. It states the exact opposite of what an investigator or an
experimenter predicts or expects. It basically defines the statement which states that there is no exact or
actual relationship between the variables.

The alternative hypothesis is generally denoted as H1. It makes a statement that suggests or advises a
potential result or an outcome that an investigator or the researcher may expect. It has been categorized
into two categories: directional alternative hypothesis and non-directional alternative hypothesis.

469. What is the difference between variance and covariance

Basic Statistics

Variance is one dimension and covariance is two-dimension measurable techniques and which measure
the volatility and relationship between the random variables respectively. Higher the Volatility in stock
riskier the stock and buying stock with negative covariance is a great way to minimize the risk. A positive
covariance means assets move in the same direction whereas negative covariance means assets generally
moves in the opposite direction.

470. What a data contains? (Information + Noise) Explain

Basic Statistics

Data = true signal + noise

Noisy data are data with a large amount of additional meaningless information in it called noise. This
includes data corruption and the term is often used as a synonym for corrupt data. It also includes any
data that a user system cannot understand and interpret correctly.

Sources of noise:

- Random noise(white noise) is often a large component of the noise in data

- Outlier data are data that appears to not belong in the data set. It can be caused by human error such as
transposing numerals, mislabeling, programming bugs, etc

- Fraud: Individuals may deliberately skew data to influence the results toward a desired conclusion

471. How to create dashboards?

Basic Tableau

1. At the bottom of the workbook, click the New Dashboard icon:

2. From the Sheets list at left, drag views to your dashboard at the right

3. To replace a sheet, select it in the dashboard at right. In the Sheets list at left, hover over the
replacement sheet, and click the Swap Sheets button.

472. What filters should be applied to rows for specific ops?

Basic Tableau

The different types of filters used in Tableau are given below. The name of filter types is sorted based
on the order of execution in Tableau.
Extract Filters

Data Source Filters

Context Filters

Dimension Filters

Measure Filters

473. Difference between Dimensions and Measures

Basic Tableau

Dimensions contain qualitative values (such as names, dates, or geographical data). You can use
dimensions to categorize, segment, and reveal the details in your data. Dimensions affect the level of
detail in the view.

Measures contain numeric, quantitative values that you can measure. Measures can be aggregated.
When you drag a measure into the view, Tableau applies an aggregation to that measure (by default).

474. What Are the Different Joins in Tableau

Intermediate Tableau

There are four types of joins which are used to combine data in Tableau: inner, left, right and full outer.
Let’s look into it one by one:

Inner:

Inner join results in a table that contains values that have matches in both tables.

Left:

The left join results in a table that contains the values from the left table and corresponding matches
from the right table. And in case, if a value in the left table doesn’t have a corresponding match in the
right table, a null value in the data grid is reflected.
Right:

Right join results in a table which contains all the values form the right table and corresponding matches
from the left table. And in case, if a value in the right table doesn’t have a corresponding match in the left
table, a null value in the data grid is reflected.

Full Outer:

Full outer join results in a table that contains all values from both tables. And a null value is reflected in
data grid when a value from either table doesn’t have a match with the other table.

475. What is a Calculated Field, and How Will You Create One

Basic Tableau

Sometimes your data source does not contain a field (or column) that you need for your analysis. For
example, your data source might contain fields with values for Sales and Profit, but not for Profit Ratio. If
this is the case, you can create a calculated field for Profit Ratio using data from the Sales and Profit
fields.

How to create a simple calculated field using an example.

Step 1: Create the calculated field

In a worksheet in Tableau, select Analysis > Create Calculated Field.

In the Calculation Editor that opens, give the calculated field a name.

In this example, the calculated field is called Profit Ratio.

Step 2: Enter a formula

In the Calculation Editor, enter a formula.

This example uses the following formula:

SUM([Profit])/SUM([Sales])

476. What Is a Parameter in Tableau

Intermediate Tableau

A Pareto chart is a type of chart that contains both bars and a line graph, where individual values are
represented in descending order by bars, and the ascending cumulative total is represented by the line.

For creating Tableau Pareto Chart first we have to create a bar chart.

Create a bar graph that shows Sales by Sub-Category in downward-sloping order.

i. Connect to the Sample – Superstore knowledge supply.

ii. From the size space of the information pane, drag Sub-Category to Columns.

iii. From the Measures space of the information pane, drag Sales to Rows.

iv. Click Sub-Category on Columns and select kind.

In the kind panel, do the following:

i. Under the kind order, select downward-sloping.

ii. Under kind by, select Field.

iii. Leave all alternative values unchanged, as well as Sales because the chosen field and add because of
the chosen aggregation.

iv. Click alright to exit the type panel.

Products area unit currently sorted from highest sales to lowest.

Add a Line Chart

Add a Line Chart that additionally shows Sales by Sub-Category

i. From the Measures space of the information pane, drag Sales to the so much right of the read, till a line
appears

ii. Drop Sales, to make a dual-axis read. it is a bit exhausting to envision that there area unit 2 instances
of the Sales bars at now, as a result of they’re organized identically.

iii. Select SUM(Sales) (2) on the Marks card, and alter the mark kind of Line.

Add a Table calculation to the road chart to indicate sales by Sub-Category as a running total and as a p.c
of total

i. Click the second copy of SUM(Sales) on Rows and select Add Table Calculation.

ii. Add a primary table calculation to SUM(Sales) to gift sales as a running total.

iii. Choose Running Total because of the Calculation kind.

iv. Do not shut the Table Calculation panel.

v. Add a secondary table calculation to gift the information as a p.c of the overall.

vi. Click Add Secondary Calculation and select p.c of Total because of the Secondary Calculation kind.

vii. This is what the Table Calculation panel ought to appear as if at this point:

Intermediate HR

Hint?

489. What significant goals have you set for yourself in the past? Have you achieved
those?

Intermediate HR

Hint?

490. You have worked in the IT sector for so long, why is there a sudden interest in
analytics?

Intermediate HR

Hint?
491. Have you worked on any analytics projects or assignments?

Basic HR

Hint?

492. Please describe your future career goals

Intermediate HR

Hint?

493. Do you have any idols? In what way do they inspire you?

Basic HR

Hint?

494. What are your interests and hobbies? What do you do in your free time?

Basic HR

Hint?

495. What has been your biggest achievement at work?

Intermediate HR

Hint?

496. Do you prefer working as an individual contributor or managing a team?

Intermediate HR

Hint?
497. Do you have any questions for us?

Basic HR

Hint?

Subaru Forester 2007 Service Manual
96% (25)
Subaru Forester 2007 Service Manual
3,548 pages
100 Python Interview Questions & Answer For Data Science
No ratings yet
100 Python Interview Questions & Answer For Data Science
72 pages
Top 100 Python Interview Questions For Data Analyst
No ratings yet
Top 100 Python Interview Questions For Data Analyst
10 pages
Complete Python Questions With Answers
No ratings yet
Complete Python Questions With Answers
13 pages
Python Interview Questions
No ratings yet
Python Interview Questions
43 pages
Python Numpy Pandas Interview Questions
No ratings yet
Python Numpy Pandas Interview Questions
8 pages
Python Sample Viva Questions For Academic Session 2024-2025
No ratings yet
Python Sample Viva Questions For Academic Session 2024-2025
6 pages
Python Important
No ratings yet
Python Important
35 pages
Python Interview Q A 1709697988
No ratings yet
Python Interview Q A 1709697988
26 pages
Southern Province Grade 10 Information and Communication Technology Ict 2020 1 Term Test Paper 61e9422335b6f
No ratings yet
Southern Province Grade 10 Information and Communication Technology Ict 2020 1 Term Test Paper 61e9422335b6f
13 pages
Assesment - Basic Python - MCQ - 40 Questions
No ratings yet
Assesment - Basic Python - MCQ - 40 Questions
9 pages
Python Unit 1 To 5
No ratings yet
Python Unit 1 To 5
10 pages
Python Interview Questions For Freshers - Docx 1
No ratings yet
Python Interview Questions For Freshers - Docx 1
6 pages
Lecture 02
No ratings yet
Lecture 02
65 pages
Lesson 1 - Introduction To SCITECH
No ratings yet
Lesson 1 - Introduction To SCITECH
16 pages
100 Python Interview Questions
No ratings yet
100 Python Interview Questions
68 pages
L4 - Data Handling
No ratings yet
L4 - Data Handling
75 pages
DA-Interview Go Through
No ratings yet
DA-Interview Go Through
59 pages
ML Lab Viva
No ratings yet
ML Lab Viva
6 pages
MTE204 Data Python
No ratings yet
MTE204 Data Python
45 pages
Ashwani Kumar Yadav Chief Mechanic
No ratings yet
Ashwani Kumar Yadav Chief Mechanic
5 pages
Python Interviews Question
No ratings yet
Python Interviews Question
47 pages
Unit 1 Part 1 - Final
No ratings yet
Unit 1 Part 1 - Final
40 pages
Python Basic Interview Questions Compressed 1
No ratings yet
Python Basic Interview Questions Compressed 1
62 pages
Most Asked Python Interview Questions at MAANG Companies
No ratings yet
Most Asked Python Interview Questions at MAANG Companies
26 pages
Poa Sba
100% (2)
Poa Sba
14 pages
Python Interview Questions
No ratings yet
Python Interview Questions
23 pages
Python Interview Question Toward Machine Learning
No ratings yet
Python Interview Question Toward Machine Learning
22 pages
Q1. What Is Python
No ratings yet
Q1. What Is Python
28 pages
Python Interview Questions: 1) What Is The Difference Between Global and Local Scope?
No ratings yet
Python Interview Questions: 1) What Is The Difference Between Global and Local Scope?
22 pages
Python Interview Questions
No ratings yet
Python Interview Questions
20 pages
Python Irerview Question and Answers
No ratings yet
Python Irerview Question and Answers
20 pages
Python Interview Questions 1714477282
No ratings yet
Python Interview Questions 1714477282
26 pages
Python Preguntas Entrevista
No ratings yet
Python Preguntas Entrevista
26 pages
Python Most Asked Interview Questions?
No ratings yet
Python Most Asked Interview Questions?
24 pages
Python: Python Interview Questions and Answers
No ratings yet
Python: Python Interview Questions and Answers
29 pages
Most Asked Python Interview Questions 1684406154
No ratings yet
Most Asked Python Interview Questions 1684406154
24 pages
12 Ip
No ratings yet
12 Ip
14 pages
Viva Questions Class 12
No ratings yet
Viva Questions Class 12
28 pages
Python IV
No ratings yet
Python IV
12 pages
CFA Institute Python Programming Fundamentals Cheat Sheet
No ratings yet
CFA Institute Python Programming Fundamentals Cheat Sheet
10 pages
Alpha Series - Front End Cylinder With Single Eye
0% (1)
Alpha Series - Front End Cylinder With Single Eye
2 pages
Python
No ratings yet
Python
31 pages
80 Câu Hỏi Phỏng Vấn Về Python
No ratings yet
80 Câu Hỏi Phỏng Vấn Về Python
15 pages
Questions Data
No ratings yet
Questions Data
12 pages
Python 1
No ratings yet
Python 1
14 pages
Who Is The Developer of Python
No ratings yet
Who Is The Developer of Python
12 pages
Mechanical Engineering - Lab Manual For Measurement and Instrumentation
No ratings yet
Mechanical Engineering - Lab Manual For Measurement and Instrumentation
18 pages
Python Imp
No ratings yet
Python Imp
15 pages
PYTHON (Model Paper-01)
No ratings yet
PYTHON (Model Paper-01)
11 pages
Python 101: Understanding The Nuts and Bolts of Python
No ratings yet
Python 101: Understanding The Nuts and Bolts of Python
46 pages
Comprehensive Python Question Bank With Answers
No ratings yet
Comprehensive Python Question Bank With Answers
10 pages
Aces Review Center: Ree Online Review Refresher Esas 7B by Engr. Jimmy L. Ocampo 0920 - 644 - 6246
No ratings yet
Aces Review Center: Ree Online Review Refresher Esas 7B by Engr. Jimmy L. Ocampo 0920 - 644 - 6246
5 pages
PWP QB1 Answers
No ratings yet
PWP QB1 Answers
9 pages
Top Python Questions 1735201448
No ratings yet
Top Python Questions 1735201448
25 pages
Week 1
No ratings yet
Week 1
7 pages
Phyton
No ratings yet
Phyton
11 pages
CT-3 QB
No ratings yet
CT-3 QB
12 pages
Python Practice Questions 1234
No ratings yet
Python Practice Questions 1234
6 pages
Viva Voce
No ratings yet
Viva Voce
5 pages
Pythonvivaquestions
No ratings yet
Pythonvivaquestions
4 pages
AWP Interview Question
No ratings yet
AWP Interview Question
4 pages
Python Shot Interview
No ratings yet
Python Shot Interview
6 pages
Internet Service Provider Business Plan
No ratings yet
Internet Service Provider Business Plan
44 pages
Nikil Python Int Que
No ratings yet
Nikil Python Int Que
3 pages
Common Python Data Science Interview Questions1
No ratings yet
Common Python Data Science Interview Questions1
5 pages
Axial Piston Variable Pump A4VG Series 32: Europe
No ratings yet
Axial Piston Variable Pump A4VG Series 32: Europe
94 pages
Software Project Management: Dr. R. Mall
No ratings yet
Software Project Management: Dr. R. Mall
87 pages
Hira For Cement Mill
No ratings yet
Hira For Cement Mill
6 pages
Ficha Técnica American Marsh
No ratings yet
Ficha Técnica American Marsh
8 pages
Forced Perspective Photography
100% (1)
Forced Perspective Photography
3 pages
Latihan Exam MTCNA - 1
No ratings yet
Latihan Exam MTCNA - 1
20 pages
Sfepy Manual
No ratings yet
Sfepy Manual
988 pages
Data Migration in Fiori
No ratings yet
Data Migration in Fiori
22 pages
Jade M Kit
No ratings yet
Jade M Kit
1 page
Rittal White Paper 401: The Benefits of Busbar Power Distribution Systems For North American & Global Applications
No ratings yet
Rittal White Paper 401: The Benefits of Busbar Power Distribution Systems For North American & Global Applications
9 pages
14 NLP
No ratings yet
14 NLP
20 pages
MS-7549 Ver:0A
No ratings yet
MS-7549 Ver:0A
34 pages
MS Boundary Gate
No ratings yet
MS Boundary Gate
18 pages
E610-Dtu (433c30) e+User+Manual en v1.0
No ratings yet
E610-Dtu (433c30) e+User+Manual en v1.0
48 pages
IoT Based Street Light Controlling and M
No ratings yet
IoT Based Street Light Controlling and M
8 pages
Superseded
No ratings yet
Superseded
19 pages
Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers For Robust Speaker Embeddings
No ratings yet
Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers For Robust Speaker Embeddings
6 pages
YATO Konteyner 9
No ratings yet
YATO Konteyner 9
8 pages
TLWA Assignment-1 - 03-09-2024
No ratings yet
TLWA Assignment-1 - 03-09-2024
2 pages
Unit 11: Travel Planning
No ratings yet
Unit 11: Travel Planning
6 pages
Panel Options LCD Samsung PDF
No ratings yet
Panel Options LCD Samsung PDF
11 pages
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet