0% found this document useful (0 votes)
12 views141 pages

Data Science Interview Ques.

The document provides a comprehensive set of interview questions and answers for data science roles, focusing on key Python concepts and libraries. It covers a variety of topics including data manipulation, analysis, and programming techniques relevant to data science. The questions range from basic to intermediate levels, helping candidates prepare effectively for their interviews.

Uploaded by

deepaksain013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views141 pages

Data Science Interview Ques.

The document provides a comprehensive set of interview questions and answers for data science roles, focusing on key Python concepts and libraries. It covers a variety of topics including data manipulation, analysis, and programming techniques relevant to data science. The questions range from basic to intermediate levels, helping candidates prepare effectively for their interviews.

Uploaded by

deepaksain013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

Ace the upcoming Data Science Interview

You can't anticipate every question an interviewer will ask. However, there are many critical
questions that you can prepare before the interview.

Our hiring partners have helped us curate a set of interview questions on key skills, which will help
you prepare better for the data science job roles.

Filters

1. Name a function which is most useful to convert a multidimensional array into a one-
dimensional array. For this function will changing the output array affect the original array?

Basic Python

The flatten( ) can be used to convert a multidimensional array into a 1D array. If we modify
the output array returned by flatten( ), it will not affect the original array because this
function returns a copy of the original array.

2. If there are two variables defined as 'a = 3' and 'b = 4', will ID() function return the same
values for a and b?
Basic Python

The id() function in python returns the identity of an object, which is actually the memory
address. Since, this identity is unique and constant for every object, it will not return same
values for a and b.

3. For what Beautiful soup library is used for?

Basic Python

4. In python, if we create two variables 'mean = 7' and 'Mean = 7' , will both of them be
considered as equivalent?

Basic Python

Python is a case-sensitive language. It has the ability to distinguish uppercase or lowercase


letters and hence these variables 'mean = 7' and 'Mean = 7' will not be considered as
equivalent.

5. What is the use of 'inplace' in pandas functions?

Basic Python

Inplace is a parameter available for a number of pandas functions. It impacts how the
function executes. Using 'inplace = True', the original dataframe can be modified and it will
return nothing. The default behaviour is 'inplace = False' which returns a copy of the
dataframe, without affecting the original dataframe.

6. How can you change the index of a dataframe in python?

Basic Python

DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)


keys: label or array-like or list of labels/arrays This parameter can be either a single column
key, a single array of the same length as the calling DataFrame, or a list containing an
arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index,
np.ndarray, and instances of Iterator.
7. How would check a number is prime or not using Python?

Basic Python

# taking input from user number = int(input("Enter any number: ")) # prime number is always
greater than 1 if number > 1: for i in range(2, number): if (number % i) == 0: print(number, "is
not a prime number") break else: print(number, "is a prime number") # if the entered number
is less than or equal to 1 # then it is not a prime number else: print(number, "is not a prime
number")

8. What is the difference between univariate and bivariate analysis? What all different
functions can be used in python?

Basic Python

Univariate analysis summarizes only one variable at a time while Bivariate analysis compares
two variables. Below are a few functions which can be used in the univariate and bivariate
analysis: 1. To find the population proportions with different types of blood disorders.
df.Thal.value_counts() 2. To make a plot of the distribution : sns.distplot(df.Variable.dropna())
3. Find the minimum, maximum, average, and standard deviation of data. There is a function
called describe() which returns the minimum, maximum, mean etc. of the numerical variables
of the data frame. 4. Find the mean of the Variable df.Variable.dropna().mean() 5. Boxplot to
observe outliers sns.boxplot(x = ' ', y = ' ', hue = ' ', data=df) 6. Correlation plot: data.corr()

9. What is the difference between 'for' loop and 'while' loop?

Basic Python

- 'for' loop is used to obtain a certain result. In a for loop, the number of iterations to be
performed is already known. - In 'while' loop, the number of iterations is not known. Here,
the statement runs until a specific condition is met and the assertion is proven untrue.

10. Differentiate between Call by value and Call by reference.

Basic Python

11. How will you import multiple excel sheets in a data frame?
Basic Python

The excel sheets can be read using 'pd.read_excel()' function into a dataframe and then
using 'pd.concat()', concatenate all the excel sheets- Syntax: df =
pd.concat(pd.read_excel('sheet_name', sheet_name=None), ignore_index=True)

12. What is the difference between 'Append' and 'Extend' function?

Basic Python

The append() method adds an item to the end of the list. The syntax of the append() method
is: list.append(item) On the other hand, the extend method extends the list by adding each
element from iterable. The syntax of the extend() method is: list.extend(item)

13. What are the data types available in Python?

Basic Python

Python has the following standard data types: - Boolean - Set - Mapping Type: dictionary -
Sequence Type: list, tuple, string - Numeric Type: complex, float, int.

14. Can you write a function using python to impute outliers?

Basic Python

import numpy as np def remove Outliers(x, outlierConstant): a = np.array(x) upper_quartile =


np.percentile(a, 75) lower_quartile = np.percentile(a, 25) IQR = (upper_quartile -
lower_quartile) * outlierConstant quartileSet = (lower_quartile - IQR, upper_quartile + IQR)
resultList = for y in a.tolist(): if y > = quartileSet[0] and y < = quartileSet[1]:
resultList.append(y) return resultList

15. Can any type of string be converted into an int, in Python?

Basic Python

Python offers the int() method that takes a String object as an argument and returns an
integer. This can be done only when the value is either of numeric object or floating-point.
But keep these special cases in mind - A floating-point (an integer with a fractional part) as
an argument will return the float rounded down to the nearest whole integer.

16. How would check a number is armstrong number using Python?

Basic Python

# Python program to check if the number is an Armstrong number or not # take input from
the user num = int(input("Enter a number: ")) # initialize sum sum = 0 # find the sum of the
cube of each digit temp = num while temp > 0: digit = temp % 10 sum += digit ** 3 temp //=
10 # display the result if num == sum: print(num,"is an Armstrong number") else:
print(num,"is not an Armstrong number")

17. What is the difference between list, array and tuple in Python?

Basic Python

The list is an ordered collection of data types. The list is mutable. Lists are dynamic and can
contain objects of different data types. List elements can be accessed by index number An
array is an ordered collection of similar data types. An array is mutable. An array can be
accessed by using its index number. Tuples are immutable and can store any type of data
type. It is defined using (). It cannot be changed or replaced as it is an immutable data type

18. What is the difference between iloc and loc activity?

Basic Python

loc gets rows (or columns) with particular labels from the index. iloc gets rows (or columns)
at particular positions in the index and it only takes integers.

19. How does the reverse function work in Python?

Basic Python

The built-in reverse( ) function reverses the contents of a list object inplace. That means, it
does not return a new instance of the original list, rather it makes a direct change to the
original list object. Syntax: list.reverse()
20. What is the apply function in Python? How does it work?

Basic Python

Pandas.apply allow the users to pass a function and apply it on every single value of the
Pandas series. Syntax: s.apply(func, convert_dtype=True, args=())

21. How do you get the frequency of a categorical column of a dataframe using python?

Basic Python

Using df.value_counts(), where df is the dataframe. The value_counts( ) function returns the
counts of the distinct elements in a dataframe column, sorted in descending order by
default.

22. Will range(5) include '5' in its output?

Basic Python

The range() function in python always excludes the last integer from the result. Here it will
generate a numeric series from '0' to (5-1)=4, and it will not include '5'.

23. How can you drop a column in python?

Basic Python

Pandas 'drop()' method is used to remove specific rows and columns. To drop a column, the
parameter 'axis' should be set as 'axis = 1'. This parameter determines whether to drop labels
from the columns or rows (index). Default behaviour is, axis = 0. Syntax:
df.drop('column_name', axis=1)

24. How NaN values behave while comparing with itself?

Basic Python

NaN values can not be compared with itself. That's why, checking if a variable is equal to
itself is the most popular way to look for NaN values. If it isn't, it's most likely a NaN value.
25. How can we convert a python series object into a dataframe?

Basic Python

The to_frame() is a function that helps us to convert a series object into a dataframe.
Syntax: Series.to_frame(name=None) name: this name will substitute the existing series
name while creating the dataframe.

26. How do you read a file without using Pandas?

Basic Python

27. Can you plot 3D plots using matplotlib? Describe the function.

Intermediate Python

Yes Function: import numpy as np import matplotlib.pyplot as plt fig = plt.figure() ax =


plt.axes(projection ='3d')

28. How get_dummies() is different from one hot encoder?

Intermediate Python

OneHotEncoder cannot process string values directly. If your nominal features are strings,
then you need to first map them into integers. pandas.get_dummies is kind of the opposite.
By default, it only converts string columns into one-hot representation, unless columns are
specified.

29. Name a tool that can be used to convert categorical columns into a numeric column.

Intermediate Python

One of the most used and popular ones are LabelEncoder and OneHotEncoder. Both are
provided as parts of sklearn library. LabelEncoder can be used to transform categorical data
into integers: from sklearn.preprocessing import LabelEncoder label_encoder =
LabelEncoder() x = ['Apple', 'Orange', 'Apple', 'Pear'] y = label_encoder.fit_transform(x) print(y)
array([0, 1, 0, 2]) OneHotEncoder can be used to transform categorical data into one hot
encoded array: from sklearn.preprocessing import OneHotEncoder onehot_encoder =
OneHotEncoder(sparse=False) y = y.reshape(len(y), 1) onehot_encoded =
onehot_encoder.fit_transform(y) print(onehot_encoded)

30. How will you remove duplicate data from a dataframe?

Intermediate Python

The 'drop_duplicates( )' function in python eliminates the redundant rows from the
DataFrame and returns it. Syntax: DataFrame.drop_duplicates(subset=None, keep=' ',
inplace=False) subset: Subset takes a column or list of column label. The default value is
none. After passing columns, it will consider them only for duplicates. keep: keep is to
control how to consider duplicate value. It has only three distinct values ('first', 'last', 'false')
and default is 'first'.

31. How do you select a sample of dataframe?

Intermediate Python

Depending on the situation, there are a few possible ways to select a sample from the
dataframe - 1. Randomly select a single row: df = df.sample() 2. Randomly select a specified
n number of rows: df = df.sample(n=3) 3. Allow a random selection of the same row more
than once: df = df.sample(n=3,replace=True) 4. Randomly select a specified fraction of the
total number of rows: df = df.sample(frac=0.50)

32. How groupby function works in Python?

Intermediate Python

Pandas dataframe.groupby() function is used to split the data into groups based on some
criteria. pandas objects can be split on any of their axes. Syntax:
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=False, **kwargs) by: mapping, function, str, or iterable axis: int,
default 0 level: If the axis is a MultiIndex (hierarchical), group by a particular level or levels
as_index: For aggregated output, return object with group labels as the index. Only relevant
for DataFrame input. sort: Sort group keys. Get better performance by turning this off. Note
this does not influence the order of observations within each group. groupby preserves the
order of rows within each group. group_keys: When calling apply, add group keys to index to
identify pieces squeeze: Reduce the dimensionality of the return type if possible, otherwise
return a consistent type Returns: GroupBy object

33. How do you check the distribution of data in python?

Intermediate Python

A simple and commonly used plot to quickly check the distribution of a sample of data is
the histogram. from matplotlib import pyplot pyplot.hist(data)

34. Which libraries in SciPy have you worked with in your project?

Intermediate Python

SciPy contains modules for optimization, linear algebra, integration, interpolation, special
functions, FFT, signal and image processing, ODE solvers etc Subpackages include:
scipy.cluster scipy.constants scipy.fftpack scipy.integrate scipy.interpolation scipy.linalg
scipy.io scipy.ndimage scipy.odr scipy.optimize scipy.signal scipy.sparse scipy.spatial
scipy.special scipy.stats scipy.weaves

35. How is the Python series different from a single column dataframe?

Intermediate Python

Python series is the data structure for a single column of a DataFrame, not only
conceptually, but literally, i.e. the data in a DataFrame is actually stored in memory as a
collection of Series Series is a one-dimensional object that can hold any data type such as
integers, floats and strings and it does not have any name/header whereas the dataframe
has column names.

36. What does the function zip() do?

Intermediate Python

The zip() function takes iterables (can be zero or more), aggregates them in a tuple, and
return it. The syntax of the zip() function is: zip(*iterables)

37. Can lambda function be used within a user-defined function?


Intermediate Python

Yes. A lambda function evaluates an expression for a given argument. It can be used as an
anonymous function within another function.

38. What does [::-1] do in python?

Intermediate Python

[::] just produces a copy of all the elements in order [::-1] produces a copy of all the
elements in reverse order

39. How do you check missing values in a dataframe using python?

Intermediate Python

Pandas isnull() function detect missing values in the given object. It returns a boolean same-
sized object indicating if the values are NA. Missing values get mapped to True and non-
missing value gets mapped to False.

40. Explain a scenario where negative indices are used in python

Intermediate Python

Python programming language supports negative indexing of arrays, something which is not
available in arrays in most other programming languages. This means that the index value of
-1 gives the last element, and -2 gives the second last element of an array. The negative
indexing starts from where the array ends. This means that the last element of the array is
the first element in the negative indexing which is -1.

41. Python or R, which one would you prefer for text analytics?

Intermediate Python

42. What all different methods can be used to standardize the data using python?
Intermediate Python

Min Max Scaler. Standard Scaler. Max Abs Scaler. Robust Scaler. Quantile Transformer
Scaler. Power Transformer Scaler. Unit Vector Scaler.

43. How would you define a block in Python?

Intermediate Python

A block is a group of statements in a program or script. Usually, it consists of at least one


statement and declarations for the block, depending on the programming or scripting
language. A language which allows grouping with blocks is called a block-structured
language

44. How do you do Up-sampling of data? Name a python function or explain the code.

Intermediate Python

Up-sampling is the process of randomly duplicating observations from the minority class in
order to reinforce its signal. There are several heuristics for doing so, but the most common
way is to simply resample with replacement. Module for resampling in Python: from
sklearn.utils import resample

45. What is machine learning?

Basic Machine Learning

Machine learning is a branch of artificial intelligence (AI) that focuses on the use of data and
algorithms to mimic the way that humans learn. It aims to gradually improve by learning from
the events that happened in the past (data captured in past), assuming that the past data is a
good representation of the future. There are various machine learning algorithms available to
build a model that can learn the hidden patterns from the past data, known as training data,
in order to make predictions for the future data or the unseen data, based on which
decisions can be taken. For example: Predicting the prices of a house based on attributes of
the property.

46. Machine learning helps in summarising the patterns in the data in a mathematically
precise way. What exactly is the mathematical outcome of any (machine learning) model
building exercise?

Intermediate Machine Learning

Machine learning models take data as input to find the hidden patterns in it and try to
summarize the patterns that exist in the data by establishing a relationship between the
predictors and the predicted values in a mathematically precise way. The mathematical
outcome of a model can be as simple as an equation that relates the predictors to the target
variable. For example, the relationship between salary and years of experience of an
individual.

47. Machine learning automates the process of building mathematical models out of data.
Explain/elaborate on this statement in the light of the linear regression algorithm.

Advanced Machine Learning

Linear regression is a linear model which tries to fit the best fit line through the data and
establish the relationship between the independent variables and the dependent variable in
a form of a linear equation. The equation of the best fit line can be given as: Y = ax1 + bx2 +
c Where a and b are the coefficients of x1 and x2 variables respectively and c is the
constant. The linear regression tries to fit the line in such a way that the errors are
minimized, that is, the predicted values are closer to the observed values. The machine-
learning algorithm of linear regression automates the process of model building i.e it
automatically finds the best fit line which has the minimum error or predicts the values that
are closest to the observed values. This means that process of finding the relationship
between independent variables and the dependent variable is automated.

48. If you model performs very well on the data that it was trained on but not on the data
that it has not seen so far, how will you address that performance gap? Why is it important
to address that gap?

Intermediate Machine Learning

Data generally contains information as well as noise. When we fit a model on the training
data, it learns both the information and noise. If the model learns too much noise and fails to
capture the required information then we see that there is a performance gap between the
training performance and the performance on the unseen data (test set). This performance
gap indicates that the model is overfitting, i.e. failing to replicate the performance of the
training set on the test set. To address this performance gap between the training and the
test set various regularization techniques can be applied. In linear models like linear
regression, regularization techniques like ridge regression and lasso regression can be used.
In non-linear models like decision trees, the pruning techniques like pre-pruning and post-
pruning techniques can be used to deal with the performance gap. Also, the technique of
cross-validation can be implemented to determine the performance of the model on the
unseen dataset.

49. When a model gets to production, it will have to make prediction on data that it has not
see so far, how can we ensure that the model performs well on this data?

Advanced Machine Learning

Before sending the model to production we can check the performance and validity of the
model by using methods. Train - Validation split: In this method, we divide the training set
into two parts one part is kept for training, and the other is kept for validating the model
performance. We train the model on the training set and test it against the validation set.
Based on the performance of the model on the validation set we tune the hyperparameters
of the model to get a generalized and good model performance. K-fold cross-validation: In
this method, we divide the training set into k-folds. Where k can be any number ranging
from 2 to the maximum number of records in the dataset - 1 (generally 10 folds are
preferred). Let’s assume that we set the value of k to be 5, then, in this case, 4 folds will be
used for training the model and the left-out fold will be used as a test set. The same
procedure is repeated for all the folds i.e each fold will be used as a training and test set. To
determine the model performance average of metrics across all the folds is taken. With this
method, we can be sure of the model’s performance because the model has been tested
across various datasets.

50. What is supervised learning?

Basic Machine Learning

Supervised learning is a type of machine learning method in which algorithms are trained
using well "labeled" training data, that is independent variables are already tagged against a
defined target variable. With this technique, we can make predictions and compare them
against the ground truth. For example, Determining if a client might default on a loan or not.

51. What is unsupervised learning?

Basic Machine Learning


Unsupervised learning is a type of machine learning method in which models are trained
using an unlabeled dataset i.e there is no defined target variable against the independent
variables. Since there is no defined target, there is no specific way to compare model
performance in most unsupervised learning methods. Hence, unsupervised learning
algorithms generally perform the task by clustering the dataset into groups according to
certain measures of similarities. For example, Advertising companies segment the population
into smaller groups with similar demographics and purchasing habits to reach their target
market with relevant ads.

52. How do we measure performance of a supervised learning model?

Intermediate Machine Learning

There are several performance metrics available to measure the performance of a


supervised learning model. In case of a regression problem, some of the metrics available to
measure the performance are R2, Adj. R2, RMSE, MAE, etc. In case of a classification
problem, some of the metrics available to measure the performance are Accuracy, Precision,
Recall, F1-Score, etc.

53. How do we measure performance of an unsupervised learning model?

Intermediate Machine Learning

There are several performance metrics available to measure the performance of an


unsupervised learning model such as silhouette score, cophenetic correlation, etc.

54. What is the difference between correlation and multicollinearity?

Advanced Machine Learning

Correlation is a statistical measure that expresses the strength of a linear relationship


between two quantitative variables. A correlation can be positive or negative. In a positive
correlation, the two variables move in the same direction i.e. when one variable increases,
the other variable also increases, and vice versa. Whereas in a negative correlation, the two
variables move in the opposite direction i.e. when one variable increases, the other variable
will decrease, and vice versa. Correlation gives a sense of the relationship between two
variables, known as pair-wise correlation. When two or more variables have a strong linear
relationship they are said to be multicollinear. Multicollinearity is a challenge in linear models
because when two or more independent variables display high correlation the model is not
able to distinguish between the individual effects of the independent variables on the
dependent variable. Multicollinearity can be detected using the Variance Inflation Factor
(VIF).

55. How does multicollinearity affect the performance of a linear regression model?

Intermediate Machine Learning

Multicollinearity doesn’t impact the performance of a linear regression model, it only


impacts the interpretation from the model. Multicollinearity is a challenge in linear
regression because when two or more independent variables display high correlation the
model is not able to distinguish between the individual effects of the independent variables
on the dependent variable.

56. Which evaluation metric should you use to evaluate a linear regression model built on
a dataset that has a lot of outliers in it?

Intermediate Machine Learning

MAE would be a good metric in that case because it is most robust to outliers. MSE or
RMSE is extremely sensitive to outliers and penalizes the outliers more.

57. What is the difference between r-squared and adjusted r-squared?

Intermediate Machine Learning

R-squared (R2) is a statistical measure that represents the proportion of the variance that is
explained in the dependent variable by the independent variables. For example, if the R2 of a
model is 0.70, then 70% of the variation can be explained by the model's inputs. Adjusted R-
squared is a modified version of R-squared that has been adjusted for the number of
independent variables in the model and penalizes the model performance for adding
variables that do not improve the existing model. If we add a new independent variable in
the model, the R2 of the model will always increase. However, the adjusted R-squared
increases only when the new independent variable improves the model more than expected
by chance. It decreases when the independent variable improves the model by less than
expected.

58. How will you explain Decision Tree to a non-tech person?


Advanced Machine Learning

A decision tree can be considered as an inverted tree representation that grows from top to
bottom instead of bottom to top. It tries to mimic the human decision-making process and
tries to represent all the possible solutions to a decision based on certain conditions. For
example, If you have to decide whether to go out for a coffee or not at a nearby place, a
simple decision tree can look like Start with the main question that is “To go out for coffee?”
The decision to go out depends on the location of the place, so the second question
becomes “Is the place nearby?” If ‘yes’ then go for coffee else ‘No’.

59. Why are decision trees prone to overfitting?

Intermediate Machine Learning

The main aim of the decision tree is to achieve homogeneity among the leaf nodes i.e any
split made by the decision tree should result in pure leaves which contain one type of
decision only. For example, If we are trying to predict whether a person will default on a loan
or not and we use the decision tree to make this prediction then the result from the decision
tree split must result in all the defaulters in one leaf and all the non-defaulters on another
leaf node. If the composition of the leaf node is 50% defaulters and 50% non-defaulters then
the leaf is considered completely impure. If a decision tree is built without any restrictions
the tree will grow to its full length and will try to achieve homogeneity by capturing complex
patterns as well as noise present in the data during this process. Due to this, it ends up
learning all the patterns that are present in the training data but fails to replicate the
performance on unseen data i.e it leads to overfitting

60. How can you improve the performance of and overfitting Decision Tree model?

Advanced Machine Learning

To avoid overfitting in decision trees and get a generalized model which performs well on
training as well as the test set we can use Pruning techniques. There are two ways to prune a
decision tree: a) Pre-Pruning: In this method, the decision tree is restricted before it can grow
to its full length by bounding the depth of the tree. There are several other hyperparameters
that are available in the SKlearn implementation of the Decision tree which help in restricting
the growth of the tree. This method is also known as the early stopping of tree. b) Post-
Pruning: In this method, the tree is allowed to grow to its full length and then the sub-trees
of the decision tree are pruned. The sub-trees that are pruned in this process are the ones
that do not provide any significant information to the model. The significance of the sub-tree
is calculated by removing it and checking the error between the full-grown tree and the tree
from which the sub-tree was removed. If the error is large that signifies the removed sub-
tree is important in prediction, if the error is small it signifies that the sub-tree is not much
important in the prediction.

61. How is a random forest model different from just using 'n' decision trees?

Advanced Machine Learning

Let’s say we build ‘n’ decision trees and a Random Forest model with ‘n’ decision tree
estimators. The Random Forest model will be different from the ‘n’ decision trees because it
will employ the process of bootstrapping in rows as well as columns. Each decision tree in
the random forest model will be built on a different dataset because of sampling with
replacement in columns and rows. The final output of the random forest will be decided on
the basis of voting or averaging of the results from ‘n’ decision tree estimators built in the
random forest thereby making the prediction more robust. Whereas if we train ‘n’ decision
the outcome will be the same because the underlying training data for each of the decision
trees is the same.

62. How is AUC different from ROC?

Intermediate Machine Learning

The ROC curve (receiver operating characteristic curve) is a curve showing the performance
of a classification model at different thresholds. This curve plots two parameters, False
Positive Rate (FPR) on the x-axis and True Positive Rate (TPR) on the y-axis. AUC stands for
"Area under the ROC Curve" i.e, AUC measures the entire area under the ROC curve. These
two metrics are typically used together to check the performance of a binary classification
problem.

63. How are k-means and hierarchical clustering different?

Intermediate Machine Learning

K-Means is a centroid-based clustering algorithm whereas hierarchical clustering is a


connectivity-based clustering algorithm. In centroid-based clustering methods, the idea of
similarity is defined as the closeness of data point from the center of the cluster whereas, in
connectivity clustering methods, the idea of similarity is defined as the closeness of data
points with each other. There are several differences between K-Means and Hierarchical
Clustering like: 1. K-Means uses a pre-defined number of clusters that is before starting to
cluster the data points we have to mention the number of clusters. Whereas in Hierarchical
clustering all the data points are considered as separate clusters and there is no requirement
for mentioning the number of clusters beforehand. 2. K-Means clustering uses mean or
median to find the centroid of a cluster whereas there are different linkage methods like
ward, single, etc can be used to find the similarity between two or more clusters in
hierarchical clustering. 3. The computation complexity is higher for hierarchical clustering for
larger datasets whereas K-Means clustering is computationally less expensive for larger
datasets. 4. K Means clustering starts with a random choice of clusters, the results produced
by running the algorithm many times may differ. Whereas the results of hierarchical
clustering are reproducible.

64. How would you identify the optimal number of clusters in your dataset?

Advanced Machine Learning

The most common method to identify the optimal number of clusters in K-Means clustering
is the elbow method. In the elbow method, we iterate over a range of K values i.e number of
clusters, and for each value of K within-cluster sum of squares (WCSS) is calculated that is
the distance between each point and the centroid in a cluster. When we plot the WCSS with
the number of clusters or K value, the plot looks like an Elbow because as the number of
clusters increases, the WCSS value will start to decrease. The K value is chosen where a
rapid decrease in the WCSS is observed or the point, where the line in the plot starts to
move almost parallel to the X-axis. The K value corresponding to this point is the optimal
number of clusters.

65. Why is it important to understand the bias variance trade-off when applying data
science?

Advanced Machine Learning

It is important to understand the bias-variance trade-off because a model high on the bias
fails to identify the underlying patterns on the training data which leads to the creation of a
simple model that fails to perform well on the training set as well as the test set leading to
high errors on training and test sets or underfitting. Whereas a model high on the variance
will be too complex and learn all the patterns as well the noise on the training set perfectly
but will fail to replicate the same performance on the test set leading to high errors on the
test set or overfitting. To avoid such issues, it is important to understand the trade-off
between bias and variance while working on a business problem and come up with an
optimal solution that maintains a balance between bias and variance so that model is neither
underfitting nor overfitting but is a good fit.

66. What is an activation function, and why does a neural network need one?

Basic Deep Learning

Hint?

Activation Functions are mathematical functions that apply a transformation on the output
of a layer in a neural network, which generally tends to be a linear combination of the nodes
of the previous layer with weights and biases. Activation Functions are crucial because they
introduce non-linearity into the neural network - without this, a neural net is simply a large
linear combination of its nodes, and hence, no more powerful than a linear regressor or
classifier. Neural networks are needed to find patterns and draw decision boundaries in
problems that can be highly complex and non-linear, and this makes Activation Functions
extremely important to their functioning. Some examples of Activation Functions are the
Sigmoid function, the Tanh function, and the ReLU function.

67. Why is the Sigmoid activation function not preferred in hidden layers of deep neural
networks?

Basic Deep Learning

Hint?

The Sigmoid function takes in any real number and outputs a continuous numeric value
between 0 and 1, which can then be discretized using a threshold (Ex: 0.5) and converted
into either 0 or 1 - hence its use as a binary classifier. Therefore, the Sigmoid function is
generally preferred in the output layer of a binary classification neural network. It is not
recommended to use it in the hidden layers because of the vanishing gradient problem i.e, if
your input is on the higher side in terms of magnitude (where the sigmoid function goes flat),
then the gradient will be close to zero. Due to the calculus of the chain rule of derivatives
used in backpropagation, this would result in multiple small values being multiplied with each
other to determine the final step size in gradient descent, and that would be an extremely
small step, meaning the neural network's learning speed would be negligible. Hence, we do
not prefer using the Sigmoid function in the hidden layers of deep neural networks.

68. Why is it not a good idea to use the Sigmoid function in the output layer of a neural
network meant for multi-class classification problems?
Basic Deep Learning

Hint?

The Sigmoid function merely outputs the probability / likelihood of that option being correct,
without taking into account the other options in a multi-class problem, and the fact that the
probabilities of all the multiple classes should add up to 1. This is actually done by the
Softmax activation function, which is a generalized version of the Sigmoid for multi-class
problems. Hence. we usually use the Softmax function in the output layer of a neural
network when dealing with multi-class classification, so that we can get the output in a
probabilistic shape taking all the options into account, and not just one.

69. What are the potential pitfalls of using neural networks for supervised learning?

Basic Deep Learning

Hint?

The first problem with traditional fully connected neural networks is that they are very
computationally intensive, so they may take significantly longer to train and come up with
predictions than a more traditional machine learning algorithm, due to their vast number of
parameters and their hierarchical non-linear complexity - especially in deep neural networks.
This drawback means that naturally, neural networks would need to significantly outperform
a competing ML model in terms of the evaluation metrics for us to even consider using them
for supervised learning - and this tends to happen only once we cross a certain threshold in
terms of the volume of training data, usually in the order of millions of training examples.
Hence neural networks should not be used on smaller or intermediate sized training datasets
in supervised learning problems, because an ML model would likely perform as well or better
at a fraction of the compute cost with that size of data. Another problem with neural
networks is their black-box nature - we often don't know how or why the NN came up with a
certain output. Since its internal working is often not interpretable, it is often out of the
question to consider using neural networks in sensitive use cases where the explainability of
a model is paramount, such as healthcare or criminal justice. These are the potential pitfalls
of using neural networks that one should keep in mind before applying them to supervised
learning problems.

70. What hyperparameters can you tune inside neural networks?

Basic Deep Learning

Hint?
The architecture of a neural network, in terms of the number of neurons, the number of
layers and the activation function at various layers, is the first obvious set of
hyperparameters that can be tuned. The learning characteristics of the network, such as its
learning rate, the number of epochs and the batch size, are also an important set of
hyperparameters which be tuned to improve the network's performance. There are smaller
and more nuanced hyperparameters that can also help in fine-tuning the neural net, such as
momentum parameters, a decay in the learning rate, the dropout ratio, the weight
initialization scheme and the batch normalization hyperparameters.

71. What are the pros and cons of using Batch Gradient Descent vs Stochastic Gradient
Descent?

Intermediate Deep Learning

Hint?

Batch Gradient Descent suffers from computational cost, especially for larger datasets,
because it accepts the entire training dataset as one batch. This means each epoch will take a
long time to complete. So in case of a large training dataset, Stochastic Gradient Descent
may be preferred. However, the convergence characteristics of Batch Gradient Descent are
better - it converges directly to a minima, whereas Stochastic Gradient Descent will oscillate
in the near vicinity of the minima without properly reaching it, although Stochastic Gradient
Descent does converge and reach that point faster. Stochastic Gradient Descent also shows
very noisy learning characteristics, due to the variability between each training example
used. Another drawback of Stochastic Gradient Descent is that since we use only one
example at a time, we lose the compute advantage of vectorized implementation on it. So
Batch Gradient Descent is generally preferred for smaller datasets, while Stochastic Gradient
Descent is used for larger datasets. However due to the significant drawbacks of each
approach, a compromise called Mini-Batch Gradient Descent is often preferred among vanilla
optimization algorithms that don't use momentum or adaptive gradient, albeit with the cost
of an additional hyperparameter to tune, which is the mini-batch size.

72. Is the bias-variance tradeoff in Machine Learning applicable to Deep Neural Networks?
Why do you say so?

Advanced Deep Learning

Hint?

The biggest advantage of neural networks is that unlike traditional machine learning
algorithms, they appear to have no limit to the sheer complexity of the decision boundaries
they can create. This means that although they are data hungry, when they are actually
provided with larger and larger volumes of data, their performance tends to continually
improve when the number of nodes and layers in the network is increased, as opposed to
machine learning algorithms, whose performance tends to stagnate beyond a point even
after access to larger amounts of data. All of this means that the traditional bias-variance
tradeoff seen in machine learning may not strictly be applicable in deep learning; neural
networks merely appear to move to a new stage of the tradeoff when the volume of data
and the complexity of the neural network are correspondingly increased.

73. Let's say you have two neural networks. One of them has one hidden layer with sixteen
nodes, while the other has four hidden layers with four nodes each, so they both have
sixteen neurons, just in different configurations. Which of these is likely to perform better
on a complex supervised learning task and why?

Intermediate Deep Learning

Hint?

Although the width (number of neurons in a layer) and depth (number of layers) of neural
networks are both important factors in determining its performance, complex supervised
learning tasks such as classifying a picture as a dog / cat appear to be best solved by
introducing a hierarchy in the neural network, that can progressively learn more and more
complex patterns in the data. In such an example, the second network, with four layers of
four nodes each, would be likely to perform better on the task than the first network, since it
has multiple layers and hence provides the network with a hierarchical mode of learning,
where the deeper layers may be able to understand more complex shapes and patterns in
the data. The depth of the neural network seems to increase its ability to learn complex
representations of the data more than its width - Ex: Some of the most famous neural
networks like GPT-3 have nearly a hundred layers.

74. How different is the decision boundary created by a neural network in comparison to
other non-linear ML algorithms such as Decision Trees and Random Forests? Which of
these techniques can create the most flexible non-linear decision boundary and why?

Advanced Deep Learning

Hint?

Neural networks can create the most complex decision boundaries out of all the alternatives
listed, due to their hierarchical nature of complexity and the fact that each node or layer
added in the network increases the flexibility of the model. Although Decision Trees,
Random Forests and Neural Networks are all non-linear approaches, the nature of the non-
linearity in the decision boundary differs among them. Decision Trees create "piecewise"
non-linearity - they create orthogonal / linear splits on every individual feature and create
rectangular boundaries based on that. This approach is more flexible than linear, but perhaps
not as flexible as a curved non-linear boundary. Random Forests attempt to aggregate
multiple trees and hence approach a curved boundary by combining multiple linear splits, but
they still only approximate curved non-linearity and don't actually accomplish it. Neural
networks do however, created curved non-linear decision boundaries because they combine
multiple linear nodes and apply non-linear transformations in the form of activation
functions at each layer, and that level of flexibility in creating curved non-linearity is
unrivalled by any other machine learning algorithm.

75. What would be a good use case for implementing fully connected or other kinds of
neural networks for supervised learning over other ML models and why?

Intermediate Deep Learning

Hint?

The use case for neural networks in supervised learning should ideally be in those scenarios
where traditional machine learning algorithms are known to fail or be inadequate for solving
the problem. This could be for highly unstructured kinds of data such as images, text or
audio, where the algorithm itself has to extract the features relevant to the prediction from
the dataset, and hence a traditional machine learning approach wouldn't work. Another use
case for neural networks is when the size of the dataset is quite large, and we would like that
increased dataset size to translate to improved pattern detection by the model. So when we
have an extremely large dataset (in the order of millions of examples) or unstructured data,
neural networks may be preferred over ML models.

76. Would applying a neural network make sense in a healthcare setting where we need to
predict the diagnosis and medication to offer a patient based on the symptoms displayed?
Why do you think so?

Intermediate Deep Learning

Hint?

No, neural networks should ideally not be applied for any use case where the interpretability
or explainability of a model's decision making is paramount. Healthcare is a highly sensitive
domain, where decision making around diagnosis and medication for symptoms can make a
huge difference to the health condition of the patient, and medical practitioners cannot
afford to make mistakes in that process. Hence, the model used needs absolute transparency
rather than top performance which is not explainable, and neural networks would not be as
preferred for a healthcare use-case as decision trees or random forests.

77. What is the role of the Convolution operation in helping a neural network understand
images?

Basic Deep Learning

Hint?

Convolution is a mathematical operation which takes two inputs such as image matrix and a
filter or kernel. It is the first layer to extract features from an input image in a CNN.
Convolution helps to retain the relationship between pixels by learning image features using
small squares of input data. The way the convolution operation mathematically works is by
using the dot product of the filter vector and pixel vector to replace the image pixels with
new values (modified image), and these dot product values are higher when the pattern of
the filter matches the pattern of the pixels. Hence, convolution excels at detecting patterns
and features in the image that match the patterns of the filters, and this is how feature
extraction is performed on the image.

78. Why do we mostly use the ReLU activation function in the feature extraction stage of
convolutional neural networks (CNNs)?

Intermediate Deep Learning

Hint?

ReLU has the advantage of being simple to compute and also avoiding the vanishing
gradient problem, due to its constant derivative of 1. This is useful in CNNs which are deep
networks, as the error from backpropagation is easily propagated for the neural network's
learning.

79. What is the role of fully connected layers in a CNN?

Intermediate Deep Learning

Hint?

The output from the convolutional layers represents high-level features in the data. While
that output could be flattened and connected to the output layer, adding a fully-connected
layer is a (usually) cheap way of learning non-linear combinations of these features.
Essentially the convolutional layers are providing a meaningful, low-dimensional, and
somewhat invariant feature space, and the fully-connected layer is learning a (possibly non-
linear) function in that space.

80. What are some drawbacks of using Convolutional Neural Networks on image datasets,
and how can they be addressed?

Intermediate Deep Learning

Hint?

Although CNNs are optimized to work on image data and perform better and more
efficiently on images than fully connected neural networks, they still suffer from some
drawbacks which should be kept in mind. CNNs require quite a lot of labelled image data in
order to reach near-human levels of performance in image related tasks, and such data may
not readily be available. In that case, it may be better to use Transfer Learning to import the
weights and architecture of a pre-trained model and only fine tune its last few layers to apply
it to the problem at hand. CNNs may also be susceptible to spurious patterns in the data
(such as the sky always being present in car images - so it wrongly learns that having a sky is
important to classify something as a car), and this susceptibility can be resolved by
diversifying the training dataset to ensure nothing else about the images is consistent other
than the exact pattern we want the CNN to learn. CNNs can also be susceptible to small
perturbations in the dataset, for example: not being rotationally invariant, and this problem
should be addressed through the technique of data augmentation through various image
modification techniques such as flipping, rotation, cropping, mirroring, color modification etc.

81. Why is text pre-processing an essential part of NLP? What happens if we fail to pre-
process text data?

Basic Deep Learning

Hint?

Text preprocessing helps us to get rid of unhelpful parts of the data, or noise by converting
all characters to lowercase, removing punctuations marks, and removing stop words and
typos. Removing noise comes in handy when you want to do text analysis on pieces of data
like comments or tweets. It will be helpful to get rid of the text that interferes with text
analysis If not pre-processed, you will receive an error or your model will not perform as
expected.
82. In case you're working on an NLP application such as sentiment analysis of Twitter
posts, describe the text pre-processing steps that would most likely be required?

Intermediate Deep Learning

Hint?

1] Lowercasing for consistency. 2] Stemming to reduce the words to root form. 3]


Lemmatization to map words to their root form. 4] Stop words removal because they carry
low information and don't contribute to the sentiment analysis. 5] Noise removal including
digits, hashtags (they are Twitter comments), and characters. 6] Remove emoticons (they are
Twitter comments) because they are noise.

83. Which evaluation metric is suitable to measure the performance of sentiment analysis
and why?

Intermediate Deep Learning

Hint?

Sentiment analysis is a classification problem, thus, it uses the metrics of Precision, Recall, F-
score, and Accuracy. Also, average measures like macro, micro, and weighted F1 scores are
useful for multi-class problems. Accuracy is used when the True Positives and True negatives
are more important while F1-score is used when the False Negatives and False Positives are
crucial. F1 scores also are helpful when there is a lot of class imbalance. As sentiment
analysis is a real-time problem, we can expect a lot of class imbalance. Thus, F1 scores are
mostly used.

84. What is the difference between stemming and lemmatization? Could you provide an
example?

Basic Deep Learning

Hint?

Stemming and Lemmatization both generate the foundation of the inflected words. The
difference is that the stem may not be an actual word, whereas the lemma is an actual
language word. For eg: beautiful and beautifully will be stemmed to beauti which has no
meaning in the English dictionary. The same are however lemmatised to beautiful and
beautifully respectively without changing the meaning of the words.
85. Would you consider Logistic Regression to be a special case of using Neural Networks?
If so, how?

Basic Deep Learning

Hint?

Yes, logistic regression is a specialized case of a one-node neural network, where we use
the Sigmoid activation function and the cost function being minimized is the Binary Cross-
Entropy function.

86. How do you compare categorical values, how would you know that a categorical value
is related to target variable?

Basic Advanced Stats

Comparing categorical Values: When there are three or more levels/categories for the predictor & Target
variable is nominal, the degree of association between the predictor and target variable can be measured
with statistics such as chi-squared tests

Categorical value is related to the target variable:

- When there is only one continuous target variable, there are one plus categorical independent
variables, and there is no control variable at all, then you can go for ANOVA.

- Similarly, when there is only one continuous target variable, there is only one categorical independent
variable (i.e. dichotomous, e.g. pass/fail), and no control variable, then go for t-Test

87. What is Linear regression? Explain the assumptions.

Basic Advanced Stats

Linear regression is an analysis that assesses whether one or more predictor variables explain the
dependent (criterion) variable. The regression has five key assumptions:

1) Linear relationship: Linear regression needs the relationship between the independent and dependent
variables to be linear. The linearity assumption can best be tested with scatter plots.

2) Normality: The error terms must be normally distributed (To check normality, one can look at QQ plot,
can also perform statistical tests of normality such as Kolmogorov-Smirnov test, Shapiro-Wilk test.

3) Multicollinearity: Linear regression assumes that there is little or no multicollinearity in the data.
Multicollinearity occurs when the independent variables are too highly correlated with each other.
Multicollinearity may be tested with three central criteria: Correlation matrix, Tolerance, VIF

4) No auto-correlation: Linear regression analysis requires that there is little or no autocorrelation in the
data. Autocorrelation occurs when the residuals are not independent of each other. For instance, this
typically occurs in stock prices, where the price is not independent of the previous price.

5) Homoscedasticity: The error terms must have constant variance. This phenomenon is known as
homoskedasticity. The presence of non-constant variance is referred to as heteroskedasticity.

88. Explain mathematically how Linear Regression works?

Basic Advanced Stats

The idea behind simple linear regression is to "fit" the observations of two variables into a linear
relationship between them. Graphically, the task is to draw the line that is "best-fitting" or "closest" to the
points (x_i,y_i), where x_i and y_i are observations of the two variables which are expected to depend
linearly on each other.

Although many measures of best fit are possible, for most applications the best-fitting line is found using
the method of least squares. The method finds the linear function L which minimizes the sum of the
squares of the errors in the approximations of the y_i by L(x_i)

For eg: To find the line y=mx+b of best fit through N points, the goal is to minimize the sum of the
squares of the differences between the y-coordinates and the predicted yy-coordinates based on the line
and the x-coordinates.

89. In your project, why classification was chosen over regression ?

Basic Advanced Stats

Classification is used when the output variable is a category such as “red” or “blue”, “spam” or “not spam”.
It is used to draw a conclusion from observed values. Differently from regression which is used when the
output variable is a real or continuous value like “age”, “salary”, etc.

When we must identify the class the data belongs to, we use classification over regression. Like when
you must identify whether a name is male or female instead of finding out how they are correlated with
the person.

90. Explain the working of logistic regression?


Basic Advanced Stats

Hint?

91. Evaluation metrics of regression/classification model?

Basic Advanced Stats

Hint?

92. Build a credit card fraud detection model

Advanced Advanced Stats

93. Evaluation Metrics (Difference between R-Square and Adjusted R-Square)

Basic Advanced Stats

R-squared (coefficient of determination) measures the proportion of the variation in your dependent
variable (Y) explained by your independent variables (X) for a linear regression model.

R² = Explained variation / Total Variation

Adjusted R-squared adjusts the statistic based on the number of independent variables in the model.

It is possible that R Square has improved significantly yet Adjusted R Square is decreased with the
addition of a new predictor when the newly added variable brings in more complexity than the power to
predict the target variables.

Adj. R² = 1 - ((1 - R.squared) * (n - 1)/(n-p-1)) where p: no. of predictors, n: no. of observations

94. Difference between logistic regression and CART?

Basic Advanced Stats

1. Cart works best locally, Logistic regression works best Globally

2. Cart is Useful for identifying interactions between variables


3. Cart can predict both categorical and quantitative data while logistic can only predict
categorical/ordinal

4. Cart is Easy to run & interpret

5. Cart can lead to overfitting as it has a disadvantage over stop splitting

6. CART works best with a larger dataset, while Logistic regression on a smaller dataset

7. Cart is non-parametric while logistic is parametric

95. What are the limitations of Logistic Regression

Basic Advanced Stats

1. The major limitation of Logistic Regression is the assumption of linearity between the dependent
variable and the independent variables.

2. It can only be used to predict discrete functions. Hence, the dependent variable of Logistic Regression
is bound to the discrete number set.

3. Non-linear problems can’t be solved with logistic regression because it has a linear decision surface.
Linearly separable data is rarely found in real-world scenarios

4. Logistic Regression requires average or no multicollinearity between independent variables

5. If the number of observations is lesser than the number of features, Logistic Regression should not be
used, otherwise, it may lead to overfitting.

96. Name the library used to implement logistic Regression

Basic Advanced Stats

Python:

from sklearn.linear_model import LogisticRegression

R:

glm(Target ~.,family=binomial(link='logit'),data=train)

97. What is confusion matrix?

Basic Advanced Stats


A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model,
where N is the number of target classes. The matrix compares the actual target values with those
predicted by the machine learning model. This gives us a holistic view of how well our classification model
is performing and what kinds of errors it is making.

True Positive (TP): The actual value was positive and the model predicted a positive value

True Negative (TN): The actual value was negative and the model predicted a negative value

False Positive (FP) – Type 1 error: The actual value was negative but the model predicted a positive value

False Negative (FN) – Type 2 error: The actual value was positive but the model predicted a negative
value

98. What is vif? What is the precision of Vif ?

Basic Advanced Stats

VIF, the Variance Inflation Factor, is used during regression analysis to assess whether certain
independent variables are correlated to each other and the severity of this correlation. If your VIF number
is greater than 10, the included variables are highly correlated to each other. Since the ability to make
precise estimates is important to many companies, generally people aim for a VIF within the range of 1-5.
A cutoff number of 5 is commonly used.

99. How do you deal with multi-colinearity and conditional probability?

Intermediate Advanced Stats

Potential solutions to deal with multicollinearity:

- Remove some of the highly correlated independent variables.

- Linearly combine the independent variables, such as adding them together.

- Perform an analysis designed for highly correlated variables, such as principal components analysis or
partial least squares regression.

100. Is logistic regression a part of Linear regression?

Basic Advanced Stats


Logistic regression is considered a generalized linear model because the outcome always depends on the
sum of the inputs and parameters.

The actual value of the dependent variable is yi.

The predicted value of yi is defined to be y^i = a xi + b, where y = a x + b is the regression equation.

The residual is the error that is not explained by the regression equation:

ei = yi - y^i.

A residual plot plots the residuals on the y-axis vs. the predicted values of the dependent variable on the
x-axis. We would like the residuals to be unbiased: have an average value of zero in any thin vertical strip,
and homoscedastic, which means "same stretch": the spread of the residuals should be the same in any
thin vertical strip.

The residuals are heteroscedastic if they are not homoscedastic.

101. Write the equation of the linear Regression? Explain residuals?

Basic Advanced Stats

The actual value of the dependent variable is yi.

The predicted value of yi is defined to be y^i = a xi + b, where y = a x + b is the regression equation.

The residual is the error that is not explained by the regression equation:

ei = yi - y^i.

A residual plot plots the residuals on the y-axis vs. the predicted values of the dependent variable on the
x-axis. We would like the residuals to be unbiased: have an average value of zero in any thin vertical strip,
and homoscedastic, which means "same stretch": the spread of the residuals should be the same in any
thin vertical strip.

The residuals are heteroscedastic if they are not homoscedastic.

102. Explain homoscedasticity ?

Intermediate Advanced Stats

The assumption of homoscedasticity (meaning “same variance”) is central to linear regression models.
Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance
in the relationship between the independent variables and the dependent variable) is the same across all
values of the independent variables. Heteroscedasticity (the violation of homoscedasticity) is present
when the size of the error term differs across values of an independent variable. The impact of violating
the assumption of homoscedasticity is a matter of degree increasing as heteroscedasticity increases.

103. Performance measures of linear Regression?

Basic Advanced Stats

Most commonly known evaluation metrics include:

R-squared (R2), which is the proportion of variation in the outcome that is explained by the predictor
variables. In multiple regression models, R2 corresponds to the squared correlation between the observed
outcome values and the predicted values by the model. The Higher the R-squared, the better the model.

Root Mean Squared Error (RMSE), which measures the average error performed by the model in
predicting the outcome for an observation. Mathematically, the RMSE is the square root of the mean
squared error (MSE), which is the average squared difference between the observed actual outcome
values and the values predicted by the model. So, MSE = mean((observeds - predicteds)^2) and RMSE =
sqrt(MSE). The lower the RMSE, the better the model.

Residual Standard Error (RSE), also known as the model sigma, is a variant of the RMSE adjusted for the
number of predictors in the model. The lower the RSE, the better the model. In practice, the difference
between RMSE and RSE is very small, particularly for large multivariate data.

Mean Absolute Error (MAE), like the RMSE, the MAE measures the prediction error. Mathematically, it is
the average absolute difference between observed and predicted outcomes, MAE = mean(abs(observeds
- predicteds)). MAE is less sensitive to outliers compared to RMSE.

Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp

The lower these metrics, the better the model.

AIC stands for (Akaike’s Information Criteria): Basic idea of AIC is to penalize the inclusion of additional
variables to a model. It adds a penalty that increases

the error when including additional terms. The lower the AIC, the better the model.

AICc is a version of AIC corrected for small sample sizes.

BIC (or Bayesian information criteria) is a variant of AIC with a stronger penalty for including additional
variables to the model.

Mallows Cp: A variant of AIC developed by Colin Mallows.


104. Explain prior probability, likelihood and marginal likelihood in context of naiveBayes
algorithm?

Basic Advanced Stats

105. Derive logistic regression equation.

Intermediate Advanced Stats

In Logistic Regression, the Probability should be between 0 to 1 and as per cut off rate, the output comes
out in the form of 0 or 1 where the linear equation does not work because value comes out inform of +
or - infinity and that the reason we have to convert a linear equation into Sigmoid Equation.

Transformation of Linear Regression Equation into Logistic Regression Equation.

1. Linear Regression Equation is Y = b0 + b1*X

Converting into Sigmoid Equation:

2. Probability should not be less than 0 i.e. eliminating -infinity

converting into the exponential form: E^Y

3. Probability should not be greater than 1 i.e. eliminating +infinity

Dividing value with 1:

P = E^Y/E^Y+1

Odds Ratio:

4. Taking Odds Ratio which is used for calculating Probability

P = Probability of Success and 1-P= Probability of Failure

P/1-P

Sigmoid Equation put into Odd Ratio:

5. Substituting the value of P with equation E^Y/E^Y+1

P/1-P = (E^Y/E^Y+1 ) / (1-E^Y/E^Y+1)

=(E^Y/E^Y+1) / (1/E^Y+1)

=(E^Y/E^Y+1) x (E^Y+1/1)

=E^Y

Odds Ratio in the form of Sigmoid:


6. We can say P/1-P = E^Y

Log Transformation:

7. Converting into Log

P/1-P = E^Y

Log(P/1-P) = Y (When it converts into a log, Exponential naturally removed)

Log(P/1-P) = b0+b1*X

106. Explain how SVM works.

Intermediate Advanced Stats

A simple linear SVM classifier works by making a straight line between two classes.

That means all of the data points on one side of the line will represent a category and the data points on
the other side of the line will be put into a different category. This means there can be an infinite number
of lines to choose from.

What makes the linear SVM algorithm better than some of the other algorithms, like k-nearest neighbors,
is that it chooses the best line to classify your data points. It chooses the line that separates the data and
is the furthest away from the closet data points as possible.

A 2-D example helps to make sense of all the machine learning jargon. Basically, you have some data
points on a grid. You're trying to separate these data points by the category they should fit in, but you
don't want to have any data in the wrong category. That means you're trying to find the line between the
two closest points that keeps the other data points separated.

So the two closest data points give you the support vectors you'll use to find that line. That line is called
the decision boundary.

The decision boundary doesn't have to be a line. It's also referred to as a hyperplane because you can find
the decision boundary with any number of features, not just two.

Types of SVMs:

Simple SVM: Typically used for linear regression and classification problems.

Kernel SVM: Has more flexibility for non-linear data because you can add more features to fit a
hyperplane instead of a two-dimensional space.

107. How will you handle class imbalance problem? What are the various approaches?
Intermediate Advanced Stats

Imbalanced data typically refers to a problem with classification problems where the classes are not
represented equally.

Few tactics To Combat Imbalanced Training Data:

- Collect More Data

- Try Changing Your Performance Metric

- Try Resampling Your Dataset

- Try Generate Synthetic Samples (The most popular of such algorithms is called SMOTE or the
Synthetic Minority Over-sampling Technique)

- Try Different Algorithms

- Try Penalized Models

108. Why do we use sigmoid and not any increasing function from 0 to 1?

Intermediate Advanced Stats

The main reason why we use the sigmoid function is that it exists between (0 to 1). Therefore, it is
especially used for models where we have to predict the probability as an output. Since the probability of
anything exists only between the range of 0 and 1, sigmoid is the right choice.

109. What are various evaluation parameters of regression and classification to evaluate
the model?

Intermediate Advanced Stats

Regression evaluation metrics:

R-squared (R2), which is the proportion of variation in the outcome that is explained by the predictor
variables. In multiple regression models, R2 corresponds to the squared correlation between the observed
outcome values and the predicted values by the model. The Higher the R-squared, the better the model.

Root Mean Squared Error (RMSE), which measures the average error performed by the model in
predicting the outcome for an observation. Mathematically, the RMSE is the square root of the mean
squared error (MSE), which is the average squared difference between the observed actual outcome
values and the values predicted by the model. So, MSE = mean((observeds - predicteds)^2) and RMSE =
sqrt(MSE). The lower the RMSE, the better the model.
Residual Standard Error (RSE), also known as the model sigma, is a variant of the RMSE adjusted for the
number of predictors in the model. The lower the RSE, the better the model. In practice, the difference
between RMSE and RSE is very small, particularly for large multivariate data.

Mean Absolute Error (MAE), like the RMSE, the MAE measures the prediction error. Mathematically, it is
the average absolute difference between observed and predicted outcomes, MAE = mean(abs(observeds
- predicteds)). MAE is less sensitive to outliers compared to RMSE.

Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp

The lower these metrics, the better the model.

AIC stands for (Akaike’s Information Criteria): Basic idea of AIC is to penalize the inclusion of additional
variables to a model. It adds a penalty that increases the error when including additional terms. The lower
the AIC, the better the model.

AICc is a version of AIC corrected for small sample sizes.

BIC (or Bayesian information criteria) is a variant of AIC with a stronger penalty for including additional
variables to the model.

Mallows Cp: A variant of AIC developed by Colin Mallows.

Classification evaluation metrics:

- Average classification accuracy, representing the proportion of correctly classified observations.

- Confusion matrix, which is 2x2 table showing four parameters, including the number of true positives,
true negatives, false negatives and false positives.

- Precision, Recall and Specificity, which are three major performance metrics describing a predictive
classification model

- ROC curve, which is a graphical summary of the overall performance of the model, showing the
proportion of true positives and false positives at all possible values of probability cutoff. The Area Under
the Curve (AUC) summarizes the overall performance of the classifier.

110. In your project, If we use regression model, what would be the outcome?

Intermediate Advanced Stats

Regression analysis generates an equation to describe the statistical relationship between one or more
predictor variables and the response variable (continuous in nature). Where the response variable is the
target variable.
111. List out some common problems faced while analyzing the data.

Basic Advanced Stats

112. OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the
statement.

Intermediate Advanced Stats

113. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the
components?

Advanced Advanced Stats

114. What are the metrics chosen to evaluate model performance

Intermediate Data Mining

Hint?

115. How will you treat missing values?

Basic Data Mining

116. Explain Random Forest algorithm

Basic Data Mining

Hint?

117. Explain Decision Tree algorithm


Basic Data Mining

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised
learning algorithms, the decision tree algorithm can be used for solving regression and classification
problems too.

The goal of using a Decision Tree is to create a training model that can use to predict the class or value of
the target variable by learning simple decision rules inferred from prior data(training data).

In Decision Trees, for predicting a class label for a record we start from the root of the tree. We compare
the values of the root attribute with the record’s attribute. On the basis of comparison, we follow the
branch corresponding to that value and jump to the next node.

Types of decision trees: ID3, CART, C4.5, CHAID, MARS

118. Differentiate between random forest and decision trees?

Basic Data Mining

Hint?

119. Why did you choose Random forest or Decision trees model ?

Basic Data Mining

Random forests consist of multiple single trees each based on a random sample of the training data. They
are typically more accurate than single decision trees. The following figure shows the decision boundary
becomes more accurate and stable as more trees are added.

Two reasons why random forests outperform single decision trees:

- Trees are unpruned. While a single decision tree like CART is often pruned, a random forest tree is fully
grown and unpruned, and so, naturally, the feature space is split into more and smaller regions.

- Trees are diverse. Each random forest tree is learned on a random sample, and at each node, a random
set of features are considered for splitting. Both mechanisms create diversity among the trees.

120. Discuss Customer segmentation by Clustering

Intermediate Data Mining


The objective of any clustering algorithm is to ensure that the distance between data points in a cluster is
very low compared to the distance between 2 clusters i.e. Members of a group are very similar, and
members of the different group are extremely dissimilar.

For e.g., k-means clustering can be used for creating customer segments based on their income and
spend data

121. List the drawbacks and advantages of decision trees

Basic Data Mining

Advantages:

- Compared to other algorithms decision trees requires less effort for data preparation during pre-
processing.

- A decision tree does not require normalization of data.

- A decision tree does not require scaling of data as well.

- Missing values in the data also do NOT affect the process of building a decision tree to any
considerable extent.

- A Decision tree model is very intuitive and easy to explain to technical teams as well as stakeholders.

Disadvantages:

- A small change in the data can cause a large change in the structure of the decision tree causing
instability.

- For a Decision tree sometimes calculation can go far more complex compared to other algorithms.

- Decision tree often involves higher time to train the model.

- Decision tree training is relatively expensive as the complexity and time have taken are more.

- The Decision Tree algorithm is inadequate for applying regression and predicting continuous values.

122. How to reduce number of variables in Logistic regression and random forest?

Basic Data Mining

Seven techniques for dimensionality reduction:

- Missing Values Ratio: Data columns with a ratio of missing values greater than a given threshold can be
removed. The higher the threshold, the more aggressive the reduction.
- Low Variance Filter: Data columns with a variance lower than a given threshold can be removed. Notice
that the variance depends on the column range, and therefore normalization is required before applying
this technique.

- High Correlation Filter: Calculate the Pearson product-moment correlation coefficient between numeric
columns and Pearson’s chi-square value between nominal columns. For the final classification, we only
retain one column of each pair of columns whose pairwise correlation exceeds a given threshold. Notice
that correlation depends on the column range, and therefore, normalization is required before applying
this technique.

- Principal Component Analysis (PCA): First principal component has the largest possible variance; each
succeeding principal component has the highest possible variance under the constraint that it is
orthogonal to (i.e., uncorrelated with) the preceding principal components. Keeping only the first m < n
principal components reduces the data dimensionality while retaining most of the data information, i.e.,
variation in the data.

- Backward Feature Elimination: We remove one input column (from training model on n columns) at a
time and train the same model on n-1 columns. The input column whose removal has produced the
smallest increase in the error rate is removed, leaving us with n-1 input columns. The classification is then
repeated using n-2 columns, and so on. Each iteration k produces a model trained on n-k columns and an
error rate e(k). By selecting the maximum tolerable error rate, we define the smallest number of columns
necessary to reach that classification performance with the selected machine learning algorithm.

- Forward Feature Construction. This is the inverse process to backward feature elimination. We start
with one column only, progressively adding one column at a time, i.e., the column that produces the
highest increase in performance.

- Multicollinearity check using VARIANCE INFLATION FACTOR (VIF): **Typically used for logistic**

The VIF provides information on how large the standard error is compared with what it would be if the
variables were uncorrelated with the other predictor variables in the model. It is calculated for each
explanatory variable and those with high values are removed. Common thumb-rule classifies a VIF value
of >=5 significantly high implying high multicollinearity. A cut-off VIF value of <=2 is used by most
businesses since it offers a more stringent and clear rule.

123. How will you decide the number of clusters in K-Means?

Basic Data Mining

The optimal number of clusters can be defined as follow:

Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying
k from 1 to 10 clusters.
For each k, calculate the total within-cluster sum of square (wss).

Plot the curve of wss according to the number of clusters k.

The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate
number of clusters.

124. List out some of the best practices for data cleaning

Basic Data Mining

125. What are assumptions of clustering algorithm

Intermediate Data Mining

K-Means clustering method considers two assumptions regarding the clusters –

first that the clusters are spherical and second that the clusters are of similar size.

Spherical assumption helps in separating the clusters when the algorithm works on the data and forms
clusters. If this assumption is violated, the clusters formed may not be what one expects. On the other
hand, assumption over the size of clusters helps in deciding the boundaries of the cluster. This assumption
helps in calculating the number of data points each cluster should have. This assumption also gives an
advantage. Clusters in K-means are defined by taking the mean of all the data points in the cluster. With
this assumption, one can start with the centers of clusters anywhere. Keeping the starting points of the
clusters anywhere will still make the algorithm converge with the same final clusters as keeping the
centers as far apart as possible.

126. What is a waterfall chart and when do we use it?

Intermediate Excel

A waterfall chart is used to represent the changes in a given value over a period of time. The changes are
usually tracked as positives (raise in the value) and negatives (dip in the value). The beginning and ending
value are representing as solid columns and the changes are tracked using floating columns. For example,
waterfall charts can be used to represent a company's financial performance (profit, loss) over a period of
time or to display the changes in a product value over a period of time.

127. What Is a one- and two-variable data table?


Intermediate Excel

This follows What-if principle again.

Two variable data table would let us check two values at the same time.for the same formula in a data
table. It is primarily used when the formula depends on several values which can be used for the two
variables

One Variable data table is similar to two variable data table, but it would check one variable at a time.

128. What are the different sections of a Pivot Table

Intermediate Excel

There are 4 different sections in Pivot Table.

1. Rows - If the fields want to be viewed in the rows of the Pivot table, the field needs to dragged to the
Rows section

2. Columns - If the fields want to be viewed in the columns of the Pivot table, the field needs to dragged
to the Columns section

3. Values - Rows and columns of the table would be fixed using Rows and Columns sections, the values
for the table can be fixed using the Values section. This e

4. Filters - The Filters area is to place filters in PivotTable

129. What are the most common questions you should ask a client before creating a
dashboard?

Advanced Excel

1. What is the inference to be made from the dashboard

2. The audience - Who views the dashboard and if it is to be detailed much

3. If the data carries present or past information

130. How can we select all blank cells in Excel?

Basic Excel
Select the whole data set

On pressing F5, it opens the Go To dialogue box

Click the Special.. button. It opens the Go To special dialogue box

Select Blanks and click Ok

This selects all the blank cells in your dataset)

131. Can we sort multiple columns at one time?

Basic Excel

Yes, using sort dialog box.

132. What is the difference between absolute and relative cell references?

Intermediate Excel

Absolute: An absolute reference in Excel refers to a reference that is "locked" so that rows and columns
won't change when copied.

Relative: A relative addresses will change when copied to other location in a worksheet because it
describes the "offset" to another cell, rather than a fixed address.

133. What formula would you use to find the length of a text string in a cell?

Basic Excel

"=LEN(cell)"

The above formula can be used to find the length of the text string in the specified cell.
134. What are slicers in Excel

Intermediate Excel

Slicers are visible filters. The objective of slicers is same as that of filters. But in slicers the filter values
would be visible. Mainly used in Pivot tables

135. How can you Combine Data from Multiple tables into 1 pivot table

Intermediate Excel

Hint?

136. Explain Goal Seek and Solver.

Intermediate Excel

Goal seek would help us adjust the value in a specific range to reach the goal(target). It acts as a
business consultant in figuring out to meet the target.

Solver uses trial and error principle, it uses iteration to check a series of solutions for a specific problem
statement. It shows the changes in the output for different inputs

137. What are named ranges in excel

Intermediate Excel

Named ranges - It is used to name a group of cells (or one) with a common name. The common name
would be easy for using the name inside the formula rather giving the range.

138. Explain wildcard characters in Excel

Basic Excel

Used to find a string in the cell that are not exact but similar to the text. There are three wildcard
characters
1. * (asterisk) - If more than one character is to be matched with the given string, we use the asterisk. For
example sh* would filter shirt, short, shell, shall, shore, etc.

2. ?(question mark) - If exactly one character is to be matched with the given strin, ? is used. For example:
ra? would filter rap, ran, rat, raw, etc.

3. ~(tilde) - If the search string contains a wildcard character, then tilde can be used to find the string. For
example, if you need to search for ki* in your data. But since * is a wildcard character, the formula may
not fetch the desired output. In such case, ki~* would return ki*.

139. Explain the functions (VLOOKUP, COUNTIF, SUMIF, IFERROR, INDEX / MATCH)

Intermediate Excel

VLOOKUP - Stands for vertical look up. It is used to look up the data that is organised vertically

COUNTIF - Conditional counting. It is used to count all the values that would meet certain criteria

SUMIF - Conditional summing. As like countif, sumif would sum all the values in a range that would meet
the condition

IFERROR - It would catch all the ERRORs within the given range. It carries two arguments - the error to
be caught and the message to be displayed while the error is caught

MATCH - It is used to fetch the location of the value given as arguments in a given range.

INDEX - It returns the specific value in the given range. INDEX function carries three arguments. First
argument - this takes the range, Second argument - The order of the value to be returned

140. What are the 5 V’s in Big Data ?

Basic Hadoop

Volume, Velocity, Variety, Veracity and Value

141. List the different daemons in Hadoop cluster

Basic Hadoop

Core Hadoop = HDFS + YARN

- HDFS Daemons --> NameNode, DataNode, StandbyNameNode


- YARN Daemons --> ResourceManager, NodeManager

142. What is HDFS ?

Basic Hadoop

Hadoop's Distributed File System which is fault-tolerant, reliable and scalable. Designed to store big
files efficiently in a distributed manner

143. What are the functions of the daemons in the Hadoop cluster ?

Basic Hadoop

Core Hadoop = HDFS + YARN

HDFS Daemons

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server
that manages the file system namespace and regulates access to files by clients. In addition, there are a
number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes
that they run on. HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The
NameNode executes file system namespace operations like opening, closing, and renaming files and
directories

SecondaryNameNode is a checkpoint node where the metadata is periodically backed up at a specific


interval. Secondary NameNode is out of date. NameNode High Availability is achieved by replacing
SecondaryNameNode to a Standby NameNode

YARN Daemons - ResourceManager, NodeManager

The ResourceManager and the NodeManager form the data-computation framework. The
ResourceManager is the ultimate authority that arbitrates resources among all the applications in the
system. The NodeManager is the per-machine framework agent who is responsible for containers,
monitoring their resource usage and reporting the same to the ResourceManager/Scheduler.

144. What is Yarn ?

Basic Hadoop
Cluster Resource Management System responsible for allocation of compute resources to all the jobs
submitted to the Hadoop cluster

145. What is new in Hadoop 3 when compared to Hadoop 2?

Intermediate Hadoop

146. What are active and passive NameNodes ?

Basic Hadoop

In a typical High Availability cluster, two separate machines are configured as NameNodes. At any point in
time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active
NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a
slave, maintaining enough state to provide a fast failover if necessary.

147. What happens when two clients try to access the same file in HDFS ?

Basic Hadoop

Concurrent writes are not allowed to HDFS at the same time, concurrent reads are fine

148. What is a checkpoint ?

Basic Hadoop

The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the
EditLog to persistently record every change that occurs to file system metadata. The entire file system
namespace is stored in another file called the FsImage. Both EditLogs and FSImage files are stored as a file
in the NameNode’s local file system. The NameNode keeps an image of the entire file system namespace
and file Blockmap in memory. When the NameNode starts up, or a checkpoint is triggered by a
configurable threshold, it reads the FsImage and EditLog from disk, applies all the transactions from the
EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new
FsImage on disk. In a cluster with no high-availability, the checkpointing is taken care of by the
SecondaryNameNode
149. How does NameNode handle DataNode failure ?

Basic Hadoop

As soon as the data node is declared dead/non-functional all the data blocks it hosts are transferred to
the other data nodes with which the blocks are replicated initially. This is how Namenode handles
datanode failures

150. What are the steps of action when NameNode is down ?

Intermediate Hadoop

1. Use the file system metadata replica (FsImage) to start a new NameNode.
2. Then, configure the DataNodes and clients so that they can acknowledge this new NameNode,
that is started.
3. Now the new NameNode will start serving the client after it has completed loading the last
checkpoint FsImage (for metadata information) and received enough block reports from the
DataNodes.

151. How is HDFS fault tolerant ?

Intermediate Hadoop

HDFS also maintains the replication factor by creating a replica of data on other available machines in
the cluster if suddenly one machine fails.

152. What is the reason we use HDFS for large datasets instead of a lot of small files ?

Basic Hadoop

As the NameNode performs storage of metadata for the file system in RAM, the amount of memory
limits the number of files in HDFS file system. In simple words, more files will generate more metadata,
that will, in turn, require more memory (RAM).

153. What is a ‘block’ in HDFS ?


Basic Hadoop

In Hadoop, HDFS splits huge file into small chunks that is called Blocks. These are the smallest unit of
data in file system

154. What are the default sizes of a Hadoop block in Hadoop 3 and Hadoop 1 ?

Basic Hadoop

128 MB Hadoop 3 and 64 MB Hadoop 1

155. How do we change the block size in Hadoop ?

Basic Hadoop

block. size can be changed to required value(default 64mb/128mb) in hdfs-site. xml file or use
dfs.blocksize= command

156. What does the ‘jps’ command do ?

Basic Hadoop

The jps command uses the java launcher to find the class name and arguments passed to the main
method.

157. What is Rack Awareness in Hadoop ?

Intermediate Hadoop

A Rack is a collection nodes usually in 10 of nodes which are closely stored together and all nodes are
connected to a same Switch. When an user requests for a read/write in a large cluster of Hadoop in order
to improve traffic the namenode chooses a datanode that is closer this is called Rack Awareness

158. What is speculative execution in Hadoop ?


Intermediate Hadoop

Hadoop doesn't try to diagnose and fix slow running tasks, instead, it tries to detect them and runs
backup tasks for them. This is called speculative execution in Hadoop. These backup tasks are called
Speculative tasks in Hadoop

159. How do you restart all the daemons ?

Intermediate Hadoop

You can stop the NameNode individually using /sbin/hadoop-daemon.sh stop namenode command.
Then start the NameNode using /sbin/hadoop-daemon.sh start namenode.

Use /sbin/stop-all.sh and the use /sbin/start-all.sh, command which will stop all the demons first.

160. What are the different modes Hadoop can run in ?

Intermediate Hadoop

Standalone Mode
Pseudo-distributed Mode
Fully-Distributed Mode.

161. What is MapReduce ?

Basic Hadoop

MapReduce is a programming model or pattern within the Hadoop framework that is used to access big
data stored in the Hadoop File System (HDFS). It is a core component, integral to the functioning of the
Hadoop framework.

162. What is the syntax to run a MapReduce program ?

Basic Hadoop

hadoop jar jar_name package_name.class_name input_path_in_hdfs output_path_in_hdfs


163. What are the main configuration parameters in a MapReduce program ?

Intermediate Hadoop

Input location of Jobs in the distributed file system.

Output location of Jobs in the distributed file system.

The input format of data.

The output format of data.

The class which contains the map function.

The class which contains the reduce function. JAR file containing the mapper, reducer and driver
classes

164. What does “RecordReader” do in Hadoop ?

Intermediate Hadoop

RecordReader, typically, converts the byte-oriented view of the input, provided by the InputSplit, and
presents a record-oriented view for the Mapper and Reducer tasks for processing. It thus assumes the
responsibility of processing record boundaries and presenting the tasks with keys and values.

165. How do reducers communicate with each other ?

Intermediate Hadoop

Reducers always run in isolation and they can never communicate with each other as per the Hadoop
MapReduce programming paradigm

166. What does a MapReducer Partitioner do ?

Intermediate Hadoop

A partitioner partitions the key-value pairs of intermediate Map-outputs. It partitions the data using a
user-defined condition, which works like a hash function. The total number of partitions is the same as
the number of Reducer tasks for the job.
167. What does a combiner do ?

Basic Hadoop

A Combiner, also known as a semi-reducer, is an optional class that operates by accepting the inputs
from the Map class and thereafter passing the output key-value pairs to the Reducer class. The main
function of a Combiner is to summarize the map output records with the same key.

168. Explain Distributed cache in a MapReduce framework

Advanced Hadoop

A distributed cache is a mechanism supported by hadoop mapreduce framework where we can


broadcast small or moderate-sized files (read-only) to all the worker nodes where the map/reduce tasks
are running for a given job.

169. What is the reason we can’t perform aggregation in mapper ? Why do we need the
reducer for this ?

Advanced Hadoop

The aggregation can not be done at Mapper phase because aggregation requires sorting of data, and
mapper executes per input split ( a Data Blocks ), so it is not possible in a mapper because it loses
previous input split every time a new instance is taken as input. The data processed by mapper is then
stored in local disk through shuffling and sorting process before the reducer phase. The latency of writing
this data directly to disk and then transferring data across the network is an expensive operation in the
processing of a MapReduce job. Hence there is a necessity to reduce the amount of data that needs to be
sent across the network to reducer whenever possible.

170. What is XGBoost?

Basic ML

Hint?

171. How do you deploy a model to cloud


Intermediate ML

The workflow can be broken down into following basic steps:

- Training a machine learning model on a local system

- Wrapping the inference logic into a flask application

- Using Docker to containerize the flask application

- Hosting the docker container on an AWS ec2 instance and consuming the web-service

172. How will you make models out of the tweets for the pharma company

Advanced ML

Hint?

173. Make 4 segments (product category, competitors etc) and identify which medicine a
doctor is likely to recommend

Intermediate ML

Hint?

174. Working of ensemble methods such as bagging, boosting, random forest.

Intermediate ML

Hint?

175. What is clustering and KNN?

Basic ML

k-Means Clustering is an unsupervised learning algorithm that is used for clustering whereas KNN is a
supervised learning algorithm used for classification.
The “k” in k-means denotes the number of clusters you want to have in the end. If k = 5, you will have 5
clusters on the data set. “k” in K-Nearest Neighbors is the number of neighbours it checks. It is
supervised because you are trying to classify a point based on the known classification of other points.

176. What is bagging and boosting?

Basic ML

Bagging and Boosting decrease the variance of your single estimate as they combine several estimates
from different models. So the result may be a model with higher stability.

Bagging is used when the goal is to reduce the variance of a decision tree classifier. Here the objective is
to create several subsets of data from the training sample chosen randomly with replacement. Each
collection of subset data is used to train their decision trees. As a result, we get an ensemble of different
models. Average of all the predictions from different trees are used which is more robust than a single
decision tree classifier.

Boosting is used to create a collection of predictors. In this technique, learners are learned sequentially
with early learners fitting simple models to the data and then analysing data for errors. Consecutive trees
(random sample) are fit and at every step, the goal is to improve the accuracy from the prior tree. When
an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to
classify it correctly. This process converts weak learners into a better performing model

177. What is ADA boosting?

Basic ML

Ada-boost is an ensemble classifier. It combines a weak classifier algorithm to form strong classifier. A
single algorithm may classify the objects poorly. But if we combine multiple classifiers with a selection of
training set at every iteration and assigning the right amount of weight in the final voting, we can have
good accuracy score for the overall classifier.

178. Explain Gradient boosting and Extreme Gradient Boosting?

Basic ML

XGBoost stands for Extreme Gradient Boosting; it is a specific implementation of the Gradient Boosting
method which uses more accurate approximations to find the best tree model. It employs a number of
nifty tricks that make it exceptionally successful, particularly with structured data.
The most important are:

1) computing second-order gradients, i.e. second partial derivatives of the loss function (similar to
Newton’s method), which provides more information about the direction of gradients and how to get to
the minimum of our loss function. While regular gradient boosting uses the loss function of our base
model (e.g. decision tree) as a proxy for minimizing the error of the overall model, XGBoost uses the 2nd
order derivative as an approximation.

2) And advanced regularization (L1 & L2), which improves model generalization.

XGBoost has additional advantages: training is very fast and can be parallelized/distributed across
clusters.

179. What is Bootstrap sampling?

Basic ML

Bootstrap Sampling is a method that involves drawing of sample data repeatedly with replacement from
a data source to estimate a population parameter.

180. What to be done on the dataset if the assumptions are not met?

Intermediate ML

1. If you create a scatter plot of values for x and y and see that there is not a linear relationship between
the two variables, then one can do the following:

- Apply a nonlinear transformation to the independent and/or dependent variable. e.g. log, square root,
or reciprocal of the independent and/or dependent variable

- Add another independent variable to the model.

2. If residuals are not independent then one can do the following:

- For positive serial correlation, consider adding lags of the dependent and/or independent variable to
the model.

- For negative serial correlation, check to make sure that none of your variables is overdifferenced.

- For seasonal correlation, consider adding seasonal dummy variables to the model

3. If Residuals do not have constant variance, then one can do the following:

- Transform the dependent variable

- Use weighted regression


4. If Residuals are not normally distributed, then one can do the following:

- First, verify that any outliers aren’t having a huge impact on the distribution. If there are outliers
present, make sure that they are real values and that they aren’t data entry errors

- Next, you can apply a nonlinear transformation to the independent and/or dependent variable. e.g. log,
square root, or the reciprocal of the independent and/or dependent variable

181. How to apply ML Algorithms in Mfg/Production Environment ?

Intermediate ML

1. Specify Performance Requirements (This may be as accurate or false positives or whatever metrics are
important to the business)

2. Separate Prediction Algorithm From Model Coefficients

2a. Select or Implement The Prediction Algorithm

2b. Serialize Your Model Coefficients

3. Develop Automated Tests For Your Model

4. Develop Back-Testing and Now-Testing Infrastructure

5. Challenge Then Trial Model Updates (For example, perhaps you set up a grid or random search of
model hyperparameters that runs every night and spits out new candidate models)

182. Difference between Classification and Linear Regression?

Basic ML

1. Fundamentally, classification is about predicting a label and regression is about predicting a quantity.

i.e. Classification is the task of predicting a discrete class label while Regression is the task of predicting
a continuous quantity

2. Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.

Regression predictions can be evaluated using root mean squared error, whereas classification
predictions cannot.

3. A regression algorithm can predict a discrete value which is in the form of an integer quantity

A classification algorithm can predict a continuous value if it is in the form of a class label probability
183. Which model to use to check whether a patient is diabetic or not?

Basic ML

Classification algorithm such as Logistic regression, Random forest etc

184. Explain missing values and outlier treatment

Basic ML

185. What is logistic regression? The output for logistic regression?

Basic ML

a. Logistic regression models the probabilities for classification problems with two possible outcomes. It's
an extension of the linear regression model for classification problems.

b. Log likelihood – This is the log likelihood of the final model

c. Number of obs – This is the number of observations that were used in the analysis

d. LR chi2(3) – This is the likelihood ratio (LR) chi-square test. The number in the parenthesis indicates
the number of degrees of freedom

e. Prob > chi2 – This is the probability of obtaining the chi-square statistic given that the null hypothesis
is true. In this case, the model is statistically significant because the p-value is less than .000.

f. Pseudo R2 – This is the pseudo R-squared.

186. What is Ensemble techniques and it's working? some models?

Basic ML

A group of weak learners coming together to form a strong learner, thus increasing the accuracy of any
Machine Learning model is called an ensemble model

Simple Ensemble Techniques: Hard Voting Classifier, Averaging, Weighted Averaging

Advanced Ensemble Techniques: Stacking, Bagging (Randomforest) and Pasting Boosting(Adaboost, XGB
etc)
187. What is Decision tree and Random forest?

Basic ML

- A decision tree is a supervised machine learning algorithm that can be used for both classification and
regression problems. A decision tree is simply a series of sequential decisions made to reach a specific
result

- Random Forest is a tree-based machine learning algorithm that leverages the power of multiple
(randomly created) decision trees for making decisions. i.e. The Random Forest Algorithm combines the
output of multiple (randomly created) Decision Trees to generate the final output.

- Random Forest is suitable for situations when we have a large dataset, and interpretability is not a major
concern. Decision trees are much easier to interpret and understand. Since a random forest combines
multiple decision trees, it becomes more difficult to interpret.

- The decision tree model gives high importance to a particular set of features. But the random forest
chooses features randomly during the training process.

188. How to deal with underfitting and overfitting

Basic ML

Handling Overfitting:

Cross-validation

This is done by splitting your dataset into ‘test’ data and ‘train’ data. Build the model using the ‘train’ set.
The ‘test’ set is used for in-time validation.

Regularization

This is a form of regression, that regularizes or shrinks the coefficient estimates towards zero. This
technique discourages learning a more complex model

Early stopping

When training a learner with an iterative method, you stop the training process before the final
iteration. This prevents the model from memorizing the dataset.

Pruning

This technique applies to decision trees.

Pre-pruning: Stop ‘growing’ the tree earlier before it perfectly classifies the training set.

Post-pruning: Allows the tree to ‘grow’, perfectly classify the training set and then post prune the tree.
Dropout

This is a technique where randomly selected neurons are ignored during training.

Regularize the weights

Handling Underfitting:

Get more training data

Increase the size or number of parameters in the model

Increase the complexity of the model

Increasing the training time, until cost function is minimised

189. What is bias variance tradeoff

Basic ML

The goal of any supervised machine learning algorithm is to achieve low bias(the difference between the
average prediction of our model and the correct value which we are trying to predict) and low
variance(variability of model prediction for a given data point or a value which tells us spread of our data).

If our model is too simple and has very few parameters then it may have high bias and low variance. On
the other hand, if our model has a large number of parameters then it’s going to have high variance and
low bias.

Increasing the bias will decrease the variance. Increasing the variance will decrease bias.

So we need to find the right/good balance without overfitting and underfitting the data.

This tradeoff in complexity is why there is a tradeoff between bias and variance.

190. How will you explain machine learning to a 5 year old.

Intermediate ML

Just like a human, a computer can learn from three sources.

One is Observing what others did in similar situations. The other is observing a situation and trying to
come up with the best possible logic on the spot to decide/conclude. The third is learning from previous
mistakes/success. These three methods correspond to three branches of Machine learning, Supervised,
Unsupervised and Reinforcement learning respectively.
- In Supervised Learning, a computer can tell what word in a sentence is the name of a city, given it is
shown example sentences which may or may not contain names of cities and every occurrence of a city
name is tagged in these examples.

- Unsupervised is where we ask the computer to make decisions based on raw data attributes and a set of
measurable quantities. Some examples would include asking a computer to come up with localities in a
dataset where Lat-Long of the house is given. It would use Lat Long to find distances and form localities
of house.

- The third type of learning is Reinforcement Learning. This is a method in which computer starts with
making random decisions, and then learns based on errors it makes and successes it encounters as it
goes. A recent discovery was an algorithm which could play many different arcade games after learning
the correct/wrong moves. These algorithms would start by making a lot of failures in the beginning and
then get better as they go.

191. What do you do in data exploration?

Basic ML

192. You are given a data set on cancer detection. You’ve build a classification model and
achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance?
What can you do about it?

Intermediate ML

193. You are working on a time series data set. You manager has asked you to build a high
accuracy model. You start with the decision tree algorithm, since you know it works fairly
well on all kinds of data. Later, you tried a time series regression model and got higher
accuracy than decision tree model. Can this happen? Why?

Advanced ML
194. You came to know that your model is suffering from low bias and high variance.
Which algorithm should you use to tackle it? Why?

Intermediate ML

195. How is kNN different from kmeans clustering?

Basic ML

196. After analyzing the model, your manager has informed that your regression model is
suffering from multicollinearity. How would you check if he’s true? Without losing any
information, can you still build a better model?

Intermediate ML

197. When is Ridge regression favorable over Lasso regression?

Basic ML

198. While working on a data set, how do you select important variables? Explain your
methods.

Basic ML

199. What is the difference between covariance and correlation?

Intermediate ML

200. Both being tree based algorithm, how is random forest different from Gradient
boosting algorithm (GBM)?
Basic ML

201. You’ve got a data set to work having p (no. of variable) > n (no. of observation). Why is
(Ordinary Least Squares) OLS is bad option to work with? Which techniques would be best
to use? Why?

Advanced ML

202. We know that one hot encoding increasing the dimensionality of a data set. But,
label encoding doesn’t. How ?

Intermediate ML

203. You are given a data set consisting of variables having more than 30% missing values?
Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will
you deal with them?

Basic ML

204. People who bought this, also bought…’ recommendations seen on amazon is a result
of which algorithm?

Intermediate ML

205. What do you understand by Type I vs Type II error ?

Basic ML

206. You have been asked to evaluate a regression model based on R², adjusted R² and
tolerance. What will be your criteria?
Basic ML

207. Considering the long list of machine learning algorithm, given a data set, how do you
decide which one to use?

Basic ML

208. When does regularization becomes necessary in Machine Learning?

Basic ML

209. What do you understand by Bias Variance trade off?

Basic ML

210. How can you prove that one improvement you've brought to an algorithm is really an
improvement over not doing anything?

Basic ML

211. Explain what resampling methods are and why they are useful. Also explain their
limitations.

Basic ML

- Repeatedly drawing samples from a training set and refitting a model of interest on each sample in
order to obtain additional information about the fitted model

- Example: repeatedly draw different samples from training data, fit a linear regression to each new
sample, and then examine the extent to which the resulting fit differ

- most common are: cross-validation and the bootstrap

cross-validation: random sampling with no replacement, bootstrap: random sampling with replacement
- cross-validation: evaluating model performance, model selection (select the appropriate level of
flexibility)

- bootstrap: mostly used to quantify the uncertainty associated with a given estimator or statistical
learning method

212. Is it better to have too many false positives, or too many false negatives? Explain.

Basic ML

False-positive and false-negative are two problems we have to deal with while evaluating a mode.

In medical, a false positive can lead to unnecessary treatment and a false negative can lead to a false
diagnostic, which is very serious since the disease has been ignored.

However, we can minimize the errors by collecting more information, considering other variables,
adjusting the sensitivity (true positive rate) and specificity (true negative rate) of the test, or conducting
the test multiple times.

Even so, it is still hard since reducing one type of error means increasing the other type of error.
Sometimes, one type of error is more preferable than the other one, so data scientists will have to
evaluate the consequences of the errors and make a decision

213. What is selection bias, why is it important and how can you avoid it

Basic ML

Selection bias occurs if a data set's examples are chosen in a way that is not reflective of their real-world
distribution.

How to avoid selection biases

Mechanisms for avoiding selection biases include:

- Using random methods when selecting subgroups from populations.

- Ensuring that the subgroups selected are equivalent to the population at large in terms of their key
characteristics (this method is less of a protection than the first since typically the key characteristics are
not known).
214. Differentiate between univariate, bivariate and multivariate analysis.

Basic ML

Univariate statistics summarize only one variable at a time.

Bivariate statistics compare two variables.

Multivariate statistics compare more than two variables.

215. What is the difference between Cluster and Systematic Sampling?

Basic ML

Systematic sampling and cluster sampling are both statistical measures used by researchers, analysts,
and marketers to study samples of a population.

Systematic sampling involves selecting fixed intervals from the larger population to create the sample.

Cluster sampling divides the population into groups, then takes a random sample from each cluster.

216. Can you cite some examples where both false positive and false negatives are equally
important?

Intermediate ML

Let us take an example of a medical field where:

A false positive = person is considered as sick but actually is healthy

A false negative = person is considered as healthy but is actually sick

What does it mean?

False-positive cases lead to overspending due to unnecessary care and damaging the health of an
otherwise healthy person due to unnecessary side effects of the therapy.

A false negative case means that your patients get sicker or die.

In this case, both false positive and false negatives are equally important since it concerns a person’s life

217. Explain Lasso regression


Basic ML

Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are
shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e.
models with fewer parameters)

Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of the
magnitude of coefficients. This type of regularization can result in sparse models with few coefficients;
Some coefficients can become zero and eliminate from the model. Larger penalties result in coefficient
values closer to zero, which is ideal for producing simpler models.

218. Explain Gradient Descent Algorithm

Intermediate ML

Gradient descent is an optimization algorithm that's used when training a machine learning model.

It's based on a convex function and tweaks its parameters iteratively to minimize a given cost function to
its local minimum.

You start by defining the initial parameter's values and from there gradient descent uses calculus to
iteratively adjust the values so they minimize the given cost-function (where a gradient measures how
much the output of a function changes if you change the inputs a little bit.)

219. How machine learning is deployed in real world scenarios?

Advanced ML

AWS or Azure instances with python jobs that run with either manual schedules, or automated to trigger
on receiving say new data. These are usually a suite of services that constitute a deployment
environment of such models.

Storage - model needs to be stored somewhere (pickle or joblib or specific model object). Either s3 on
aws or blob in azure.

Computing instance - Computing environment that contains python and is enabled to communicate to
every platform that is relevant to the deployment context.

Job scheduler - Devops is the norm now. Automated pipelines that procure data, process,
load/retrain/predict with the packaged model.

Final layer - either BI tools like tableau, qilkview etc or sql/nosql databases or excel reports
220. What is cosine similarity?

Intermediate ML

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size.
Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional
space. The cosine similarity is advantageous because even if the two similar documents are far apart by
the Euclidean distance (due to the size of the document), chances are they may still be oriented closer
together. The smaller the angle, the higher the cosine similarity.

221. How to implement Tensorflow?

Intermediate ML

The usual workflow of running a program in TensorFlow is as follows:

Build a computational graph, this can be any mathematical operation TensorFlow supports.

Initialize variables, to compile the variables defined previously

Create a session, this is where the magic starts!

Run graph in session, the compiled graph is passed to the session, which starts its execution.

Close session, shut down the session.

222. What is part of speech (POS) tagging? What is the simplest approach to building a
POS tagger that you can imagine?

Basic NLP

POS tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag,
based on its context and definition. The most common approach is to use the lexicon-based approach,
using a lexicon to assign a tag for each word. The lexicon is constructed from a gold standard annotated
corpus, where each word type is coupled with its most frequent associated tag in the gold standard
corpus.

223. How would you build a part of speech (POS) tagger from scratch given a corpus of
annotated sentences? How would you deal with unknown words?
Basic NLP

First, we will create features from words (like last 2,3 letters, the previous word, next word, etc.). Then we
will train a classifier to find the POS tag. HMM, CRF and RNNs can be used to train the model. Unknown
words can also be predicted by generating the features (position of the word, suffix, etc) from them.

224. How would you train a model that identifies whether the word “Apple” in a sentence
belongs to the fruit or the company?

Basic NLP

This particular task is known as NER (Named Entity Recognition) tagging. HMM, CRF and RNNs can be
used to train a model for NER

225. How would you find all the occurrences of quoted text in a news article?

Basic NLP

Train a classifier model to look at the constituent parts of a news article and assign a probability that,
taken together, composes a valid quoted text.

226. How would you build a system that auto-corrects text that has been generated by a
speech recognition system?

Basic NLP

It can be done in multiple ways, but the simplest way would be to take the unknown words and compare
them with similar words from our dictionary. Distances can be calculated using algorithms like
Levenshtein and if the result is satisfactory, the words can be exchanged

227. Which are some popular models other than word2vec?

Basic NLP

Some popular models other than word2vec are GloVe, Adagram, FastText, etc
228. What is latent semantic indexing and where can it be applied?

Basic NLP

Latent semantic indexing (LSI) is a concept used by search engines to discover how a term and content
work together to mean the same thing, even if they do not share keywords or synonyms. Search engines
use LSI to judge the quality of the content on a page by checking for words that should appear alongside
a given search term or keyword

229. Explain some metrics to test out a Named Entity recognition model.

Basic NLP

When you train a NER system the most typical evaluation method is to measure precision, recall, f1-
score, and confusion matrix at a token level.

230. List out some popular Python libraries that are used for NLP.

Basic NLP

Some popular libraries for NLP are, NLTK, Gensim, spaCy, TextBlob, etc.

231. What are some popular applications of NLP?

Basic NLP

Some popular applications are Text summarization, Machine translation, Sentiment Analysis, chatbots,
etc.

232. What is the difference between search function and match function?

Basic NLP

re.search() method finds something anywhere present in the string and return a match object, whereas
re.match() method finds something only at the beginning of the string and returns a match object
233. What is tokenization, chinking, chunking?

Basic NLP

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be
either word, characters, or subwords. Chunking means a grouping of words/tokens into chunks. Chunking
can break sentences into phrases that are more useful than individual words and yield meaningful results.
Chinking is a lot like chunking, it is basically a way for you to remove a chunk from a chunk.

234. What is the skip-gram model?

Basic NLP

Skip-gram is an unsupervised algorithm to find word embeddings. It tries to predict the source context
words (surrounding words) given a target word (the center word)

235. What is a CBOW model?

Basic NLP

CBOW is an unsupervised algorithm to find word embeddings. It tries to predict the target word (the
center word) given the source context words (surrounding words).

236. How can you create your own word embeddings?

Basic NLP

You can use gensim library to implement word2vec model, you can train the word2vec model on your
text corpus and then generate word embeddings.

237. What is the difference between stemming and lemmatization?

Basic NLP

Stemming and lemmatization, both are used to derive root (base) word from their inflected form. A stem
might not be an actual word whereas a lemma will be an actual word.
238. How would you build a system to translate English text to Greek and vice-versa?

Basic NLP

One can use Neural Machine Translation to translate English text to Greek and vice-versa. A sequence
to sequence model can be created using RNNs.

239. How would you build a system that automatically groups news articles by subject?

Basic NLP

There can be different ways to do this task, if you have annotated data, you can train a classifier model
to classify different articles

240. What are stop words? Describe an application in which stop words should be
removed.

Basic NLP

Stop words are frequently used word, which does not add much meaning to a sentence or does not help
in prediction. We will need to remove the stop words while performing sentiment analysis.

241. How would you design a model to predict whether a movie review was positive or
negative?

Basic NLP

We will need to perform sentiment analysis on the reviews, It can be done in multiple ways, one simple
way to do this is by training a classifier using ML algorithms or RNNs (LSTM or GRU).

242. What is entropy? How would you estimate the entropy of the English language?

Basic NLP

Entropy is a measure of randomness in the information. One possible way of calculating the entropy of
English uses N-grams. One can statistically calculate the entropy of the next letter when the previous N -
1 letters are known.
243. What is the TF-IDF score of a word and in what context is this useful?

Basic NLP

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of
documents. This is done by multiplying two metrics: how many times a word appears in a document, and
the inverse document frequency of the word across a set of documents. TF-IDF is used to convert text
corpus into a matrix on which Machine learning algorithms can be implemented

244. What is dependency parsing?

Basic NLP

Dependency parsing is the process of analyzing the grammatical structure of a sentence based on the
dependencies between the words in that sentence.

245. What are the difficulties in building and using an annotated corpus of text such as
the Brown Corpus and what can be done to mitigate them?

Basic NLP

246. What tools for training NLP models (NLTK, Apache OpenNLP, GATE, MALLET etc…)
have you used?

Basic NLP

To train NLP models, I have used NLTK, Gensim, Spacy and a few others

247. Are you familiar with WordNet or other related linguistic resources?

Basic NLP

WordNet is the lexical database i.e. dictionary for the English language, specifically designed for NLP.
Synset is a special kind of a simple interface that is present in NLTK to look up words in WordNet.
248. Problems faced in NLP and how you tackled them?

Basic NLP

Most of the challenges I faced in NLP are due to data complexity, characteristics such as sparsity,
diversity, dimensionality, etc. and the dynamic nature of the datasets. With the special focus on
addressing NLP challenges, one can build accelerators, robust, scalable domain-specific knowledge bases
and dictionaries that bridges the gap between user vocabulary and domain nomenclature.

249. What are some of the common problems using fixed window neural models?

Advanced NLP

The main problem faced while using a fixed window neural model, is the window size can be small for
large sentences, making it unable to process the complete information

250. What are some common examples of sequential data?

Advanced NLP

Some common examples of sequential data are text corpus, DNA sequence, and time-series data

251. What are some problems with N-gram language models?

Advanced NLP

An issue when using n-gram language models is out-of-vocabulary (OOV) words. They are encountered in
computational linguistics and natural language processing when the input includes words which were not
present in a system's dictionary or database during its preparation.

252. What are some limitations of RNNs?

Advanced NLP

RNNs are prone to exploding and vanishing gradient problem. RNN also fails to keep track of long term
dependencies.
253. What are Vanishing gradient problems?

Advanced NLP

As more layers using certain activation functions are added to neural networks, the gradients of the loss
function approach zero, making the network hard to train.

254. What is exploding gradients in RNN?

Advanced NLP

Exploding gradients are a problem where large error gradients accumulate and result in very large
updates to neural network model weights during training.

255. Can you give me an example of many-to-one architecture in sequence models?

Advanced NLP

An example of Many to one architecture in sequence model, would be sentiment analysis, where the
inputs are words and the output is sentiment

256. What activation layer is used in the hidden units of an RNN?

Advanced NLP

tanh activation function is used in hidden units of RNN

257. What is the use of the Forget Gate in LSTMs?

Advanced NLP

In LSTM, the forget gate controls the extent to which a value remains in the cell

258. Why is there a specific need for an architecture like GRU or LSTM?
Advanced NLP

RNN’s suffer from short-term memory. If a sequence is long enough, they will have a hard time carrying
the information from the earlier timesteps to later ones. This is called the Vanishing Gradient Problem. To
solve this issue, GRU and LSTMs are used

259. What problems of RNNs do LSTMs address?

Advanced NLP

RNN’s suffer from short-term memory. If a sequence is long enough, they will have a hard time carrying
the information from the earlier timesteps to later ones. This is called the Vanishing Gradient Problem. To
solve this issue, GRU and LSTMs are used

260. What is the primary difference between an LSTM and GRU?

Advanced NLP

The main difference between GRU and LSTM is, GRU have 2 gates whereas LSTM has 3 gates, thus GRU
is faster than LSTM. But LSTMs generally perform better at remembering longer sequences than GRU

261. What kind of datasets are RNNs known best to work on?

Advanced NLP

RNNs are good at making predictions when the data is sequential.


262. What are the different possible architectures in RNNs and give examples of the
same?

Advanced NLP

Different possible architectures for RNN are the following:

1. One-to-Many: ex. Auto-Image captioning


2. Many-to-Many: ex. Neural Machine Translation
3. Many-to-one: ex. Sentiment Analysis

263. What are some of the ways to address the exploding gradients problem in RNNs?

Advanced NLP

Some of the ways to address the exploding gradient problem are

1. Gradient Clipping: Limit the size of gradients during the training of your network.
2. Weight Regularization: apply a penalty to the networks loss function for large weight values.
3. Using LSTM or GRU

264. Explain encoder-decoder architecture?

Advanced NLP

An Encoder-Decoder architecture was developed where an input sequence was read in entirety and
encoded to a fixed-length internal representation. A decoder network then used this internal
representation to output words. This is generally used in machine translation

265. What are the drawbacks of attention mechanisms?

Advanced NLP

The main disadvantage of attention mechanism is that it adds more weights to train, thus increases the
training time of the model

266. What is BERT? What are the applications of it?


Advanced NLP

BERT stands for Bidirectional Encoder Representations from Transformers. BERT is pre-trained on a large
corpus of unlabelled text. It is bidirectional meaning it learns information from both the left and the right
side of a token’s context during the training phase. BERT is used for text summarization, knowledge
extraction, chatbots etc.

267. What is XLNet?

Advanced NLP

XLNet is an auto-regressive language model which outputs the joint probability of a sequence of tokens
based on the transformer architecture with recurrence.

268. What are the Transformers?

Advanced NLP

The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks while
handling long-range dependencies with ease.

269. What is the time complexity of LSTM?

Advanced NLP

270. Why do we need attention mechanisms?

Advanced NLP

The standard seq2seq model is generally unable to accurately process long input sequences, the
attention mechanism allows the model to focus and place more “Attention” on the relevant parts of the
input sequence as needed.

271. What are the different types of attention mechanisms?


Advanced NLP

There are 2 different types of attention mechanism

1. Bahdanau Attention
2. Luong Attention

272. What are the advantages of BERT?

Advanced NLP

Since BERT model is deeply bidirectional, it is able to generate more accurate word representation. Since
BERT uses transformers, it provides parallelization and thus faster to train on large datasets

273. What information is stored in the hidden and cell state of an LSTM?

Advanced NLP

The cell state ( also called long-term memory) contains the information from the past. Hidden State (also
called working memory) contains the information from the current state that needs to be taken to the
next state

274. Why is the transformer better than LSTMs?

Advanced NLP

Transformers are better than all the other architectures because they totally avoid recursion, by
processing sentences as a whole and by learning relationships between words, using multi-head
attention mechanisms and positional embeddings
275. What are the differences between BERT and ALBERT v2?

Advanced NLP

BERT is an expensive model in terms of memory and time consumed on computations, even with GPU.
ALBERT v2 is lighter and faster than BERT. Cross-layer parameter sharing is the most significant change
in BERT architecture that created ALBERT.

276. What are the different variants of BERT?

Advanced NLP

There are 2 different variants of BERT

1. BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
2. BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters

277. What is the state of the art model currently in NLP?

Advanced NLP

Following are the state of the art model currently in NLP

1. BERT
2. GPT-3
3. XLNet

278. What are the most challenging NLP problems that researchers/industries are
working on currently?

Advanced NLP

Following are the challenges faced currently in NLP

1. Extraction of meaning from a variety of complex, multi-format documents.


2. Support for multiple languages
3. Integration of pre-existing, text-based knowledge
279. What are built-in function in python?

Basic Python

Hint?

280. Differentiate between Call by value and Call by reference

Basic Python

Hint?

281. How to read any file (without using Pandas)

Intermediate Python

Hint?

282. What is NaN in python?

Basic Python

Hint?

283. What is the use of ID() function in python?

Basic Python

Hint?

284. How will you import multiple excel sheets in a data frame?

Basic Python

Hint?
285. What are the different types of data types?

Basic Python

Hint?

286. Difference between lists/ tuples/ dictionaries?

Basic Python

Hint?

287. How would check a number is prime or not using Python?

Basic Python

# taking input from user

number = int(input("Enter any number: "))

# prime number is always greater than 1

if number > 1:

for i in range(2, number):

if (number % i) == 0:

print(number, "is not a prime number")

break

else:

print(number, "is a prime number")

# if the entered number is less than or equal to 1

# then it is not a prime number

else:

print(number, "is not a prime number")

288. How would check a number is armstrong number using Python?


Basic Python

# Python program to check if the number is an Armstrong number or not

# take input from the user

num = int(input("Enter a number: "))

# initialize sum

sum = 0

# find the sum of the cube of each digit

temp = num

while temp > 0:

digit = temp % 10

sum += digit ** 3

temp //= 10

# display the result

if num == sum:

print(num,"is an Armstrong number")

else:

print(num,"is not an Armstrong number")

289. What is an Append Function?

Basic Python

The append() method adds an item to the end of the list.

The syntax of the append() method is:

list.append(item)

290. For what Beautiful soup library is used for?

Basic Python

Hint?
291. Which function is most useful to convert a multidimensional array into a one-
dimensional

Basic Python

Hint?

292. Python or R – Which one would you prefer for text analytics?

Intermediate Python

293. What is the lambda function in Python?

Intermediate Python

In Python, anonymous functions are defined using the lambda keyword

Syntax of Lambda Function in python

lambda arguments: expression

294. How negative indices are used in Python?

Intermediate Python

Python programming language supports negative indexing of arrays, something which is not available in
arrays in most other programming languages. This means that the index value of -1 gives the last element,
and -2 gives the second last element of an array. The negative indexing starts from where the array ends.
This means that the last element of the array is the first element in the negative indexing which is -1.

295. How is the Python series different from a single column dataframe?

Intermediate Python

Python series is the data structure for a single column of a DataFrame, not only conceptually, but
literally, i.e. the data in a DataFrame is actually stored in memory as a collection of Series
Series is a one-dimensional object that can hold any data type such as integers, floats and strings and it
does not have any name/header whereas the dataframe has column names.

296. Which libraries in SciPy have you worked with in your project?

Intermediate Python

SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT,
signal and image processing, ODE solvers etc

Subpackages include:

scipy.cluster

scipy.constants

scipy.fftpack

scipy.integrate

scipy.interpolation

scipy.linalg

scipy.io

scipy.ndimage

scipy.odr

scipy.optimize

scipy.signal

scipy.sparse

scipy.spatial

scipy.special

scipy.stats

scipy.weaves

297. How groupby function works in Python?

Intermediate Python
Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. pandas
objects can be split on any of their axes.

Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True,


squeeze=False, **kwargs)

Parameters :

by: mapping, function, str, or iterable

axis: int, default 0

level: If the axis is a MultiIndex (hierarchical), group by a particular level or levels

as_index: For aggregated output, return object with group labels as the index. Only relevant for
DataFrame input. as_index=False is effectively “SQL-style” grouped output

sort: Sort group keys. Get better performance by turning this off. Note this does not influence the order
of observations within each group. groupby preserves the order of rows within each group.

group_keys: When calling apply, add group keys to index to identify pieces

squeeze: Reduce the dimensionality of the return type if possible, otherwise return a consistent type

Returns: GroupBy object

298. What does [::-1] do in python?

Intermediate Python

[::] just produces a copy of all the elements in order

[::-1] produces a copy of all the elements in reverse order

299. What are python packages?

Basic Python

Packages are namespaces which contain multiple packages and modules themselves. They are simply
directories.

Each package in Python is a directory which MUST contain a special file called __init__.py. This file can be
empty, and it indicates that the directory it contains is a Python package, so it can be imported the same
way a module can be imported.
If we create a directory called foo, which marks the package name, we can then create a module inside
that package called bar. We also must not forget to add the __init__.py file inside the foo directory.

300. How do you check missing values in a dataframe using python?

Intermediate Python

Pandas isnull() function detect missing values in the given object. It returns a boolean same-sized object
indicating if the values are NA. Missing values get mapped to True and non-missing value gets mapped to
False.

301. How do you get the frequency of a categorical column of a dataframe using python?

Basic Python

Using Series.value_counts()

302. Can you write a function using python to impute outliers?

Basic Python

import numpy as np

def removeOutliers(x, outlierConstant):

a = np.array(x)

upper_quartile = np.percentile(a, 75)

lower_quartile = np.percentile(a, 25)

IQR = (upper_quartile - lower_quartile) * outlierConstant

quartileSet = (lower_quartile - IQR, upper_quartile + IQR)

resultList = []

for y in a.tolist():

if y > = quartileSet[0] and y < = quartileSet[1]:

resultList.append(y)

return resultList
303. How can we convert a python series object into a dataframe?

Basic Python

Series.to_frame(name=None)

304. How can you change the index of a dataframe in python?

Basic Python

DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

keys: label or array-like or list of labels/arrays

This parameter can be either a single column key, a single array of the same length as the calling
DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array”
encompasses Series, Index, np.ndarray, and instances of Iterator.

305. Is Python case sensitive?

Basic Python

Yes

306. What all ways have you used to convert categorical columns into numerical data
using python?

Intermediate Python

One of the most used and popular ones are LabelEncoder and OneHotEncoder.

Both are provided as parts of sklearn library.

LabelEncoder can be used to transform categorical data into integers:

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

x = ['Apple', 'Orange', 'Apple', 'Pear']

y = label_encoder.fit_transform(x)
print(y)

array([0, 1, 0, 2])

OneHotEncoder can be used to transform categorical data into one hot encoded array:

from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder(sparse=False)

y = y.reshape(len(y), 1)

onehot_encoded = onehot_encoder.fit_transform(y)

print(onehot_encoded)

307. How get_dummies() is different from one hot encoder?

Intermediate Python

OneHotEncoder cannot process string values directly. If your nominal features are strings, then you
need to first map them into integers.

pandas.get_dummies is kind of the opposite. By default, it only converts string columns into one-hot
representation, unless columns are specified.

308. How do you check the distribution of data in python?

Intermediate Python

A simple and commonly used plot to quickly check the distribution of a sample of data is the histogram.

from matplotlib import pyplot

pyplot.hist(data)

309. What is the difference between iloc and loc activity?

Basic Python

loc gets rows (or columns) with particular labels from the index.

iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
310. Difference between univariate and bivariate analysis? What all different functions
can be used in python?

Basic Python

Univariate statistics summarize only one variable at a time.

Bivariate statistics compare two variables.

Below are a few functions which can be used in the univariate and bivariate analysis:

1. To find the population proportions with different types of blood disorders.

df.Thal.value_counts()

2. To make a plot of the distribution :

sns.distplot(df.Variable.dropna())

3. Find the minimum, maximum, average, and standard deviation of data.

There is a function called ‘describe’

4. Find the mean of the Variable

df.Variable.dropna().mean()

5. Boxplot to observe outliers

sns.boxplot(x = "", y = "", hue = "", data=df)

6. Correlation plot:

data.corr()

311. What all different methods can be used to standardize the data using python?

Intermediate Python

Min Max Scaler.

Standard Scaler.

Max Abs Scaler.

Robust Scaler.

Quantile Transformer Scaler.


Power Transformer Scaler.

Unit Vector Scaler.

312. What is the apply function in Python? How does it work?

Basic Python

Pandas.apply allow the users to pass a function and apply it on every single value of the Pandas series.

Syntax:

s.apply(func, convert_dtype=True, args=())

313. How do you do upsampling of data? Name a python function or explain the code.

Intermediate Python

Up-sampling is the process of randomly duplicating observations from the minority class in order to
reinforce its signal.

There are several heuristics for doing so, but the most common way is to simply resample with
replacement.

Module for resampling in Python:

from sklearn.utils import resample

314. Can you plot 3D plots using matplotlib? Name the function.

Intermediate Python

Yes

Function:

import numpy as np

import matplotlib.pyplot as plt

fig = plt.figure()

ax = plt.axes(projection ='3d')
315. How can you drop a column in python?

Basic Python

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False,


errors='raise')

316. What is the use of ‘inplace’ in python functions?

Basic Python

In-place operation is an operation that changes directly the content of a given linear algebra, vector,
matrices(Tensor) with/without making a copy

When inplace = True is used, it performs an operation on data and nothing is returned.

When inplace=False is used, it performs an operation on data and returns a new copy of data.

317. How do you select a sample of dataframe?

Intermediate Python

1. Randomly select a single row: df = df.sample()

2. Randomly select a specified n number of rows: df = df.sample(n=3)

3. Allow a random selection of the same row more than once: df = df.sample(n=3,replace=True)

4. Randomly select a specified fraction of the total number of rows: df = df.sample(frac=0.50)

318. How would you define a block in Python?

Intermediate Python

A block is a group of statements in a program or script. Usually, it consists of at least one statement and
declarations for the block, depending on the programming or scripting language. A language which allows
grouping with blocks is called a block-structured language

319. How will you remove duplicate data from a dataframe?


Intermediate Python

DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it
will consider them only for duplicates.

keep: keep is to control how to consider duplicate value. It has only three distinct value and default is
‘first’.

320. Can you convert a string into an int? When and how?

Basic Python

Python offers the int() method that takes a String object as an argument and returns an integer. This can
be done when the value is either of numeric object or floating-point.

But keep these special cases in mind:

A floating-point (an integer with a fractional part) as an argument will return the float rounded down to
the nearest whole integer.

321. What does the function zip() do?

Intermediate Python

The zip() function takes iterables (can be zero or more), aggregates them in a tuple, and return it.

The syntax of the zip() function is:

zip(*iterables)

322. How many arguments can the range() function take?

Basic Python

It can take mainly three arguments.

start: integer starting from which the sequence of integers is to be returned

stop: integer before which the sequence of integers is to be returned.

The range of integers ends at stop – 1.


step: integer value which determines the increment between each integer in the sequence

323. What is the difference between list, array and tuple in Python?

Basic Python

List:

The list is an ordered collection of data types.

The list is mutable.

List are dynamic and can contain objects of different data types.

List elements can be accessed by index number

Array:

An array is an ordered collection of similar data types.

An array is mutable.

An array can be accessed by using its index number.

Tuple:

Tuples are immutable and can store any type of data type.

it is defined using ().

it cannot be changed or replaced as it is an immutable data type

324. Write a Sorting Algorithm in R?

Intermediate R

There are multiple algorithms for performing sorting on the data in the R programming language. Below
different types of sorting function have been discussed.

Bubble Sort

Insertion Sort
Selection Sort

Merge Sort

Quick Sort

325. What are the packages used in R for data science?

Basic R

1. Dplyr

2. Ggplot2

3. Shiny

4. Lubridate

5. Knitr

6. Mlr

7. Caret

8. Text2Vec

9. Prophet

10. SnowballC

326. Explain the functions of dplyr package

Basic R

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most
common data manipulation challenges:

Below are the functions used in this package:

mutate() adds new variables that are functions of existing variables

select() picks variables based on their names.

filter() picks cases based on their values.

summarise() reduces multiple values down to a single summary.

arrange() changes the ordering of the rows.


327. Explain the syntax of rbind and cbind in R

Intermediate R

cbind() and rbind() both create matrices by combining several vectors of the same length. cbind()
combines vectors as columns, while rbind() combines them as rows

328. What is interaction function?

Basic R

interaction computes a factor which represents the interaction of the given factors. The result of the
interaction is always unordered.

Syntax: interaction(…, drop = FALSE, sep = ".", lex.order = FALSE)

329. What are the different data types/objects in R?

Basic R

Hint?

330. What is a factor variable, and why would you use one?

Basic R

Hint?

331. How do you concatenate strings in R

Basic R

Hint?

332. What is difference between lapply and sapply


Basic R

Hint?

333. How many sorting algorithms are available

Basic R

Hint?

334. What is the use of lattice package

Basic R

Hint?

335. What is the use of MANOVA

Basic R

Hint?

336. What is the difference between data frame and a matrix in R?

Basic R

Matrix in R –

It’s a homogeneous collection of data sets which is arranged in a two-dimensional rectangular


organisation. It’s a m*n array with a similar data type. It is created using a vector input. It has a fixed
number of rows and columns. You can perform many arithmetic operations on R matrix like – addition,
subtraction, multiplication, and divisions.

DataFrames in R –

It is used for storing data tables. It can contain multiple data types in multiple columns called fields. It is a
list of a vector of equal length. It is a generalized form of a matrix. It is like a table in excel sheets. It has
column and row names. The name of rows is unique with no empty columns. The data stored must be
numeric, character or factor type. DataFrames are heterogeneous.

337. How missing values and impossible values are represented in R language?

Intermediate R

In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by
zero) are represented by the symbol NaN (not a number)

338. What is the process to create a table in R language without using external files?

Intermediate R

Hint?

339. What is the difference between data frame and a matrix in R

Basic R

Matrix in R –

It’s a homogeneous collection of data sets which is arranged in a two-dimensional rectangular


organisation. It’s an m*n array with a similar data type. It is created using a vector input. It has a fixed
number of rows and columns. You can perform many arithmetic operations on R matrix like – addition,
subtraction, multiplication, and divisions.

DataFrames in R –

It is used for storing data tables. It can contain multiple data types in multiple columns called fields. It is a
list of the vector of equal length. It is a generalized form of a matrix. It is like a table in excel sheets. It has
column and row names. The name of rows is unique with no empty columns. The data stored must be
numeric, character or factor type. DataFrames are heterogeneous.
340. How can you verify if a given object “X” is a matrix data object

Basic R

Hint?

341. What is Rshiny

Basic R

Shiny is an open-source R package that provides an elegant and powerful web framework for building
web applications using R. Shiny helps you turn your analyses into interactive web applications without
requiring HTML, CSS, or JavaScript knowledge.

342. Advantages of R/Python visualization over tableau

Basic R

Few advantages of R/Python are as follows:

R/Python is an open-source portable language supported by a huge standard library.

With R/Python, you can visualise data in a similar way to Tableau, and build interactive visualisations
with many libraries but you have a lot more flexibility.

343. Explain the various benefits of R language?

Basic R

344. What are the differences between the sum function and using “+” operator

Basic SAS

SUM function returns the sum of non-missing arguments whereas “+” operator returns a missing value if
any of the arguments are missing
345. How does PROC SQL work

Intermediate SAS

The SQL query structure does not change even if we use PROC SQL command. For example -

PROC SQL;

SELECT column(s)

FROM table(s) | view(s)

WHERE expression

GROUP BY column(s)

HAVING expression

ORDER BY column(s);

QUIT;

In the above query, the select statement is nothing like a select sql query but you always end the query
with QUIT;
346. If you are given an unsorted data set, how will you read the last observation to a new
dataset

Intermediate SAS

We can read the last observation to a new data set using end= data set option.

For example:

data work.calculus;

set work.comp end=last;

If last;

run;

Here in the above query, a new dataset calculus is getting created from comp (within work directory).
last is the temporary variable (initialized to 0) which is set to 1 when the set statement reads the last
observation

347. Can you tell the difference between VAR X1 – X3 and VAR X1 — X3

Intermediate SAS
348. What is the purpose of trailing @ and @@? How do you use them

Intermediate SAS

The trailing @ is also known as a column pointer. By using the trailing @, in the Input statement gives you
the ability to read a part of your raw data line, test it and then decide how to read additional data from
the same record.

The single trailing @ tells the SAS system to “hold the line”.

The double trailing @@ tells the SAS system to “hold the line more strongly”.

An Input statement ending with @@ instructs the program to release the current raw data line only when
there are no data values left to be read from that line. The @@, therefore, holds the input record even
across multiple iterations of the data step.

349. What is the difference between the Do Index, Do While and the Do Until loop

Intermediate SAS

350. What is the ANYDIGIT function in SAS

Basic SAS

Searches a character string for a digit and returns the first position at which it is found

351. What is interleaving in SAS

Basic SAS

352. What is the difference between RDDs and Dataframe in Spark?

Intermediate Spark

A Data Frame is the tabular representation of data and is equivalent to a table in a relational database
but with better optimization.
RDD is the representation of a set of records, logically partitioned across multiple nodes for parallel
processing.

353. How to do Spark Tuning ( Optimization)?

Intermediate Spark

Spark performance tuning is the process of efficiently utilizing the spark resources such as memory,
cores, instances as per the input data records.

354. What is a stage in Spark and What are the types of stages?

Basic Spark

Spark stage is nothing but each individual job work/tasks from the entire execution plan.

There are two types of stages:

1. ShuffleMapStage: It is an intermediate stage and produces data for the next stage.
2. ResultStage: Final stage of spark and helps in the computation of result from the action plan.

355. What are shared variables in Spark and what is the use of it?

Basic Spark

Shared variables are nothing but the globally referred variables use in multiple functions and methods
in parallel.

Spark provides two special types of shared variables

1. Broadcast Variables(Used to cache a value in memory on all nodes)


2. Accumulators (used to implement counters and sums).

356. What is the difference between Batch processing and real time streaming?

Intermediate Spark
Batch processing is the processing of blocks of data that have already been stored over a period of time.
It is used in the scenarios where it is required to process large volumes of data to get more detailed
insights than it is to get fast analytics results. On the other hand, real-time processing as the name
suggests is used for real-time analytics. It is used to process the data as it arrives and gets instant
analytics result.

357. Can you connect SparkSQL to RDBMS? If yes, How?

Intermediate Spark

SparkSQL itself is built of two main components: Dataframe and SQLContext. SQLContext encapsulates
all the relative functionality of spark and provides extended functionality to be able to 'talk' to different
databases which could be SQL or NoSQL DBs. Every DB has its own respective connectors to be
integrated with spark and with the help of such dedicated connectors SQLContext talks to DBs.

358. What are Accumulators in Spark?

Basic Spark

Accumulators are one of the types of shared variables used in spark. It is meant for numeric data
aggregation where the data is stored in the cache and can be accessed throughout the model
functionalities.

359. What is the difference between SQLContext and HiveContext in Spark?

Intermediate Spark

SQLContext is nothing but the gateway to SparkSQL from where the spark can interact with the
databases. HiveContext is the superset of SQLContext which inherits all the property of SQLContext for
DB interactions with addition of HiveContext properties to connect with Hive and HBase.

360. Explain the project you did using Spark?

Intermediate Spark
Spark is basically used where basic python is not capable of solving the issue. Used spark functionalities
on Python for telecom domain use cases where data size was huge > 20 GB. Used RDD concepts for
parallel and fast data preprocessing. used shared variables concept for data storage and loading from
cache.

361. How does Kafka work?

Intermediate Spark

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance
TCP. Applications (producers) send messages (records) to a Kafka node (broker) and said messages are
processed by other applications called consumers. Said messages get stored in a topic and consumers
subscribe to the topic to receive new messages.

362. Explain Spark Architecture

Intermediate Spark

Apache spark follows a master/slave architecture where the master drives the process and slave daemon
are the worker nodes which does the actual processing.

Spark Driver contains various components – DAGScheduler, TaskScheduler, BackendScheduler and


BlockManager responsible for the translation of spark user code into actual spark jobs executed on the
cluster

363. What is Lazy evaluation in Spark?

Basic Spark

Lazy evaluation as the name suggests means the execution will not start until an action is triggered.
Whenever there is some operation on RDD, it does not get executed immediately. Spark adds them to a
DAG of computation and only when the driver requests some data, this DAG actually gets executed

Advantages of lazy evaluation.

1. It is an optimization technique i.e. it provides optimization by reducing the number of queries.


2. It saves the round trips between driver and cluster, thus speeds up the process.
364. How can you use Apache Spark with Hadoop?

Intermediate Spark

We need to understand that the spark is not intended to replace the Hadoop stack rather enhance its
functionalities. Spark can enrich the processing capabilities in terms of read and write data from HDFS by
combining spark with Hadoop MapReduce, HBase.

There are two different ways in which the deployment happens.

1. Standalone Deployment: Hadoop cluster run side by side with Hadoop MR. user can run Spark
jobs directly on HDFS.
2. Hadoop Yarn Deployment: Users can deploy Hadoop yarn and can run spark on yarn without any
pre-installation or administrative access required.

365. What are different types of cluster managers in Spark?

Basic Spark

Apache has 3 types of cluster managers.

Standalone: Simplest way to run spark in a clustered environment. It is a cluster which spark itself
manages. It has masters and number of workers with the configured amount of memory and CPU
cores.
Mesos: Mesos handles the workload in a distributed environment by dynamic resource sharing and
isolation. It is used for large scale cluster deployments and it decreases an overhead of allocating a
specific machine for different workloads.
Hadoop Yarn: YARN data computation framework is a combination of the ResourceManager, the
NodeManager. In resource manager, The Scheduler allocates a resource to the various running
application and Application Manager manages applications across all the nodes.

366. What is the use of broadcast variables in Apache Spark?

Intermediate Spark

Broadcast variables are useful when large datasets need to be cached in executors. Without this, these
need to be shipped to each executor before the actual process call. It is meant to be a read-only and is a
mechanism for sharing variables across executors
367. What is the role of Dstream in Spark?

Basic Spark

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a
continuous stream of data, either the input data stream received from the source or the processed data
stream generated by transforming the input stream. Internally, a DStream is represented by a continuous
series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream
contains data from a certain interval.

368. What are the different levels of persistence available in Spark?

Intermediate Spark

Different Persistence levels in Apache Spark are as follows:

1. MEMORY_ONLY: In this level, RDD object is stored as a de-serialized Java object in JVM. If an
RDD doesn’t fit in the memory, it will be recomputed.
2. MEMORY_AND_DISK: In this level, RDD object is stored as a de-serialized Java object in JVM. If
an RDD doesn’t fit in the memory, it will be stored on the Disk.
3. MEMORY_ONLY_SER: In this level, RDD object is stored as a serialized Java object in JVM. It is
more efficient than a de-serialized object.
4. MEMORY_AND_DISK_SER: In this level, RDD object is stored as a serialized Java object in JVM.
If an RDD doesn’t fit in the memory, it will be stored on the Disk.
5. DISK_ONLY: In this level, RDD object is stored only on Disk.

369. What do you understand by partitions in Spark?

Intermediate Spark

Apache distributed datasets are basically big dataset which needs to be further partitioned across various
nodes in order to facilitate processing. Execution on a single node for such huge datasets efficiently is not
possible. Hence partitioning is required where each partitioned block are evaluated lazily and are stored
as DAG.

370. What is Pyspark?

Basic Spark
Pyspark is nothing but the python API for Spark. Its sole purpose is to support the collaboration of
Apache Spark and Python. It provides an interface to interact with RDD in Apache Spark through python
programming language.

371. What are actions and transformations in Spark?

Basic Spark

Transformations and actions are nothing but two types of operations which can be performed on RDDs.
Transformations are such type of operations which are when applied on an RDD it returns a new
transformed RDD.

Ex : map(),filter(),flatMap(). Action are methods to access the actual data available in an RDD, the result
of an action can be taken into the programmatic flow where an action is called and all the transformation
occurs.

Ex: collect(),reduce(),first(),take(),count()

372. What is GraphX and what are its applications?

Intermediate Spark

GraphX is Apache Spark’s API for graphs and graph-parallel computation. This includes the collection of
graph algorithms and processes to do graph analytics. GraphX extends the Spark RDD with a Resilient
Distributed Property Graph.

The property graph is a directed multigraph which can have multiple edges in parallel. Every edge and
vertex have user-defined properties associated with it. The parallel edges allow multiple relationships
between the same vertices. It is flexible, fast and open source.

373. What is Sliding Window in Spark streaming?

Intermediate Spark

Spark streaming has an advantageous feature of windowed operation. It can do the transformation
operation over a sliding window of data. Generally, the sliding window operation requires two specific
parameters.

Window length which defines the duration of the window & Sliding Interval which defines the interval at
which the operation is performed.
374. What is the difference between Spark Session and Spark Context?

Intermediate Spark

Spark SparkContext is an entry point to Spark and used to programmatically create RDDs and other
variables. It's object "sc" is a default variable and can be created by using SparkContext class.

However, SparkSession is a superset of SparkContext which includes all the functional class of different
APIs, Spark Context, SQLContext, HiveContext etc. It's an entry point to underlying spark functionality
itself.

375. Why is Spark faster than Hadoop?

Basic Spark

Theoretically, Spark performs 100 times faster than Hadoop and this is possible only because it processes
data in random access memory (RAM), while Hadoop MapReduce persists data back to the disk after a
map or reduce action. Nonetheless, Spark needs lots of memory and keeps the data there until a further
call for caching.

376. What is the use of SQLContext in Spark?

Basic Spark

SQLContext is nothing but the gateway to SparkSQL from where the spark can interact with the
databases. Here the DB can be both SQL and NoSQL. Respective drivers are available for different DBs
which can be initiated along with the SparkSession builder process itself.

377. What are the limitations of Spark?

Basic Spark

1. Need manual optimization whenever required. No automated process is available.


2. No own file management system. Dependency on HDFS or something else.
3. Spark ML is very limited. MLlib does not support all extensive algorithms as of now. Not good for
advance analytics.
4. Not good for a multi-user environment. not capable of handling users concurrency.
378. Explain Spark Streaming with Kafka?

Intermediate Spark

Spark Streaming is nothing but a continuous stream that is processed using algorithms as it is. The output
is also retrieved in the form of a continuous data stream. Kafka streaming works on state transitions
unlike batches as that in Spark Streaming.

It stores the states within its topics, which is used by the stream processing applications for storing and
querying of the data. Thereby, all its operations are state-controlled. These states are further used to
connect topics to form an event task

379. How can you connect with Hive in Spark?

Intermediate Spark

Hive is connected through HiveContext in spark. HiveContext is the superset of SQLContext which
inherits all the property of SQLContext for DB interactions with addition of HiveContext properties to
connect with Hive and HBase.

380. What is the difference between Cache vs Broadcast in Spark?

Intermediate Spark

Cache stores each node or any partitions of it that it computes, in memory and reuses them in other
actions on the dataset. It helps in faster execution in future processes. Whereas, Broadcast variables
allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy
of it with tasks.

381. Explain DAG in Spark?

Basic Spark

DAG is the abbreviation of the Directed Acyclic Graph. In Spark, this is used for the visual representation
of RDDs and the operations being performed on them. The RDDs are represented by vertices, while the
operations are represented by edges. Every edge is directed from an 'earlier state' to a 'later state.
382. Given a table(cars) with 4 columns(model_id, model_name,color, price) , perform
groupby using model_name and color, order by highest price, get 3rd highest.

Basic SQL

Hint?

383. What is the difference between the WHERE and HAVING clauses?

Basic SQL

Hint?

384. Given a table(employee). Find the Second highest salary. Find the 10th highest
salary. Find the 25-30th highest salary.

Intermediate SQL

Hint?

385. Fetch department-wise salary from an employee table

Basic SQL

Hint?

386. Given a table with order-id , order item-id and quantity Find the quantity for distinct
order-id

Basic SQL

Hint?

387. What are the different type of Joins in Sql and explain them? (Mainly focused on full
outer join )
Basic SQL

Hint?

388. Given 2 tables and the following query. What will be the output (select * from table 1
full outer join table 2) where values not in (select * from table 1 inner join table 2)

Intermediate SQL

Hint?

389. Given an assumption: There are 2 tables, first table has 10 records and second table
has 15 records. There are 5 records common in both the tables. Number of records that
would be fetched when you perform left join/right join/inner join/cross-join.

Basic SQL

Hint?

390. Given a word "JOE", find the word in a given string irrespective of word being upper
case or lower case or capitalize?

Intermediate SQL

Hint?

391. Find out if the database has any duplicate record names.

Basic SQL

Hint?
392. Differentiate between Implicit vs Explicit Join

Intermediate SQL

Hint?

393. With respect to SQL, which one is more preferable - Subqueries or Joins? Why?

Intermediate SQL

Hint?

394. Does SQL have User Defined Functions?

Basic SQL

Hint?

395. Query to find the employees in the office given check in and check out as fields.

Intermediate SQL

Hint?

396. Given a table of an event having columns date-ts/ event id. Find the event that
happened 3rd on every month

Basic SQL

Hint?

397. Split a full name into 2. First and last.

Basic SQL

Hint?
398. Find the Salary greater than Average salary without using Joins or Sub-Queries

Advanced SQL

Hint?

399. What is difference between rownum and dense rank ?

Basic SQL

Hint?

400. How will you use partition by?

Basic SQL

Hint?

401. Types of Joins

Basic SQL

Hint?

402. What is inner join?

Basic SQL

The INNER JOIN creates a new result table by combining column values of two tables (table1 and table2)
based upon the join-predicate. The query compares each row of table1 with each row of table2 to find all
pairs of rows which satisfy the join-predicate.

403. What is left / outer/inner join?

Basic SQL

(INNER) JOIN: Returns records that have matching values in both tables
LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right
table

RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left
table

FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table

404. What is Normalization

Basic SQL

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging
between 0 and 1. It is also known as Min-Max scaling.

405. Use count function in a query

Basic SQL

SELECT COUNT(*) FROM dataset;

406. Difference between count(column_name) and count(*)

Intermediate SQL

COUNT(*) will count the number of records.

COUNT(column_name) will count the number of records where column_name is not null.

407. Write SQL query to find the cumulative price of each customer in a table?

Intermediate SQL

Step1: Partition data using OVER Clause

SELECT CustomerID

,TransactionDate
,TransactionAmount

,SUM(TransactionAmount) OVER(PARTITION BY TransactionDate) RunningTotal

FROM Sales.CustomerTransactions T1

WHERE TransactionTypeID = 1

GROUP BY CustomerID

ORDER BY InvoiceID

,TransactionAmount

Step2: Order Partitions with Order BY

SELECT CustomerID

,TransactionDate

,TransactionAmount

,SUM(TransactionAmount) OVER(PARTITION BY TransactionDate ORDER BY InvoiceID) RunningTotal

FROM Sales.CustomerTransactions T1

WHERE TransactionTypeID = 1

GROUP BY CustomerID

ORDER BY InvoiceID

,TransactionAmount

408. Write a query to delete duplicate records in a table

Intermediate SQL

SELECT [FirstName],

[LastName],

[Country],

COUNT(*) AS CNT

FROM [SampleDB].[dbo].[Employee]

GROUP BY [FirstName],

[LastName],
[Country]

HAVING COUNT(*) > 1;

409. What is a constraint in SQL?

Intermediate SQL

Hint?

410. What are the constraints type available in SQL

Intermediate SQL

Hint?

411. What is a Primary Key

Basic SQL

The PRIMARY KEY constraint uniquely identifies each record in a table.

Primary keys must contain UNIQUE values, and cannot contain NULL values.

A table can have only ONE primary key; and in the table, this primary key can consist of single or
multiple columns (fields).

412. Can we have multiple keys for primary key

Basic SQL

A table can have only one primary key, which may consist of single or multiple fields. When multiple
fields are used as a primary key, they are called a composite key. If a table has a primary key defined on
any field(s), then you cannot have two records having the same value of that field(s).

413. What is a Unique Key ?

Basic SQL
Unique key constraints identify an individual tuple uniquely in relation or table. A table can have more
than one unique key, unlike primary key. Unique key constraints can accept only one NULL value for
column. Unique constraints are also referenced by the foreign key of another table. It can be used when
someone wants to enforce unique constraints on a column and a group of columns which is not a primary
key.

414. What is a Foreign Key ?

Basic SQL

A FOREIGN KEY is a key used to link two tables together.

A FOREIGN KEY is a field (or collection of fields) in one table that refers to the PRIMARY KEY in
another table.

The table containing the foreign key is called the child table, and the table containing the candidate key
is called the referenced or parent table

415. Can a table contain multiple FOREIGN KEY’s?

Basic SQL

A table may have multiple foreign keys, and each foreign key can have a different parent table.

416. What is SQL NOT NULL constraint?

Intermediate SQL

A NOT NULL constraint in SQL is used to prevent inserting NULL values into the specified column,
considering it as a not accepted value for that column. This means that you should provide a valid SQL
NOT NULL value to that column in the INSERT or UPDATE statements, as the column will always contain
data

417. What is a CHECK constraint?

Intermediate SQL
The CHECK constraint is used to limit the value range that can be placed in a column. If you define a
CHECK constraint on a single column it allows only certain values for this column. If you define a CHECK
constraint on a table it can limit the values in certain columns based on values in other columns in the
row.

418. What is a DEFAULT constraint?

Intermediate SQL

The DEFAULT constraint is used to provide a default value for a column. The default value will be added
to all new records IF no other value is specified.

419. What is the difference between NULL value, Zero, and Blank space?

Intermediate SQL

A NULL value is not same as zero or a blank space. A NULL value is a value which is 'unavailable,
unassigned, unknown or not applicable'. Whereas, zero is a number and blank space is a character

420. What is a Composite key ?

Intermediate SQL

A composite key is a combination of two or more columns in a table that can be used to uniquely identify
each row in the table when the columns are combined uniqueness is guaranteed, but when it is taken
individually it does not guarantee uniqueness.

421. How do you restrict the data at columns level ?

Intermediate SQL

The SQL SELECT LIMIT statement is used to retrieve records from one or more tables in a database and
limit the number of records returned based on a limit value

422. write a query f_name and l_name fields from table emp and allow a space in
between the 2 columns
Intermediate SQL

select ID, f_name + ' ' + l_name from emp

423. Write a query to rename the column name id as emp_id, name as emp_name for the
table emp;

Basic SQL

select id as emp_id, name as emp_name from emp;

424. select * from dual, what does the dual mean and what is the default data types ?

Intermediate SQL

The DUAL is special one row, one column table present by default in all Oracle databases. The owner of
DUAL is SYS (SYS owns the data dictionary, therefore DUAL is part of the data dictionary.) but DUAL can
be accessed by every user.

The table has a single VARCHAR2(1) column called DUMMY that has a value of 'X'. MySQL allows DUAL
to be specified as a table in queries that do not need data from any tables. In SQL Server DUAL table
does not exist, but you could create one.

425. How do you get the current system date using dual table?

Intermediate SQL

SELECT sysdate FROM DUAL ;

426. Write a query to get the number of records from a table emp

Basic SQL

SELECT COUNT(*) FROM emp;

427. Explain DDL with examples.


Basic SQL

DDL is Data Definition Language and is used to define the structures like schema, database, tables,
constraints etc. Examples of DDL are create and alter statements.

428. Explain DML with examples.

Basic SQL

DML is Data Manipulation Language and is used to manipulate data. Examples of DML are insert,
update and delete statements.

429. Explain DCL and TCL with examples

Intermediate SQL

DCL is Data Control Language

TCL is Transaction Control Language

Examples under DCL: GRANT, REVOKE

Examples under TCL: START TRANSACTION, COMMIT, ROLLBACK

430. How to get only the delhi records from emp table, and handle all types of case
sensitive issues: Delhi, delhi, DELHI, DELhi

Intermediate SQL

select * from emp where upper(city)='DELHI'

431. Write a query to change the format of the date to (YYYY-MON-DD) in dual table ?

Intermediate SQL

select to_char(date_column,'YYYY-MM-DD') from dual;


432. How to remove duplicate in the col from a table ?

Basic SQL

SELECT DISTINCT column FROM table1;

433. Write a query to find count of unique id in the retail_shopping table

Basic SQL

SELECT count(DISTINCT ID) FROM retail_shopping;

434. Write a query to select only the id, name, city,country and phone from the table
customer and restrict the record only to india

Basic SQL

select id, name, city, country, phone from customer where country='india'

435. Write a query to update the table emp, where the city name is Madras to chennai

Basic SQL

UPDATE emp

SET city= 'chennai'

WHERE city= 'Madras';

436. Write a query to remove record whose salary is great than or equal to 50000 and
city is chennai

Basic SQL

DELETE FROM emp WHERE salary>=50000 and city='chennai'

437. Write a query to select all the students from table stud whose name begins with 'S'
Intermediate SQL

select * from stud where name like "S%"

438. Write a query to display all the records from table emp where the age is between 18
and 58

Basic SQL

select * from emp where age between 18 and 58

439. Select all the record for emp, in which gender is female or age > 18

Basic SQL

select * from emp where gender='female' or age>18

440. Write a query to exact all the record for which payment_detail col is null for the
table payment_detail

Basic SQL

select * from payment_detail where payment_detail is null

441. Write a query to get the top 5 salary from the table emp

Basic SQL

select top 5 * from emp

442. Query the records from emp where order is descending for name and ascending for
salary

Intermediate SQL
select * from emp order by name descending, salary

443. What is the difference between union and union all in SQL

Intermediate SQL

UNION removes duplicate records (where all columns in the results are the same), UNION ALL does not.

There is a performance hit when using UNION instead of UNION ALL, since the database server must do
additional work to remove the duplicate rows, but usually, you do not want the duplicates (especially
when developing reports).

444. What is an execution plan? When would you use it? How would you view the
execution plan

Intermediate SQL

An execution plan is a window in SQL Server Management Studio to shows you how SQL Server breaks
down a query and also identifies where issues might exist within the execution plan. By identifying the
statements that take a long time to complete, you can then look at the execution plan to determine
tuning needs.

When do you use it?

You can use it anytime you write a query. Most developers use execution plan when they have database
queries consumes a lot of resources and takes time.

How do you view it in SQL Server?

SQL Server can create execution plans in two ways:

Actual Execution Plan - (CTRL + M) - is created after execution of the query and contains the steps that
were performed

Estimated Execution Plan - (CTRL + L) - is created without executing the query and contains an
approximate execution plan

Execution plans can be presented in these three ways.


Text Plans

Graphical Plans

XML Plans

445. How can you select all the even number records from a table? All the odd number
records?

Intermediate SQL

Select * from table where id % 2 = 0

Select * from table where id % 2 != 0

446. What is the difference between the RANK() and DENSE_RANK() functions? Provide
an example.

Intermediate SQL

The one and the only difference between the DENSE_RANK() and RANK() functions is the fact that
RANK() will assign non-consecutive ranks to the values in a set in the case of a tie, which means that with
RANK() there will be gaps between the integer values when there is a tie. But the DENSE_RANK() will
assign consecutive ranks to the values in the case of a tie, so there will be no gaps between the integer
values in the case of a tie.

447. What is the difference between char and varchar2?

Intermediate SQL

CHAR is used for storing fix length character strings. It will waste a lot of disk space if this type is used
to store variable-length strings.

VARCHAR2 is used to store variable-length character strings.

448. How do you detect outliers

Basic Statistics
Hint?

449. Difference between pause and continue

Basic Statistics

Hint?

450. Why you used T-test in the project that you have mentioned in your resume.

Basic Statistics

Hint?

451. Given two populations, to perform a test of effectiveness of a drug, which statistical
test will you perform?

Intermediate Statistics

Hint?

452. If a height is co - related to weight & weight is co -related height are the both the
statements same?

Basic Statistics

Yes, both the statements are true, given that they are continuous variables.

453. Given a data / statement, calculate the Z score

Basic Statistics

A z-score measures exactly how many standard deviations above or below the mean a data point is.

Formula for calculating a z-score:

Here's the formula for calculating a z-score:


z={data point}-{mean}} / {standard deviation}

A positive z-score says the data point is above average.

A negative z-score says the data point is below average.

A z-score close to 000 says the data point is close to average.

A data point can be considered unusual if its z-score is above 333 or below -3−3minus, 3

454. What is p-value?

Basic Statistics

Hint?

455. Explain Chi-squared test, Z-test, Anova.

Intermediate Statistics

Hint?

456. Difference between precision/ recall/ f1 score.

Intermediate Statistics

Hint?

457. What are independent variables and categorical variables. Highlight the key
differences.

Basic Statistics

An independent variable sometimes called an experimental or predictor variable, is a variable that is


being manipulated in an experiment in order to observe the effect on a dependent variable, sometimes
called an outcome variable.
Categorical variables contain a finite number of categories or distinct groups. Categorical data might not
have a logical order. For example, categorical predictors include gender, material type, and payment
method.

An independent variable can be categorical or numerical. A categorical variable can be an independent


variable or a dependent variable

458. What is Chi Square ?

Basic Statistics

The Chi-Square statistic is commonly used for testing relationships between categorical variables. The
null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the
population; they are independent.

459. How to prove a sample is the true representation of population?

Intermediate Statistics

Properties of Representative Samples:

- Estimates calculated from sample data are often used to make

inferences about populations.

- If a sample is representative of a population, then Sample reflects the characteristics of the population,
so those sample findings can be generalized to the population

- A most effective way to achieve representativeness is through randomization; random selection or


random assignment

460. What is Hypothesis Testing?

Basic Statistics

A statistical hypothesis is an assumption about a population parameter. This assumption may or may not
be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject
statistical hypotheses
461. A scenario was given and was asked to write Null and Alternate Hypothesis

Intermediate Statistics

Null hypothesis. The null hypothesis, denoted by Ho, is usually the hypothesis that sample observations
result purely from chance.

Alternative hypothesis. The alternative hypothesis, denoted by H1 or Ha, is the hypothesis that sample
observations are influenced by some non-random cause.

462. How you handle the skewness

Intermediate Statistics

We can handle skewness using log transformation. A log transformation can help to fit a very skewed
distribution into a Gaussian one.

463. Explain in detail about distributions in statistics

Basic Statistics

Gaussian Distribution: A Gaussian distribution can be described using two parameters:

Data from many fields of study surprisingly can be described using a Gaussian distribution, so much so
that the distribution is often called the “normal” distribution because it is so common.

mean: Denoted with the Greek lowercase letter mu, is the expected value of the distribution.

variance: Denoted with the Greek lowercase letter sigma raised to the second power (because the units
of the variable are squared), describes the spread of observation from the mean.

standard deviation: Denoted with the Greek lowercase letter sigma, describes the normalized spread of
observations from the mean.

Student’s t-Distribution: It is a distribution that arises when attempting to estimate the mean of a normal
distribution with different sized samples.

The distribution can be described using a single parameter:

number of degrees of freedom: denoted with the lowercase Greek letter nu (v), denotes the number
degrees of freedom.
Chi-Squared Distribution: Like the Student’s t-distribution, the chi-squared distribution is also used in
statistical methods on data drawn from a Gaussian distribution to quantify the uncertainty.

The chi-squared distribution has one parameter:

degrees of freedom, denoted k.

etc

464. Difference between Binomial and Poisson Distribution

Basic Statistics

The binomial distribution is one, whose possible number of outcomes are two, i.e. success or failure. On
the other hand, there is no limit on possible outcomes in Poisson distribution.

Binomial is biparametric in nature while poisson is uniparametric.

Binomial only has two possible outcomes, while Poisson has an unlimited number of possible outcomes.

Mean > Variance for binomial, Mean=Variance for Poisson

465. What are the conditions for performing two sample hypothesis testing?

Basic Statistics

When comparing two population proportions, we start with two assumptions:

The two independent samples are simple random samples that are independent.

The number of successes is at least five and the number of failures is at least five for each of the
samples.

466. What is sigmoid function, conditional probability and probability difference

Basic Statistics

- A Sigmoid function is a mathematical function which has a characteristic S-shaped curve. There are a
number of common sigmoid functions, such as the logistic function, the hyperbolic tangent, and the
arctangent
- All sigmoid functions have the property that they map the entire number line into a small range such as
between 0 and 1, or -1 and 1, so one use of a sigmoid function is to convert a real value into one that can
be interpreted as a probability. “odds ratio” p / (1 - p), which describes the ratio between the probability
that a certain, positive, event occurs and the probability that it doesn’t occur – where positive refers to
the “event that we want to predict”, i.e., p(y=1 | x).

- Sigmoid function outputs the conditional probabilities of the prediction, the class probabilities.

467. You are given a data set. The data set has missing values which spread along 1
standard deviation from the median. What percentage of data would remain unaffected?
Why?

Intermediate Statistics

468. What are different types of Hypothesis Testing

Intermediate Statistics

There are basically two types, namely, null hypothesis and alternative hypothesis

The null hypothesis is generally denoted as H0. It states the exact opposite of what an investigator or an
experimenter predicts or expects. It basically defines the statement which states that there is no exact or
actual relationship between the variables.

The alternative hypothesis is generally denoted as H1. It makes a statement that suggests or advises a
potential result or an outcome that an investigator or the researcher may expect. It has been categorized
into two categories: directional alternative hypothesis and non-directional alternative hypothesis.

469. What is the difference between variance and covariance

Basic Statistics

Variance is one dimension and covariance is two-dimension measurable techniques and which measure
the volatility and relationship between the random variables respectively. Higher the Volatility in stock
riskier the stock and buying stock with negative covariance is a great way to minimize the risk. A positive
covariance means assets move in the same direction whereas negative covariance means assets generally
moves in the opposite direction.

470. What a data contains? (Information + Noise) Explain

Basic Statistics

Data = true signal + noise

Noisy data are data with a large amount of additional meaningless information in it called noise. This
includes data corruption and the term is often used as a synonym for corrupt data. It also includes any
data that a user system cannot understand and interpret correctly.

Sources of noise:

- Random noise(white noise) is often a large component of the noise in data

- Outlier data are data that appears to not belong in the data set. It can be caused by human error such as
transposing numerals, mislabeling, programming bugs, etc

- Fraud: Individuals may deliberately skew data to influence the results toward a desired conclusion

471. How to create dashboards?

Basic Tableau

1. At the bottom of the workbook, click the New Dashboard icon:

2. From the Sheets list at left, drag views to your dashboard at the right

3. To replace a sheet, select it in the dashboard at right. In the Sheets list at left, hover over the
replacement sheet, and click the Swap Sheets button.

472. What filters should be applied to rows for specific ops?

Basic Tableau

The different types of filters used in Tableau are given below. The name of filter types is sorted based
on the order of execution in Tableau.
Extract Filters

Data Source Filters

Context Filters

Dimension Filters

Measure Filters

473. Difference between Dimensions and Measures

Basic Tableau

Dimensions contain qualitative values (such as names, dates, or geographical data). You can use
dimensions to categorize, segment, and reveal the details in your data. Dimensions affect the level of
detail in the view.

Measures contain numeric, quantitative values that you can measure. Measures can be aggregated.
When you drag a measure into the view, Tableau applies an aggregation to that measure (by default).

474. What Are the Different Joins in Tableau

Intermediate Tableau

There are four types of joins which are used to combine data in Tableau: inner, left, right and full outer.
Let’s look into it one by one:

Inner:

Inner join results in a table that contains values that have matches in both tables.

Left:

The left join results in a table that contains the values from the left table and corresponding matches
from the right table. And in case, if a value in the left table doesn’t have a corresponding match in the
right table, a null value in the data grid is reflected.
Right:

Right join results in a table which contains all the values form the right table and corresponding matches
from the left table. And in case, if a value in the right table doesn’t have a corresponding match in the left
table, a null value in the data grid is reflected.

Full Outer:

Full outer join results in a table that contains all values from both tables. And a null value is reflected in
data grid when a value from either table doesn’t have a match with the other table.

475. What is a Calculated Field, and How Will You Create One

Basic Tableau

Sometimes your data source does not contain a field (or column) that you need for your analysis. For
example, your data source might contain fields with values for Sales and Profit, but not for Profit Ratio. If
this is the case, you can create a calculated field for Profit Ratio using data from the Sales and Profit
fields.

How to create a simple calculated field using an example.

Step 1: Create the calculated field

In a worksheet in Tableau, select Analysis > Create Calculated Field.

In the Calculation Editor that opens, give the calculated field a name.

In this example, the calculated field is called Profit Ratio.

Step 2: Enter a formula

In the Calculation Editor, enter a formula.

This example uses the following formula:


SUM([Profit])/SUM([Sales])

476. What Is a Parameter in Tableau

Intermediate Tableau

A parameter is a global placeholder value such as a number, date, or string that can replace a constant
value in a calculation, filter, or reference line.

For example, you may create a calculated field that returns True if Sales is greater than $500,000 and
otherwise returns False. You can replace the constant value of “500000” in the formula with a parameter.
Then, using the parameter control, you can dynamically change the threshold in your calculation

477. What is the Use of Dual-axis

Intermediate Tableau

Dual axes are two independent axes that are layered on top of each other. According to Tableau, dual
axes allow you to compare multiple measures. Dual axes are useful when you have two measures that
have different scales.

478. What is the Difference Between Treemaps and Heat Maps

Basic Tableau

A heat map is a two-dimensional representation of information with the help of colours. Heat maps can
help the user visualize simple or complex information.

Treemaps are ideal for displaying large amounts of hierarchically structured (tree-structured) data. The
space in the visualization is split up into rectangles that are sized and ordered by a quantitative variable.
The levels in the hierarchy of the treemap are visualized as rectangles containing other rectangles. Each
set of rectangles on the same level in the hierarchy represents a column or an expression in a data table.
Each individual rectangle on a level in the hierarchy represents a category in a column.

479. What is the Difference Between .twbx And .twb


Basic Tableau

Tableau Workbook File (TWB) is an XML document. It contains the information about your sheets,
dashboards and stories. The TWB file references a data source file such as Excel or TDE, and when you
save the TWB file, it is linked to the source.

The most important thing to remember about TWB files is that they don’t contain any data – if you want
to share your workbook, therefore, you will need to send both the Tableau Workbook File and the data
source file.

Tableau Packaged Workbook (TWBX) is a package of files “compressed” together. It includes a data
source file, TWB, and any other file used to produce the workbook (including images).

TWBX is intended for sharing. It does not link to the original file source; instead, it contains a copy of the
data that was obtained when the file was created. TWBX files are usually used as reports and can be
viewed using Tableau Viewer.

TWBX isn’t designed for auto-updating. If you refresh/update the source file, TWBX will stay unchanged.
If you want your workbook to update when the source file is updated, you need to use the TWB file
format.

480. Explain the Difference Between Tableau Worksheet, Dashboard, Story, and
Workbook

Basic Tableau

Tableau uses a workbook and sheet file structure, much like Microsoft Excel.

A workbook contains sheets, which can be a worksheet, dashboard, or a story.

A worksheet contains a single view along with shelves, legends, and the Data pane.

A dashboard is a collection of views from multiple worksheets.

A story contains a sequence of worksheets or dashboards that work together to convey information.

481. How many maximum tables can you join in Tableau


Basic Tableau

Hint?

482. Explain Pareto chart and how is it created in Tableau

Intermediate Tableau

A Pareto chart is a type of chart that contains both bars and a line graph, where individual values are
represented in descending order by bars, and the ascending cumulative total is represented by the line.

For creating Tableau Pareto Chart first we have to create a bar chart.

Create a bar graph that shows Sales by Sub-Category in downward-sloping order.

i. Connect to the Sample – Superstore knowledge supply.

ii. From the size space of the information pane, drag Sub-Category to Columns.

iii. From the Measures space of the information pane, drag Sales to Rows.

iv. Click Sub-Category on Columns and select kind.

In the kind panel, do the following:

i. Under the kind order, select downward-sloping.

ii. Under kind by, select Field.

iii. Leave all alternative values unchanged, as well as Sales because the chosen field and add because of
the chosen aggregation.

iv. Click alright to exit the type panel.

Products area unit currently sorted from highest sales to lowest.

Add a Line Chart

Add a Line Chart that additionally shows Sales by Sub-Category


i. From the Measures space of the information pane, drag Sales to the so much right of the read, till a line
appears

ii. Drop Sales, to make a dual-axis read. it is a bit exhausting to envision that there area unit 2 instances
of the Sales bars at now, as a result of they’re organized identically.

iii. Select SUM(Sales) (2) on the Marks card, and alter the mark kind of Line.

Add a Table calculation to the road chart to indicate sales by Sub-Category as a running total and as a p.c
of total

i. Click the second copy of SUM(Sales) on Rows and select Add Table Calculation.

ii. Add a primary table calculation to SUM(Sales) to gift sales as a running total.

iii. Choose Running Total because of the Calculation kind.

iv. Do not shut the Table Calculation panel.

v. Add a secondary table calculation to gift the information as a p.c of the overall.

vi. Click Add Secondary Calculation and select p.c of Total because of the Secondary Calculation kind.

vii. This is what the Table Calculation panel ought to appear as if at this point:

viii. Click the X in the upper-right corner of the Table Calculations panel to shut it.

ix. Click color the Marks card to vary the color of the road.

483. Introduce yourself/Tell us about yourself

Basic HR

Hint?

484. Why do you want to leave your current organization?

Basic HR

Hint?

485. What are your strengths?


Basic HR

Hint?

486. What are your weaknesses?

Basic HR

Hint?

487. Why do you want to join our company?

Basic HR

Hint?

488. How do you typically respond to problems?

Intermediate HR

Hint?

489. What significant goals have you set for yourself in the past? Have you achieved
those?

Intermediate HR

Hint?

490. You have worked in the IT sector for so long, why is there a sudden interest in
analytics?

Intermediate HR

Hint?
491. Have you worked on any analytics projects or assignments?

Basic HR

Hint?

492. Please describe your future career goals

Intermediate HR

Hint?

493. Do you have any idols? In what way do they inspire you?

Basic HR

Hint?

494. What are your interests and hobbies? What do you do in your free time?

Basic HR

Hint?

495. What has been your biggest achievement at work?

Intermediate HR

Hint?

496. Do you prefer working as an individual contributor or managing a team?

Intermediate HR

Hint?
497. Do you have any questions for us?

Basic HR

Hint?

Copyright © 2024 Great Learning. All Rights Reserved.

You might also like