Data Science Interview Ques.
Data Science Interview Ques.
You can't anticipate every question an interviewer will ask. However, there are many critical
questions that you can prepare before the interview.
Our hiring partners have helped us curate a set of interview questions on key skills, which will help
you prepare better for the data science job roles.
Filters
1. Name a function which is most useful to convert a multidimensional array into a one-
dimensional array. For this function will changing the output array affect the original array?
Basic Python
The flatten( ) can be used to convert a multidimensional array into a 1D array. If we modify
the output array returned by flatten( ), it will not affect the original array because this
function returns a copy of the original array.
2. If there are two variables defined as 'a = 3' and 'b = 4', will ID() function return the same
values for a and b?
Basic Python
The id() function in python returns the identity of an object, which is actually the memory
address. Since, this identity is unique and constant for every object, it will not return same
values for a and b.
Basic Python
4. In python, if we create two variables 'mean = 7' and 'Mean = 7' , will both of them be
considered as equivalent?
Basic Python
Basic Python
Inplace is a parameter available for a number of pandas functions. It impacts how the
function executes. Using 'inplace = True', the original dataframe can be modified and it will
return nothing. The default behaviour is 'inplace = False' which returns a copy of the
dataframe, without affecting the original dataframe.
Basic Python
Basic Python
# taking input from user number = int(input("Enter any number: ")) # prime number is always
greater than 1 if number > 1: for i in range(2, number): if (number % i) == 0: print(number, "is
not a prime number") break else: print(number, "is a prime number") # if the entered number
is less than or equal to 1 # then it is not a prime number else: print(number, "is not a prime
number")
8. What is the difference between univariate and bivariate analysis? What all different
functions can be used in python?
Basic Python
Univariate analysis summarizes only one variable at a time while Bivariate analysis compares
two variables. Below are a few functions which can be used in the univariate and bivariate
analysis: 1. To find the population proportions with different types of blood disorders.
df.Thal.value_counts() 2. To make a plot of the distribution : sns.distplot(df.Variable.dropna())
3. Find the minimum, maximum, average, and standard deviation of data. There is a function
called describe() which returns the minimum, maximum, mean etc. of the numerical variables
of the data frame. 4. Find the mean of the Variable df.Variable.dropna().mean() 5. Boxplot to
observe outliers sns.boxplot(x = ' ', y = ' ', hue = ' ', data=df) 6. Correlation plot: data.corr()
Basic Python
- 'for' loop is used to obtain a certain result. In a for loop, the number of iterations to be
performed is already known. - In 'while' loop, the number of iterations is not known. Here,
the statement runs until a specific condition is met and the assertion is proven untrue.
Basic Python
11. How will you import multiple excel sheets in a data frame?
Basic Python
The excel sheets can be read using 'pd.read_excel()' function into a dataframe and then
using 'pd.concat()', concatenate all the excel sheets- Syntax: df =
pd.concat(pd.read_excel('sheet_name', sheet_name=None), ignore_index=True)
Basic Python
The append() method adds an item to the end of the list. The syntax of the append() method
is: list.append(item) On the other hand, the extend method extends the list by adding each
element from iterable. The syntax of the extend() method is: list.extend(item)
Basic Python
Python has the following standard data types: - Boolean - Set - Mapping Type: dictionary -
Sequence Type: list, tuple, string - Numeric Type: complex, float, int.
Basic Python
Basic Python
Python offers the int() method that takes a String object as an argument and returns an
integer. This can be done only when the value is either of numeric object or floating-point.
But keep these special cases in mind - A floating-point (an integer with a fractional part) as
an argument will return the float rounded down to the nearest whole integer.
Basic Python
# Python program to check if the number is an Armstrong number or not # take input from
the user num = int(input("Enter a number: ")) # initialize sum sum = 0 # find the sum of the
cube of each digit temp = num while temp > 0: digit = temp % 10 sum += digit ** 3 temp //=
10 # display the result if num == sum: print(num,"is an Armstrong number") else:
print(num,"is not an Armstrong number")
17. What is the difference between list, array and tuple in Python?
Basic Python
The list is an ordered collection of data types. The list is mutable. Lists are dynamic and can
contain objects of different data types. List elements can be accessed by index number An
array is an ordered collection of similar data types. An array is mutable. An array can be
accessed by using its index number. Tuples are immutable and can store any type of data
type. It is defined using (). It cannot be changed or replaced as it is an immutable data type
Basic Python
loc gets rows (or columns) with particular labels from the index. iloc gets rows (or columns)
at particular positions in the index and it only takes integers.
Basic Python
The built-in reverse( ) function reverses the contents of a list object inplace. That means, it
does not return a new instance of the original list, rather it makes a direct change to the
original list object. Syntax: list.reverse()
20. What is the apply function in Python? How does it work?
Basic Python
Pandas.apply allow the users to pass a function and apply it on every single value of the
Pandas series. Syntax: s.apply(func, convert_dtype=True, args=())
21. How do you get the frequency of a categorical column of a dataframe using python?
Basic Python
Using df.value_counts(), where df is the dataframe. The value_counts( ) function returns the
counts of the distinct elements in a dataframe column, sorted in descending order by
default.
Basic Python
The range() function in python always excludes the last integer from the result. Here it will
generate a numeric series from '0' to (5-1)=4, and it will not include '5'.
Basic Python
Pandas 'drop()' method is used to remove specific rows and columns. To drop a column, the
parameter 'axis' should be set as 'axis = 1'. This parameter determines whether to drop labels
from the columns or rows (index). Default behaviour is, axis = 0. Syntax:
df.drop('column_name', axis=1)
Basic Python
NaN values can not be compared with itself. That's why, checking if a variable is equal to
itself is the most popular way to look for NaN values. If it isn't, it's most likely a NaN value.
25. How can we convert a python series object into a dataframe?
Basic Python
The to_frame() is a function that helps us to convert a series object into a dataframe.
Syntax: Series.to_frame(name=None) name: this name will substitute the existing series
name while creating the dataframe.
Basic Python
27. Can you plot 3D plots using matplotlib? Describe the function.
Intermediate Python
Intermediate Python
OneHotEncoder cannot process string values directly. If your nominal features are strings,
then you need to first map them into integers. pandas.get_dummies is kind of the opposite.
By default, it only converts string columns into one-hot representation, unless columns are
specified.
29. Name a tool that can be used to convert categorical columns into a numeric column.
Intermediate Python
One of the most used and popular ones are LabelEncoder and OneHotEncoder. Both are
provided as parts of sklearn library. LabelEncoder can be used to transform categorical data
into integers: from sklearn.preprocessing import LabelEncoder label_encoder =
LabelEncoder() x = ['Apple', 'Orange', 'Apple', 'Pear'] y = label_encoder.fit_transform(x) print(y)
array([0, 1, 0, 2]) OneHotEncoder can be used to transform categorical data into one hot
encoded array: from sklearn.preprocessing import OneHotEncoder onehot_encoder =
OneHotEncoder(sparse=False) y = y.reshape(len(y), 1) onehot_encoded =
onehot_encoder.fit_transform(y) print(onehot_encoded)
Intermediate Python
The 'drop_duplicates( )' function in python eliminates the redundant rows from the
DataFrame and returns it. Syntax: DataFrame.drop_duplicates(subset=None, keep=' ',
inplace=False) subset: Subset takes a column or list of column label. The default value is
none. After passing columns, it will consider them only for duplicates. keep: keep is to
control how to consider duplicate value. It has only three distinct values ('first', 'last', 'false')
and default is 'first'.
Intermediate Python
Depending on the situation, there are a few possible ways to select a sample from the
dataframe - 1. Randomly select a single row: df = df.sample() 2. Randomly select a specified
n number of rows: df = df.sample(n=3) 3. Allow a random selection of the same row more
than once: df = df.sample(n=3,replace=True) 4. Randomly select a specified fraction of the
total number of rows: df = df.sample(frac=0.50)
Intermediate Python
Pandas dataframe.groupby() function is used to split the data into groups based on some
criteria. pandas objects can be split on any of their axes. Syntax:
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=False, **kwargs) by: mapping, function, str, or iterable axis: int,
default 0 level: If the axis is a MultiIndex (hierarchical), group by a particular level or levels
as_index: For aggregated output, return object with group labels as the index. Only relevant
for DataFrame input. sort: Sort group keys. Get better performance by turning this off. Note
this does not influence the order of observations within each group. groupby preserves the
order of rows within each group. group_keys: When calling apply, add group keys to index to
identify pieces squeeze: Reduce the dimensionality of the return type if possible, otherwise
return a consistent type Returns: GroupBy object
Intermediate Python
A simple and commonly used plot to quickly check the distribution of a sample of data is
the histogram. from matplotlib import pyplot pyplot.hist(data)
34. Which libraries in SciPy have you worked with in your project?
Intermediate Python
SciPy contains modules for optimization, linear algebra, integration, interpolation, special
functions, FFT, signal and image processing, ODE solvers etc Subpackages include:
scipy.cluster scipy.constants scipy.fftpack scipy.integrate scipy.interpolation scipy.linalg
scipy.io scipy.ndimage scipy.odr scipy.optimize scipy.signal scipy.sparse scipy.spatial
scipy.special scipy.stats scipy.weaves
35. How is the Python series different from a single column dataframe?
Intermediate Python
Python series is the data structure for a single column of a DataFrame, not only
conceptually, but literally, i.e. the data in a DataFrame is actually stored in memory as a
collection of Series Series is a one-dimensional object that can hold any data type such as
integers, floats and strings and it does not have any name/header whereas the dataframe
has column names.
Intermediate Python
The zip() function takes iterables (can be zero or more), aggregates them in a tuple, and
return it. The syntax of the zip() function is: zip(*iterables)
Yes. A lambda function evaluates an expression for a given argument. It can be used as an
anonymous function within another function.
Intermediate Python
[::] just produces a copy of all the elements in order [::-1] produces a copy of all the
elements in reverse order
Intermediate Python
Pandas isnull() function detect missing values in the given object. It returns a boolean same-
sized object indicating if the values are NA. Missing values get mapped to True and non-
missing value gets mapped to False.
Intermediate Python
Python programming language supports negative indexing of arrays, something which is not
available in arrays in most other programming languages. This means that the index value of
-1 gives the last element, and -2 gives the second last element of an array. The negative
indexing starts from where the array ends. This means that the last element of the array is
the first element in the negative indexing which is -1.
41. Python or R, which one would you prefer for text analytics?
Intermediate Python
42. What all different methods can be used to standardize the data using python?
Intermediate Python
Min Max Scaler. Standard Scaler. Max Abs Scaler. Robust Scaler. Quantile Transformer
Scaler. Power Transformer Scaler. Unit Vector Scaler.
Intermediate Python
44. How do you do Up-sampling of data? Name a python function or explain the code.
Intermediate Python
Up-sampling is the process of randomly duplicating observations from the minority class in
order to reinforce its signal. There are several heuristics for doing so, but the most common
way is to simply resample with replacement. Module for resampling in Python: from
sklearn.utils import resample
Machine learning is a branch of artificial intelligence (AI) that focuses on the use of data and
algorithms to mimic the way that humans learn. It aims to gradually improve by learning from
the events that happened in the past (data captured in past), assuming that the past data is a
good representation of the future. There are various machine learning algorithms available to
build a model that can learn the hidden patterns from the past data, known as training data,
in order to make predictions for the future data or the unseen data, based on which
decisions can be taken. For example: Predicting the prices of a house based on attributes of
the property.
46. Machine learning helps in summarising the patterns in the data in a mathematically
precise way. What exactly is the mathematical outcome of any (machine learning) model
building exercise?
Machine learning models take data as input to find the hidden patterns in it and try to
summarize the patterns that exist in the data by establishing a relationship between the
predictors and the predicted values in a mathematically precise way. The mathematical
outcome of a model can be as simple as an equation that relates the predictors to the target
variable. For example, the relationship between salary and years of experience of an
individual.
47. Machine learning automates the process of building mathematical models out of data.
Explain/elaborate on this statement in the light of the linear regression algorithm.
Linear regression is a linear model which tries to fit the best fit line through the data and
establish the relationship between the independent variables and the dependent variable in
a form of a linear equation. The equation of the best fit line can be given as: Y = ax1 + bx2 +
c Where a and b are the coefficients of x1 and x2 variables respectively and c is the
constant. The linear regression tries to fit the line in such a way that the errors are
minimized, that is, the predicted values are closer to the observed values. The machine-
learning algorithm of linear regression automates the process of model building i.e it
automatically finds the best fit line which has the minimum error or predicts the values that
are closest to the observed values. This means that process of finding the relationship
between independent variables and the dependent variable is automated.
48. If you model performs very well on the data that it was trained on but not on the data
that it has not seen so far, how will you address that performance gap? Why is it important
to address that gap?
Data generally contains information as well as noise. When we fit a model on the training
data, it learns both the information and noise. If the model learns too much noise and fails to
capture the required information then we see that there is a performance gap between the
training performance and the performance on the unseen data (test set). This performance
gap indicates that the model is overfitting, i.e. failing to replicate the performance of the
training set on the test set. To address this performance gap between the training and the
test set various regularization techniques can be applied. In linear models like linear
regression, regularization techniques like ridge regression and lasso regression can be used.
In non-linear models like decision trees, the pruning techniques like pre-pruning and post-
pruning techniques can be used to deal with the performance gap. Also, the technique of
cross-validation can be implemented to determine the performance of the model on the
unseen dataset.
49. When a model gets to production, it will have to make prediction on data that it has not
see so far, how can we ensure that the model performs well on this data?
Before sending the model to production we can check the performance and validity of the
model by using methods. Train - Validation split: In this method, we divide the training set
into two parts one part is kept for training, and the other is kept for validating the model
performance. We train the model on the training set and test it against the validation set.
Based on the performance of the model on the validation set we tune the hyperparameters
of the model to get a generalized and good model performance. K-fold cross-validation: In
this method, we divide the training set into k-folds. Where k can be any number ranging
from 2 to the maximum number of records in the dataset - 1 (generally 10 folds are
preferred). Let’s assume that we set the value of k to be 5, then, in this case, 4 folds will be
used for training the model and the left-out fold will be used as a test set. The same
procedure is repeated for all the folds i.e each fold will be used as a training and test set. To
determine the model performance average of metrics across all the folds is taken. With this
method, we can be sure of the model’s performance because the model has been tested
across various datasets.
Supervised learning is a type of machine learning method in which algorithms are trained
using well "labeled" training data, that is independent variables are already tagged against a
defined target variable. With this technique, we can make predictions and compare them
against the ground truth. For example, Determining if a client might default on a loan or not.
55. How does multicollinearity affect the performance of a linear regression model?
56. Which evaluation metric should you use to evaluate a linear regression model built on
a dataset that has a lot of outliers in it?
MAE would be a good metric in that case because it is most robust to outliers. MSE or
RMSE is extremely sensitive to outliers and penalizes the outliers more.
R-squared (R2) is a statistical measure that represents the proportion of the variance that is
explained in the dependent variable by the independent variables. For example, if the R2 of a
model is 0.70, then 70% of the variation can be explained by the model's inputs. Adjusted R-
squared is a modified version of R-squared that has been adjusted for the number of
independent variables in the model and penalizes the model performance for adding
variables that do not improve the existing model. If we add a new independent variable in
the model, the R2 of the model will always increase. However, the adjusted R-squared
increases only when the new independent variable improves the model more than expected
by chance. It decreases when the independent variable improves the model by less than
expected.
A decision tree can be considered as an inverted tree representation that grows from top to
bottom instead of bottom to top. It tries to mimic the human decision-making process and
tries to represent all the possible solutions to a decision based on certain conditions. For
example, If you have to decide whether to go out for a coffee or not at a nearby place, a
simple decision tree can look like Start with the main question that is “To go out for coffee?”
The decision to go out depends on the location of the place, so the second question
becomes “Is the place nearby?” If ‘yes’ then go for coffee else ‘No’.
The main aim of the decision tree is to achieve homogeneity among the leaf nodes i.e any
split made by the decision tree should result in pure leaves which contain one type of
decision only. For example, If we are trying to predict whether a person will default on a loan
or not and we use the decision tree to make this prediction then the result from the decision
tree split must result in all the defaulters in one leaf and all the non-defaulters on another
leaf node. If the composition of the leaf node is 50% defaulters and 50% non-defaulters then
the leaf is considered completely impure. If a decision tree is built without any restrictions
the tree will grow to its full length and will try to achieve homogeneity by capturing complex
patterns as well as noise present in the data during this process. Due to this, it ends up
learning all the patterns that are present in the training data but fails to replicate the
performance on unseen data i.e it leads to overfitting
60. How can you improve the performance of and overfitting Decision Tree model?
To avoid overfitting in decision trees and get a generalized model which performs well on
training as well as the test set we can use Pruning techniques. There are two ways to prune a
decision tree: a) Pre-Pruning: In this method, the decision tree is restricted before it can grow
to its full length by bounding the depth of the tree. There are several other hyperparameters
that are available in the SKlearn implementation of the Decision tree which help in restricting
the growth of the tree. This method is also known as the early stopping of tree. b) Post-
Pruning: In this method, the tree is allowed to grow to its full length and then the sub-trees
of the decision tree are pruned. The sub-trees that are pruned in this process are the ones
that do not provide any significant information to the model. The significance of the sub-tree
is calculated by removing it and checking the error between the full-grown tree and the tree
from which the sub-tree was removed. If the error is large that signifies the removed sub-
tree is important in prediction, if the error is small it signifies that the sub-tree is not much
important in the prediction.
61. How is a random forest model different from just using 'n' decision trees?
Let’s say we build ‘n’ decision trees and a Random Forest model with ‘n’ decision tree
estimators. The Random Forest model will be different from the ‘n’ decision trees because it
will employ the process of bootstrapping in rows as well as columns. Each decision tree in
the random forest model will be built on a different dataset because of sampling with
replacement in columns and rows. The final output of the random forest will be decided on
the basis of voting or averaging of the results from ‘n’ decision tree estimators built in the
random forest thereby making the prediction more robust. Whereas if we train ‘n’ decision
the outcome will be the same because the underlying training data for each of the decision
trees is the same.
The ROC curve (receiver operating characteristic curve) is a curve showing the performance
of a classification model at different thresholds. This curve plots two parameters, False
Positive Rate (FPR) on the x-axis and True Positive Rate (TPR) on the y-axis. AUC stands for
"Area under the ROC Curve" i.e, AUC measures the entire area under the ROC curve. These
two metrics are typically used together to check the performance of a binary classification
problem.
64. How would you identify the optimal number of clusters in your dataset?
The most common method to identify the optimal number of clusters in K-Means clustering
is the elbow method. In the elbow method, we iterate over a range of K values i.e number of
clusters, and for each value of K within-cluster sum of squares (WCSS) is calculated that is
the distance between each point and the centroid in a cluster. When we plot the WCSS with
the number of clusters or K value, the plot looks like an Elbow because as the number of
clusters increases, the WCSS value will start to decrease. The K value is chosen where a
rapid decrease in the WCSS is observed or the point, where the line in the plot starts to
move almost parallel to the X-axis. The K value corresponding to this point is the optimal
number of clusters.
65. Why is it important to understand the bias variance trade-off when applying data
science?
It is important to understand the bias-variance trade-off because a model high on the bias
fails to identify the underlying patterns on the training data which leads to the creation of a
simple model that fails to perform well on the training set as well as the test set leading to
high errors on training and test sets or underfitting. Whereas a model high on the variance
will be too complex and learn all the patterns as well the noise on the training set perfectly
but will fail to replicate the same performance on the test set leading to high errors on the
test set or overfitting. To avoid such issues, it is important to understand the trade-off
between bias and variance while working on a business problem and come up with an
optimal solution that maintains a balance between bias and variance so that model is neither
underfitting nor overfitting but is a good fit.
66. What is an activation function, and why does a neural network need one?
Hint?
Activation Functions are mathematical functions that apply a transformation on the output
of a layer in a neural network, which generally tends to be a linear combination of the nodes
of the previous layer with weights and biases. Activation Functions are crucial because they
introduce non-linearity into the neural network - without this, a neural net is simply a large
linear combination of its nodes, and hence, no more powerful than a linear regressor or
classifier. Neural networks are needed to find patterns and draw decision boundaries in
problems that can be highly complex and non-linear, and this makes Activation Functions
extremely important to their functioning. Some examples of Activation Functions are the
Sigmoid function, the Tanh function, and the ReLU function.
67. Why is the Sigmoid activation function not preferred in hidden layers of deep neural
networks?
Hint?
The Sigmoid function takes in any real number and outputs a continuous numeric value
between 0 and 1, which can then be discretized using a threshold (Ex: 0.5) and converted
into either 0 or 1 - hence its use as a binary classifier. Therefore, the Sigmoid function is
generally preferred in the output layer of a binary classification neural network. It is not
recommended to use it in the hidden layers because of the vanishing gradient problem i.e, if
your input is on the higher side in terms of magnitude (where the sigmoid function goes flat),
then the gradient will be close to zero. Due to the calculus of the chain rule of derivatives
used in backpropagation, this would result in multiple small values being multiplied with each
other to determine the final step size in gradient descent, and that would be an extremely
small step, meaning the neural network's learning speed would be negligible. Hence, we do
not prefer using the Sigmoid function in the hidden layers of deep neural networks.
68. Why is it not a good idea to use the Sigmoid function in the output layer of a neural
network meant for multi-class classification problems?
Basic Deep Learning
Hint?
The Sigmoid function merely outputs the probability / likelihood of that option being correct,
without taking into account the other options in a multi-class problem, and the fact that the
probabilities of all the multiple classes should add up to 1. This is actually done by the
Softmax activation function, which is a generalized version of the Sigmoid for multi-class
problems. Hence. we usually use the Softmax function in the output layer of a neural
network when dealing with multi-class classification, so that we can get the output in a
probabilistic shape taking all the options into account, and not just one.
69. What are the potential pitfalls of using neural networks for supervised learning?
Hint?
The first problem with traditional fully connected neural networks is that they are very
computationally intensive, so they may take significantly longer to train and come up with
predictions than a more traditional machine learning algorithm, due to their vast number of
parameters and their hierarchical non-linear complexity - especially in deep neural networks.
This drawback means that naturally, neural networks would need to significantly outperform
a competing ML model in terms of the evaluation metrics for us to even consider using them
for supervised learning - and this tends to happen only once we cross a certain threshold in
terms of the volume of training data, usually in the order of millions of training examples.
Hence neural networks should not be used on smaller or intermediate sized training datasets
in supervised learning problems, because an ML model would likely perform as well or better
at a fraction of the compute cost with that size of data. Another problem with neural
networks is their black-box nature - we often don't know how or why the NN came up with a
certain output. Since its internal working is often not interpretable, it is often out of the
question to consider using neural networks in sensitive use cases where the explainability of
a model is paramount, such as healthcare or criminal justice. These are the potential pitfalls
of using neural networks that one should keep in mind before applying them to supervised
learning problems.
Hint?
The architecture of a neural network, in terms of the number of neurons, the number of
layers and the activation function at various layers, is the first obvious set of
hyperparameters that can be tuned. The learning characteristics of the network, such as its
learning rate, the number of epochs and the batch size, are also an important set of
hyperparameters which be tuned to improve the network's performance. There are smaller
and more nuanced hyperparameters that can also help in fine-tuning the neural net, such as
momentum parameters, a decay in the learning rate, the dropout ratio, the weight
initialization scheme and the batch normalization hyperparameters.
71. What are the pros and cons of using Batch Gradient Descent vs Stochastic Gradient
Descent?
Hint?
Batch Gradient Descent suffers from computational cost, especially for larger datasets,
because it accepts the entire training dataset as one batch. This means each epoch will take a
long time to complete. So in case of a large training dataset, Stochastic Gradient Descent
may be preferred. However, the convergence characteristics of Batch Gradient Descent are
better - it converges directly to a minima, whereas Stochastic Gradient Descent will oscillate
in the near vicinity of the minima without properly reaching it, although Stochastic Gradient
Descent does converge and reach that point faster. Stochastic Gradient Descent also shows
very noisy learning characteristics, due to the variability between each training example
used. Another drawback of Stochastic Gradient Descent is that since we use only one
example at a time, we lose the compute advantage of vectorized implementation on it. So
Batch Gradient Descent is generally preferred for smaller datasets, while Stochastic Gradient
Descent is used for larger datasets. However due to the significant drawbacks of each
approach, a compromise called Mini-Batch Gradient Descent is often preferred among vanilla
optimization algorithms that don't use momentum or adaptive gradient, albeit with the cost
of an additional hyperparameter to tune, which is the mini-batch size.
72. Is the bias-variance tradeoff in Machine Learning applicable to Deep Neural Networks?
Why do you say so?
Hint?
The biggest advantage of neural networks is that unlike traditional machine learning
algorithms, they appear to have no limit to the sheer complexity of the decision boundaries
they can create. This means that although they are data hungry, when they are actually
provided with larger and larger volumes of data, their performance tends to continually
improve when the number of nodes and layers in the network is increased, as opposed to
machine learning algorithms, whose performance tends to stagnate beyond a point even
after access to larger amounts of data. All of this means that the traditional bias-variance
tradeoff seen in machine learning may not strictly be applicable in deep learning; neural
networks merely appear to move to a new stage of the tradeoff when the volume of data
and the complexity of the neural network are correspondingly increased.
73. Let's say you have two neural networks. One of them has one hidden layer with sixteen
nodes, while the other has four hidden layers with four nodes each, so they both have
sixteen neurons, just in different configurations. Which of these is likely to perform better
on a complex supervised learning task and why?
Hint?
Although the width (number of neurons in a layer) and depth (number of layers) of neural
networks are both important factors in determining its performance, complex supervised
learning tasks such as classifying a picture as a dog / cat appear to be best solved by
introducing a hierarchy in the neural network, that can progressively learn more and more
complex patterns in the data. In such an example, the second network, with four layers of
four nodes each, would be likely to perform better on the task than the first network, since it
has multiple layers and hence provides the network with a hierarchical mode of learning,
where the deeper layers may be able to understand more complex shapes and patterns in
the data. The depth of the neural network seems to increase its ability to learn complex
representations of the data more than its width - Ex: Some of the most famous neural
networks like GPT-3 have nearly a hundred layers.
74. How different is the decision boundary created by a neural network in comparison to
other non-linear ML algorithms such as Decision Trees and Random Forests? Which of
these techniques can create the most flexible non-linear decision boundary and why?
Hint?
Neural networks can create the most complex decision boundaries out of all the alternatives
listed, due to their hierarchical nature of complexity and the fact that each node or layer
added in the network increases the flexibility of the model. Although Decision Trees,
Random Forests and Neural Networks are all non-linear approaches, the nature of the non-
linearity in the decision boundary differs among them. Decision Trees create "piecewise"
non-linearity - they create orthogonal / linear splits on every individual feature and create
rectangular boundaries based on that. This approach is more flexible than linear, but perhaps
not as flexible as a curved non-linear boundary. Random Forests attempt to aggregate
multiple trees and hence approach a curved boundary by combining multiple linear splits, but
they still only approximate curved non-linearity and don't actually accomplish it. Neural
networks do however, created curved non-linear decision boundaries because they combine
multiple linear nodes and apply non-linear transformations in the form of activation
functions at each layer, and that level of flexibility in creating curved non-linearity is
unrivalled by any other machine learning algorithm.
75. What would be a good use case for implementing fully connected or other kinds of
neural networks for supervised learning over other ML models and why?
Hint?
The use case for neural networks in supervised learning should ideally be in those scenarios
where traditional machine learning algorithms are known to fail or be inadequate for solving
the problem. This could be for highly unstructured kinds of data such as images, text or
audio, where the algorithm itself has to extract the features relevant to the prediction from
the dataset, and hence a traditional machine learning approach wouldn't work. Another use
case for neural networks is when the size of the dataset is quite large, and we would like that
increased dataset size to translate to improved pattern detection by the model. So when we
have an extremely large dataset (in the order of millions of examples) or unstructured data,
neural networks may be preferred over ML models.
76. Would applying a neural network make sense in a healthcare setting where we need to
predict the diagnosis and medication to offer a patient based on the symptoms displayed?
Why do you think so?
Hint?
No, neural networks should ideally not be applied for any use case where the interpretability
or explainability of a model's decision making is paramount. Healthcare is a highly sensitive
domain, where decision making around diagnosis and medication for symptoms can make a
huge difference to the health condition of the patient, and medical practitioners cannot
afford to make mistakes in that process. Hence, the model used needs absolute transparency
rather than top performance which is not explainable, and neural networks would not be as
preferred for a healthcare use-case as decision trees or random forests.
77. What is the role of the Convolution operation in helping a neural network understand
images?
Hint?
Convolution is a mathematical operation which takes two inputs such as image matrix and a
filter or kernel. It is the first layer to extract features from an input image in a CNN.
Convolution helps to retain the relationship between pixels by learning image features using
small squares of input data. The way the convolution operation mathematically works is by
using the dot product of the filter vector and pixel vector to replace the image pixels with
new values (modified image), and these dot product values are higher when the pattern of
the filter matches the pattern of the pixels. Hence, convolution excels at detecting patterns
and features in the image that match the patterns of the filters, and this is how feature
extraction is performed on the image.
78. Why do we mostly use the ReLU activation function in the feature extraction stage of
convolutional neural networks (CNNs)?
Hint?
ReLU has the advantage of being simple to compute and also avoiding the vanishing
gradient problem, due to its constant derivative of 1. This is useful in CNNs which are deep
networks, as the error from backpropagation is easily propagated for the neural network's
learning.
Hint?
The output from the convolutional layers represents high-level features in the data. While
that output could be flattened and connected to the output layer, adding a fully-connected
layer is a (usually) cheap way of learning non-linear combinations of these features.
Essentially the convolutional layers are providing a meaningful, low-dimensional, and
somewhat invariant feature space, and the fully-connected layer is learning a (possibly non-
linear) function in that space.
80. What are some drawbacks of using Convolutional Neural Networks on image datasets,
and how can they be addressed?
Hint?
Although CNNs are optimized to work on image data and perform better and more
efficiently on images than fully connected neural networks, they still suffer from some
drawbacks which should be kept in mind. CNNs require quite a lot of labelled image data in
order to reach near-human levels of performance in image related tasks, and such data may
not readily be available. In that case, it may be better to use Transfer Learning to import the
weights and architecture of a pre-trained model and only fine tune its last few layers to apply
it to the problem at hand. CNNs may also be susceptible to spurious patterns in the data
(such as the sky always being present in car images - so it wrongly learns that having a sky is
important to classify something as a car), and this susceptibility can be resolved by
diversifying the training dataset to ensure nothing else about the images is consistent other
than the exact pattern we want the CNN to learn. CNNs can also be susceptible to small
perturbations in the dataset, for example: not being rotationally invariant, and this problem
should be addressed through the technique of data augmentation through various image
modification techniques such as flipping, rotation, cropping, mirroring, color modification etc.
81. Why is text pre-processing an essential part of NLP? What happens if we fail to pre-
process text data?
Hint?
Text preprocessing helps us to get rid of unhelpful parts of the data, or noise by converting
all characters to lowercase, removing punctuations marks, and removing stop words and
typos. Removing noise comes in handy when you want to do text analysis on pieces of data
like comments or tweets. It will be helpful to get rid of the text that interferes with text
analysis If not pre-processed, you will receive an error or your model will not perform as
expected.
82. In case you're working on an NLP application such as sentiment analysis of Twitter
posts, describe the text pre-processing steps that would most likely be required?
Hint?
83. Which evaluation metric is suitable to measure the performance of sentiment analysis
and why?
Hint?
Sentiment analysis is a classification problem, thus, it uses the metrics of Precision, Recall, F-
score, and Accuracy. Also, average measures like macro, micro, and weighted F1 scores are
useful for multi-class problems. Accuracy is used when the True Positives and True negatives
are more important while F1-score is used when the False Negatives and False Positives are
crucial. F1 scores also are helpful when there is a lot of class imbalance. As sentiment
analysis is a real-time problem, we can expect a lot of class imbalance. Thus, F1 scores are
mostly used.
84. What is the difference between stemming and lemmatization? Could you provide an
example?
Hint?
Stemming and Lemmatization both generate the foundation of the inflected words. The
difference is that the stem may not be an actual word, whereas the lemma is an actual
language word. For eg: beautiful and beautifully will be stemmed to beauti which has no
meaning in the English dictionary. The same are however lemmatised to beautiful and
beautifully respectively without changing the meaning of the words.
85. Would you consider Logistic Regression to be a special case of using Neural Networks?
If so, how?
Hint?
Yes, logistic regression is a specialized case of a one-node neural network, where we use
the Sigmoid activation function and the cost function being minimized is the Binary Cross-
Entropy function.
86. How do you compare categorical values, how would you know that a categorical value
is related to target variable?
Comparing categorical Values: When there are three or more levels/categories for the predictor & Target
variable is nominal, the degree of association between the predictor and target variable can be measured
with statistics such as chi-squared tests
- When there is only one continuous target variable, there are one plus categorical independent
variables, and there is no control variable at all, then you can go for ANOVA.
- Similarly, when there is only one continuous target variable, there is only one categorical independent
variable (i.e. dichotomous, e.g. pass/fail), and no control variable, then go for t-Test
Linear regression is an analysis that assesses whether one or more predictor variables explain the
dependent (criterion) variable. The regression has five key assumptions:
1) Linear relationship: Linear regression needs the relationship between the independent and dependent
variables to be linear. The linearity assumption can best be tested with scatter plots.
2) Normality: The error terms must be normally distributed (To check normality, one can look at QQ plot,
can also perform statistical tests of normality such as Kolmogorov-Smirnov test, Shapiro-Wilk test.
3) Multicollinearity: Linear regression assumes that there is little or no multicollinearity in the data.
Multicollinearity occurs when the independent variables are too highly correlated with each other.
Multicollinearity may be tested with three central criteria: Correlation matrix, Tolerance, VIF
4) No auto-correlation: Linear regression analysis requires that there is little or no autocorrelation in the
data. Autocorrelation occurs when the residuals are not independent of each other. For instance, this
typically occurs in stock prices, where the price is not independent of the previous price.
5) Homoscedasticity: The error terms must have constant variance. This phenomenon is known as
homoskedasticity. The presence of non-constant variance is referred to as heteroskedasticity.
The idea behind simple linear regression is to "fit" the observations of two variables into a linear
relationship between them. Graphically, the task is to draw the line that is "best-fitting" or "closest" to the
points (x_i,y_i), where x_i and y_i are observations of the two variables which are expected to depend
linearly on each other.
Although many measures of best fit are possible, for most applications the best-fitting line is found using
the method of least squares. The method finds the linear function L which minimizes the sum of the
squares of the errors in the approximations of the y_i by L(x_i)
For eg: To find the line y=mx+b of best fit through N points, the goal is to minimize the sum of the
squares of the differences between the y-coordinates and the predicted yy-coordinates based on the line
and the x-coordinates.
Classification is used when the output variable is a category such as “red” or “blue”, “spam” or “not spam”.
It is used to draw a conclusion from observed values. Differently from regression which is used when the
output variable is a real or continuous value like “age”, “salary”, etc.
When we must identify the class the data belongs to, we use classification over regression. Like when
you must identify whether a name is male or female instead of finding out how they are correlated with
the person.
Hint?
Hint?
R-squared (coefficient of determination) measures the proportion of the variation in your dependent
variable (Y) explained by your independent variables (X) for a linear regression model.
Adjusted R-squared adjusts the statistic based on the number of independent variables in the model.
It is possible that R Square has improved significantly yet Adjusted R Square is decreased with the
addition of a new predictor when the newly added variable brings in more complexity than the power to
predict the target variables.
6. CART works best with a larger dataset, while Logistic regression on a smaller dataset
1. The major limitation of Logistic Regression is the assumption of linearity between the dependent
variable and the independent variables.
2. It can only be used to predict discrete functions. Hence, the dependent variable of Logistic Regression
is bound to the discrete number set.
3. Non-linear problems can’t be solved with logistic regression because it has a linear decision surface.
Linearly separable data is rarely found in real-world scenarios
5. If the number of observations is lesser than the number of features, Logistic Regression should not be
used, otherwise, it may lead to overfitting.
Python:
R:
glm(Target ~.,family=binomial(link='logit'),data=train)
True Positive (TP): The actual value was positive and the model predicted a positive value
True Negative (TN): The actual value was negative and the model predicted a negative value
False Positive (FP) – Type 1 error: The actual value was negative but the model predicted a positive value
False Negative (FN) – Type 2 error: The actual value was positive but the model predicted a negative
value
VIF, the Variance Inflation Factor, is used during regression analysis to assess whether certain
independent variables are correlated to each other and the severity of this correlation. If your VIF number
is greater than 10, the included variables are highly correlated to each other. Since the ability to make
precise estimates is important to many companies, generally people aim for a VIF within the range of 1-5.
A cutoff number of 5 is commonly used.
- Perform an analysis designed for highly correlated variables, such as principal components analysis or
partial least squares regression.
The residual is the error that is not explained by the regression equation:
ei = yi - y^i.
A residual plot plots the residuals on the y-axis vs. the predicted values of the dependent variable on the
x-axis. We would like the residuals to be unbiased: have an average value of zero in any thin vertical strip,
and homoscedastic, which means "same stretch": the spread of the residuals should be the same in any
thin vertical strip.
The residual is the error that is not explained by the regression equation:
ei = yi - y^i.
A residual plot plots the residuals on the y-axis vs. the predicted values of the dependent variable on the
x-axis. We would like the residuals to be unbiased: have an average value of zero in any thin vertical strip,
and homoscedastic, which means "same stretch": the spread of the residuals should be the same in any
thin vertical strip.
The assumption of homoscedasticity (meaning “same variance”) is central to linear regression models.
Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance
in the relationship between the independent variables and the dependent variable) is the same across all
values of the independent variables. Heteroscedasticity (the violation of homoscedasticity) is present
when the size of the error term differs across values of an independent variable. The impact of violating
the assumption of homoscedasticity is a matter of degree increasing as heteroscedasticity increases.
R-squared (R2), which is the proportion of variation in the outcome that is explained by the predictor
variables. In multiple regression models, R2 corresponds to the squared correlation between the observed
outcome values and the predicted values by the model. The Higher the R-squared, the better the model.
Root Mean Squared Error (RMSE), which measures the average error performed by the model in
predicting the outcome for an observation. Mathematically, the RMSE is the square root of the mean
squared error (MSE), which is the average squared difference between the observed actual outcome
values and the values predicted by the model. So, MSE = mean((observeds - predicteds)^2) and RMSE =
sqrt(MSE). The lower the RMSE, the better the model.
Residual Standard Error (RSE), also known as the model sigma, is a variant of the RMSE adjusted for the
number of predictors in the model. The lower the RSE, the better the model. In practice, the difference
between RMSE and RSE is very small, particularly for large multivariate data.
Mean Absolute Error (MAE), like the RMSE, the MAE measures the prediction error. Mathematically, it is
the average absolute difference between observed and predicted outcomes, MAE = mean(abs(observeds
- predicteds)). MAE is less sensitive to outliers compared to RMSE.
Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp
AIC stands for (Akaike’s Information Criteria): Basic idea of AIC is to penalize the inclusion of additional
variables to a model. It adds a penalty that increases
the error when including additional terms. The lower the AIC, the better the model.
BIC (or Bayesian information criteria) is a variant of AIC with a stronger penalty for including additional
variables to the model.
In Logistic Regression, the Probability should be between 0 to 1 and as per cut off rate, the output comes
out in the form of 0 or 1 where the linear equation does not work because value comes out inform of +
or - infinity and that the reason we have to convert a linear equation into Sigmoid Equation.
P = E^Y/E^Y+1
Odds Ratio:
P/1-P
=(E^Y/E^Y+1) / (1/E^Y+1)
=(E^Y/E^Y+1) x (E^Y+1/1)
=E^Y
Log Transformation:
P/1-P = E^Y
Log(P/1-P) = b0+b1*X
A simple linear SVM classifier works by making a straight line between two classes.
That means all of the data points on one side of the line will represent a category and the data points on
the other side of the line will be put into a different category. This means there can be an infinite number
of lines to choose from.
What makes the linear SVM algorithm better than some of the other algorithms, like k-nearest neighbors,
is that it chooses the best line to classify your data points. It chooses the line that separates the data and
is the furthest away from the closet data points as possible.
A 2-D example helps to make sense of all the machine learning jargon. Basically, you have some data
points on a grid. You're trying to separate these data points by the category they should fit in, but you
don't want to have any data in the wrong category. That means you're trying to find the line between the
two closest points that keeps the other data points separated.
So the two closest data points give you the support vectors you'll use to find that line. That line is called
the decision boundary.
The decision boundary doesn't have to be a line. It's also referred to as a hyperplane because you can find
the decision boundary with any number of features, not just two.
Types of SVMs:
Simple SVM: Typically used for linear regression and classification problems.
Kernel SVM: Has more flexibility for non-linear data because you can add more features to fit a
hyperplane instead of a two-dimensional space.
107. How will you handle class imbalance problem? What are the various approaches?
Intermediate Advanced Stats
Imbalanced data typically refers to a problem with classification problems where the classes are not
represented equally.
- Try Generate Synthetic Samples (The most popular of such algorithms is called SMOTE or the
Synthetic Minority Over-sampling Technique)
108. Why do we use sigmoid and not any increasing function from 0 to 1?
The main reason why we use the sigmoid function is that it exists between (0 to 1). Therefore, it is
especially used for models where we have to predict the probability as an output. Since the probability of
anything exists only between the range of 0 and 1, sigmoid is the right choice.
109. What are various evaluation parameters of regression and classification to evaluate
the model?
R-squared (R2), which is the proportion of variation in the outcome that is explained by the predictor
variables. In multiple regression models, R2 corresponds to the squared correlation between the observed
outcome values and the predicted values by the model. The Higher the R-squared, the better the model.
Root Mean Squared Error (RMSE), which measures the average error performed by the model in
predicting the outcome for an observation. Mathematically, the RMSE is the square root of the mean
squared error (MSE), which is the average squared difference between the observed actual outcome
values and the values predicted by the model. So, MSE = mean((observeds - predicteds)^2) and RMSE =
sqrt(MSE). The lower the RMSE, the better the model.
Residual Standard Error (RSE), also known as the model sigma, is a variant of the RMSE adjusted for the
number of predictors in the model. The lower the RSE, the better the model. In practice, the difference
between RMSE and RSE is very small, particularly for large multivariate data.
Mean Absolute Error (MAE), like the RMSE, the MAE measures the prediction error. Mathematically, it is
the average absolute difference between observed and predicted outcomes, MAE = mean(abs(observeds
- predicteds)). MAE is less sensitive to outliers compared to RMSE.
Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp
AIC stands for (Akaike’s Information Criteria): Basic idea of AIC is to penalize the inclusion of additional
variables to a model. It adds a penalty that increases the error when including additional terms. The lower
the AIC, the better the model.
BIC (or Bayesian information criteria) is a variant of AIC with a stronger penalty for including additional
variables to the model.
- Confusion matrix, which is 2x2 table showing four parameters, including the number of true positives,
true negatives, false negatives and false positives.
- Precision, Recall and Specificity, which are three major performance metrics describing a predictive
classification model
- ROC curve, which is a graphical summary of the overall performance of the model, showing the
proportion of true positives and false positives at all possible values of probability cutoff. The Area Under
the Curve (AUC) summarizes the overall performance of the classifier.
110. In your project, If we use regression model, what would be the outcome?
Regression analysis generates an equation to describe the statistical relationship between one or more
predictor variables and the response variable (continuous in nature). Where the response variable is the
target variable.
111. List out some common problems faced while analyzing the data.
112. OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the
statement.
113. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the
components?
Hint?
Hint?
Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised
learning algorithms, the decision tree algorithm can be used for solving regression and classification
problems too.
The goal of using a Decision Tree is to create a training model that can use to predict the class or value of
the target variable by learning simple decision rules inferred from prior data(training data).
In Decision Trees, for predicting a class label for a record we start from the root of the tree. We compare
the values of the root attribute with the record’s attribute. On the basis of comparison, we follow the
branch corresponding to that value and jump to the next node.
Hint?
119. Why did you choose Random forest or Decision trees model ?
Random forests consist of multiple single trees each based on a random sample of the training data. They
are typically more accurate than single decision trees. The following figure shows the decision boundary
becomes more accurate and stable as more trees are added.
- Trees are unpruned. While a single decision tree like CART is often pruned, a random forest tree is fully
grown and unpruned, and so, naturally, the feature space is split into more and smaller regions.
- Trees are diverse. Each random forest tree is learned on a random sample, and at each node, a random
set of features are considered for splitting. Both mechanisms create diversity among the trees.
For e.g., k-means clustering can be used for creating customer segments based on their income and
spend data
Advantages:
- Compared to other algorithms decision trees requires less effort for data preparation during pre-
processing.
- Missing values in the data also do NOT affect the process of building a decision tree to any
considerable extent.
- A Decision tree model is very intuitive and easy to explain to technical teams as well as stakeholders.
Disadvantages:
- A small change in the data can cause a large change in the structure of the decision tree causing
instability.
- For a Decision tree sometimes calculation can go far more complex compared to other algorithms.
- Decision tree training is relatively expensive as the complexity and time have taken are more.
- The Decision Tree algorithm is inadequate for applying regression and predicting continuous values.
122. How to reduce number of variables in Logistic regression and random forest?
- Missing Values Ratio: Data columns with a ratio of missing values greater than a given threshold can be
removed. The higher the threshold, the more aggressive the reduction.
- Low Variance Filter: Data columns with a variance lower than a given threshold can be removed. Notice
that the variance depends on the column range, and therefore normalization is required before applying
this technique.
- High Correlation Filter: Calculate the Pearson product-moment correlation coefficient between numeric
columns and Pearson’s chi-square value between nominal columns. For the final classification, we only
retain one column of each pair of columns whose pairwise correlation exceeds a given threshold. Notice
that correlation depends on the column range, and therefore, normalization is required before applying
this technique.
- Principal Component Analysis (PCA): First principal component has the largest possible variance; each
succeeding principal component has the highest possible variance under the constraint that it is
orthogonal to (i.e., uncorrelated with) the preceding principal components. Keeping only the first m < n
principal components reduces the data dimensionality while retaining most of the data information, i.e.,
variation in the data.
- Backward Feature Elimination: We remove one input column (from training model on n columns) at a
time and train the same model on n-1 columns. The input column whose removal has produced the
smallest increase in the error rate is removed, leaving us with n-1 input columns. The classification is then
repeated using n-2 columns, and so on. Each iteration k produces a model trained on n-k columns and an
error rate e(k). By selecting the maximum tolerable error rate, we define the smallest number of columns
necessary to reach that classification performance with the selected machine learning algorithm.
- Forward Feature Construction. This is the inverse process to backward feature elimination. We start
with one column only, progressively adding one column at a time, i.e., the column that produces the
highest increase in performance.
- Multicollinearity check using VARIANCE INFLATION FACTOR (VIF): **Typically used for logistic**
The VIF provides information on how large the standard error is compared with what it would be if the
variables were uncorrelated with the other predictor variables in the model. It is calculated for each
explanatory variable and those with high values are removed. Common thumb-rule classifies a VIF value
of >=5 significantly high implying high multicollinearity. A cut-off VIF value of <=2 is used by most
businesses since it offers a more stringent and clear rule.
Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying
k from 1 to 10 clusters.
For each k, calculate the total within-cluster sum of square (wss).
The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate
number of clusters.
124. List out some of the best practices for data cleaning
first that the clusters are spherical and second that the clusters are of similar size.
Spherical assumption helps in separating the clusters when the algorithm works on the data and forms
clusters. If this assumption is violated, the clusters formed may not be what one expects. On the other
hand, assumption over the size of clusters helps in deciding the boundaries of the cluster. This assumption
helps in calculating the number of data points each cluster should have. This assumption also gives an
advantage. Clusters in K-means are defined by taking the mean of all the data points in the cluster. With
this assumption, one can start with the centers of clusters anywhere. Keeping the starting points of the
clusters anywhere will still make the algorithm converge with the same final clusters as keeping the
centers as far apart as possible.
Intermediate Excel
A waterfall chart is used to represent the changes in a given value over a period of time. The changes are
usually tracked as positives (raise in the value) and negatives (dip in the value). The beginning and ending
value are representing as solid columns and the changes are tracked using floating columns. For example,
waterfall charts can be used to represent a company's financial performance (profit, loss) over a period of
time or to display the changes in a product value over a period of time.
Two variable data table would let us check two values at the same time.for the same formula in a data
table. It is primarily used when the formula depends on several values which can be used for the two
variables
One Variable data table is similar to two variable data table, but it would check one variable at a time.
Intermediate Excel
1. Rows - If the fields want to be viewed in the rows of the Pivot table, the field needs to dragged to the
Rows section
2. Columns - If the fields want to be viewed in the columns of the Pivot table, the field needs to dragged
to the Columns section
3. Values - Rows and columns of the table would be fixed using Rows and Columns sections, the values
for the table can be fixed using the Values section. This e
129. What are the most common questions you should ask a client before creating a
dashboard?
Advanced Excel
Basic Excel
Select the whole data set
Basic Excel
132. What is the difference between absolute and relative cell references?
Intermediate Excel
Absolute: An absolute reference in Excel refers to a reference that is "locked" so that rows and columns
won't change when copied.
Relative: A relative addresses will change when copied to other location in a worksheet because it
describes the "offset" to another cell, rather than a fixed address.
133. What formula would you use to find the length of a text string in a cell?
Basic Excel
"=LEN(cell)"
The above formula can be used to find the length of the text string in the specified cell.
134. What are slicers in Excel
Intermediate Excel
Slicers are visible filters. The objective of slicers is same as that of filters. But in slicers the filter values
would be visible. Mainly used in Pivot tables
135. How can you Combine Data from Multiple tables into 1 pivot table
Intermediate Excel
Hint?
Intermediate Excel
Goal seek would help us adjust the value in a specific range to reach the goal(target). It acts as a
business consultant in figuring out to meet the target.
Solver uses trial and error principle, it uses iteration to check a series of solutions for a specific problem
statement. It shows the changes in the output for different inputs
Intermediate Excel
Named ranges - It is used to name a group of cells (or one) with a common name. The common name
would be easy for using the name inside the formula rather giving the range.
Basic Excel
Used to find a string in the cell that are not exact but similar to the text. There are three wildcard
characters
1. * (asterisk) - If more than one character is to be matched with the given string, we use the asterisk. For
example sh* would filter shirt, short, shell, shall, shore, etc.
2. ?(question mark) - If exactly one character is to be matched with the given strin, ? is used. For example:
ra? would filter rap, ran, rat, raw, etc.
3. ~(tilde) - If the search string contains a wildcard character, then tilde can be used to find the string. For
example, if you need to search for ki* in your data. But since * is a wildcard character, the formula may
not fetch the desired output. In such case, ki~* would return ki*.
139. Explain the functions (VLOOKUP, COUNTIF, SUMIF, IFERROR, INDEX / MATCH)
Intermediate Excel
VLOOKUP - Stands for vertical look up. It is used to look up the data that is organised vertically
COUNTIF - Conditional counting. It is used to count all the values that would meet certain criteria
SUMIF - Conditional summing. As like countif, sumif would sum all the values in a range that would meet
the condition
IFERROR - It would catch all the ERRORs within the given range. It carries two arguments - the error to
be caught and the message to be displayed while the error is caught
MATCH - It is used to fetch the location of the value given as arguments in a given range.
INDEX - It returns the specific value in the given range. INDEX function carries three arguments. First
argument - this takes the range, Second argument - The order of the value to be returned
Basic Hadoop
Basic Hadoop
Basic Hadoop
Hadoop's Distributed File System which is fault-tolerant, reliable and scalable. Designed to store big
files efficiently in a distributed manner
143. What are the functions of the daemons in the Hadoop cluster ?
Basic Hadoop
HDFS Daemons
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server
that manages the file system namespace and regulates access to files by clients. In addition, there are a
number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes
that they run on. HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The
NameNode executes file system namespace operations like opening, closing, and renaming files and
directories
The ResourceManager and the NodeManager form the data-computation framework. The
ResourceManager is the ultimate authority that arbitrates resources among all the applications in the
system. The NodeManager is the per-machine framework agent who is responsible for containers,
monitoring their resource usage and reporting the same to the ResourceManager/Scheduler.
Basic Hadoop
Cluster Resource Management System responsible for allocation of compute resources to all the jobs
submitted to the Hadoop cluster
Intermediate Hadoop
Basic Hadoop
In a typical High Availability cluster, two separate machines are configured as NameNodes. At any point in
time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active
NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a
slave, maintaining enough state to provide a fast failover if necessary.
147. What happens when two clients try to access the same file in HDFS ?
Basic Hadoop
Concurrent writes are not allowed to HDFS at the same time, concurrent reads are fine
Basic Hadoop
The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the
EditLog to persistently record every change that occurs to file system metadata. The entire file system
namespace is stored in another file called the FsImage. Both EditLogs and FSImage files are stored as a file
in the NameNode’s local file system. The NameNode keeps an image of the entire file system namespace
and file Blockmap in memory. When the NameNode starts up, or a checkpoint is triggered by a
configurable threshold, it reads the FsImage and EditLog from disk, applies all the transactions from the
EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new
FsImage on disk. In a cluster with no high-availability, the checkpointing is taken care of by the
SecondaryNameNode
149. How does NameNode handle DataNode failure ?
Basic Hadoop
As soon as the data node is declared dead/non-functional all the data blocks it hosts are transferred to
the other data nodes with which the blocks are replicated initially. This is how Namenode handles
datanode failures
Intermediate Hadoop
1. Use the file system metadata replica (FsImage) to start a new NameNode.
2. Then, configure the DataNodes and clients so that they can acknowledge this new NameNode,
that is started.
3. Now the new NameNode will start serving the client after it has completed loading the last
checkpoint FsImage (for metadata information) and received enough block reports from the
DataNodes.
Intermediate Hadoop
HDFS also maintains the replication factor by creating a replica of data on other available machines in
the cluster if suddenly one machine fails.
152. What is the reason we use HDFS for large datasets instead of a lot of small files ?
Basic Hadoop
As the NameNode performs storage of metadata for the file system in RAM, the amount of memory
limits the number of files in HDFS file system. In simple words, more files will generate more metadata,
that will, in turn, require more memory (RAM).
In Hadoop, HDFS splits huge file into small chunks that is called Blocks. These are the smallest unit of
data in file system
154. What are the default sizes of a Hadoop block in Hadoop 3 and Hadoop 1 ?
Basic Hadoop
Basic Hadoop
block. size can be changed to required value(default 64mb/128mb) in hdfs-site. xml file or use
dfs.blocksize= command
Basic Hadoop
The jps command uses the java launcher to find the class name and arguments passed to the main
method.
Intermediate Hadoop
A Rack is a collection nodes usually in 10 of nodes which are closely stored together and all nodes are
connected to a same Switch. When an user requests for a read/write in a large cluster of Hadoop in order
to improve traffic the namenode chooses a datanode that is closer this is called Rack Awareness
Hadoop doesn't try to diagnose and fix slow running tasks, instead, it tries to detect them and runs
backup tasks for them. This is called speculative execution in Hadoop. These backup tasks are called
Speculative tasks in Hadoop
Intermediate Hadoop
You can stop the NameNode individually using /sbin/hadoop-daemon.sh stop namenode command.
Then start the NameNode using /sbin/hadoop-daemon.sh start namenode.
Use /sbin/stop-all.sh and the use /sbin/start-all.sh, command which will stop all the demons first.
Intermediate Hadoop
Standalone Mode
Pseudo-distributed Mode
Fully-Distributed Mode.
Basic Hadoop
MapReduce is a programming model or pattern within the Hadoop framework that is used to access big
data stored in the Hadoop File System (HDFS). It is a core component, integral to the functioning of the
Hadoop framework.
Basic Hadoop
Intermediate Hadoop
The class which contains the reduce function. JAR file containing the mapper, reducer and driver
classes
Intermediate Hadoop
RecordReader, typically, converts the byte-oriented view of the input, provided by the InputSplit, and
presents a record-oriented view for the Mapper and Reducer tasks for processing. It thus assumes the
responsibility of processing record boundaries and presenting the tasks with keys and values.
Intermediate Hadoop
Reducers always run in isolation and they can never communicate with each other as per the Hadoop
MapReduce programming paradigm
Intermediate Hadoop
A partitioner partitions the key-value pairs of intermediate Map-outputs. It partitions the data using a
user-defined condition, which works like a hash function. The total number of partitions is the same as
the number of Reducer tasks for the job.
167. What does a combiner do ?
Basic Hadoop
A Combiner, also known as a semi-reducer, is an optional class that operates by accepting the inputs
from the Map class and thereafter passing the output key-value pairs to the Reducer class. The main
function of a Combiner is to summarize the map output records with the same key.
Advanced Hadoop
169. What is the reason we can’t perform aggregation in mapper ? Why do we need the
reducer for this ?
Advanced Hadoop
The aggregation can not be done at Mapper phase because aggregation requires sorting of data, and
mapper executes per input split ( a Data Blocks ), so it is not possible in a mapper because it loses
previous input split every time a new instance is taken as input. The data processed by mapper is then
stored in local disk through shuffling and sorting process before the reducer phase. The latency of writing
this data directly to disk and then transferring data across the network is an expensive operation in the
processing of a MapReduce job. Hence there is a necessity to reduce the amount of data that needs to be
sent across the network to reducer whenever possible.
Basic ML
Hint?
- Hosting the docker container on an AWS ec2 instance and consuming the web-service
172. How will you make models out of the tweets for the pharma company
Advanced ML
Hint?
173. Make 4 segments (product category, competitors etc) and identify which medicine a
doctor is likely to recommend
Intermediate ML
Hint?
Intermediate ML
Hint?
Basic ML
k-Means Clustering is an unsupervised learning algorithm that is used for clustering whereas KNN is a
supervised learning algorithm used for classification.
The “k” in k-means denotes the number of clusters you want to have in the end. If k = 5, you will have 5
clusters on the data set. “k” in K-Nearest Neighbors is the number of neighbours it checks. It is
supervised because you are trying to classify a point based on the known classification of other points.
Basic ML
Bagging and Boosting decrease the variance of your single estimate as they combine several estimates
from different models. So the result may be a model with higher stability.
Bagging is used when the goal is to reduce the variance of a decision tree classifier. Here the objective is
to create several subsets of data from the training sample chosen randomly with replacement. Each
collection of subset data is used to train their decision trees. As a result, we get an ensemble of different
models. Average of all the predictions from different trees are used which is more robust than a single
decision tree classifier.
Boosting is used to create a collection of predictors. In this technique, learners are learned sequentially
with early learners fitting simple models to the data and then analysing data for errors. Consecutive trees
(random sample) are fit and at every step, the goal is to improve the accuracy from the prior tree. When
an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to
classify it correctly. This process converts weak learners into a better performing model
Basic ML
Ada-boost is an ensemble classifier. It combines a weak classifier algorithm to form strong classifier. A
single algorithm may classify the objects poorly. But if we combine multiple classifiers with a selection of
training set at every iteration and assigning the right amount of weight in the final voting, we can have
good accuracy score for the overall classifier.
Basic ML
XGBoost stands for Extreme Gradient Boosting; it is a specific implementation of the Gradient Boosting
method which uses more accurate approximations to find the best tree model. It employs a number of
nifty tricks that make it exceptionally successful, particularly with structured data.
The most important are:
1) computing second-order gradients, i.e. second partial derivatives of the loss function (similar to
Newton’s method), which provides more information about the direction of gradients and how to get to
the minimum of our loss function. While regular gradient boosting uses the loss function of our base
model (e.g. decision tree) as a proxy for minimizing the error of the overall model, XGBoost uses the 2nd
order derivative as an approximation.
2) And advanced regularization (L1 & L2), which improves model generalization.
XGBoost has additional advantages: training is very fast and can be parallelized/distributed across
clusters.
Basic ML
Bootstrap Sampling is a method that involves drawing of sample data repeatedly with replacement from
a data source to estimate a population parameter.
180. What to be done on the dataset if the assumptions are not met?
Intermediate ML
1. If you create a scatter plot of values for x and y and see that there is not a linear relationship between
the two variables, then one can do the following:
- Apply a nonlinear transformation to the independent and/or dependent variable. e.g. log, square root,
or reciprocal of the independent and/or dependent variable
- For positive serial correlation, consider adding lags of the dependent and/or independent variable to
the model.
- For negative serial correlation, check to make sure that none of your variables is overdifferenced.
- For seasonal correlation, consider adding seasonal dummy variables to the model
3. If Residuals do not have constant variance, then one can do the following:
- First, verify that any outliers aren’t having a huge impact on the distribution. If there are outliers
present, make sure that they are real values and that they aren’t data entry errors
- Next, you can apply a nonlinear transformation to the independent and/or dependent variable. e.g. log,
square root, or the reciprocal of the independent and/or dependent variable
Intermediate ML
1. Specify Performance Requirements (This may be as accurate or false positives or whatever metrics are
important to the business)
5. Challenge Then Trial Model Updates (For example, perhaps you set up a grid or random search of
model hyperparameters that runs every night and spits out new candidate models)
Basic ML
1. Fundamentally, classification is about predicting a label and regression is about predicting a quantity.
i.e. Classification is the task of predicting a discrete class label while Regression is the task of predicting
a continuous quantity
2. Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.
Regression predictions can be evaluated using root mean squared error, whereas classification
predictions cannot.
3. A regression algorithm can predict a discrete value which is in the form of an integer quantity
A classification algorithm can predict a continuous value if it is in the form of a class label probability
183. Which model to use to check whether a patient is diabetic or not?
Basic ML
Basic ML
Basic ML
a. Logistic regression models the probabilities for classification problems with two possible outcomes. It's
an extension of the linear regression model for classification problems.
c. Number of obs – This is the number of observations that were used in the analysis
d. LR chi2(3) – This is the likelihood ratio (LR) chi-square test. The number in the parenthesis indicates
the number of degrees of freedom
e. Prob > chi2 – This is the probability of obtaining the chi-square statistic given that the null hypothesis
is true. In this case, the model is statistically significant because the p-value is less than .000.
Basic ML
A group of weak learners coming together to form a strong learner, thus increasing the accuracy of any
Machine Learning model is called an ensemble model
Advanced Ensemble Techniques: Stacking, Bagging (Randomforest) and Pasting Boosting(Adaboost, XGB
etc)
187. What is Decision tree and Random forest?
Basic ML
- A decision tree is a supervised machine learning algorithm that can be used for both classification and
regression problems. A decision tree is simply a series of sequential decisions made to reach a specific
result
- Random Forest is a tree-based machine learning algorithm that leverages the power of multiple
(randomly created) decision trees for making decisions. i.e. The Random Forest Algorithm combines the
output of multiple (randomly created) Decision Trees to generate the final output.
- Random Forest is suitable for situations when we have a large dataset, and interpretability is not a major
concern. Decision trees are much easier to interpret and understand. Since a random forest combines
multiple decision trees, it becomes more difficult to interpret.
- The decision tree model gives high importance to a particular set of features. But the random forest
chooses features randomly during the training process.
Basic ML
Handling Overfitting:
Cross-validation
This is done by splitting your dataset into ‘test’ data and ‘train’ data. Build the model using the ‘train’ set.
The ‘test’ set is used for in-time validation.
Regularization
This is a form of regression, that regularizes or shrinks the coefficient estimates towards zero. This
technique discourages learning a more complex model
Early stopping
When training a learner with an iterative method, you stop the training process before the final
iteration. This prevents the model from memorizing the dataset.
Pruning
Pre-pruning: Stop ‘growing’ the tree earlier before it perfectly classifies the training set.
Post-pruning: Allows the tree to ‘grow’, perfectly classify the training set and then post prune the tree.
Dropout
This is a technique where randomly selected neurons are ignored during training.
Handling Underfitting:
Basic ML
The goal of any supervised machine learning algorithm is to achieve low bias(the difference between the
average prediction of our model and the correct value which we are trying to predict) and low
variance(variability of model prediction for a given data point or a value which tells us spread of our data).
If our model is too simple and has very few parameters then it may have high bias and low variance. On
the other hand, if our model has a large number of parameters then it’s going to have high variance and
low bias.
Increasing the bias will decrease the variance. Increasing the variance will decrease bias.
So we need to find the right/good balance without overfitting and underfitting the data.
This tradeoff in complexity is why there is a tradeoff between bias and variance.
Intermediate ML
One is Observing what others did in similar situations. The other is observing a situation and trying to
come up with the best possible logic on the spot to decide/conclude. The third is learning from previous
mistakes/success. These three methods correspond to three branches of Machine learning, Supervised,
Unsupervised and Reinforcement learning respectively.
- In Supervised Learning, a computer can tell what word in a sentence is the name of a city, given it is
shown example sentences which may or may not contain names of cities and every occurrence of a city
name is tagged in these examples.
- Unsupervised is where we ask the computer to make decisions based on raw data attributes and a set of
measurable quantities. Some examples would include asking a computer to come up with localities in a
dataset where Lat-Long of the house is given. It would use Lat Long to find distances and form localities
of house.
- The third type of learning is Reinforcement Learning. This is a method in which computer starts with
making random decisions, and then learns based on errors it makes and successes it encounters as it
goes. A recent discovery was an algorithm which could play many different arcade games after learning
the correct/wrong moves. These algorithms would start by making a lot of failures in the beginning and
then get better as they go.
Basic ML
192. You are given a data set on cancer detection. You’ve build a classification model and
achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance?
What can you do about it?
Intermediate ML
193. You are working on a time series data set. You manager has asked you to build a high
accuracy model. You start with the decision tree algorithm, since you know it works fairly
well on all kinds of data. Later, you tried a time series regression model and got higher
accuracy than decision tree model. Can this happen? Why?
Advanced ML
194. You came to know that your model is suffering from low bias and high variance.
Which algorithm should you use to tackle it? Why?
Intermediate ML
Basic ML
196. After analyzing the model, your manager has informed that your regression model is
suffering from multicollinearity. How would you check if he’s true? Without losing any
information, can you still build a better model?
Intermediate ML
Basic ML
198. While working on a data set, how do you select important variables? Explain your
methods.
Basic ML
Intermediate ML
200. Both being tree based algorithm, how is random forest different from Gradient
boosting algorithm (GBM)?
Basic ML
201. You’ve got a data set to work having p (no. of variable) > n (no. of observation). Why is
(Ordinary Least Squares) OLS is bad option to work with? Which techniques would be best
to use? Why?
Advanced ML
202. We know that one hot encoding increasing the dimensionality of a data set. But,
label encoding doesn’t. How ?
Intermediate ML
203. You are given a data set consisting of variables having more than 30% missing values?
Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will
you deal with them?
Basic ML
204. People who bought this, also bought…’ recommendations seen on amazon is a result
of which algorithm?
Intermediate ML
Basic ML
206. You have been asked to evaluate a regression model based on R², adjusted R² and
tolerance. What will be your criteria?
Basic ML
207. Considering the long list of machine learning algorithm, given a data set, how do you
decide which one to use?
Basic ML
Basic ML
Basic ML
210. How can you prove that one improvement you've brought to an algorithm is really an
improvement over not doing anything?
Basic ML
211. Explain what resampling methods are and why they are useful. Also explain their
limitations.
Basic ML
- Repeatedly drawing samples from a training set and refitting a model of interest on each sample in
order to obtain additional information about the fitted model
- Example: repeatedly draw different samples from training data, fit a linear regression to each new
sample, and then examine the extent to which the resulting fit differ
cross-validation: random sampling with no replacement, bootstrap: random sampling with replacement
- cross-validation: evaluating model performance, model selection (select the appropriate level of
flexibility)
- bootstrap: mostly used to quantify the uncertainty associated with a given estimator or statistical
learning method
212. Is it better to have too many false positives, or too many false negatives? Explain.
Basic ML
False-positive and false-negative are two problems we have to deal with while evaluating a mode.
In medical, a false positive can lead to unnecessary treatment and a false negative can lead to a false
diagnostic, which is very serious since the disease has been ignored.
However, we can minimize the errors by collecting more information, considering other variables,
adjusting the sensitivity (true positive rate) and specificity (true negative rate) of the test, or conducting
the test multiple times.
Even so, it is still hard since reducing one type of error means increasing the other type of error.
Sometimes, one type of error is more preferable than the other one, so data scientists will have to
evaluate the consequences of the errors and make a decision
213. What is selection bias, why is it important and how can you avoid it
Basic ML
Selection bias occurs if a data set's examples are chosen in a way that is not reflective of their real-world
distribution.
- Ensuring that the subgroups selected are equivalent to the population at large in terms of their key
characteristics (this method is less of a protection than the first since typically the key characteristics are
not known).
214. Differentiate between univariate, bivariate and multivariate analysis.
Basic ML
Basic ML
Systematic sampling and cluster sampling are both statistical measures used by researchers, analysts,
and marketers to study samples of a population.
Systematic sampling involves selecting fixed intervals from the larger population to create the sample.
Cluster sampling divides the population into groups, then takes a random sample from each cluster.
216. Can you cite some examples where both false positive and false negatives are equally
important?
Intermediate ML
False-positive cases lead to overspending due to unnecessary care and damaging the health of an
otherwise healthy person due to unnecessary side effects of the therapy.
A false negative case means that your patients get sicker or die.
In this case, both false positive and false negatives are equally important since it concerns a person’s life
Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are
shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e.
models with fewer parameters)
Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of the
magnitude of coefficients. This type of regularization can result in sparse models with few coefficients;
Some coefficients can become zero and eliminate from the model. Larger penalties result in coefficient
values closer to zero, which is ideal for producing simpler models.
Intermediate ML
Gradient descent is an optimization algorithm that's used when training a machine learning model.
It's based on a convex function and tweaks its parameters iteratively to minimize a given cost function to
its local minimum.
You start by defining the initial parameter's values and from there gradient descent uses calculus to
iteratively adjust the values so they minimize the given cost-function (where a gradient measures how
much the output of a function changes if you change the inputs a little bit.)
Advanced ML
AWS or Azure instances with python jobs that run with either manual schedules, or automated to trigger
on receiving say new data. These are usually a suite of services that constitute a deployment
environment of such models.
Storage - model needs to be stored somewhere (pickle or joblib or specific model object). Either s3 on
aws or blob in azure.
Computing instance - Computing environment that contains python and is enabled to communicate to
every platform that is relevant to the deployment context.
Job scheduler - Devops is the norm now. Automated pipelines that procure data, process,
load/retrain/predict with the packaged model.
Final layer - either BI tools like tableau, qilkview etc or sql/nosql databases or excel reports
220. What is cosine similarity?
Intermediate ML
Cosine similarity is a metric used to measure how similar the documents are irrespective of their size.
Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional
space. The cosine similarity is advantageous because even if the two similar documents are far apart by
the Euclidean distance (due to the size of the document), chances are they may still be oriented closer
together. The smaller the angle, the higher the cosine similarity.
Intermediate ML
Build a computational graph, this can be any mathematical operation TensorFlow supports.
Run graph in session, the compiled graph is passed to the session, which starts its execution.
222. What is part of speech (POS) tagging? What is the simplest approach to building a
POS tagger that you can imagine?
Basic NLP
POS tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag,
based on its context and definition. The most common approach is to use the lexicon-based approach,
using a lexicon to assign a tag for each word. The lexicon is constructed from a gold standard annotated
corpus, where each word type is coupled with its most frequent associated tag in the gold standard
corpus.
223. How would you build a part of speech (POS) tagger from scratch given a corpus of
annotated sentences? How would you deal with unknown words?
Basic NLP
First, we will create features from words (like last 2,3 letters, the previous word, next word, etc.). Then we
will train a classifier to find the POS tag. HMM, CRF and RNNs can be used to train the model. Unknown
words can also be predicted by generating the features (position of the word, suffix, etc) from them.
224. How would you train a model that identifies whether the word “Apple” in a sentence
belongs to the fruit or the company?
Basic NLP
This particular task is known as NER (Named Entity Recognition) tagging. HMM, CRF and RNNs can be
used to train a model for NER
225. How would you find all the occurrences of quoted text in a news article?
Basic NLP
Train a classifier model to look at the constituent parts of a news article and assign a probability that,
taken together, composes a valid quoted text.
226. How would you build a system that auto-corrects text that has been generated by a
speech recognition system?
Basic NLP
It can be done in multiple ways, but the simplest way would be to take the unknown words and compare
them with similar words from our dictionary. Distances can be calculated using algorithms like
Levenshtein and if the result is satisfactory, the words can be exchanged
Basic NLP
Some popular models other than word2vec are GloVe, Adagram, FastText, etc
228. What is latent semantic indexing and where can it be applied?
Basic NLP
Latent semantic indexing (LSI) is a concept used by search engines to discover how a term and content
work together to mean the same thing, even if they do not share keywords or synonyms. Search engines
use LSI to judge the quality of the content on a page by checking for words that should appear alongside
a given search term or keyword
229. Explain some metrics to test out a Named Entity recognition model.
Basic NLP
When you train a NER system the most typical evaluation method is to measure precision, recall, f1-
score, and confusion matrix at a token level.
230. List out some popular Python libraries that are used for NLP.
Basic NLP
Some popular libraries for NLP are, NLTK, Gensim, spaCy, TextBlob, etc.
Basic NLP
Some popular applications are Text summarization, Machine translation, Sentiment Analysis, chatbots,
etc.
232. What is the difference between search function and match function?
Basic NLP
re.search() method finds something anywhere present in the string and return a match object, whereas
re.match() method finds something only at the beginning of the string and returns a match object
233. What is tokenization, chinking, chunking?
Basic NLP
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be
either word, characters, or subwords. Chunking means a grouping of words/tokens into chunks. Chunking
can break sentences into phrases that are more useful than individual words and yield meaningful results.
Chinking is a lot like chunking, it is basically a way for you to remove a chunk from a chunk.
Basic NLP
Skip-gram is an unsupervised algorithm to find word embeddings. It tries to predict the source context
words (surrounding words) given a target word (the center word)
Basic NLP
CBOW is an unsupervised algorithm to find word embeddings. It tries to predict the target word (the
center word) given the source context words (surrounding words).
Basic NLP
You can use gensim library to implement word2vec model, you can train the word2vec model on your
text corpus and then generate word embeddings.
Basic NLP
Stemming and lemmatization, both are used to derive root (base) word from their inflected form. A stem
might not be an actual word whereas a lemma will be an actual word.
238. How would you build a system to translate English text to Greek and vice-versa?
Basic NLP
One can use Neural Machine Translation to translate English text to Greek and vice-versa. A sequence
to sequence model can be created using RNNs.
239. How would you build a system that automatically groups news articles by subject?
Basic NLP
There can be different ways to do this task, if you have annotated data, you can train a classifier model
to classify different articles
240. What are stop words? Describe an application in which stop words should be
removed.
Basic NLP
Stop words are frequently used word, which does not add much meaning to a sentence or does not help
in prediction. We will need to remove the stop words while performing sentiment analysis.
241. How would you design a model to predict whether a movie review was positive or
negative?
Basic NLP
We will need to perform sentiment analysis on the reviews, It can be done in multiple ways, one simple
way to do this is by training a classifier using ML algorithms or RNNs (LSTM or GRU).
242. What is entropy? How would you estimate the entropy of the English language?
Basic NLP
Entropy is a measure of randomness in the information. One possible way of calculating the entropy of
English uses N-grams. One can statistically calculate the entropy of the next letter when the previous N -
1 letters are known.
243. What is the TF-IDF score of a word and in what context is this useful?
Basic NLP
TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of
documents. This is done by multiplying two metrics: how many times a word appears in a document, and
the inverse document frequency of the word across a set of documents. TF-IDF is used to convert text
corpus into a matrix on which Machine learning algorithms can be implemented
Basic NLP
Dependency parsing is the process of analyzing the grammatical structure of a sentence based on the
dependencies between the words in that sentence.
245. What are the difficulties in building and using an annotated corpus of text such as
the Brown Corpus and what can be done to mitigate them?
Basic NLP
246. What tools for training NLP models (NLTK, Apache OpenNLP, GATE, MALLET etc…)
have you used?
Basic NLP
To train NLP models, I have used NLTK, Gensim, Spacy and a few others
247. Are you familiar with WordNet or other related linguistic resources?
Basic NLP
WordNet is the lexical database i.e. dictionary for the English language, specifically designed for NLP.
Synset is a special kind of a simple interface that is present in NLTK to look up words in WordNet.
248. Problems faced in NLP and how you tackled them?
Basic NLP
Most of the challenges I faced in NLP are due to data complexity, characteristics such as sparsity,
diversity, dimensionality, etc. and the dynamic nature of the datasets. With the special focus on
addressing NLP challenges, one can build accelerators, robust, scalable domain-specific knowledge bases
and dictionaries that bridges the gap between user vocabulary and domain nomenclature.
249. What are some of the common problems using fixed window neural models?
Advanced NLP
The main problem faced while using a fixed window neural model, is the window size can be small for
large sentences, making it unable to process the complete information
Advanced NLP
Some common examples of sequential data are text corpus, DNA sequence, and time-series data
Advanced NLP
An issue when using n-gram language models is out-of-vocabulary (OOV) words. They are encountered in
computational linguistics and natural language processing when the input includes words which were not
present in a system's dictionary or database during its preparation.
Advanced NLP
RNNs are prone to exploding and vanishing gradient problem. RNN also fails to keep track of long term
dependencies.
253. What are Vanishing gradient problems?
Advanced NLP
As more layers using certain activation functions are added to neural networks, the gradients of the loss
function approach zero, making the network hard to train.
Advanced NLP
Exploding gradients are a problem where large error gradients accumulate and result in very large
updates to neural network model weights during training.
Advanced NLP
An example of Many to one architecture in sequence model, would be sentiment analysis, where the
inputs are words and the output is sentiment
Advanced NLP
Advanced NLP
In LSTM, the forget gate controls the extent to which a value remains in the cell
258. Why is there a specific need for an architecture like GRU or LSTM?
Advanced NLP
RNN’s suffer from short-term memory. If a sequence is long enough, they will have a hard time carrying
the information from the earlier timesteps to later ones. This is called the Vanishing Gradient Problem. To
solve this issue, GRU and LSTMs are used
Advanced NLP
RNN’s suffer from short-term memory. If a sequence is long enough, they will have a hard time carrying
the information from the earlier timesteps to later ones. This is called the Vanishing Gradient Problem. To
solve this issue, GRU and LSTMs are used
Advanced NLP
The main difference between GRU and LSTM is, GRU have 2 gates whereas LSTM has 3 gates, thus GRU
is faster than LSTM. But LSTMs generally perform better at remembering longer sequences than GRU
261. What kind of datasets are RNNs known best to work on?
Advanced NLP
Advanced NLP
263. What are some of the ways to address the exploding gradients problem in RNNs?
Advanced NLP
1. Gradient Clipping: Limit the size of gradients during the training of your network.
2. Weight Regularization: apply a penalty to the networks loss function for large weight values.
3. Using LSTM or GRU
Advanced NLP
An Encoder-Decoder architecture was developed where an input sequence was read in entirety and
encoded to a fixed-length internal representation. A decoder network then used this internal
representation to output words. This is generally used in machine translation
Advanced NLP
The main disadvantage of attention mechanism is that it adds more weights to train, thus increases the
training time of the model
BERT stands for Bidirectional Encoder Representations from Transformers. BERT is pre-trained on a large
corpus of unlabelled text. It is bidirectional meaning it learns information from both the left and the right
side of a token’s context during the training phase. BERT is used for text summarization, knowledge
extraction, chatbots etc.
Advanced NLP
XLNet is an auto-regressive language model which outputs the joint probability of a sequence of tokens
based on the transformer architecture with recurrence.
Advanced NLP
The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks while
handling long-range dependencies with ease.
Advanced NLP
Advanced NLP
The standard seq2seq model is generally unable to accurately process long input sequences, the
attention mechanism allows the model to focus and place more “Attention” on the relevant parts of the
input sequence as needed.
1. Bahdanau Attention
2. Luong Attention
Advanced NLP
Since BERT model is deeply bidirectional, it is able to generate more accurate word representation. Since
BERT uses transformers, it provides parallelization and thus faster to train on large datasets
273. What information is stored in the hidden and cell state of an LSTM?
Advanced NLP
The cell state ( also called long-term memory) contains the information from the past. Hidden State (also
called working memory) contains the information from the current state that needs to be taken to the
next state
Advanced NLP
Transformers are better than all the other architectures because they totally avoid recursion, by
processing sentences as a whole and by learning relationships between words, using multi-head
attention mechanisms and positional embeddings
275. What are the differences between BERT and ALBERT v2?
Advanced NLP
BERT is an expensive model in terms of memory and time consumed on computations, even with GPU.
ALBERT v2 is lighter and faster than BERT. Cross-layer parameter sharing is the most significant change
in BERT architecture that created ALBERT.
Advanced NLP
1. BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
2. BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters
Advanced NLP
1. BERT
2. GPT-3
3. XLNet
278. What are the most challenging NLP problems that researchers/industries are
working on currently?
Advanced NLP
Basic Python
Hint?
Basic Python
Hint?
Intermediate Python
Hint?
Basic Python
Hint?
Basic Python
Hint?
284. How will you import multiple excel sheets in a data frame?
Basic Python
Hint?
285. What are the different types of data types?
Basic Python
Hint?
Basic Python
Hint?
Basic Python
if number > 1:
if (number % i) == 0:
break
else:
else:
# initialize sum
sum = 0
temp = num
digit = temp % 10
sum += digit ** 3
temp //= 10
if num == sum:
else:
Basic Python
list.append(item)
Basic Python
Hint?
291. Which function is most useful to convert a multidimensional array into a one-
dimensional
Basic Python
Hint?
292. Python or R – Which one would you prefer for text analytics?
Intermediate Python
Intermediate Python
Intermediate Python
Python programming language supports negative indexing of arrays, something which is not available in
arrays in most other programming languages. This means that the index value of -1 gives the last element,
and -2 gives the second last element of an array. The negative indexing starts from where the array ends.
This means that the last element of the array is the first element in the negative indexing which is -1.
295. How is the Python series different from a single column dataframe?
Intermediate Python
Python series is the data structure for a single column of a DataFrame, not only conceptually, but
literally, i.e. the data in a DataFrame is actually stored in memory as a collection of Series
Series is a one-dimensional object that can hold any data type such as integers, floats and strings and it
does not have any name/header whereas the dataframe has column names.
296. Which libraries in SciPy have you worked with in your project?
Intermediate Python
SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT,
signal and image processing, ODE solvers etc
Subpackages include:
scipy.cluster
scipy.constants
scipy.fftpack
scipy.integrate
scipy.interpolation
scipy.linalg
scipy.io
scipy.ndimage
scipy.odr
scipy.optimize
scipy.signal
scipy.sparse
scipy.spatial
scipy.special
scipy.stats
scipy.weaves
Intermediate Python
Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. pandas
objects can be split on any of their axes.
Parameters :
as_index: For aggregated output, return object with group labels as the index. Only relevant for
DataFrame input. as_index=False is effectively “SQL-style” grouped output
sort: Sort group keys. Get better performance by turning this off. Note this does not influence the order
of observations within each group. groupby preserves the order of rows within each group.
group_keys: When calling apply, add group keys to index to identify pieces
squeeze: Reduce the dimensionality of the return type if possible, otherwise return a consistent type
Intermediate Python
Basic Python
Packages are namespaces which contain multiple packages and modules themselves. They are simply
directories.
Each package in Python is a directory which MUST contain a special file called __init__.py. This file can be
empty, and it indicates that the directory it contains is a Python package, so it can be imported the same
way a module can be imported.
If we create a directory called foo, which marks the package name, we can then create a module inside
that package called bar. We also must not forget to add the __init__.py file inside the foo directory.
Intermediate Python
Pandas isnull() function detect missing values in the given object. It returns a boolean same-sized object
indicating if the values are NA. Missing values get mapped to True and non-missing value gets mapped to
False.
301. How do you get the frequency of a categorical column of a dataframe using python?
Basic Python
Using Series.value_counts()
Basic Python
import numpy as np
a = np.array(x)
resultList = []
for y in a.tolist():
resultList.append(y)
return resultList
303. How can we convert a python series object into a dataframe?
Basic Python
Series.to_frame(name=None)
Basic Python
This parameter can be either a single column key, a single array of the same length as the calling
DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array”
encompasses Series, Index, np.ndarray, and instances of Iterator.
Basic Python
Yes
306. What all ways have you used to convert categorical columns into numerical data
using python?
Intermediate Python
One of the most used and popular ones are LabelEncoder and OneHotEncoder.
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(x)
print(y)
array([0, 1, 0, 2])
OneHotEncoder can be used to transform categorical data into one hot encoded array:
onehot_encoder = OneHotEncoder(sparse=False)
y = y.reshape(len(y), 1)
onehot_encoded = onehot_encoder.fit_transform(y)
print(onehot_encoded)
Intermediate Python
OneHotEncoder cannot process string values directly. If your nominal features are strings, then you
need to first map them into integers.
pandas.get_dummies is kind of the opposite. By default, it only converts string columns into one-hot
representation, unless columns are specified.
Intermediate Python
A simple and commonly used plot to quickly check the distribution of a sample of data is the histogram.
pyplot.hist(data)
Basic Python
loc gets rows (or columns) with particular labels from the index.
iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
310. Difference between univariate and bivariate analysis? What all different functions
can be used in python?
Basic Python
Below are a few functions which can be used in the univariate and bivariate analysis:
df.Thal.value_counts()
sns.distplot(df.Variable.dropna())
df.Variable.dropna().mean()
6. Correlation plot:
data.corr()
311. What all different methods can be used to standardize the data using python?
Intermediate Python
Standard Scaler.
Robust Scaler.
Basic Python
Pandas.apply allow the users to pass a function and apply it on every single value of the Pandas series.
Syntax:
313. How do you do upsampling of data? Name a python function or explain the code.
Intermediate Python
Up-sampling is the process of randomly duplicating observations from the minority class in order to
reinforce its signal.
There are several heuristics for doing so, but the most common way is to simply resample with
replacement.
314. Can you plot 3D plots using matplotlib? Name the function.
Intermediate Python
Yes
Function:
import numpy as np
fig = plt.figure()
ax = plt.axes(projection ='3d')
315. How can you drop a column in python?
Basic Python
Basic Python
In-place operation is an operation that changes directly the content of a given linear algebra, vector,
matrices(Tensor) with/without making a copy
When inplace = True is used, it performs an operation on data and nothing is returned.
When inplace=False is used, it performs an operation on data and returns a new copy of data.
Intermediate Python
3. Allow a random selection of the same row more than once: df = df.sample(n=3,replace=True)
Intermediate Python
A block is a group of statements in a program or script. Usually, it consists of at least one statement and
declarations for the block, depending on the programming or scripting language. A language which allows
grouping with blocks is called a block-structured language
subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it
will consider them only for duplicates.
keep: keep is to control how to consider duplicate value. It has only three distinct value and default is
‘first’.
320. Can you convert a string into an int? When and how?
Basic Python
Python offers the int() method that takes a String object as an argument and returns an integer. This can
be done when the value is either of numeric object or floating-point.
A floating-point (an integer with a fractional part) as an argument will return the float rounded down to
the nearest whole integer.
Intermediate Python
The zip() function takes iterables (can be zero or more), aggregates them in a tuple, and return it.
zip(*iterables)
Basic Python
323. What is the difference between list, array and tuple in Python?
Basic Python
List:
List are dynamic and can contain objects of different data types.
Array:
An array is mutable.
Tuple:
Tuples are immutable and can store any type of data type.
Intermediate R
There are multiple algorithms for performing sorting on the data in the R programming language. Below
different types of sorting function have been discussed.
Bubble Sort
Insertion Sort
Selection Sort
Merge Sort
Quick Sort
Basic R
1. Dplyr
2. Ggplot2
3. Shiny
4. Lubridate
5. Knitr
6. Mlr
7. Caret
8. Text2Vec
9. Prophet
10. SnowballC
Basic R
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most
common data manipulation challenges:
Intermediate R
cbind() and rbind() both create matrices by combining several vectors of the same length. cbind()
combines vectors as columns, while rbind() combines them as rows
Basic R
interaction computes a factor which represents the interaction of the given factors. The result of the
interaction is always unordered.
Basic R
Hint?
330. What is a factor variable, and why would you use one?
Basic R
Hint?
Basic R
Hint?
Hint?
Basic R
Hint?
Basic R
Hint?
Basic R
Hint?
Basic R
Matrix in R –
DataFrames in R –
It is used for storing data tables. It can contain multiple data types in multiple columns called fields. It is a
list of a vector of equal length. It is a generalized form of a matrix. It is like a table in excel sheets. It has
column and row names. The name of rows is unique with no empty columns. The data stored must be
numeric, character or factor type. DataFrames are heterogeneous.
337. How missing values and impossible values are represented in R language?
Intermediate R
In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by
zero) are represented by the symbol NaN (not a number)
338. What is the process to create a table in R language without using external files?
Intermediate R
Hint?
Basic R
Matrix in R –
DataFrames in R –
It is used for storing data tables. It can contain multiple data types in multiple columns called fields. It is a
list of the vector of equal length. It is a generalized form of a matrix. It is like a table in excel sheets. It has
column and row names. The name of rows is unique with no empty columns. The data stored must be
numeric, character or factor type. DataFrames are heterogeneous.
340. How can you verify if a given object “X” is a matrix data object
Basic R
Hint?
Basic R
Shiny is an open-source R package that provides an elegant and powerful web framework for building
web applications using R. Shiny helps you turn your analyses into interactive web applications without
requiring HTML, CSS, or JavaScript knowledge.
Basic R
With R/Python, you can visualise data in a similar way to Tableau, and build interactive visualisations
with many libraries but you have a lot more flexibility.
Basic R
344. What are the differences between the sum function and using “+” operator
Basic SAS
SUM function returns the sum of non-missing arguments whereas “+” operator returns a missing value if
any of the arguments are missing
345. How does PROC SQL work
Intermediate SAS
The SQL query structure does not change even if we use PROC SQL command. For example -
PROC SQL;
SELECT column(s)
WHERE expression
GROUP BY column(s)
HAVING expression
ORDER BY column(s);
QUIT;
In the above query, the select statement is nothing like a select sql query but you always end the query
with QUIT;
346. If you are given an unsorted data set, how will you read the last observation to a new
dataset
Intermediate SAS
We can read the last observation to a new data set using end= data set option.
For example:
data work.calculus;
If last;
run;
Here in the above query, a new dataset calculus is getting created from comp (within work directory).
last is the temporary variable (initialized to 0) which is set to 1 when the set statement reads the last
observation
347. Can you tell the difference between VAR X1 – X3 and VAR X1 — X3
Intermediate SAS
348. What is the purpose of trailing @ and @@? How do you use them
Intermediate SAS
The trailing @ is also known as a column pointer. By using the trailing @, in the Input statement gives you
the ability to read a part of your raw data line, test it and then decide how to read additional data from
the same record.
The single trailing @ tells the SAS system to “hold the line”.
The double trailing @@ tells the SAS system to “hold the line more strongly”.
An Input statement ending with @@ instructs the program to release the current raw data line only when
there are no data values left to be read from that line. The @@, therefore, holds the input record even
across multiple iterations of the data step.
349. What is the difference between the Do Index, Do While and the Do Until loop
Intermediate SAS
Basic SAS
Searches a character string for a digit and returns the first position at which it is found
Basic SAS
Intermediate Spark
A Data Frame is the tabular representation of data and is equivalent to a table in a relational database
but with better optimization.
RDD is the representation of a set of records, logically partitioned across multiple nodes for parallel
processing.
Intermediate Spark
Spark performance tuning is the process of efficiently utilizing the spark resources such as memory,
cores, instances as per the input data records.
354. What is a stage in Spark and What are the types of stages?
Basic Spark
Spark stage is nothing but each individual job work/tasks from the entire execution plan.
1. ShuffleMapStage: It is an intermediate stage and produces data for the next stage.
2. ResultStage: Final stage of spark and helps in the computation of result from the action plan.
355. What are shared variables in Spark and what is the use of it?
Basic Spark
Shared variables are nothing but the globally referred variables use in multiple functions and methods
in parallel.
356. What is the difference between Batch processing and real time streaming?
Intermediate Spark
Batch processing is the processing of blocks of data that have already been stored over a period of time.
It is used in the scenarios where it is required to process large volumes of data to get more detailed
insights than it is to get fast analytics results. On the other hand, real-time processing as the name
suggests is used for real-time analytics. It is used to process the data as it arrives and gets instant
analytics result.
Intermediate Spark
SparkSQL itself is built of two main components: Dataframe and SQLContext. SQLContext encapsulates
all the relative functionality of spark and provides extended functionality to be able to 'talk' to different
databases which could be SQL or NoSQL DBs. Every DB has its own respective connectors to be
integrated with spark and with the help of such dedicated connectors SQLContext talks to DBs.
Basic Spark
Accumulators are one of the types of shared variables used in spark. It is meant for numeric data
aggregation where the data is stored in the cache and can be accessed throughout the model
functionalities.
Intermediate Spark
SQLContext is nothing but the gateway to SparkSQL from where the spark can interact with the
databases. HiveContext is the superset of SQLContext which inherits all the property of SQLContext for
DB interactions with addition of HiveContext properties to connect with Hive and HBase.
Intermediate Spark
Spark is basically used where basic python is not capable of solving the issue. Used spark functionalities
on Python for telecom domain use cases where data size was huge > 20 GB. Used RDD concepts for
parallel and fast data preprocessing. used shared variables concept for data storage and loading from
cache.
Intermediate Spark
Kafka is a distributed system consisting of servers and clients that communicate via a high-performance
TCP. Applications (producers) send messages (records) to a Kafka node (broker) and said messages are
processed by other applications called consumers. Said messages get stored in a topic and consumers
subscribe to the topic to receive new messages.
Intermediate Spark
Apache spark follows a master/slave architecture where the master drives the process and slave daemon
are the worker nodes which does the actual processing.
Basic Spark
Lazy evaluation as the name suggests means the execution will not start until an action is triggered.
Whenever there is some operation on RDD, it does not get executed immediately. Spark adds them to a
DAG of computation and only when the driver requests some data, this DAG actually gets executed
Intermediate Spark
We need to understand that the spark is not intended to replace the Hadoop stack rather enhance its
functionalities. Spark can enrich the processing capabilities in terms of read and write data from HDFS by
combining spark with Hadoop MapReduce, HBase.
1. Standalone Deployment: Hadoop cluster run side by side with Hadoop MR. user can run Spark
jobs directly on HDFS.
2. Hadoop Yarn Deployment: Users can deploy Hadoop yarn and can run spark on yarn without any
pre-installation or administrative access required.
Basic Spark
Standalone: Simplest way to run spark in a clustered environment. It is a cluster which spark itself
manages. It has masters and number of workers with the configured amount of memory and CPU
cores.
Mesos: Mesos handles the workload in a distributed environment by dynamic resource sharing and
isolation. It is used for large scale cluster deployments and it decreases an overhead of allocating a
specific machine for different workloads.
Hadoop Yarn: YARN data computation framework is a combination of the ResourceManager, the
NodeManager. In resource manager, The Scheduler allocates a resource to the various running
application and Application Manager manages applications across all the nodes.
Intermediate Spark
Broadcast variables are useful when large datasets need to be cached in executors. Without this, these
need to be shipped to each executor before the actual process call. It is meant to be a read-only and is a
mechanism for sharing variables across executors
367. What is the role of Dstream in Spark?
Basic Spark
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a
continuous stream of data, either the input data stream received from the source or the processed data
stream generated by transforming the input stream. Internally, a DStream is represented by a continuous
series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream
contains data from a certain interval.
Intermediate Spark
1. MEMORY_ONLY: In this level, RDD object is stored as a de-serialized Java object in JVM. If an
RDD doesn’t fit in the memory, it will be recomputed.
2. MEMORY_AND_DISK: In this level, RDD object is stored as a de-serialized Java object in JVM. If
an RDD doesn’t fit in the memory, it will be stored on the Disk.
3. MEMORY_ONLY_SER: In this level, RDD object is stored as a serialized Java object in JVM. It is
more efficient than a de-serialized object.
4. MEMORY_AND_DISK_SER: In this level, RDD object is stored as a serialized Java object in JVM.
If an RDD doesn’t fit in the memory, it will be stored on the Disk.
5. DISK_ONLY: In this level, RDD object is stored only on Disk.
Intermediate Spark
Apache distributed datasets are basically big dataset which needs to be further partitioned across various
nodes in order to facilitate processing. Execution on a single node for such huge datasets efficiently is not
possible. Hence partitioning is required where each partitioned block are evaluated lazily and are stored
as DAG.
Basic Spark
Pyspark is nothing but the python API for Spark. Its sole purpose is to support the collaboration of
Apache Spark and Python. It provides an interface to interact with RDD in Apache Spark through python
programming language.
Basic Spark
Transformations and actions are nothing but two types of operations which can be performed on RDDs.
Transformations are such type of operations which are when applied on an RDD it returns a new
transformed RDD.
Ex : map(),filter(),flatMap(). Action are methods to access the actual data available in an RDD, the result
of an action can be taken into the programmatic flow where an action is called and all the transformation
occurs.
Ex: collect(),reduce(),first(),take(),count()
Intermediate Spark
GraphX is Apache Spark’s API for graphs and graph-parallel computation. This includes the collection of
graph algorithms and processes to do graph analytics. GraphX extends the Spark RDD with a Resilient
Distributed Property Graph.
The property graph is a directed multigraph which can have multiple edges in parallel. Every edge and
vertex have user-defined properties associated with it. The parallel edges allow multiple relationships
between the same vertices. It is flexible, fast and open source.
Intermediate Spark
Spark streaming has an advantageous feature of windowed operation. It can do the transformation
operation over a sliding window of data. Generally, the sliding window operation requires two specific
parameters.
Window length which defines the duration of the window & Sliding Interval which defines the interval at
which the operation is performed.
374. What is the difference between Spark Session and Spark Context?
Intermediate Spark
Spark SparkContext is an entry point to Spark and used to programmatically create RDDs and other
variables. It's object "sc" is a default variable and can be created by using SparkContext class.
However, SparkSession is a superset of SparkContext which includes all the functional class of different
APIs, Spark Context, SQLContext, HiveContext etc. It's an entry point to underlying spark functionality
itself.
Basic Spark
Theoretically, Spark performs 100 times faster than Hadoop and this is possible only because it processes
data in random access memory (RAM), while Hadoop MapReduce persists data back to the disk after a
map or reduce action. Nonetheless, Spark needs lots of memory and keeps the data there until a further
call for caching.
Basic Spark
SQLContext is nothing but the gateway to SparkSQL from where the spark can interact with the
databases. Here the DB can be both SQL and NoSQL. Respective drivers are available for different DBs
which can be initiated along with the SparkSession builder process itself.
Basic Spark
Intermediate Spark
Spark Streaming is nothing but a continuous stream that is processed using algorithms as it is. The output
is also retrieved in the form of a continuous data stream. Kafka streaming works on state transitions
unlike batches as that in Spark Streaming.
It stores the states within its topics, which is used by the stream processing applications for storing and
querying of the data. Thereby, all its operations are state-controlled. These states are further used to
connect topics to form an event task
Intermediate Spark
Hive is connected through HiveContext in spark. HiveContext is the superset of SQLContext which
inherits all the property of SQLContext for DB interactions with addition of HiveContext properties to
connect with Hive and HBase.
Intermediate Spark
Cache stores each node or any partitions of it that it computes, in memory and reuses them in other
actions on the dataset. It helps in faster execution in future processes. Whereas, Broadcast variables
allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy
of it with tasks.
Basic Spark
DAG is the abbreviation of the Directed Acyclic Graph. In Spark, this is used for the visual representation
of RDDs and the operations being performed on them. The RDDs are represented by vertices, while the
operations are represented by edges. Every edge is directed from an 'earlier state' to a 'later state.
382. Given a table(cars) with 4 columns(model_id, model_name,color, price) , perform
groupby using model_name and color, order by highest price, get 3rd highest.
Basic SQL
Hint?
383. What is the difference between the WHERE and HAVING clauses?
Basic SQL
Hint?
384. Given a table(employee). Find the Second highest salary. Find the 10th highest
salary. Find the 25-30th highest salary.
Intermediate SQL
Hint?
Basic SQL
Hint?
386. Given a table with order-id , order item-id and quantity Find the quantity for distinct
order-id
Basic SQL
Hint?
387. What are the different type of Joins in Sql and explain them? (Mainly focused on full
outer join )
Basic SQL
Hint?
388. Given 2 tables and the following query. What will be the output (select * from table 1
full outer join table 2) where values not in (select * from table 1 inner join table 2)
Intermediate SQL
Hint?
389. Given an assumption: There are 2 tables, first table has 10 records and second table
has 15 records. There are 5 records common in both the tables. Number of records that
would be fetched when you perform left join/right join/inner join/cross-join.
Basic SQL
Hint?
390. Given a word "JOE", find the word in a given string irrespective of word being upper
case or lower case or capitalize?
Intermediate SQL
Hint?
391. Find out if the database has any duplicate record names.
Basic SQL
Hint?
392. Differentiate between Implicit vs Explicit Join
Intermediate SQL
Hint?
393. With respect to SQL, which one is more preferable - Subqueries or Joins? Why?
Intermediate SQL
Hint?
Basic SQL
Hint?
395. Query to find the employees in the office given check in and check out as fields.
Intermediate SQL
Hint?
396. Given a table of an event having columns date-ts/ event id. Find the event that
happened 3rd on every month
Basic SQL
Hint?
Basic SQL
Hint?
398. Find the Salary greater than Average salary without using Joins or Sub-Queries
Advanced SQL
Hint?
Basic SQL
Hint?
Basic SQL
Hint?
Basic SQL
Hint?
Basic SQL
The INNER JOIN creates a new result table by combining column values of two tables (table1 and table2)
based upon the join-predicate. The query compares each row of table1 with each row of table2 to find all
pairs of rows which satisfy the join-predicate.
Basic SQL
(INNER) JOIN: Returns records that have matching values in both tables
LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right
table
RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left
table
FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table
Basic SQL
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging
between 0 and 1. It is also known as Min-Max scaling.
Basic SQL
Intermediate SQL
COUNT(column_name) will count the number of records where column_name is not null.
407. Write SQL query to find the cumulative price of each customer in a table?
Intermediate SQL
SELECT CustomerID
,TransactionDate
,TransactionAmount
FROM Sales.CustomerTransactions T1
WHERE TransactionTypeID = 1
GROUP BY CustomerID
ORDER BY InvoiceID
,TransactionAmount
SELECT CustomerID
,TransactionDate
,TransactionAmount
FROM Sales.CustomerTransactions T1
WHERE TransactionTypeID = 1
GROUP BY CustomerID
ORDER BY InvoiceID
,TransactionAmount
Intermediate SQL
SELECT [FirstName],
[LastName],
[Country],
COUNT(*) AS CNT
FROM [SampleDB].[dbo].[Employee]
GROUP BY [FirstName],
[LastName],
[Country]
Intermediate SQL
Hint?
Intermediate SQL
Hint?
Basic SQL
Primary keys must contain UNIQUE values, and cannot contain NULL values.
A table can have only ONE primary key; and in the table, this primary key can consist of single or
multiple columns (fields).
Basic SQL
A table can have only one primary key, which may consist of single or multiple fields. When multiple
fields are used as a primary key, they are called a composite key. If a table has a primary key defined on
any field(s), then you cannot have two records having the same value of that field(s).
Basic SQL
Unique key constraints identify an individual tuple uniquely in relation or table. A table can have more
than one unique key, unlike primary key. Unique key constraints can accept only one NULL value for
column. Unique constraints are also referenced by the foreign key of another table. It can be used when
someone wants to enforce unique constraints on a column and a group of columns which is not a primary
key.
Basic SQL
A FOREIGN KEY is a field (or collection of fields) in one table that refers to the PRIMARY KEY in
another table.
The table containing the foreign key is called the child table, and the table containing the candidate key
is called the referenced or parent table
Basic SQL
A table may have multiple foreign keys, and each foreign key can have a different parent table.
Intermediate SQL
A NOT NULL constraint in SQL is used to prevent inserting NULL values into the specified column,
considering it as a not accepted value for that column. This means that you should provide a valid SQL
NOT NULL value to that column in the INSERT or UPDATE statements, as the column will always contain
data
Intermediate SQL
The CHECK constraint is used to limit the value range that can be placed in a column. If you define a
CHECK constraint on a single column it allows only certain values for this column. If you define a CHECK
constraint on a table it can limit the values in certain columns based on values in other columns in the
row.
Intermediate SQL
The DEFAULT constraint is used to provide a default value for a column. The default value will be added
to all new records IF no other value is specified.
419. What is the difference between NULL value, Zero, and Blank space?
Intermediate SQL
A NULL value is not same as zero or a blank space. A NULL value is a value which is 'unavailable,
unassigned, unknown or not applicable'. Whereas, zero is a number and blank space is a character
Intermediate SQL
A composite key is a combination of two or more columns in a table that can be used to uniquely identify
each row in the table when the columns are combined uniqueness is guaranteed, but when it is taken
individually it does not guarantee uniqueness.
Intermediate SQL
The SQL SELECT LIMIT statement is used to retrieve records from one or more tables in a database and
limit the number of records returned based on a limit value
422. write a query f_name and l_name fields from table emp and allow a space in
between the 2 columns
Intermediate SQL
423. Write a query to rename the column name id as emp_id, name as emp_name for the
table emp;
Basic SQL
424. select * from dual, what does the dual mean and what is the default data types ?
Intermediate SQL
The DUAL is special one row, one column table present by default in all Oracle databases. The owner of
DUAL is SYS (SYS owns the data dictionary, therefore DUAL is part of the data dictionary.) but DUAL can
be accessed by every user.
The table has a single VARCHAR2(1) column called DUMMY that has a value of 'X'. MySQL allows DUAL
to be specified as a table in queries that do not need data from any tables. In SQL Server DUAL table
does not exist, but you could create one.
425. How do you get the current system date using dual table?
Intermediate SQL
426. Write a query to get the number of records from a table emp
Basic SQL
DDL is Data Definition Language and is used to define the structures like schema, database, tables,
constraints etc. Examples of DDL are create and alter statements.
Basic SQL
DML is Data Manipulation Language and is used to manipulate data. Examples of DML are insert,
update and delete statements.
Intermediate SQL
430. How to get only the delhi records from emp table, and handle all types of case
sensitive issues: Delhi, delhi, DELHI, DELhi
Intermediate SQL
431. Write a query to change the format of the date to (YYYY-MON-DD) in dual table ?
Intermediate SQL
Basic SQL
Basic SQL
434. Write a query to select only the id, name, city,country and phone from the table
customer and restrict the record only to india
Basic SQL
select id, name, city, country, phone from customer where country='india'
435. Write a query to update the table emp, where the city name is Madras to chennai
Basic SQL
UPDATE emp
436. Write a query to remove record whose salary is great than or equal to 50000 and
city is chennai
Basic SQL
437. Write a query to select all the students from table stud whose name begins with 'S'
Intermediate SQL
438. Write a query to display all the records from table emp where the age is between 18
and 58
Basic SQL
439. Select all the record for emp, in which gender is female or age > 18
Basic SQL
440. Write a query to exact all the record for which payment_detail col is null for the
table payment_detail
Basic SQL
441. Write a query to get the top 5 salary from the table emp
Basic SQL
442. Query the records from emp where order is descending for name and ascending for
salary
Intermediate SQL
select * from emp order by name descending, salary
443. What is the difference between union and union all in SQL
Intermediate SQL
UNION removes duplicate records (where all columns in the results are the same), UNION ALL does not.
There is a performance hit when using UNION instead of UNION ALL, since the database server must do
additional work to remove the duplicate rows, but usually, you do not want the duplicates (especially
when developing reports).
444. What is an execution plan? When would you use it? How would you view the
execution plan
Intermediate SQL
An execution plan is a window in SQL Server Management Studio to shows you how SQL Server breaks
down a query and also identifies where issues might exist within the execution plan. By identifying the
statements that take a long time to complete, you can then look at the execution plan to determine
tuning needs.
You can use it anytime you write a query. Most developers use execution plan when they have database
queries consumes a lot of resources and takes time.
Actual Execution Plan - (CTRL + M) - is created after execution of the query and contains the steps that
were performed
Estimated Execution Plan - (CTRL + L) - is created without executing the query and contains an
approximate execution plan
Graphical Plans
XML Plans
445. How can you select all the even number records from a table? All the odd number
records?
Intermediate SQL
446. What is the difference between the RANK() and DENSE_RANK() functions? Provide
an example.
Intermediate SQL
The one and the only difference between the DENSE_RANK() and RANK() functions is the fact that
RANK() will assign non-consecutive ranks to the values in a set in the case of a tie, which means that with
RANK() there will be gaps between the integer values when there is a tie. But the DENSE_RANK() will
assign consecutive ranks to the values in the case of a tie, so there will be no gaps between the integer
values in the case of a tie.
Intermediate SQL
CHAR is used for storing fix length character strings. It will waste a lot of disk space if this type is used
to store variable-length strings.
Basic Statistics
Hint?
Basic Statistics
Hint?
450. Why you used T-test in the project that you have mentioned in your resume.
Basic Statistics
Hint?
451. Given two populations, to perform a test of effectiveness of a drug, which statistical
test will you perform?
Intermediate Statistics
Hint?
452. If a height is co - related to weight & weight is co -related height are the both the
statements same?
Basic Statistics
Yes, both the statements are true, given that they are continuous variables.
Basic Statistics
A z-score measures exactly how many standard deviations above or below the mean a data point is.
A data point can be considered unusual if its z-score is above 333 or below -3−3minus, 3
Basic Statistics
Hint?
Intermediate Statistics
Hint?
Intermediate Statistics
Hint?
457. What are independent variables and categorical variables. Highlight the key
differences.
Basic Statistics
Basic Statistics
The Chi-Square statistic is commonly used for testing relationships between categorical variables. The
null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the
population; they are independent.
Intermediate Statistics
- If a sample is representative of a population, then Sample reflects the characteristics of the population,
so those sample findings can be generalized to the population
Basic Statistics
A statistical hypothesis is an assumption about a population parameter. This assumption may or may not
be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject
statistical hypotheses
461. A scenario was given and was asked to write Null and Alternate Hypothesis
Intermediate Statistics
Null hypothesis. The null hypothesis, denoted by Ho, is usually the hypothesis that sample observations
result purely from chance.
Alternative hypothesis. The alternative hypothesis, denoted by H1 or Ha, is the hypothesis that sample
observations are influenced by some non-random cause.
Intermediate Statistics
We can handle skewness using log transformation. A log transformation can help to fit a very skewed
distribution into a Gaussian one.
Basic Statistics
Data from many fields of study surprisingly can be described using a Gaussian distribution, so much so
that the distribution is often called the “normal” distribution because it is so common.
mean: Denoted with the Greek lowercase letter mu, is the expected value of the distribution.
variance: Denoted with the Greek lowercase letter sigma raised to the second power (because the units
of the variable are squared), describes the spread of observation from the mean.
standard deviation: Denoted with the Greek lowercase letter sigma, describes the normalized spread of
observations from the mean.
Student’s t-Distribution: It is a distribution that arises when attempting to estimate the mean of a normal
distribution with different sized samples.
number of degrees of freedom: denoted with the lowercase Greek letter nu (v), denotes the number
degrees of freedom.
Chi-Squared Distribution: Like the Student’s t-distribution, the chi-squared distribution is also used in
statistical methods on data drawn from a Gaussian distribution to quantify the uncertainty.
etc
Basic Statistics
The binomial distribution is one, whose possible number of outcomes are two, i.e. success or failure. On
the other hand, there is no limit on possible outcomes in Poisson distribution.
Binomial only has two possible outcomes, while Poisson has an unlimited number of possible outcomes.
465. What are the conditions for performing two sample hypothesis testing?
Basic Statistics
The two independent samples are simple random samples that are independent.
The number of successes is at least five and the number of failures is at least five for each of the
samples.
Basic Statistics
- A Sigmoid function is a mathematical function which has a characteristic S-shaped curve. There are a
number of common sigmoid functions, such as the logistic function, the hyperbolic tangent, and the
arctangent
- All sigmoid functions have the property that they map the entire number line into a small range such as
between 0 and 1, or -1 and 1, so one use of a sigmoid function is to convert a real value into one that can
be interpreted as a probability. “odds ratio” p / (1 - p), which describes the ratio between the probability
that a certain, positive, event occurs and the probability that it doesn’t occur – where positive refers to
the “event that we want to predict”, i.e., p(y=1 | x).
- Sigmoid function outputs the conditional probabilities of the prediction, the class probabilities.
467. You are given a data set. The data set has missing values which spread along 1
standard deviation from the median. What percentage of data would remain unaffected?
Why?
Intermediate Statistics
Intermediate Statistics
There are basically two types, namely, null hypothesis and alternative hypothesis
The null hypothesis is generally denoted as H0. It states the exact opposite of what an investigator or an
experimenter predicts or expects. It basically defines the statement which states that there is no exact or
actual relationship between the variables.
The alternative hypothesis is generally denoted as H1. It makes a statement that suggests or advises a
potential result or an outcome that an investigator or the researcher may expect. It has been categorized
into two categories: directional alternative hypothesis and non-directional alternative hypothesis.
Basic Statistics
Variance is one dimension and covariance is two-dimension measurable techniques and which measure
the volatility and relationship between the random variables respectively. Higher the Volatility in stock
riskier the stock and buying stock with negative covariance is a great way to minimize the risk. A positive
covariance means assets move in the same direction whereas negative covariance means assets generally
moves in the opposite direction.
Basic Statistics
Noisy data are data with a large amount of additional meaningless information in it called noise. This
includes data corruption and the term is often used as a synonym for corrupt data. It also includes any
data that a user system cannot understand and interpret correctly.
Sources of noise:
- Outlier data are data that appears to not belong in the data set. It can be caused by human error such as
transposing numerals, mislabeling, programming bugs, etc
- Fraud: Individuals may deliberately skew data to influence the results toward a desired conclusion
Basic Tableau
2. From the Sheets list at left, drag views to your dashboard at the right
3. To replace a sheet, select it in the dashboard at right. In the Sheets list at left, hover over the
replacement sheet, and click the Swap Sheets button.
Basic Tableau
The different types of filters used in Tableau are given below. The name of filter types is sorted based
on the order of execution in Tableau.
Extract Filters
Context Filters
Dimension Filters
Measure Filters
Basic Tableau
Dimensions contain qualitative values (such as names, dates, or geographical data). You can use
dimensions to categorize, segment, and reveal the details in your data. Dimensions affect the level of
detail in the view.
Measures contain numeric, quantitative values that you can measure. Measures can be aggregated.
When you drag a measure into the view, Tableau applies an aggregation to that measure (by default).
Intermediate Tableau
There are four types of joins which are used to combine data in Tableau: inner, left, right and full outer.
Let’s look into it one by one:
Inner:
Inner join results in a table that contains values that have matches in both tables.
Left:
The left join results in a table that contains the values from the left table and corresponding matches
from the right table. And in case, if a value in the left table doesn’t have a corresponding match in the
right table, a null value in the data grid is reflected.
Right:
Right join results in a table which contains all the values form the right table and corresponding matches
from the left table. And in case, if a value in the right table doesn’t have a corresponding match in the left
table, a null value in the data grid is reflected.
Full Outer:
Full outer join results in a table that contains all values from both tables. And a null value is reflected in
data grid when a value from either table doesn’t have a match with the other table.
475. What is a Calculated Field, and How Will You Create One
Basic Tableau
Sometimes your data source does not contain a field (or column) that you need for your analysis. For
example, your data source might contain fields with values for Sales and Profit, but not for Profit Ratio. If
this is the case, you can create a calculated field for Profit Ratio using data from the Sales and Profit
fields.
In the Calculation Editor that opens, give the calculated field a name.
Intermediate Tableau
A parameter is a global placeholder value such as a number, date, or string that can replace a constant
value in a calculation, filter, or reference line.
For example, you may create a calculated field that returns True if Sales is greater than $500,000 and
otherwise returns False. You can replace the constant value of “500000” in the formula with a parameter.
Then, using the parameter control, you can dynamically change the threshold in your calculation
Intermediate Tableau
Dual axes are two independent axes that are layered on top of each other. According to Tableau, dual
axes allow you to compare multiple measures. Dual axes are useful when you have two measures that
have different scales.
Basic Tableau
A heat map is a two-dimensional representation of information with the help of colours. Heat maps can
help the user visualize simple or complex information.
Treemaps are ideal for displaying large amounts of hierarchically structured (tree-structured) data. The
space in the visualization is split up into rectangles that are sized and ordered by a quantitative variable.
The levels in the hierarchy of the treemap are visualized as rectangles containing other rectangles. Each
set of rectangles on the same level in the hierarchy represents a column or an expression in a data table.
Each individual rectangle on a level in the hierarchy represents a category in a column.
Tableau Workbook File (TWB) is an XML document. It contains the information about your sheets,
dashboards and stories. The TWB file references a data source file such as Excel or TDE, and when you
save the TWB file, it is linked to the source.
The most important thing to remember about TWB files is that they don’t contain any data – if you want
to share your workbook, therefore, you will need to send both the Tableau Workbook File and the data
source file.
Tableau Packaged Workbook (TWBX) is a package of files “compressed” together. It includes a data
source file, TWB, and any other file used to produce the workbook (including images).
TWBX is intended for sharing. It does not link to the original file source; instead, it contains a copy of the
data that was obtained when the file was created. TWBX files are usually used as reports and can be
viewed using Tableau Viewer.
TWBX isn’t designed for auto-updating. If you refresh/update the source file, TWBX will stay unchanged.
If you want your workbook to update when the source file is updated, you need to use the TWB file
format.
480. Explain the Difference Between Tableau Worksheet, Dashboard, Story, and
Workbook
Basic Tableau
Tableau uses a workbook and sheet file structure, much like Microsoft Excel.
A worksheet contains a single view along with shelves, legends, and the Data pane.
A story contains a sequence of worksheets or dashboards that work together to convey information.
Hint?
Intermediate Tableau
A Pareto chart is a type of chart that contains both bars and a line graph, where individual values are
represented in descending order by bars, and the ascending cumulative total is represented by the line.
For creating Tableau Pareto Chart first we have to create a bar chart.
ii. From the size space of the information pane, drag Sub-Category to Columns.
iii. From the Measures space of the information pane, drag Sales to Rows.
iii. Leave all alternative values unchanged, as well as Sales because the chosen field and add because of
the chosen aggregation.
ii. Drop Sales, to make a dual-axis read. it is a bit exhausting to envision that there area unit 2 instances
of the Sales bars at now, as a result of they’re organized identically.
iii. Select SUM(Sales) (2) on the Marks card, and alter the mark kind of Line.
Add a Table calculation to the road chart to indicate sales by Sub-Category as a running total and as a p.c
of total
i. Click the second copy of SUM(Sales) on Rows and select Add Table Calculation.
ii. Add a primary table calculation to SUM(Sales) to gift sales as a running total.
v. Add a secondary table calculation to gift the information as a p.c of the overall.
vi. Click Add Secondary Calculation and select p.c of Total because of the Secondary Calculation kind.
vii. This is what the Table Calculation panel ought to appear as if at this point:
viii. Click the X in the upper-right corner of the Table Calculations panel to shut it.
ix. Click color the Marks card to vary the color of the road.
Basic HR
Hint?
Basic HR
Hint?
Hint?
Basic HR
Hint?
Basic HR
Hint?
Intermediate HR
Hint?
489. What significant goals have you set for yourself in the past? Have you achieved
those?
Intermediate HR
Hint?
490. You have worked in the IT sector for so long, why is there a sudden interest in
analytics?
Intermediate HR
Hint?
491. Have you worked on any analytics projects or assignments?
Basic HR
Hint?
Intermediate HR
Hint?
493. Do you have any idols? In what way do they inspire you?
Basic HR
Hint?
494. What are your interests and hobbies? What do you do in your free time?
Basic HR
Hint?
Intermediate HR
Hint?
Intermediate HR
Hint?
497. Do you have any questions for us?
Basic HR
Hint?