0% found this document useful (0 votes)
0 views

Interview Questions

The document is a comprehensive list of interview questions and answers related to statistics and data science, covering topics such as causal inference, sampling methods, and model evaluation techniques. It includes explanations of key concepts like the Law of Large Numbers, confidence intervals, and the importance of train/test splits. Additionally, it discusses practical applications of statistical methods in communicating insights and handling data issues like missing values and outliers.

Uploaded by

kritika singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Interview Questions

The document is a comprehensive list of interview questions and answers related to statistics and data science, covering topics such as causal inference, sampling methods, and model evaluation techniques. It includes explanations of key concepts like the Law of Large Numbers, confidence intervals, and the importance of train/test splits. Additionally, it discusses practical applications of statistical methods in communicating insights and handling data issues like missing values and outliers.

Uploaded by

kritika singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Table of Contents

INTERVIEW QUESTIONS......................................................................................................3
What is causal inference?.................................................................................................................3
Why do we do a train/test split?......................................................................................................3
When is it best to remove missing values?.......................................................................................3
How do you communicate statistical insights to non-technical stakeholders?.................................3
Give an example where the median is a better measure than the mean.........................................3
How do you calculate the needed sample size?...............................................................................3
What are the types of sampling in Statistics?...................................................................................4
What is Bessel’s correction?.............................................................................................................5
What is the assumption of normality?.............................................................................................6
What types of biases can you encounter while sampling?...............................................................6
What is the meaning of an inliner?.................................................................................................. 6
What is the difference between Point Estimate and Confidence Interval Estimate?........................6
Mention the relationship between standard error and margin of error?.........................................7
What is the proportion of confidence interval that will not contain the population parameter?....7
What is the Law of Large Numbers in statistics?..............................................................................7
What is the goal of A/B testing?.......................................................................................................7
What do you understand by sensitivity and specificity?..................................................................7
What is Resampling and what are the common methods of resampling?.......................................8
What is Cost Function?.....................................................................................................................8
What is the Law of Large Numbers in statistics and how it can be used in data science?................8
What is the difference between a confidence interval and a prediction interval, and how do you
calculate them?................................................................................................................................ 9
How to make your model robust to outliers...................................................................................10
Describe the motivation behind random forests and mention two reasons why they are better
than individual decision trees........................................................................................................ 11
What are the differences and similarities between gradient boosting and random forest? And
what are the advantages and disadvantages of each when compared to each other?..................11
What are the differences between a model that minimizes squared error and the one that
minimizes the absolute error? and in which cases each error metric would be more appropriate?
....................................................................................................................................................... 12
Define and compare parametric and non-parametric models and give two examples for each of
them?............................................................................................................................................. 13
Explain the kernel trick in SVM. Why do we use it and how to choose what kernel to use?..........13
Define the cross-validation process and the motivation behind using it........................................14
You are building a binary classifier and you found that the data is imbalanced, what should you do
to handle this situation?.................................................................................................................15
You are working on a clustering problem, what are different evaluation metrics that can be used,
and how to choose between them?...............................................................................................16
What is the difference between hard and soft voting classifiers in the context of ensemble
learners?........................................................................................................................................ 18
What is boosting in the context of ensemble learners discuss two famous boosting methods.....18
How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?
....................................................................................................................................................... 19
Define the curse of dimensionality and how to solve it.................................................................19
In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA?.....20
Discuss two clustering algorithms that can scale to large datasets................................................20
Do you need to scale your data if you will be using the SVM classifier and discus your answer....20
What are Loss Functions and Cost Functions? Explain the key Difference Between them.............21
What is the importance of batch in machine learning and explain some batch-dependent gradient
descent algorithms?.......................................................................................................................21
What are the different methods to split a tree in a decision tree algorithm?................................21
Why boosting is a more stable algorithm as compared to other ensemble algorithms?...............22
What is active learning and discuss one strategy of it?..................................................................22
What are the different approaches to implementing recommendation systems?.........................23
What are the evaluation metrics that can be used for multi-label classification?..........................23
What is the difference between concept and data drift and how to overcome each of them?.....24
INTERVIEW QUESTIONS
What is causal inference?

Causal inference is an important idea that has been getting a lot of attention. Causal
inference is the process of determining whether and how one variable (the cause)
directly affects another variable (the effect), which is a distinguishing characteristic
between causal inference and mere correlation. Causal inference is really a group of
methods that aims to establish a cause-and-effect relationship in order to understand
if something is working, like an intervention or treatment of some kind. If a researcher
needs to understand if a drug is working, for example, causal inference can help
answer the question.
Why do we do a train/test split?

Splitting data into training and test sets helps us evaluate the model's performance on
unseen data. The training set is used to train the model, while the test set is used to
assess how well the model generalizes to new data. This practice helps in detecting
overfitting and ensures that the model can perform well on real-world data.
When is it best to remove missing values?

Removing missing values can be appropriate when the proportion of missing data is
very small, typically less than 5%. Also, it’s a good idea when the missing data are
MCAR (Missing Completely at Random), meaning the missingness does not introduce
bias. Finally, I would consider removing missing values if the dataset is large enough
that deleting a small number of rows does not significantly impact the analysis.

How do you communicate statistical insights to non-technical


stakeholders?

To effectively communicate with different stakeholders, I adjust my style based on


their backgrounds and interests. For example, for executives, I prioritize business
impact, using business language and visuals to facilitate quick decision-making. On
the other hand, for developers, I provide technical details. In both cases, I make sure
that the concepts involved are relevantly clear and accessible, and I encourage
questions and feedback. This approach ensures that each stakeholder group receives
the information they need in a format that resonates with them.

Give an example where the median is a better measure than the mean

The median is a better measure of central tendency than the mean when the
distribution of data values is skewed or when there are clear outliers.

How do you calculate the needed sample size?

To calculate the sample size needed for a survey or experiment:

📍 Define the population size: The first thing is to determine the total number of your
target demographic. If you are dealing with a larger population, you can approximate
the total population between several educated guesses.
📍 Decide on a margin of error: Also known as a “confidence interval”. The margin of
error indicates how much of a difference you are willing to allow between your sample
mean and the mean of the population.

📍 Choose a confidence level: Your confidence level indicates how assured you are that
the actual mean will fall within your chosen margin of error. The most common
confidence levels are 90%, 95%, and 99%. Your specified confidence level corresponds
with a z-score.

Z-scores for the three most common confidence levels are:

 90% = 1.645

 95% = 1.96

 99% = 2.576

📍 Pick a standard of deviation: Next, you will need to determine your standard of
deviation, or the level of variance you expect to see in the information gathered. If you
don’t know how much variance to expect, a standard deviation of 0.5 is typically a safe
choice that will ensure your sample size is large enough.

📍 Calculate your sample size: Finally, you can use these values to calculate the sample
size. You can do this by using the formula or by using a sample size using a calculator
online.

What are the types of sampling in Statistics?

The four main types of data sampling in Statistics are:

📍 Simple random sampling: This method involves pure random division. Each
individual has the same probability of being chosen to be a part of the sample.
📍 Cluster sampling: This method involves dividing the entire population into clusters.
Clusters are identified and included in the sample based on demographic parameters
like sex, age and location.

📍 Stratified sampling: This method involves dividing the population into unique groups
that represent the entire population. While sampling, these groups can be organized
and then drawn a sample from each group separately.

📍 Systematic sampling: This sampling method involves choosing the sample members
from a larger according to a random starting point but with a fixed, periodic interval
called sampling interval. The sampling interval is calculated by diving the population
by the desired sample size. This type of sampling method has a predefined range,
hence the least time-consuming.
What is Bessel’s correction?

In statistics, Bessel’s correction is the use of n-1 instead of n in several formulas,


including the sample variance and standard deviation, where n is the number of
observations in a sample. This method corrects the bias in the estimation of the
population variance. It also partially corrects the bias in the estimation of the
population standard deviation, thereby, providing more accurate results.

What is the assumption of normality?

This assumption of normality dictates that if many independent random samples are
collected from a population and some value of interest (like the sample mean) is
calculated, and then a histogram is created to visualize the distribution of sample
means, a normal distribution should be observed.

What types of biases can you encounter while sampling?

Sampling bias occurs when a sample is not representative of a target population


during an investigation or a survey. The three main that one can encounter while
sampling is:

📍Selection bias: It involves the selection of individual or grouped data in a way that is
not random.

📍Undercoverage bias: This type of bias occurs when some population members are
inadequately represented in the sample.

📍Survivorship bias occurs when a sample concentrates on the ‘surviving’ or existing


observations and ignores those that have already ceased to exist. This can lead to
wrong conclusions in numerous different means.
What is the meaning of an inliner?

An inlier is a data value that lies within the general distribution of other observed
values but is an error. Inliers are difficult to distinguish from good data values,
therefore, they are sometimes difficult to find and correct.

An example of an inlier might be a value recorded in the wrong units.

What is the difference between Point Estimate and Confidence Interval


Estimate?

📍 A point estimate gives a single value as an estimate of a population parameter. For


example, a sample standard deviation is a point estimate of a population’s standard
deviation.

📍 A confidence interval estimate gives a range of values likely to contain the


population parameter. It is the most common type of interval estimate because it tells
us how likely this interval is to contain the population parameter.

Mention the relationship between standard error and margin of error?

As the standard error increases, the margin of error also increases.

The margin of error can be calculated using the standard error with this formula:

Margin of error = Critical value * Standard error of the sample

What is the proportion of confidence interval that will not contain the
population parameter?

Alpha (α) is the portion of the confidence interval that will not contain the population
parameter.

α = 1 – CL = the probability a confidence interval will not include the population


parameter.

1 – α = CL = the probability the population parameter will be in the interval

For example, if the confidence level (CL) is 95%, then, α = 1 – 0.95, or α = 0.05.
What is the Law of Large Numbers in statistics?

According to the law of large numbers in statistics, an increase in the number of trials
performed will cause a positive proportional increase in the average of the results,
becoming the expected value.

For example, the probability of flipping a fair coin and landing heads is closer to 0.5
when flipped 100, 000 times compared to when flipped 50 times.

What is the goal of A/B testing?

A/B testing is statistical hypothesis testing. It is an analytical method for making


decisions that estimate population parameters based on sample statistics.

The goal is usually to identify any changes to a web page to maximize or increase the
outcome of interest. A/B testing is a fantastic method to figure out the best online
promotional and marketing strategies for your business.

What do you understand by sensitivity and specificity?

📍Sensitivity is a measure of the proportion of actual positive cases that got predicted
as positive (or true positive).

📍Specificity is a measure of the proportion of actual negative cases that got predicted
as negative (or true negative).

The calculation of Sensitivity and Specificity is pretty straightforward;

What is Resampling and what are the common methods of resampling?

Resampling involves the selection of randomized cases with replacement from the
original data sample in such a way that each number of the sample drawn has several
cases that are similar to the original data sample.

Two common methods of resampling are:

1. Bootstrapping and Normal resampling

2. Cross Validation
What is Cost Function?

The cost function is an important parameter that measures the performance of a


machine learning model for a given dataset.

It measures how wrong the model is in estimating the relationship between input
and output parameters.

What is the Law of Large Numbers in statistics and how it can be used
in data science?

The law of large numbers states that as the number of trials in a random experiment
increases, the average of the results obtained from the experiment approaches the
expected value. In statistics, it's used to describe the relationship between sample size
and the accuracy of statistical estimates.

In data science, the law of large numbers is used to understand the behavior of
random variables over many trials. It's often applied in areas such as predictive
modeling, risk assessment, and quality control to ensure that data-driven decisions are
based on a robust and accurate representation of the underlying patterns in the data.

The law of large numbers helps to guarantee that the average of the results from a
large number of independent and identically distributed trials will converge to the
expected value, providing a foundation for statistical inference and hypothesis testing.

What is the difference between a confidence interval and a prediction


interval, and how do you calculate them?

A confidence interval is a range of values that is likely to contain the true value of a
population parameter with a certain level of confidence. It is used to estimate the
precision or accuracy of a sample statistic, such as a mean or a proportion, based on a
sample from a larger population.

For example, if we want to estimate the average height of all adults in a certain
region, we can take a random sample of individuals from that region and calculate the
sample mean height. Then we can construct a confidence interval for the true
population mean height, based on the sample mean and the sample size, with a certain
level of confidence, such as 95%. This means that if we repeat the sampling process
many times, 95% of the resulting intervals will contain the true population mean
height.

The formula for a confidence interval is: confidence interval = sample statistic +/-
margin of error

The margin of error depends on the sample size, the standard deviation of the
population (or the sample, if the population standard deviation is unknown), and the
desired level of confidence. For example, if the sample size is larger or the standard
deviation is smaller, the margin of error will be smaller, resulting in a narrower
confidence interval.
A prediction interval is a range of values that is likely to contain a future observation
or outcome with a certain level of confidence. It is used to estimate the uncertainty or
variability of a future value based on a statistical model and the observed data.

For example, if we have a regression model that predicts the sales of a product based
on its price and advertising budget, we can use a prediction interval to estimate the
range of possible sales for a new product with a certain price and advertising budget,
with a certain level of confidence, such as 95%. This means that if we repeat the
prediction process many times, 95% of the resulting intervals will contain the true
sales value.

The formula for a prediction interval is: prediction interval = point estimate +/-
margin of error

The point estimate is the predicted value of the outcome variable based on the model
and the input variables. The margin of error depends on the residual standard
deviation of the model, which measures the variability of the observed data around the
predicted values, and the desired level of confidence. For example, if the residual
standard deviation is larger or the level of confidence is higher, the margin of error
will be larger, resulting in a wider prediction interval.

How to make your model robust to outliers.

There are several options when it comes to strengthening your model in terms of
outliers. Investigating these outliers is always the first step in understanding how to
treat them. After you recognize the nature of why they occurred, you can apply one of
the several methods below:

 Add regularization that will reduce variance, for example, L1 or L2


regularization.

 Use tree-based models (random forest, gradient boosting) that are generally
less affected by outliers.
 Winsorize the data. Winsorizing or winsorization is the transformation of
statistics by limiting extreme values in the statistical data to reduce the effect of
possibly spurious outliers. In numerical data, if the distribution is almost normal
using the Z-score, we can detect the outliers and treat them by either removing
or capping them with some value.
If the distribution is skewed, using IQR we can detect and treat it again either
by removing or capping it with some value. In categorical data check for value
counts in the percentage. If we have very few records from some category, we
can either remove it or cap it with some categorical value like others.

 Transform the data. For example, you can do a log transformation when the
response variable follows an exponential distribution, or when it is right-
skewed.

 Use more robust error metrics such as MAE or Huber loss instead of MSE.

 Remove the outliers. However, do this if you are certain that the outliers are
true anomalies not worth adding to your model. This should be your last
consideration since dropping them means losing information.

Explain the linear regression model and discuss its assumption.

Linear regression is a form of supervised learning, where the model is trained on


labeled input data. In linear regression, the goal is to estimate a function f(x) sо that
each feature has a linear relationship to the target variable y where y= X*beta. X is a
matrix of predictor variables and beta is a vector of parameters that determines the
weight of each variable when predicting the target variable.

Since linear regression is one of the most commonly used models, it has the honor of
also being one of the most misapplied ones. So before running it, you must validate its
four main assumptions to prevent false results:

 Linearity: The relation between the feature set and the target variable is linear.

 Homoscedasticity: The variance of the residuals is constant.

 Independence: All observations are independent of one another.

 Normality: The distribution of Y is assumed to be normal.

The widespread empirical application of linear regression means that the interview
questions will assure you have more knowledge than just blindly importing it from
scikit-learn and using it. Interviewers will try to determine whether you have a deep
understanding of how the model works, its assumption, and the different evaluation
metrics. They will be addressing edge cases that come up in real-life scenarios and
challenge your ability to put theory into practice.

Describe the motivation behind random forests and mention two


reasons why they are better than individual decision trees.

The motivation behind random forest or ensemble models can be explained easily by
using the following example: Let’s say we have a question to solve. We gather 100
people, ask each of them this question, and record their answers. After we combine all
the replies we have received, we will discover that the aggregated collective opinion
will be close to the actual solution to the problem. This is known as the “Wisdom of the
crowd” which is, in fact, the motivation behind random forests. We take weak learners
(ML models) specifically, decision trees in the case of random forest, and aggregate
their results to get good predictions by removing dependency on a particular set of
features. In regression, we take the mean and for classification, we take the majority
vote of the classifiers.

Generally, you should note that no algorithm is better than the other. It always
depends on the case and the dataset used (Check the No Free Lunch Theorem). Still,
there are reasons why random forests often allow for stronger prediction than
individual decision trees:

 Decision trees are prone to overfit whereas random forest generalizes better on
unseen data as it uses randomness in feature selection and during sampling of
the data. Therefore, random forests have lower variance compared to that of
the decision tree without substantially increasing the error due to bias.

 Generally, ensemble models like random forests perform better as they are
aggregations of various models (decision trees in the case of a random forest),
using the concept of the “Wisdom of the crowd.”

What are the differences and similarities between gradient boosting


and random forest? And what are the advantages and disadvantages of
each when compared to each other?

The similarities between gradient boosting and random forest can be summed up like
this:

 Both these algorithms are decision-tree based.

 Both are also ensemble algorithms - they are flexible models and do not need
much data preprocessing.

There are two main differences we can mention here:

 Random forest uses Bagging. This means that trees are arranged in a parallel
fashion, where the results from all of them are aggregated at the end through
averaging or majority vote. On, the other hand, gradient boosting uses
Boosting, where trees are arranged in a series sequential fashion, where every
tree tries to minimize the error of the previous one.

 In random forests, every tree is constructed independently of the others,


whereas, in gradient boosting, every tree is dependent on the previous one.

When we discuss the advantages and disadvantages between the two it is only fair to
juxtapose them both with their weaknesses and with their strengths. We need to keep
in mind that each one of them is more applicable in certain instances than the other
and vice versa. It depends, on the outcome we want to reach and the task we need to
solve.

So, the advantages of gradient boosting over random forests include:


 Gradient boosting can be more accurate than random forests because we train
them to minimize the previous tree’s error.

 It can also capture complex patterns in the data.

 Gradient boosting is better than random forest when used on unbalanced data
sets.

On the other hand, we have the advantages of random forest over gradient boosting as
well:

 Random forest is less prone to overfitting compared to gradient boosting.

 It has faster training as trees are created in parallel and independent of each
other.

Moreover, gradient boosting also exhibits the following weaknesses:

 Due to the focus on mistakes during training iterations and the lack of
independence in tree building, gradient boosting is indeed more susceptible to
overfitting. If the data is noisy, the boosted trees might overfit and start
modeling the noise.

 In gradient boosting, training might take longer because every tree is created
sequentially.

 Additionally, tunning the hyperparameters of gradient boosting is more complex


than those of random forest.

What are the differences between a model that minimizes squared error
and the one that minimizes the absolute error? and in which cases each
error metric would be more appropriate?

 Both mean square error (MSE) and mean absolute error (MAE) measures the
distances between vectors and express average model prediction in units of the
target variable. Both can range from 0 to infinity, the lower they are the better
the model.
 The main difference between them is that in MSE the errors are squared before
being averaged while in MAE they are not. This means that a large weight will
be given to large errors. MSE is useful when large errors in the model are
trying to be avoided. This means that outliers affect MSE more than MAE, that
is why MAE is more robust to outliers. Computation-wise MSE is easier to use
as the gradient calculation will be more straightforward than MAE, which
requires linear programming to calculate it.

Define and compare parametric and non-parametric models and give


two examples for each of them?

 Parametric models assume that the dataset comes from a certain function
with some set of parameters that should be tuned to reach the optimal
performance. For such models, the number of parameters is determined prior to
training, thus the degree of freedom is limited and reduces the chances of
overfitting.
 Ex. Linear Regression, Logistic Regression, LDA
 Nonparametric models don't assume anything about the function from which
the dataset was sampled. For these models, the number of parameters is not
determined prior to training, thus they are free to generalize the model based
on the data. Sometimes these models overfit themselves while generalizing. To
generalize they need more data in comparison with Parametric Models. They
are relatively more difficult to interpret compared to Parametric Models.
 Ex. Decision Tree, Random Forest.

Explain the kernel trick in SVM. Why do we use it and how to choose
what kernel to use?

Answer: Kernels are used in SVM to map the original input data into a particular
higher dimensional space where it will be easier to find patterns in the data and train
the model with better performance.

For eg.: If we have binary class data which form a ring-like pattern (inner and outer
rings representing two different class instances) when plotted in 2D space, a linear
SVM kernel will not be able to differentiate the two classes well when compared to an
RBF (radial basis function) kernel, mapping the data into a particular higher
dimensional space where the two classes are clearly separable.

Typically without the kernel trick, in order to calculate support vectors and support
vector classifiers, we need first to transform data points one by one to the higher
dimensional space, do the calculations based on SVM equations in the higher
dimensional space, and then return the results. The ‘trick’ in the kernel trick is that
we design the kernels based on some conditions as mathematical functions that are
equivalent to a dot product in the higher dimensional space without even having to
transform data points to the higher dimensional space. i.e. we can calculate support
vectors and support vector classifiers in the same space where the data is provided
which saves a lot of time and calculations.

Having domain knowledge can be very helpful in choosing the optimal kernel for your
problem, however, in the absence of such knowledge following this default rule can be
helpful: For linear problems, we can try linear or logistic kernels, and for nonlinear
problems, we can use RBF or Gaussian kernels.
Define the cross-validation process and the motivation behind using it.

Cross-validation is a technique used to assess the performance of a learning model in


several subsamples of training data. In general, we split the data into train and test
sets where we use the training data to train our model and the test data to evaluate
the performance of the model on unseen data and validation set for choosing the best
hyperparameters. Now, a random split in most cases(for large datasets) is fine.
However, for smaller datasets, it is susceptible to loss of important information
present in the data in which it was not trained. Hence, cross-validation though
computationally expensive combats this issue.

The process of cross-validation is as follows:

1. Define k or the number of folds

2. Randomly shuffle the data into K equally-sized blocks (folds)

3. For each i in fold 1 to k train the data using all the folds except for fold i and
test on the fold i.

4. Average the K validation/test error from the previous step to get an estimate of
the error.

This process aims to accomplish the following: 1- Prevent overfitting during training
by avoiding training and testing on the same subset of the data points

2- Avoid information loss by using a certain subset of the data for validation only. This
is important for small datasets.

Cross-validation is always good to be used for small datasets, and if used for large
datasets the computational complexity will increase depending on the number of folds.
You are building a binary classifier and you found that the data is
imbalanced, what should you do to handle this situation?

Answer: If there is a data imbalance there are several measures we can take to train a
fairer binary classifier:

1. Pre-Processing:

 Check whether you can get more data or not.

 Use sampling techniques (Sample minority class, Downsample majority class,


can take the hybrid approach as well). We can also use data augmentation to
add more data points for the minority class but with little deviations/changes
leading to new data points that are similar to the ones they are derived from.
The most common/popular technique is SMOTE (Synthetic Minority
Oversampling technique)

 Suppression: Though not recommended, we can drop off some features directly
responsible for the imbalance.

 Learning Fair Representation: Projecting the training examples to a subspace or


plane minimizes the data imbalance.

 Re-Weighting: We can assign some weights to each training example to reduce


the imbalance in the data.

2. In-Processing:

 Regularisation: We can add score terms that measure the data imbalance in the
loss function and therefore minimizing the loss function will also minimize the
degree of imbalance concerning the score chosen which also indirectly
minimizes other metrics that measure the degree of data imbalance.

 Adversarial Debiasing: Here we use the adversarial notion to train the model
where the discriminator tries to detect if there are signs of data imbalance in
the predicted data by the generator and hence the generator learns to generate
data that is less prone to imbalance.

3. Post-Processing:

 Odds-Equalization: Here we try to equalize the odds for the classes with respect
to the data is imbalanced for correct imbalance in the trained model. Usually,
the F1 score is a good choice, if both precision and recall scores are important

 Choose appropriate performance metrics. For example, accuracy is not a


correct metric to use when classes are imbalanced. Instead, use precision,
recall, F1 score, and ROC curve.
You are working on a clustering problem, what are different evaluation
metrics that can be used, and how to choose between them?

Answer:

Clusters are evaluated based on some similarity or dissimilarity measure such as the
distance between cluster points. If the clustering algorithm separates dissimilar
observations and similar observations together, then it has performed well. The two

𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 and 𝐃𝐮𝐧𝐧’𝐬 𝐈𝐧𝐝𝐞𝐱.


most popular metrics evaluation metrics for clustering algorithms are the

𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 The Silhouette Coefficient is defined for each sample and
is composed of two scores: a: The mean distance between a sample and all other
points in the same cluster. b: The mean distance between a sample and all other points
in the next nearest cluster.

S = (b-a) / max(a,b)

The 𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 for a set of samples is given as the mean of the
Silhouette Coefficient for each sample. The score is bounded between -1 for incorrect
clustering and +1 for highly dense clustering. Scores around zero indicate
overlapping clusters. The score is higher when clusters are dense and well separated,
which relates to a standard concept of a cluster.

Dunn’s Index

Dunn’s Index (DI) is another metric for evaluating a clustering algorithm. Dunn’s
Index is equal to the minimum inter-cluster distance divided by the maximum cluster
size. Note that large inter-cluster distances (better separation) and smaller cluster
sizes (more compact clusters) lead to a higher DI value. A higher DI implies better
clustering. It assumes that better clustering means that clusters are compact and well-
separated from other clusters.
What is the ROC curve and when should you use it?

ROC curve, Receiver Operating Characteristic curve, is a graphical representation of


the model's performance where we plot the True Positive Rate (TPR) against the False
Positive Rate (FPR) for different threshold values, for hard classification, between 0 to
1 based on model output.

This ROC curve is mainly used to compare two or more models as shown in the figure
below. Now, it is easy to see that a reasonable model will always give FPR less (since
it's an error) than TPR so, the curve hugs the upper left corner of the square box 0 to
1 on the TPR axis and 0 to 1 on the FPR axis.

The more the AUC(area under the curve) for a model's ROC curve, the better the
model in terms of prediction accuracy in terms of TPR and FPR.

Here are some benefits of using the ROC Curve :

 Can help prioritize either true positives or true negatives depending on your
case study (Helps you visually choose the best hyperparameters for your case)

 Can be very insightful when we have unbalanced datasets

 Can be used to compare different ML models by calculating the area under the
ROC curve (AUC)
What is the difference between hard and soft voting classifiers in the
context of ensemble learners?

 Hard Voting: We take into account the class predictions for each classifier and
then classify an input based on the maximum votes to a particular class.

 Soft Voting: We take into account the probability predictions for each class by
each classifier and then classify an input to the class with maximum probability
based on the average probability (averaged over the classifier's probabilities)
for that class.

What is boosting in the context of ensemble learners discuss two


famous boosting methods

Boosting refers to any Ensemble method that can combine several weak learners into
a strong learner. The general idea of most boosting methods is to train predictors
sequentially, each trying to correct its predecessor.

There are many boosting methods available, but by far the most popular are:

 Adaptive Boosting: One way for a new predictor to correct its predecessor is to
pay a bit more attention to the training instances that the predecessor under-
fitted. This results in new predictors focusing more and more on the hard cases.

 Gradient Boosting: Another very popular Boosting algorithm is Gradient


Boosting. Just like AdaBoost, Gradient Boosting works by sequentially adding
predictors to an ensemble, each one correcting its predecessor. However,
instead of tweaking the instance weights at every iteration as AdaBoost does,
this method tries to fit the new predictor to the residual errors made by the
previous predictor.

How can you evaluate the performance of a dimensionality reduction


algorithm on your dataset?

Intuitively, a dimensionality reduction algorithm performs well if it eliminates a lot of


dimensions from the dataset without losing too much information. One way to
measure this is to apply the reverse transformation and measure the reconstruction
error. However, not all dimensionality reduction algorithms provide a reverse
transformation.

Alternatively, if you are using dimensionality reduction as a preprocessing step before


another Machine Learning algorithm (e.g., a Random Forest classifier), then you can
simply measure the performance of that second algorithm; if dimensionality reduction
did not lose too much information, then the algorithm should perform just as well as
when using the original dataset.

Define the curse of dimensionality and how to solve it.

Answer: Curse of dimensionality represents the situation when the amount of data is
too few to be represented in a high-dimensional space, as it will be highly scattered in
that high-dimensional space and it becomes more probable that we overfit this data. If
we increase the number of features, we are implicitly increasing model complexity and
if we increase model complexity we need more data.
Possible solutions are: Remove irrelevant features not discriminating classes
correlated or features not resulting in much improvement, we can use:

 Feature selection(select the most important ones).

 Feature extraction(transform current feature dimensionality into a lower


dimension preserving the most possible amount of information like PCA ).

In what cases would you use vanilla PCA, Incremental PCA, Randomized
PCA, or Kernel PCA?

Answer:

Regular PCA is the default, but it works only if the dataset fits in memory. Incremental
PCA is useful for large datasets that don't fit in memory, but it is slower than regular
PCA, so if the dataset fits in memory you should prefer regular PCA. Incremental PCA
is also useful for online tasks when you need to apply PCA on the fly, every time a new
instance arrives. Randomized PCA is useful when you want to considerably reduce
dimensionality and the dataset fits in memory; in this case, it is much faster than
regular PCA. Finally, Kernel PCA is useful for nonlinear datasets.

Discuss two clustering algorithms that can scale to large datasets

Answer:

Minibatch Kmeans: Instead of using the full dataset at each iteration, the algorithm
is capable of using mini-batches, moving the centroids just slightly at each iteration.
This speeds up the algorithm typically by a factor of 3 or 4 and makes it possible to
cluster huge datasets that do not fit in memory. Scikit-Learn implements this
algorithm in the MiniBatchKMeans class.

Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is a


clustering algorithm that can cluster large datasets by first generating a small and
compact summary of the large dataset that retains as much information as possible.
This smaller summary is then clustered instead of clustering the larger dataset.
Do you need to scale your data if you will be using the SVM classifier
and discus your answer

Answer: Yes, feature scaling is required for SVM and all margin-based classifiers since
the optimal hyperplane (the decision boundary) is dependent on the scale of the input
features. In other words, the distance between two observations will differ for scaled
and non-scaled cases, leading to different models being generated.

This can be seen in the figure below, when the features have different scales, we can
see that the decision boundary and the support vectors are only classifying the X1
features without taking into consideration the X0 feature, however after scaling the
data to the same scale the decision boundaries and support vectors are looking much
better and the model is taking into account both features.

To scale the data, normalization, and standardization are the most popular
approaches.

What are Loss Functions and Cost Functions? Explain the key
Difference Between them.

Answer: The loss function is the measure of the performance of the model on a single
training example, whereas the cost function is the average loss function over all
training examples or across the batch in the case of mini-batch gradient descent.

Some examples of loss functions are Mean Squared Error, Binary Cross Entropy, etc.

Whereas, the cost function is the average of the above loss functions over training
examples.

What is the importance of batch in machine learning and explain some


batch-dependent gradient descent algorithms?

Answer: In the memory, the dataset can load either completely at once or in the form
of a set. If we have a huge size of the dataset, then loading the whole data into
memory will reduce the training speed, hence batch term is introduced.

Example: image data contains 1,00,000 images, we can load this into 3125 batches
where 1 batch = 32 images. So instead of loading the whole 1,00,000 images in
memory, we can load 32 images 3125 times which requires less memory.

In summary, a batch is important in two ways: (1) Efficient memory consumption. (2)
Improve training speed.
There are 3 types of gradient descent algorithms based on batch size: (1) Stochastic
gradient descent (2) Batch gradient descent (3) Mini Batch gradient descent

If the whole data is in a single batch, it is called batch gradient descent. If the single
data points are equal to one batch i.e. number of batches = number of data instances,
it is called stochastic gradient descent. If the number of batches is less than the
number of data points or greater than 1, it is known as mini-batch gradient descent.

What are the different methods to split a tree in a decision tree


algorithm?

Decision trees can be of two types regression and classification. For classification,
classification accuracy created a lot of instability. So the following loss functions are
used:

 Gini's Index Gini impurity is used to predict the likelihood of a randomly chosen
example being incorrectly classified by a particular node. It’s referred to as an
“impurity” measure because it demonstrates how the model departs from a
simple division.

 Cross-entropy or Information Gain Information gain refers to the process of


identifying the most important features/attributes that convey the most
information about a class. The entropy principle is followed with the goal of
reducing entropy from the root node to the leaf nodes. Information gain is the
difference in entropy before and after splitting, which describes the impurity of
in-class items.

For regression, the good old mean squared error serves as a good loss function which
is minimized by splits of the input features and predicting the mean value of the target
feature on the subspaces resulting from the split. But finding the split that results in
the minimum possible residual sum of squares is computationally infeasible, so a
greedy top-down approach is taken i.e. the splits are made at a level from top to down
which results in maximum reduction of RSS. We continue this until some maximum
depth or number of leaves is attained.

Why boosting is a more stable algorithm as compared to other


ensemble algorithms?

Boosting algorithms focus on errors found in previous iterations until they become
obsolete. Whereas in bagging there is no corrective loop. That’s why boosting is a
more stable algorithm compared to other ensemble algorithms.

What is active learning and discuss one strategy of it?

Active learning is a special case of machine learning in which a learning algorithm can
interactively query a user (or some other information source) to label new data points
with the desired outputs. In statistics literature, it is sometimes referred to as optimal
experimental design.

1. Stream-based sampling In stream-based selective sampling, unlabelled data is


continuously fed to an active learning system, where the learner decides
whether to send the same to a human oracle or not based on a predefined
learning strategy. This method is apt in scenarios where the model is in
production and the data sources/distributions vary over time.

2. Pool-based sampling In this case, the data samples are chosen from a pool of
unlabelled data based on the informative value scores and sent for manual
labeling. Unlike stream-based sampling, oftentimes, the entire unlabelled
dataset is scrutinized for the selection of the best instances.

What are the different approaches to implementing recommendation


systems?

1. 𝐂𝐨𝐧𝐭𝐞𝐧𝐭-𝐁𝐚𝐬𝐞𝐝 𝐅𝐢𝐥𝐭𝐞𝐫𝐢𝐧𝐠: Content-Based Filtering depends on similarities of


items and users' past activities on the website to recommend any product or
service.

This filter helps in avoiding a cold start for any new products as it doesn't rely on
other users' feedback, it can recommend products based on similarity factors.
However, content-based filtering needs a lot of domain knowledge so that the
recommendations made are 100 percent accurate.

2. 𝐂𝐨𝐥𝐥𝐚𝐛𝐨𝐫𝐚𝐭𝐢𝐯𝐞-𝐁𝐚𝐬𝐞𝐝 𝐅𝐢𝐥𝐭𝐞𝐫𝐢𝐧𝐠: The primary job of a collaborative filtering


system is to overcome the shortcomings of content-based filtering.

So, instead of focusing on just one user, the collaborative filtering system focuses on
all the users and clusters them according to their interests.

Basically, it recommends a product 'x' to user 'a' based on the interest of user 'b';
users 'a' and 'b' must have had similar interests in the past, which is why they are
clustered together.

The domain knowledge that is required for collaborative filtering is less,


recommendations made are more accurate and it can adapt to the changing tastes of
users over time. However, collaborative filtering faces the problem of a cold start as it
heavily relies on feedback or activity from other users.
3. 𝐇𝐲𝐛𝐫𝐢𝐝 𝐟𝐢𝐥𝐭𝐞𝐫𝐢𝐧𝐠: A mixture of content and collaborative methods. Uses
descriptors and interactions.

More modern approaches typically fall into the hybrid filtering category and tend to
work in two stages:

1) A candidate generation phase where we coarsely generate candidates from a


corpus of hundreds of thousands, millions, or billions of items down to a few
hundred or thousand
2) A ranking phase where we re-rank the candidates into a final top-n set to be shown
to the user. Some systems employ multiple candidate generation methods and
rankers.

What are the evaluation metrics that can be used for multi-label
classification?

Multi-label classification is a type of classification problem where each instance can be


assigned to multiple classes or labels simultaneously.

The evaluation metrics for multi-label classification are designed to measure the
performance of a multi-label classifier in predicting the correct set of labels for each
instance. Some commonly used evaluation metrics for multi-label classification are:

1. Hamming Loss: Hamming Loss is the fraction of labels that are incorrectly
predicted. It is defined as the average number of labels that are predicted
incorrectly per instance.

2. Accuracy: Accuracy is the fraction of instances that are correctly predicted. In


multi-label classification, accuracy is calculated as the percentage of instances
for which all labels are predicted correctly.

3. Precision, Recall, F1-Score: These metrics can be applied to each label


separately, treating the classification of each label as a separate binary
classification problem. Precision measures the proportion of predicted positive
labels that are correct, recall measures the proportion of actual positive labels
that are correctly predicted, and F1-score is the harmonic mean of precision
and recall.

4. Macro-F1, Micro-F1: Macro-F1 and Micro-F1 are two types of F1-score metrics
that take into account the label imbalance in the dataset. Macro-F1 calculates
the F1-score for each label and then averages them, while Micro-F1 calculates
the overall F1-score by aggregating the true positive, false positive, and false
negative counts across all labels.

There are other metrics that can be used such as:

 Precision at k (P@k)

 Average precision at k (AP@k)

 Mean average precision at k (MAP@k)


What is the difference between concept and data drift and how to
overcome each of them?

Concept drift and data drift are two different types of problems that can occur in
machine learning systems.

Concept drift refers to changes in the underlying relationships between the input data
and the target variable over time. This means that the distribution of the data that the
model was trained on no longer matches the distribution of the data it is being tested
on. For example, a spam filter model that was trained on emails from several years
ago may not be as effective at identifying spam emails from today because the
language and tactics used in spam emails may have changed.

Data drift, on the other hand, refers to changes in the input data itself over time. This
means that the values of the input feature that the model was trained on no longer
match the values of the input features in the data it is being tested on. For example, a
model that was trained on data from a particular geographical region may not be as
effective at predicting outcomes for data from a different region.

To overcome concept drift, one approach is to use online learning methods that allow
the model to adapt to new data as it arrives. This involves continually training the
model on the most recent data while using historical data to maintain context. Another
approach is to periodically retrain the model using a representative sample of the
most recent data.

To overcome data drift, one approach is to monitor the input data for changes and
retrain the model when significant changes are detected. This may involve setting up a
monitoring system that alerts the user when the data distribution changes beyond a
certain threshold.

Another approach is to preprocess the input data to remove or mitigate the effects of
the features changing over time so that the model can continue learning from the
remaining features.

You might also like