Interview Questions
Interview Questions
INTERVIEW QUESTIONS......................................................................................................3
What is causal inference?.................................................................................................................3
Why do we do a train/test split?......................................................................................................3
When is it best to remove missing values?.......................................................................................3
How do you communicate statistical insights to non-technical stakeholders?.................................3
Give an example where the median is a better measure than the mean.........................................3
How do you calculate the needed sample size?...............................................................................3
What are the types of sampling in Statistics?...................................................................................4
What is Bessel’s correction?.............................................................................................................5
What is the assumption of normality?.............................................................................................6
What types of biases can you encounter while sampling?...............................................................6
What is the meaning of an inliner?.................................................................................................. 6
What is the difference between Point Estimate and Confidence Interval Estimate?........................6
Mention the relationship between standard error and margin of error?.........................................7
What is the proportion of confidence interval that will not contain the population parameter?....7
What is the Law of Large Numbers in statistics?..............................................................................7
What is the goal of A/B testing?.......................................................................................................7
What do you understand by sensitivity and specificity?..................................................................7
What is Resampling and what are the common methods of resampling?.......................................8
What is Cost Function?.....................................................................................................................8
What is the Law of Large Numbers in statistics and how it can be used in data science?................8
What is the difference between a confidence interval and a prediction interval, and how do you
calculate them?................................................................................................................................ 9
How to make your model robust to outliers...................................................................................10
Describe the motivation behind random forests and mention two reasons why they are better
than individual decision trees........................................................................................................ 11
What are the differences and similarities between gradient boosting and random forest? And
what are the advantages and disadvantages of each when compared to each other?..................11
What are the differences between a model that minimizes squared error and the one that
minimizes the absolute error? and in which cases each error metric would be more appropriate?
....................................................................................................................................................... 12
Define and compare parametric and non-parametric models and give two examples for each of
them?............................................................................................................................................. 13
Explain the kernel trick in SVM. Why do we use it and how to choose what kernel to use?..........13
Define the cross-validation process and the motivation behind using it........................................14
You are building a binary classifier and you found that the data is imbalanced, what should you do
to handle this situation?.................................................................................................................15
You are working on a clustering problem, what are different evaluation metrics that can be used,
and how to choose between them?...............................................................................................16
What is the difference between hard and soft voting classifiers in the context of ensemble
learners?........................................................................................................................................ 18
What is boosting in the context of ensemble learners discuss two famous boosting methods.....18
How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?
....................................................................................................................................................... 19
Define the curse of dimensionality and how to solve it.................................................................19
In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA?.....20
Discuss two clustering algorithms that can scale to large datasets................................................20
Do you need to scale your data if you will be using the SVM classifier and discus your answer....20
What are Loss Functions and Cost Functions? Explain the key Difference Between them.............21
What is the importance of batch in machine learning and explain some batch-dependent gradient
descent algorithms?.......................................................................................................................21
What are the different methods to split a tree in a decision tree algorithm?................................21
Why boosting is a more stable algorithm as compared to other ensemble algorithms?...............22
What is active learning and discuss one strategy of it?..................................................................22
What are the different approaches to implementing recommendation systems?.........................23
What are the evaluation metrics that can be used for multi-label classification?..........................23
What is the difference between concept and data drift and how to overcome each of them?.....24
INTERVIEW QUESTIONS
What is causal inference?
Causal inference is an important idea that has been getting a lot of attention. Causal
inference is the process of determining whether and how one variable (the cause)
directly affects another variable (the effect), which is a distinguishing characteristic
between causal inference and mere correlation. Causal inference is really a group of
methods that aims to establish a cause-and-effect relationship in order to understand
if something is working, like an intervention or treatment of some kind. If a researcher
needs to understand if a drug is working, for example, causal inference can help
answer the question.
Why do we do a train/test split?
Splitting data into training and test sets helps us evaluate the model's performance on
unseen data. The training set is used to train the model, while the test set is used to
assess how well the model generalizes to new data. This practice helps in detecting
overfitting and ensures that the model can perform well on real-world data.
When is it best to remove missing values?
Removing missing values can be appropriate when the proportion of missing data is
very small, typically less than 5%. Also, it’s a good idea when the missing data are
MCAR (Missing Completely at Random), meaning the missingness does not introduce
bias. Finally, I would consider removing missing values if the dataset is large enough
that deleting a small number of rows does not significantly impact the analysis.
Give an example where the median is a better measure than the mean
The median is a better measure of central tendency than the mean when the
distribution of data values is skewed or when there are clear outliers.
📍 Define the population size: The first thing is to determine the total number of your
target demographic. If you are dealing with a larger population, you can approximate
the total population between several educated guesses.
📍 Decide on a margin of error: Also known as a “confidence interval”. The margin of
error indicates how much of a difference you are willing to allow between your sample
mean and the mean of the population.
📍 Choose a confidence level: Your confidence level indicates how assured you are that
the actual mean will fall within your chosen margin of error. The most common
confidence levels are 90%, 95%, and 99%. Your specified confidence level corresponds
with a z-score.
90% = 1.645
95% = 1.96
99% = 2.576
📍 Pick a standard of deviation: Next, you will need to determine your standard of
deviation, or the level of variance you expect to see in the information gathered. If you
don’t know how much variance to expect, a standard deviation of 0.5 is typically a safe
choice that will ensure your sample size is large enough.
📍 Calculate your sample size: Finally, you can use these values to calculate the sample
size. You can do this by using the formula or by using a sample size using a calculator
online.
📍 Simple random sampling: This method involves pure random division. Each
individual has the same probability of being chosen to be a part of the sample.
📍 Cluster sampling: This method involves dividing the entire population into clusters.
Clusters are identified and included in the sample based on demographic parameters
like sex, age and location.
📍 Stratified sampling: This method involves dividing the population into unique groups
that represent the entire population. While sampling, these groups can be organized
and then drawn a sample from each group separately.
📍 Systematic sampling: This sampling method involves choosing the sample members
from a larger according to a random starting point but with a fixed, periodic interval
called sampling interval. The sampling interval is calculated by diving the population
by the desired sample size. This type of sampling method has a predefined range,
hence the least time-consuming.
What is Bessel’s correction?
This assumption of normality dictates that if many independent random samples are
collected from a population and some value of interest (like the sample mean) is
calculated, and then a histogram is created to visualize the distribution of sample
means, a normal distribution should be observed.
📍Selection bias: It involves the selection of individual or grouped data in a way that is
not random.
📍Undercoverage bias: This type of bias occurs when some population members are
inadequately represented in the sample.
An inlier is a data value that lies within the general distribution of other observed
values but is an error. Inliers are difficult to distinguish from good data values,
therefore, they are sometimes difficult to find and correct.
The margin of error can be calculated using the standard error with this formula:
What is the proportion of confidence interval that will not contain the
population parameter?
Alpha (α) is the portion of the confidence interval that will not contain the population
parameter.
For example, if the confidence level (CL) is 95%, then, α = 1 – 0.95, or α = 0.05.
What is the Law of Large Numbers in statistics?
According to the law of large numbers in statistics, an increase in the number of trials
performed will cause a positive proportional increase in the average of the results,
becoming the expected value.
For example, the probability of flipping a fair coin and landing heads is closer to 0.5
when flipped 100, 000 times compared to when flipped 50 times.
The goal is usually to identify any changes to a web page to maximize or increase the
outcome of interest. A/B testing is a fantastic method to figure out the best online
promotional and marketing strategies for your business.
📍Sensitivity is a measure of the proportion of actual positive cases that got predicted
as positive (or true positive).
📍Specificity is a measure of the proportion of actual negative cases that got predicted
as negative (or true negative).
Resampling involves the selection of randomized cases with replacement from the
original data sample in such a way that each number of the sample drawn has several
cases that are similar to the original data sample.
2. Cross Validation
What is Cost Function?
It measures how wrong the model is in estimating the relationship between input
and output parameters.
What is the Law of Large Numbers in statistics and how it can be used
in data science?
The law of large numbers states that as the number of trials in a random experiment
increases, the average of the results obtained from the experiment approaches the
expected value. In statistics, it's used to describe the relationship between sample size
and the accuracy of statistical estimates.
In data science, the law of large numbers is used to understand the behavior of
random variables over many trials. It's often applied in areas such as predictive
modeling, risk assessment, and quality control to ensure that data-driven decisions are
based on a robust and accurate representation of the underlying patterns in the data.
The law of large numbers helps to guarantee that the average of the results from a
large number of independent and identically distributed trials will converge to the
expected value, providing a foundation for statistical inference and hypothesis testing.
A confidence interval is a range of values that is likely to contain the true value of a
population parameter with a certain level of confidence. It is used to estimate the
precision or accuracy of a sample statistic, such as a mean or a proportion, based on a
sample from a larger population.
For example, if we want to estimate the average height of all adults in a certain
region, we can take a random sample of individuals from that region and calculate the
sample mean height. Then we can construct a confidence interval for the true
population mean height, based on the sample mean and the sample size, with a certain
level of confidence, such as 95%. This means that if we repeat the sampling process
many times, 95% of the resulting intervals will contain the true population mean
height.
The formula for a confidence interval is: confidence interval = sample statistic +/-
margin of error
The margin of error depends on the sample size, the standard deviation of the
population (or the sample, if the population standard deviation is unknown), and the
desired level of confidence. For example, if the sample size is larger or the standard
deviation is smaller, the margin of error will be smaller, resulting in a narrower
confidence interval.
A prediction interval is a range of values that is likely to contain a future observation
or outcome with a certain level of confidence. It is used to estimate the uncertainty or
variability of a future value based on a statistical model and the observed data.
For example, if we have a regression model that predicts the sales of a product based
on its price and advertising budget, we can use a prediction interval to estimate the
range of possible sales for a new product with a certain price and advertising budget,
with a certain level of confidence, such as 95%. This means that if we repeat the
prediction process many times, 95% of the resulting intervals will contain the true
sales value.
The formula for a prediction interval is: prediction interval = point estimate +/-
margin of error
The point estimate is the predicted value of the outcome variable based on the model
and the input variables. The margin of error depends on the residual standard
deviation of the model, which measures the variability of the observed data around the
predicted values, and the desired level of confidence. For example, if the residual
standard deviation is larger or the level of confidence is higher, the margin of error
will be larger, resulting in a wider prediction interval.
There are several options when it comes to strengthening your model in terms of
outliers. Investigating these outliers is always the first step in understanding how to
treat them. After you recognize the nature of why they occurred, you can apply one of
the several methods below:
Use tree-based models (random forest, gradient boosting) that are generally
less affected by outliers.
Winsorize the data. Winsorizing or winsorization is the transformation of
statistics by limiting extreme values in the statistical data to reduce the effect of
possibly spurious outliers. In numerical data, if the distribution is almost normal
using the Z-score, we can detect the outliers and treat them by either removing
or capping them with some value.
If the distribution is skewed, using IQR we can detect and treat it again either
by removing or capping it with some value. In categorical data check for value
counts in the percentage. If we have very few records from some category, we
can either remove it or cap it with some categorical value like others.
Transform the data. For example, you can do a log transformation when the
response variable follows an exponential distribution, or when it is right-
skewed.
Use more robust error metrics such as MAE or Huber loss instead of MSE.
Remove the outliers. However, do this if you are certain that the outliers are
true anomalies not worth adding to your model. This should be your last
consideration since dropping them means losing information.
Since linear regression is one of the most commonly used models, it has the honor of
also being one of the most misapplied ones. So before running it, you must validate its
four main assumptions to prevent false results:
Linearity: The relation between the feature set and the target variable is linear.
The widespread empirical application of linear regression means that the interview
questions will assure you have more knowledge than just blindly importing it from
scikit-learn and using it. Interviewers will try to determine whether you have a deep
understanding of how the model works, its assumption, and the different evaluation
metrics. They will be addressing edge cases that come up in real-life scenarios and
challenge your ability to put theory into practice.
The motivation behind random forest or ensemble models can be explained easily by
using the following example: Let’s say we have a question to solve. We gather 100
people, ask each of them this question, and record their answers. After we combine all
the replies we have received, we will discover that the aggregated collective opinion
will be close to the actual solution to the problem. This is known as the “Wisdom of the
crowd” which is, in fact, the motivation behind random forests. We take weak learners
(ML models) specifically, decision trees in the case of random forest, and aggregate
their results to get good predictions by removing dependency on a particular set of
features. In regression, we take the mean and for classification, we take the majority
vote of the classifiers.
Generally, you should note that no algorithm is better than the other. It always
depends on the case and the dataset used (Check the No Free Lunch Theorem). Still,
there are reasons why random forests often allow for stronger prediction than
individual decision trees:
Decision trees are prone to overfit whereas random forest generalizes better on
unseen data as it uses randomness in feature selection and during sampling of
the data. Therefore, random forests have lower variance compared to that of
the decision tree without substantially increasing the error due to bias.
Generally, ensemble models like random forests perform better as they are
aggregations of various models (decision trees in the case of a random forest),
using the concept of the “Wisdom of the crowd.”
The similarities between gradient boosting and random forest can be summed up like
this:
Both are also ensemble algorithms - they are flexible models and do not need
much data preprocessing.
Random forest uses Bagging. This means that trees are arranged in a parallel
fashion, where the results from all of them are aggregated at the end through
averaging or majority vote. On, the other hand, gradient boosting uses
Boosting, where trees are arranged in a series sequential fashion, where every
tree tries to minimize the error of the previous one.
When we discuss the advantages and disadvantages between the two it is only fair to
juxtapose them both with their weaknesses and with their strengths. We need to keep
in mind that each one of them is more applicable in certain instances than the other
and vice versa. It depends, on the outcome we want to reach and the task we need to
solve.
Gradient boosting is better than random forest when used on unbalanced data
sets.
On the other hand, we have the advantages of random forest over gradient boosting as
well:
It has faster training as trees are created in parallel and independent of each
other.
Due to the focus on mistakes during training iterations and the lack of
independence in tree building, gradient boosting is indeed more susceptible to
overfitting. If the data is noisy, the boosted trees might overfit and start
modeling the noise.
In gradient boosting, training might take longer because every tree is created
sequentially.
What are the differences between a model that minimizes squared error
and the one that minimizes the absolute error? and in which cases each
error metric would be more appropriate?
Both mean square error (MSE) and mean absolute error (MAE) measures the
distances between vectors and express average model prediction in units of the
target variable. Both can range from 0 to infinity, the lower they are the better
the model.
The main difference between them is that in MSE the errors are squared before
being averaged while in MAE they are not. This means that a large weight will
be given to large errors. MSE is useful when large errors in the model are
trying to be avoided. This means that outliers affect MSE more than MAE, that
is why MAE is more robust to outliers. Computation-wise MSE is easier to use
as the gradient calculation will be more straightforward than MAE, which
requires linear programming to calculate it.
Parametric models assume that the dataset comes from a certain function
with some set of parameters that should be tuned to reach the optimal
performance. For such models, the number of parameters is determined prior to
training, thus the degree of freedom is limited and reduces the chances of
overfitting.
Ex. Linear Regression, Logistic Regression, LDA
Nonparametric models don't assume anything about the function from which
the dataset was sampled. For these models, the number of parameters is not
determined prior to training, thus they are free to generalize the model based
on the data. Sometimes these models overfit themselves while generalizing. To
generalize they need more data in comparison with Parametric Models. They
are relatively more difficult to interpret compared to Parametric Models.
Ex. Decision Tree, Random Forest.
Explain the kernel trick in SVM. Why do we use it and how to choose
what kernel to use?
Answer: Kernels are used in SVM to map the original input data into a particular
higher dimensional space where it will be easier to find patterns in the data and train
the model with better performance.
For eg.: If we have binary class data which form a ring-like pattern (inner and outer
rings representing two different class instances) when plotted in 2D space, a linear
SVM kernel will not be able to differentiate the two classes well when compared to an
RBF (radial basis function) kernel, mapping the data into a particular higher
dimensional space where the two classes are clearly separable.
Typically without the kernel trick, in order to calculate support vectors and support
vector classifiers, we need first to transform data points one by one to the higher
dimensional space, do the calculations based on SVM equations in the higher
dimensional space, and then return the results. The ‘trick’ in the kernel trick is that
we design the kernels based on some conditions as mathematical functions that are
equivalent to a dot product in the higher dimensional space without even having to
transform data points to the higher dimensional space. i.e. we can calculate support
vectors and support vector classifiers in the same space where the data is provided
which saves a lot of time and calculations.
Having domain knowledge can be very helpful in choosing the optimal kernel for your
problem, however, in the absence of such knowledge following this default rule can be
helpful: For linear problems, we can try linear or logistic kernels, and for nonlinear
problems, we can use RBF or Gaussian kernels.
Define the cross-validation process and the motivation behind using it.
3. For each i in fold 1 to k train the data using all the folds except for fold i and
test on the fold i.
4. Average the K validation/test error from the previous step to get an estimate of
the error.
This process aims to accomplish the following: 1- Prevent overfitting during training
by avoiding training and testing on the same subset of the data points
2- Avoid information loss by using a certain subset of the data for validation only. This
is important for small datasets.
Cross-validation is always good to be used for small datasets, and if used for large
datasets the computational complexity will increase depending on the number of folds.
You are building a binary classifier and you found that the data is
imbalanced, what should you do to handle this situation?
Answer: If there is a data imbalance there are several measures we can take to train a
fairer binary classifier:
1. Pre-Processing:
Suppression: Though not recommended, we can drop off some features directly
responsible for the imbalance.
2. In-Processing:
Regularisation: We can add score terms that measure the data imbalance in the
loss function and therefore minimizing the loss function will also minimize the
degree of imbalance concerning the score chosen which also indirectly
minimizes other metrics that measure the degree of data imbalance.
Adversarial Debiasing: Here we use the adversarial notion to train the model
where the discriminator tries to detect if there are signs of data imbalance in
the predicted data by the generator and hence the generator learns to generate
data that is less prone to imbalance.
3. Post-Processing:
Odds-Equalization: Here we try to equalize the odds for the classes with respect
to the data is imbalanced for correct imbalance in the trained model. Usually,
the F1 score is a good choice, if both precision and recall scores are important
Answer:
Clusters are evaluated based on some similarity or dissimilarity measure such as the
distance between cluster points. If the clustering algorithm separates dissimilar
observations and similar observations together, then it has performed well. The two
𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 The Silhouette Coefficient is defined for each sample and
is composed of two scores: a: The mean distance between a sample and all other
points in the same cluster. b: The mean distance between a sample and all other points
in the next nearest cluster.
S = (b-a) / max(a,b)
The 𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 for a set of samples is given as the mean of the
Silhouette Coefficient for each sample. The score is bounded between -1 for incorrect
clustering and +1 for highly dense clustering. Scores around zero indicate
overlapping clusters. The score is higher when clusters are dense and well separated,
which relates to a standard concept of a cluster.
Dunn’s Index
Dunn’s Index (DI) is another metric for evaluating a clustering algorithm. Dunn’s
Index is equal to the minimum inter-cluster distance divided by the maximum cluster
size. Note that large inter-cluster distances (better separation) and smaller cluster
sizes (more compact clusters) lead to a higher DI value. A higher DI implies better
clustering. It assumes that better clustering means that clusters are compact and well-
separated from other clusters.
What is the ROC curve and when should you use it?
This ROC curve is mainly used to compare two or more models as shown in the figure
below. Now, it is easy to see that a reasonable model will always give FPR less (since
it's an error) than TPR so, the curve hugs the upper left corner of the square box 0 to
1 on the TPR axis and 0 to 1 on the FPR axis.
The more the AUC(area under the curve) for a model's ROC curve, the better the
model in terms of prediction accuracy in terms of TPR and FPR.
Can help prioritize either true positives or true negatives depending on your
case study (Helps you visually choose the best hyperparameters for your case)
Can be used to compare different ML models by calculating the area under the
ROC curve (AUC)
What is the difference between hard and soft voting classifiers in the
context of ensemble learners?
Hard Voting: We take into account the class predictions for each classifier and
then classify an input based on the maximum votes to a particular class.
Soft Voting: We take into account the probability predictions for each class by
each classifier and then classify an input to the class with maximum probability
based on the average probability (averaged over the classifier's probabilities)
for that class.
Boosting refers to any Ensemble method that can combine several weak learners into
a strong learner. The general idea of most boosting methods is to train predictors
sequentially, each trying to correct its predecessor.
There are many boosting methods available, but by far the most popular are:
Adaptive Boosting: One way for a new predictor to correct its predecessor is to
pay a bit more attention to the training instances that the predecessor under-
fitted. This results in new predictors focusing more and more on the hard cases.
Answer: Curse of dimensionality represents the situation when the amount of data is
too few to be represented in a high-dimensional space, as it will be highly scattered in
that high-dimensional space and it becomes more probable that we overfit this data. If
we increase the number of features, we are implicitly increasing model complexity and
if we increase model complexity we need more data.
Possible solutions are: Remove irrelevant features not discriminating classes
correlated or features not resulting in much improvement, we can use:
In what cases would you use vanilla PCA, Incremental PCA, Randomized
PCA, or Kernel PCA?
Answer:
Regular PCA is the default, but it works only if the dataset fits in memory. Incremental
PCA is useful for large datasets that don't fit in memory, but it is slower than regular
PCA, so if the dataset fits in memory you should prefer regular PCA. Incremental PCA
is also useful for online tasks when you need to apply PCA on the fly, every time a new
instance arrives. Randomized PCA is useful when you want to considerably reduce
dimensionality and the dataset fits in memory; in this case, it is much faster than
regular PCA. Finally, Kernel PCA is useful for nonlinear datasets.
Answer:
Minibatch Kmeans: Instead of using the full dataset at each iteration, the algorithm
is capable of using mini-batches, moving the centroids just slightly at each iteration.
This speeds up the algorithm typically by a factor of 3 or 4 and makes it possible to
cluster huge datasets that do not fit in memory. Scikit-Learn implements this
algorithm in the MiniBatchKMeans class.
Answer: Yes, feature scaling is required for SVM and all margin-based classifiers since
the optimal hyperplane (the decision boundary) is dependent on the scale of the input
features. In other words, the distance between two observations will differ for scaled
and non-scaled cases, leading to different models being generated.
This can be seen in the figure below, when the features have different scales, we can
see that the decision boundary and the support vectors are only classifying the X1
features without taking into consideration the X0 feature, however after scaling the
data to the same scale the decision boundaries and support vectors are looking much
better and the model is taking into account both features.
To scale the data, normalization, and standardization are the most popular
approaches.
What are Loss Functions and Cost Functions? Explain the key
Difference Between them.
Answer: The loss function is the measure of the performance of the model on a single
training example, whereas the cost function is the average loss function over all
training examples or across the batch in the case of mini-batch gradient descent.
Some examples of loss functions are Mean Squared Error, Binary Cross Entropy, etc.
Whereas, the cost function is the average of the above loss functions over training
examples.
Answer: In the memory, the dataset can load either completely at once or in the form
of a set. If we have a huge size of the dataset, then loading the whole data into
memory will reduce the training speed, hence batch term is introduced.
Example: image data contains 1,00,000 images, we can load this into 3125 batches
where 1 batch = 32 images. So instead of loading the whole 1,00,000 images in
memory, we can load 32 images 3125 times which requires less memory.
In summary, a batch is important in two ways: (1) Efficient memory consumption. (2)
Improve training speed.
There are 3 types of gradient descent algorithms based on batch size: (1) Stochastic
gradient descent (2) Batch gradient descent (3) Mini Batch gradient descent
If the whole data is in a single batch, it is called batch gradient descent. If the single
data points are equal to one batch i.e. number of batches = number of data instances,
it is called stochastic gradient descent. If the number of batches is less than the
number of data points or greater than 1, it is known as mini-batch gradient descent.
Decision trees can be of two types regression and classification. For classification,
classification accuracy created a lot of instability. So the following loss functions are
used:
Gini's Index Gini impurity is used to predict the likelihood of a randomly chosen
example being incorrectly classified by a particular node. It’s referred to as an
“impurity” measure because it demonstrates how the model departs from a
simple division.
For regression, the good old mean squared error serves as a good loss function which
is minimized by splits of the input features and predicting the mean value of the target
feature on the subspaces resulting from the split. But finding the split that results in
the minimum possible residual sum of squares is computationally infeasible, so a
greedy top-down approach is taken i.e. the splits are made at a level from top to down
which results in maximum reduction of RSS. We continue this until some maximum
depth or number of leaves is attained.
Boosting algorithms focus on errors found in previous iterations until they become
obsolete. Whereas in bagging there is no corrective loop. That’s why boosting is a
more stable algorithm compared to other ensemble algorithms.
Active learning is a special case of machine learning in which a learning algorithm can
interactively query a user (or some other information source) to label new data points
with the desired outputs. In statistics literature, it is sometimes referred to as optimal
experimental design.
2. Pool-based sampling In this case, the data samples are chosen from a pool of
unlabelled data based on the informative value scores and sent for manual
labeling. Unlike stream-based sampling, oftentimes, the entire unlabelled
dataset is scrutinized for the selection of the best instances.
This filter helps in avoiding a cold start for any new products as it doesn't rely on
other users' feedback, it can recommend products based on similarity factors.
However, content-based filtering needs a lot of domain knowledge so that the
recommendations made are 100 percent accurate.
So, instead of focusing on just one user, the collaborative filtering system focuses on
all the users and clusters them according to their interests.
Basically, it recommends a product 'x' to user 'a' based on the interest of user 'b';
users 'a' and 'b' must have had similar interests in the past, which is why they are
clustered together.
More modern approaches typically fall into the hybrid filtering category and tend to
work in two stages:
What are the evaluation metrics that can be used for multi-label
classification?
The evaluation metrics for multi-label classification are designed to measure the
performance of a multi-label classifier in predicting the correct set of labels for each
instance. Some commonly used evaluation metrics for multi-label classification are:
1. Hamming Loss: Hamming Loss is the fraction of labels that are incorrectly
predicted. It is defined as the average number of labels that are predicted
incorrectly per instance.
4. Macro-F1, Micro-F1: Macro-F1 and Micro-F1 are two types of F1-score metrics
that take into account the label imbalance in the dataset. Macro-F1 calculates
the F1-score for each label and then averages them, while Micro-F1 calculates
the overall F1-score by aggregating the true positive, false positive, and false
negative counts across all labels.
Precision at k (P@k)
Concept drift and data drift are two different types of problems that can occur in
machine learning systems.
Concept drift refers to changes in the underlying relationships between the input data
and the target variable over time. This means that the distribution of the data that the
model was trained on no longer matches the distribution of the data it is being tested
on. For example, a spam filter model that was trained on emails from several years
ago may not be as effective at identifying spam emails from today because the
language and tactics used in spam emails may have changed.
Data drift, on the other hand, refers to changes in the input data itself over time. This
means that the values of the input feature that the model was trained on no longer
match the values of the input features in the data it is being tested on. For example, a
model that was trained on data from a particular geographical region may not be as
effective at predicting outcomes for data from a different region.
To overcome concept drift, one approach is to use online learning methods that allow
the model to adapt to new data as it arrives. This involves continually training the
model on the most recent data while using historical data to maintain context. Another
approach is to periodically retrain the model using a representative sample of the
most recent data.
To overcome data drift, one approach is to monitor the input data for changes and
retrain the model when significant changes are detected. This may involve setting up a
monitoring system that alerts the user when the data distribution changes beyond a
certain threshold.
Another approach is to preprocess the input data to remove or mitigate the effects of
the features changing over time so that the model can continue learning from the
remaining features.