Data Science Interview Questions
Data Science Interview Questions
What is ensemble
learning and how does it
improve model
performance?
Answer :
Ensemble Learning:
- Combines multiple
models (weak learners)
to create a stronger
model.
- Techniques:
- Bagging: Reduces
variance by averaging
predictions (e.g., Random
Forest).
- Boosting: Reduces
bias by sequentially
correcting errors (e.g.,
AdaBoost, Gradient
Boosting).
- Stacking: Combines
multiple models by
training a meta-model on
their predictions.
Answer :
Handling Categorical
Data:
- Label Encoding:
Converts categories to
numeric labels.
- One-Hot Encoding:
Converts categories to
binary vectors.
- Target Encoding:
Replaces categories with
the mean of the target
variable for each category.
- Frequency Encoding:
Replaces categories with
their frequency counts.
Answer :
Parametric Models:
- Assumes a specific
form for the function that
models the data.
- Example: Linear
Regression.
Non-Parametric Models:
- Does not assume a
specific form and can
adapt to the data more
flexibly.
- Example: Decision Trees,
k-Nearest Neighbors
(k-NN).
Answer :
Curse of Dimensionality:
- Refers to various
phenomena that arise
when analyzing and
organizing data in
high-dimensional spaces.
- Challenges: Increased
sparsity, overfitting,
increased computational
cost.
Answer :
Ensemble Learning:
- Combines multiple
models to create a more
robust and accurate
prediction.
Popular Ensemble
Techniques:
- Bagging (Bootstrap
Aggregating): Reduces
variance by training
multiple models on
different subsets of data
and averaging their
predictions (e.g., Random
Forest).
- Boosting: Reduces bias
by sequentially training
models, each correcting
the errors of its
predecessor (e.g.,
AdaBoost, Gradient
Boosting).
- Stacking: Combines
multiple models by
training a meta-model on
their predictions.
Answer :
k-Nearest Neighbors
(k-NN) Algorithm:
- A simple,
non-parametric
classification and
regression algorithm.
- Steps:
- Choose the number of
neighbors (k).
- Calculate the distance
between the query point
and all training points.
- Select the k nearest
neighbors based on the
smallest distances.
- For classification,
assign the most frequent
class among the
neighbors.
- For regression,
average the values of the
neighbors.
Answer :
ROC Curve (Receiver
Operating Characteristic):
- Plots the true positive
rate (TPR) against the
false positive rate (FPR)
at various threshold
settings.
- Shows the trade-off
between sensitivity (recall)
and specificity.
AUC (Area Under the
Curve):
- Measures the area
under the ROC curve.
- A higher AUC indicates a
better model
performance, with 1 being
a perfect model and 0.5
representing a random
model.
Follow for more
informative content: