Interview Questions
Interview Questions
1. Unbounded Predictions: Linear regression does not restrict output to the [0,1]
interval. This means it can predict values outside this range, which don’t make sense for
probabilities or class labels.
3. Inappropriate Loss Function: Linear regression minimizes the mean squared error
(MSE), which is not optimal for classification. Logistic regression, in contrast, uses a
log-loss (cross-entropy) function that directly optimizes for classification performance.
So, while linear regression can technically be applied to binary classification, logistic
regression is usually the more suitable choice for this task due to its better alignment
with classification needs.
B. Why is logistic regression called regression but still used for classification
problems?
The name "logistic regression" can be a bit misleading because, despite the term
"regression," it’s actually designed for classification problems. Here’s why it’s called
logistic regression and why it works for classification:
Why "Regression"?
2. The Sigmoid Function: Logistic regression applies a logistic (sigmoid) function to the
linear combination of features. This function maps any real-valued number to a range
between 0 and 1, making the output interpretable as a probability, which is essential for
binary classification.
Logistic regression is a popular and e^ective choice for many classification problems,
but like any model, it has its strengths and limitations. Here’s a breakdown:
- It can be quickly trained and is easy to implement in most machine learning libraries.
3. Works Well with Linearly Separable Data:
- Logistic regression performs well when the classes are linearly separable (can be
divided with a linear boundary).
- It’s often a good starting model for binary classification tasks due to its simplicity and
e^ectiveness.
5. Probability Interpretation:
- Logistic regression provides probabilistic outputs (between 0 and 1), making it easy
to understand the model’s confidence in each prediction.
- When this assumption does not hold (e.g., in complex, nonlinear data), logistic
regression may perform poorly.
2. Limited to Binary or Multiclass Classification:
- Standard logistic regression is typically used for binary classification. Extensions like
multinomial or ordinal logistic regression allow for multiclass classification, but it still
struggles with very complex or highly multiclass problems.
3. Sensitive to Outliers:
- Outliers can have a significant impact on logistic regression, as they can distort the
estimated coe^icients and reduce model accuracy.
- For high-dimensional or intricate data, logistic regression may not capture all
relevant patterns and relationships.
- Regularization can mitigate some issues with correlated features, but other models
may be better suited if multicollinearity is high.
Summary: Logistic regression is often a great first choice for binary classification,
thanks to its simplicity, interpretability, and e^iciency. However, it may struggle with
complex, nonlinear data and is sensitive to feature scaling, outliers, and
multicollinearity. In cases where logistic regression falls short, more complex models
may be needed to capture intricate relationships in the data.
1. Distortion of Coe^icients:
- Logistic regression aims to fit a line that best separates the classes by adjusting the
coe^icients for each feature.
- Outliers—extreme values that are far from the majority of the data—can
disproportionately influence the coe^icients, skewing the model's decision boundary
and making it less representative of the typical data.
- Logistic regression outputs probabilities for each class. When outliers exist, they can
distort these probabilities, causing the model to be overly confident or under-confident
in its predictions.
- This is particularly problematic when outliers are incorrectly classified, as they can
drag the probability estimates for the rest of the data, resulting in less reliable
predictions.
- This can decrease the model's accuracy, as the coe^icients adjust to fit the outliers
instead of the overall data distribution.
- Outliers can introduce high variance, making the model more sensitive to small
changes in the data. This can lead to overfitting, where the model fits the noise rather
than the underlying pattern.
- Logistic regression with large or extreme outliers may become overly complex in an
attempt to accommodate these points, resulting in a poor generalization to new data.
Unlike robust models, such as decision trees, that are relatively una^ected by outliers,
logistic regression uses a linear combination of features, which can be sensitive to
extreme values. This sensitivity is due to the way logistic regression fits the log-odds of
the classes to a line that minimizes the log-loss (or cross-entropy) function. Extreme
outliers can disproportionately impact this minimization process, leading to biased
coe^icients and a skewed decision boundary.
- Identify and remove outliers before training. Techniques such as Z-scores, IQR
(interquartile range), or more complex approaches like DBSCAN can be used to detect
outliers.
- This approach works best when there are only a few outliers, and when they can be
clearly defined as unrepresentative of the population.
- Regularization does not eliminate the impact of outliers but can mitigate their
influence by limiting coe^icient size.
- Robust logistic regression techniques, such as those based on Huber loss or quantile
regression, can handle outliers more e^ectively than standard logistic regression by
reducing the influence of outliers in the loss function.
4. Transform Features:
- Applying transformations (like log, square root, or Winsorization) can reduce the
influence of outliers by compressing extreme values.
- This approach can make the distribution of each feature more uniform, making the
logistic regression less sensitive to outliers.
- For datasets with many outliers or where the data is noisy, models less sensitive to
outliers—such as decision trees, random forests, or support vector machines (SVM)—
may be more suitable.
Summary
No, residuals as they are traditionally defined in linear regression don’t exist in logistic
regression, primarily due to the nature of the model and the type of predictions it
generates.
Why Residuals Don’t Exist in Logistic Regression
1. Prediction Type:
- In linear regression, the model predicts a continuous outcome, and residuals are the
di^erences between the observed and predicted values (i.e., \( \text{residual} = y -
\hat{y} \)).
- Logistic regression, on the other hand, predicts probabilities for each class (e.g., the
probability of an instance belonging to class 1). The output isn’t a continuous outcome
to be compared directly to observed binary class labels (0 or 1).
2. Binary Outcomes:
- Logistic regression models binary (or categorical) outcomes, where the observed
values are binary labels (0 or 1) rather than continuous values.
- Because the model outputs probabilities, there's no direct residual (as in linear
regression) that represents the di^erence between observed and predicted values.
- Logistic regression uses a log-loss or cross-entropy function to evaluate how well the
model fits the data, which measures the di^erence between the predicted probability
and the actual class label.
- This loss is based on the probability output of the model rather than a residual-based
metric like mean squared error (MSE), which is used in linear regression.
While traditional residuals aren’t applicable in logistic regression, there are several
alternative ways to evaluate and understand model performance:
1. Deviance or Log-Loss:
- The error rate (or accuracy) compares the predicted class labels to the true labels
and calculates the percentage of correct predictions. This metric gives an indication of
how well the model classifies examples.
3. Pseudo R-Squared:
- Various pseudo \( R^2 \) metrics, like McFadden’s \( R^2 \), can give a sense of how
well the model explains the variability in the outcome, similar to \( R^2 \) in linear
regression, but adapted for logistic regression.
- Metrics like precision, recall, F1-score, and the area under the ROC curve (AUC-ROC)
are commonly used to assess the quality of classification models. These metrics
provide insights into the model's performance in distinguishing between classes.
Summary
In logistic regression, traditional residuals don’t exist because the model predicts
probabilities rather than continuous outcomes. Instead, model evaluation relies on
metrics that measure the di^erence between predicted probabilities and observed
binary outcomes, such as log-loss, deviance, or classification accuracy. These metrics
provide insights into the model’s performance without requiring traditional residuals.
1. Accuracy
- When to Use: Useful when classes are balanced (i.e., the number of instances in
each class is roughly equal).
- Limitation: Accuracy can be misleading when classes are imbalanced, as the model
might simply predict the majority class to achieve high accuracy.
- Confusion Matrix: This matrix displays counts of true positives (TP), true negatives
(TN), false positives (FP), and false negatives (FN).
- Precision: The proportion of positive predictions that are correct, useful for
applications where false positives are costly.
Precision = TP \ TP + FP
- Recall (Sensitivity): The proportion of actual positives that are correctly identified,
important in applications where false negatives are costly.
Recall = TP \ TP + FN
- F1-Score: The harmonic mean of precision and recall, balancing the two metrics.
Useful when there’s an uneven class distribution or when precision and recall are both
important.
- ROC Curve: Plots the true positive rate (recall) against the false positive rate at
various probability thresholds. It illustrates the model's ability to distinguish between
classes across thresholds.
- AUC (Area Under the ROC Curve): Measures the overall performance across all
classification thresholds, with values closer to 1 indicating better model performance.
- When to Use: AUC is especially useful for imbalanced datasets because it evaluates
how well the model separates the positive and negative classes across di^erent
thresholds.
- When to Use: Use log-loss when you need a measure of probability calibration—i.e.,
how well the predicted probabilities reflect actual class likelihoods.
5. Calibration Curve
- Definition: A calibration curve (or reliability plot) shows how well the predicted
probabilities match observed probabilities. If a model is well-calibrated, instances
predicted with a 70% probability should be correct about 70% of the time.
- When to Use: Useful when probability estimates are important, such as in risk
assessment models.
6. Precision-Recall Curve and Average Precision Score
- Precision-Recall Curve: Shows the trade-o^ between precision and recall at di^erent
probability thresholds.
- When to Use: Particularly useful for imbalanced datasets where the positive class is
rare, as it focuses on the model’s ability to detect the minority class.
7. Brier Score
- Definition: The Brier score measures the mean squared di^erence between the
predicted probability and the actual outcome (0 or 1).
- When to Use: The Brier score is similar to log-loss but simpler to interpret. It’s
especially useful when you want to evaluate the accuracy of probability predictions
without focusing on specific thresholds.
- For imbalanced classes, use precision, recall, F1-score, or AUC-PR to evaluate model
performance.
- When probabilistic predictions are needed, log-loss, calibration curves, and the Brier
score can assess the accuracy of probability estimates.
The cost function in logistic regression, often called the log-loss or cross-entropy loss,
measures how well the model's predicted probabilities match the actual class labels.
Logistic regression uses this cost function to guide the optimization of model
parameters, ensuring that it accurately predicts probabilities for each class.
- The goal is to minimize the average cross-entropy across all observations in the
training set. So, the cost function for \( m \) training examples is:
- The cost function heavily penalizes predictions that are both wrong and confident
(e.g., a high probability of class 1 when the actual label is 0).
- This penalty encourages the model to output well-calibrated probabilities, especially
for instances where the model is less certain.
- The derivatives with respect to the weights w are computed to find the gradient of
the cost function, allowing the model to iteratively update w during training in order to
minimize the cost function using optimization methods like gradient descent.
3. Probabilistic Interpretation:
Summary
The logistic regression cost function, based on binary cross-entropy, measures how well
predicted probabilities align with actual labels. It penalizes incorrect, confident
predictions more heavily and is optimized to guide the model in generating accurate,
well-calibrated probabilities for each class. This probabilistic framework is essential to
logistic regression’s e^ectiveness in binary classification tasks.
The One-vs-All (OvA), also known as One-vs-Rest (OvR), is a strategy used in multiclass
classification with models that are inherently binary, like logistic regression. Since
logistic regression is designed for binary classification (i.e., distinguishing between two
classes), it needs a modified approach to handle problems with more than two classes.
The One-vs-All method enables this by breaking down the multiclass problem into
multiple binary classification problems.
- For each class in a multiclass problem, the One-vs-All method creates a separate
binary logistic regression classifier.
- If there are k classes, k binary classifiers are created. Each classifier is trained to
distinguish between one class (positive) and the rest of the classes (negative).
- For each binary classifier, one class is designated as the "positive" class, while all
other classes are treated as the "negative" class.
- For example, if you have three classes: Class A, Class B, and Class C:
- The first classifier is trained to distinguish Class A (positive) from Classes B and C
(negative).
- The second classifier is trained to distinguish Class B (positive) from Classes A and
C (negative).
- The third classifier is trained to distinguish Class C (positive) from Classes A and B
(negative).
3. Making Predictions:
- The final prediction is made by selecting the class with the highest probability among
all classifiers.
Example of One-vs-All in Action
Suppose we have a dataset with three classes: Cats, Dogs, and Birds.
- The model assigns the new data point to the class with the highest probability, which
in this case would be Cats.
- E^icient for Many Algorithms: OvA can be applied to many binary classifiers and still
give good results.
- Interpretability: Since each binary classifier focuses on one class, it’s easy to analyze
and understand how each class is distinguished from others.
- Class Overlap: Sometimes, multiple classifiers may predict the same instance with
high probabilities for di^erent classes. In such cases, selecting the class with the
highest probability may not fully reflect the true confidence.
- Imbalanced Class Problem: When each classifier treats one class as positive and the
rest as negative, some classes might have more examples than others, leading to
potential imbalances in the training process.
Summary
The One-vs-All method is a commonly used strategy in logistic regression for multiclass
classification problems. It converts a multiclass problem into multiple binary
classification problems, where each classifier focuses on identifying one specific class
against all others. This approach is widely used because of its simplicity and
e^ectiveness, though it may face challenges in terms of training complexity and class
overlap.
To compare the performance of multiple logistic regression models, we can use several
evaluation metrics and techniques. Here’s a comprehensive approach to comparing
these models:
- Train-Test Split: Ensure that each model is trained and tested on the same dataset
split. This eliminates variation due to di^erent training data, allowing you to attribute
performance di^erences directly to the models.
2. Evaluation Metrics
Each model’s performance should be evaluated with multiple metrics, especially in
cases of imbalanced classes or di^erent types of predictive goals. Here are key metrics
to consider:
- Accuracy: The percentage of correctly classified instances, but useful mainly if the
classes are balanced.
- Precision measures the proportion of positive predictions that are correct, which is
helpful in cases where false positives are costly.
- Recall measures the proportion of actual positives that are correctly identified,
important when false negatives are costly.
- F1-Score is the harmonic mean of precision and recall, balancing both metrics,
especially useful for imbalanced datasets.
- AUC-ROC measures the model’s ability to distinguish between classes across all
probability thresholds. It’s a reliable metric, especially for imbalanced data, as it
evaluates the model’s discriminatory power independently of the threshold.
- Log-Loss (Cross-Entropy Loss): Measures how close the predicted probabilities are
to the actual labels. It penalizes high-confidence errors more than smaller ones, and a
lower log-loss value indicates better probability calibration.
- Brier Score: This measures the mean squared di^erence between the predicted
probabilities and the actual outcomes. It’s particularly useful for assessing how
accurately predicted probabilities reflect reality.
3. Confusion Matrix
- Error Analysis: By examining false positives and false negatives, you can identify
patterns in misclassification that may suggest ways to improve each model.
4. Calibration Curve
- Why: A calibration curve shows how well a model’s predicted probabilities match
actual outcomes. If your application relies on accurate probabilities rather than just
class predictions, calibration is crucial.
- Analysis: Compare calibration curves to see which model better reflects real-world
probabilities, with curves closer to the diagonal indicating better calibration.
- Cohen’s Kappa: This measure provides insight into inter-model agreement, assessing
if the classification outcomes agree beyond what would be expected by chance.
- ROC and Precision-Recall Curves: Plot ROC and precision-recall curves for each
model on the same graph to visually compare their performance across di^erent
thresholds. The model with the curve closer to the top-left (ROC) or top-right (PR)
generally performs better.
- Calibration Plots: For comparing how well probabilities align with actual outcomes,
calibration plots are very e^ective.
- Lift and Gain Charts: These charts show how well a model captures positive
instances relative to a random model, providing a visual tool to evaluate e^ectiveness,
especially for marketing or risk assessment contexts.
- Training and Inference Time: If you’re comparing models for deployment, consider
training time and inference speed, as some logistic regression models may use di^erent
regularization or optimization techniques that impact speed.
- Model Complexity and Interpretability: Simpler models are easier to interpret, which
can be an important factor in contexts requiring model transparency, such as
healthcare or finance.
- Di^erent models might have similar performance metrics, so choose one that aligns
with your business goals. For example, in a fraud detection system, a model with higher
recall might be preferred, while in marketing, precision might be prioritized.
3. Plot ROC and PR Curves: Overlay ROC and PR curves to visually compare
discriminatory performance.