ML Interview Questions
ML Interview Questions
- Over fitting: Model performs well on training data but poorly on unseen data.
- Under fitting: Model performs poorly on both training and test data, failing
to capture the underlying patterns.
# Example metrics
training_error = 0.1 # Example training error
validation_error = 0.3 # Example validation error
1. Overfitting:
Overfitting happens when the model learns not only the underlying pattern but
also the noise and details of the training data. This leads to poor generalization
on new, unseen data.
# Signs of Overfitting:
- High accuracy on the training data but low accuracy on the test data.
- A large gap between training and validation errors.
# Causes:
- A model that is too complex (e.g., too many features or too many
parameters).
- Insufficient training data relative to the model complexity.
# Code Example:
Here’s an example using Ridge regression to address overfitting.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression
# Predictions
train_pred = ridge_model.predict(X_train)
test_pred = ridge_model.predict(X_test)
2. Underfitting:
Underfitting occurs when the model is too simple to capture the underlying
patterns in the data. It results in poor performance on both the training and test
datasets.
# Signs of Underfitting:
- High error on both the training and validation/test datasets.
- The model fails to capture the complexity of the data.
# Causes:
- A model that is too simple (e.g., too few features or low model
complexity).
- Insufficient training time or iterations (in neural networks).
# Code Example:
Here’s an example using a polynomial regression to solve underfitting.
In this case, using a polynomial transformation helps the model capture more
complex patterns and reduce underfitting.
Summary:
- Overfitting: Model performs well on training data but poorly on test data.
- Fixes: Regularization, simpler models, more data, cross-validation.
- Underfitting: Model performs poorly on both training and test data.
- Fixes: Increase model complexity, train longer, better feature
engineering.
# Example metrics
training_error = 0.1 # Example training error
validation_error = 0.3 # Example validation error
- Loss Function: Measures how well a model's predictions match the actual
outcomes (error for one training example).
- Cost Function: The average of loss functions across all training examples.
- Use in Project: You used loss and cost functions to evaluate the performance
of your machine learning models during training.
- Code Example:
import numpy as np
# Example usage
y_true = np.array([1, 2, 3])
y_pred = np.array([1, 2, 4])
print(mean_squared_error(y_true, y_pred)) # Output: 0.333...
- Use in Project: You used loss and cost functions to evaluate the
performance of your machine learning models during training.
- Code Example:
import numpy as np
# Example usage
y_true = np.array([1, 2, 3])
y_pred = np.array([1, 2, 4])
print(mean_squared_error(y_true, y_pred)) # Output: 0.333...
3) Regression Models:
- Models used to predict a continuous outcome based on independent variables
(e.g., linear regression, logistic regression).
1. Label Encoding
- Description: Each category is assigned a unique integer label. This
method is suitable for ordinal data where the categories have an inherent order.
- Example:
- Colors: Red = 0, Green = 1, Blue = 2
2. One-Hot Encoding
- Description: Converts each category into a new binary column (0s and
1s). This method is useful for nominal data, where categories do not have an order.
- Example:
- Colors: Red = [1, 0, 0], Green = [0, 1, 0], Blue = [0, 0, 1]
3. Binary Encoding
- Description: Each category is first converted into an integer, then
that integer is converted into binary code. This method is efficient for high
cardinality categorical variables.
- Example:
- Colors: Red = 0 (00), Green = 1 (01), Blue = 2 (10) →
- Red = [0, 0], Green = [0, 1], Blue = [1, 0]
4. Frequency Encoding
- Description: Each category is replaced with the frequency of its
occurrence in the dataset. This can help retain some information about the
distribution of categories.
- Example:
- If Red appears 10 times, Green 5 times, and Blue 15 times:
- Red = 10, Green = 5, Blue = 15
6. Ordinal Encoding
- Description: Similar to label encoding, but it specifically considers
the ordinal nature of the categorical data. Each category is assigned a rank based
on its order.
- Example:
- Sizes: Small = 1, Medium = 2, Large = 3
7. Hash Encoding
- Description: Categories are hashed into a fixed number of dimensions,
which helps handle high cardinality without creating too many columns. However, it
may lead to collisions.
- Example:
- Colors: Using a hash function that generates a fixed number of binary
columns.
8. Count Encoding
- Description: Similar to frequency encoding, but the categories are
replaced with their count in the dataset.
- Example:
- Red = 10, Green = 5, Blue = 15
9. Custom Encoding
- Description: Users can define their encoding scheme based on domain
knowledge, often combining several of the above methods.
- Example: Assigning specific numerical values based on business logic.
These encoding techniques play a crucial role in preparing data for machine
learning models, helping ensure that algorithms can effectively learn from the data
provided. Let me know if you need more details on any specific encoding method!
Boosting:
- Goal: Reduce bias(error) and create a strong model by focusing on
mistakes.
- Method: Models are trained sequentially, with each new model focusing
on correcting the errors made by the previous ones. The final prediction is a
weighted combination of all models.
- Popular Algorithms: AdaBoost, Gradient Boosting, XGBoost.
- Key Benefit: Increases accuracy by focusing on the hardest-to-predict
examples.
---
8) percentage-->
The percentage is a way of expressing a number as a fraction of 100. It’s
used to compare relative proportions or to quantify parts of a whole in a
standardized way. For example, 40% means 40 out of every 100.
---
9) varience-->
Variance measures the spread of data points around the mean in a dataset.
It’s calculated as the average of the squared deviations from the mean and gives
insight into data variability. A higher variance indicates more spread-out data.
---
10) standard deviation-->
Standard deviation is the square root of the variance. It quantifies the
amount of variation or dispersion in a dataset, showing how much individual data
points typically deviate from the mean.
---
11) Hypothesis testing-->
Hypothesis testing is a statistical method used to determine if there is
enough evidence in a sample to infer a condition about the larger population. It
involves making an initial assumption (null hypothesis), collecting data, and then
determining whether the data provides enough reason to reject this assumption.
---
12) Type of hypothesis testing--
Common types include:
- T-tests: Compare means between groups (e.g., two-sample t-test).
- ANOVA (Analysis of Variance): Compare means among three or more groups.
- Chi-Square Test: Test relationships between categorical variables.
- Z-tests: Often used for large samples to compare means.
- Non-parametric Tests: Tests like the Mann-Whitney or Wilcoxon for non-
normal data.
---
13) when we do hypothesis testing and why we do it
Hypothesis testing is used to make informed decisions based on sample data.
It helps to confirm or refute assumptions about population parameters and determine
statistical significance. For example, testing if a new treatment is effective or
if there's a relationship between two variables.
---
14) difference in SQL and pandas which one is used to which purposes in data
scientist roles/ projects?
- SQL: Used for querying and managing structured data in relational
databases. It's essential for extracting, filtering, and aggregating large datasets
directly from databases.
- Pandas: A Python library for data manipulation and analysis, ideal for in-
memory data processing. It’s commonly used for complex data transformations,
analysis, and feature engineering in data science workflows.
Data scientists use SQL for database operations and Pandas for more detailed
data analysis and processing.
---
15) what is time series
A time series is a sequence of data points collected or recorded at regular
time intervals. Examples include stock prices, weather data, and sales figures over
time. Time series analysis helps identify trends, seasonality, and patterns in
data.
---
16) random oversampling and udersampling?
- Oversampling: Increases the representation of minority classes by
duplicating instances to balance the class distribution in datasets.
- Undersampling: Reduces the representation of majority classes by removing
instances to balance class distribution.
Both techniques are used to address class imbalance in classification
problems.
---
17) de duplication in random forest?
Random Forest is inherently resilient to duplicates since it’s an ensemble
method that averages results from many decision trees, each using random samples of
data. However, pre-processing steps like removing duplicates can help improve model
performance and efficiency in some cases.
18) differences?
1) difference between random forest and support vector machine(SVM)?
Random Forest
XGBoost
1)Model Building 1)Ensemble learning using independently built 1)Sequential
ensemble learning with trees
decision trees.
correcting errors of previous ones.
2)Optimization 2)Makes predictions by averaging individual 2)Employs
gradient boosting to minimize a loss
Approach tree outputs.
function and improve accuracy iteratively.
3)Handling Unbalanced 3)Can struggle a bit 3)Handles
it like a pro
Datasets
4)Ease of Tuning 4)Simple and straightforward 4)Requires
more practice but offers higher
accuracy
5)Adaptability to 5)Works well with multiple machines
5)Needs more coordination but can handle large Distributed Computing
datasets efficiently
6)Handling Large 6)Can handle them but may slow down with very 6)Built
for speed, perfect for big Datasets large data
datasets
7)Predictive Accuracy 7)Good, but not always the most precise 7)Superior
accuracy, especially in tough
situations
4) difference between liner and logistic regression?
LINEAR REGRESSION(Regression)
LOGISTIC REGRESSION(Classification)
1)Linear Regression is a supervised regression model. 1)Logistic
Regression is a supervised classification
model.
malignant.
19) Assumptions in linear regression?
1. there is relation in between dependant and independent variables
2. bias is very low
3. coliearnity
22)ensemble techniques?
1)Bagging ---random forest
2)Boosting ---XGBoost
For a binary classification, the Gini index for a node is calculated as:
where:
(p) is the probability of one class,
(q) is the probability of the other class (for binary classes, ( q = 1 -
p )).
Q) what is mean by entropy? how will you find the entropy of column of
dataset?------->Code
degree of Disorderness
Q) CART Algoritham?
Q)