0% found this document useful (0 votes)
8 views

ML Interview Questions

Uploaded by

Mitesh Waghe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ML Interview Questions

Uploaded by

Mitesh Waghe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 10

21) Over fitting & under fitting:

- Over fitting: Model performs well on training data but poorly on unseen data.
- Under fitting: Model performs poorly on both training and test data, failing
to capture the underlying patterns.

- Use in Project: You monitored model performance metrics to prevent overfitting


and underfitting during model training in your machine learning projects.
- Code Example:

# Example metrics
training_error = 0.1 # Example training error
validation_error = 0.3 # Example validation error

if training_error < 0.1 and validation_error > 0.2:


print("Model is overfitting")

Overfitting and Underfitting are common problems in machine learning (ML)


models. Understanding these issues and how to address them is crucial for building
effective models.

1. Overfitting:
Overfitting happens when the model learns not only the underlying pattern but
also the noise and details of the training data. This leads to poor generalization
on new, unseen data.

# Signs of Overfitting:
- High accuracy on the training data but low accuracy on the test data.
- A large gap between training and validation errors.

# Causes:
- A model that is too complex (e.g., too many features or too many
parameters).
- Insufficient training data relative to the model complexity.

# How to address Overfitting:


- Simplify the model: Use a less complex model with fewer parameters.
- Regularization: Add regularization terms like L1 (Lasso) or L2 (Ridge) to
penalize large coefficients.
- More data: If possible, increasing the size of the training data can help.
- Cross-validation: Use k-fold cross-validation to ensure the model
generalizes well.
- Early stopping: When training neural networks, stop training when
performance on validation data starts to degrade.

# Code Example:
Here’s an example using Ridge regression to address overfitting.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression

# Generating synthetic data


X, y = make_regression(n_samples=1000, n_features=20, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Ridge regression (L2 regularization)
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)

# Predictions
train_pred = ridge_model.predict(X_train)
test_pred = ridge_model.predict(X_test)

# Mean Squared Error (MSE)


print(f'Train MSE: {mean_squared_error(y_train, train_pred)}')
print(f'Test MSE: {mean_squared_error(y_test, test_pred)}')

In this case, Ridge regression helps control overfitting by adding a penalty


to large coefficients.

2. Underfitting:
Underfitting occurs when the model is too simple to capture the underlying
patterns in the data. It results in poor performance on both the training and test
datasets.

# Signs of Underfitting:
- High error on both the training and validation/test datasets.
- The model fails to capture the complexity of the data.

# Causes:
- A model that is too simple (e.g., too few features or low model
complexity).
- Insufficient training time or iterations (in neural networks).

# How to address Underfitting:


- Increase model complexity: Use a more complex model with more features or
layers (in neural networks).
- Train for longer: If you're using neural networks, train for more epochs.
- Feature engineering: Add more relevant features or use more powerful
transformations.
- Remove regularization: If you're using regularization, consider reducing it
or turning it off.

# Code Example:
Here’s an example using a polynomial regression to solve underfitting.

from sklearn.preprocessing import PolynomialFeatures


from sklearn.linear_model import LinearRegression

# Generating synthetic data


X, y = make_regression(n_samples=1000, n_features=2, noise=30)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Polynomial transformation (increase model complexity)


poly = PolynomialFeatures(degree=3) # Trying polynomial of degree 3
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Linear regression on the transformed data


model = LinearRegression()
model.fit(X_train_poly, y_train)
# Predictions
train_pred = model.predict(X_train_poly)
test_pred = model.predict(X_test_poly)

# Mean Squared Error (MSE)


print(f'Train MSE: {mean_squared_error(y_train, train_pred)}')
print(f'Test MSE: {mean_squared_error(y_test, test_pred)}')

In this case, using a polynomial transformation helps the model capture more
complex patterns and reduce underfitting.

Summary:
- Overfitting: Model performs well on training data but poorly on test data.
- Fixes: Regularization, simpler models, more data, cross-validation.
- Underfitting: Model performs poorly on both training and test data.
- Fixes: Increase model complexity, train longer, better feature
engineering.

These adjustments can significantly improve model performance and ensure


better generalization to unseen data.

Overfitting & Underfitting


- Use in Project: You monitored model performance metrics to prevent
overfitting and underfitting during model training in your machine learning
projects.
- Code Example:

# Example metrics
training_error = 0.1 # Example training error
validation_error = 0.3 # Example validation error

if training_error < 0.1 and validation_error > 0.2:


print("Model is overfitting")

2) Loss & Cost Functions:


In machine learning, **loss** is the error or difference between predicted
and actual values for a single data point, while **cost** is the average of these
losses across all data points in a dataset.

- Loss Function: Measures how well a model's predictions match the actual
outcomes (error for one training example).
- Cost Function: The average of loss functions across all training examples.
- Use in Project: You used loss and cost functions to evaluate the performance
of your machine learning models during training.
- Code Example:

import numpy as np

# Mean Squared Error as a loss function


def mean_squared_error(y_true, y_pred):
return np.mean((y_true - y_pred) 2)

# Example usage
y_true = np.array([1, 2, 3])
y_pred = np.array([1, 2, 4])
print(mean_squared_error(y_true, y_pred)) # Output: 0.333...
- Use in Project: You used loss and cost functions to evaluate the
performance of your machine learning models during training.
- Code Example:

import numpy as np

# Mean Squared Error as a loss function


def mean_squared_error(y_true, y_pred):
return np.mean((y_true - y_pred) 2)

# Example usage
y_true = np.array([1, 2, 3])
y_pred = np.array([1, 2, 4])
print(mean_squared_error(y_true, y_pred)) # Output: 0.333...

3) Regression Models:
- Models used to predict a continuous outcome based on independent variables
(e.g., linear regression, logistic regression).

4) Encoding types in Machine learning?


In machine learning, encoding techniques are essential for converting
categorical data into numerical formats that algorithms can work with. Here are
some common types of encoding methods:

1. Label Encoding
- Description: Each category is assigned a unique integer label. This
method is suitable for ordinal data where the categories have an inherent order.
- Example:
- Colors: Red = 0, Green = 1, Blue = 2

2. One-Hot Encoding
- Description: Converts each category into a new binary column (0s and
1s). This method is useful for nominal data, where categories do not have an order.
- Example:
- Colors: Red = [1, 0, 0], Green = [0, 1, 0], Blue = [0, 0, 1]

3. Binary Encoding
- Description: Each category is first converted into an integer, then
that integer is converted into binary code. This method is efficient for high
cardinality categorical variables.
- Example:
- Colors: Red = 0 (00), Green = 1 (01), Blue = 2 (10) →
- Red = [0, 0], Green = [0, 1], Blue = [1, 0]

4. Frequency Encoding
- Description: Each category is replaced with the frequency of its
occurrence in the dataset. This can help retain some information about the
distribution of categories.
- Example:
- If Red appears 10 times, Green 5 times, and Blue 15 times:
- Red = 10, Green = 5, Blue = 15

5. Target Encoding (Mean Encoding)


- Description: Categories are replaced with the mean of the target
variable for each category. This method can be powerful but may lead to overfitting
if not handled properly.
- Example:
- If Red has an average target value of 2, Green 3, and Blue 1:
- Red = 2, Green = 3, Blue = 1

6. Ordinal Encoding
- Description: Similar to label encoding, but it specifically considers
the ordinal nature of the categorical data. Each category is assigned a rank based
on its order.
- Example:
- Sizes: Small = 1, Medium = 2, Large = 3

7. Hash Encoding
- Description: Categories are hashed into a fixed number of dimensions,
which helps handle high cardinality without creating too many columns. However, it
may lead to collisions.
- Example:
- Colors: Using a hash function that generates a fixed number of binary
columns.

8. Count Encoding
- Description: Similar to frequency encoding, but the categories are
replaced with their count in the dataset.
- Example:
- Red = 10, Green = 5, Blue = 15

9. Custom Encoding
- Description: Users can define their encoding scheme based on domain
knowledge, often combining several of the above methods.
- Example: Assigning specific numerical values based on business logic.

When to Use Each Encoding


- Label Encoding: Use for ordinal variables where the order matters.
- One-Hot Encoding: Use for nominal variables with no intrinsic
ordering.
- Binary Encoding: Use for high cardinality categorical variables to
save space.
- Frequency and Target Encoding: Use when you want to retain some
information about the distribution of the categorical variable, but be cautious of
overfitting.

These encoding techniques play a crucial role in preparing data for machine
learning models, helping ensure that algorithms can effectively learn from the data
provided. Let me know if you need more details on any specific encoding method!

5) What is bagging and boosting in Machine learning?


Boosting and Bagging are both ensemble learning techniques used to improve
the performance of machine learning models by combining multiple weaker models
(often called base learners) into a stronger one. Here’s a quick overview:

Bagging (Bootstrap Aggregating):


- Goal: Reduce variance and avoid overfitting.
- Method: Multiple models (often decision trees) are trained
independently on different random subsets of the dataset (with replacement). The
final prediction is made by averaging (for regression) or majority voting (for
classification).
- Popular Algorithm: Random Forest.
- Key Benefit: Stabilizes model by reducing variance.

Boosting:
- Goal: Reduce bias(error) and create a strong model by focusing on
mistakes.
- Method: Models are trained sequentially, with each new model focusing
on correcting the errors made by the previous ones. The final prediction is a
weighted combination of all models.
- Popular Algorithms: AdaBoost, Gradient Boosting, XGBoost.
- Key Benefit: Increases accuracy by focusing on the hardest-to-predict
examples.

In short, bagging builds independent models in parallel to reduce variance,


while boosting builds models sequentially to reduce bias and improve accuracy.

6) what is hyperparameter tuning?


Hyperparameter tuning is the process of optimizing the hyperparameters of a
machine learning model to improve its performance. Hyperparameters are parameters
that govern the training process or the structure of the model itself, and unlike
model parameters, they are not learned directly from the data. Instead, they are
set before training begins. Examples include the learning rate, number of hidden
layers in a neural network, tree depth in decision trees, and the number of
neighbors in a K-Nearest Neighbors model.

The goal of hyperparameter tuning is to find the optimal combination of these


hyperparameters that minimizes a model's error on validation data, helping achieve
the best generalization on new, unseen data. Tuning methods often include:

1. Grid Search: Tests all possible combinations of hyperparameters within a


specified range.
2. Random Search: Randomly samples hyperparameter combinations within a
specified range, which can be more efficient than grid search.
3. Bayesian Optimization: Uses probabilistic models to predict the
performance of different hyperparameter combinations, focusing on promising areas
in the search space.
4. Automated Methods: Techniques like Hyperband or Optuna that adaptively
adjust hyperparameters as the model trains.

Effective hyperparameter tuning can significantly improve model accuracy and


reduce overfitting or underfitting.

7) mean-->average of value-->Numerical, median--> middle value-->Numerical, mode-->


repatative values--> categorical

---
8) percentage-->
The percentage is a way of expressing a number as a fraction of 100. It’s
used to compare relative proportions or to quantify parts of a whole in a
standardized way. For example, 40% means 40 out of every 100.

---
9) varience-->
Variance measures the spread of data points around the mean in a dataset.
It’s calculated as the average of the squared deviations from the mean and gives
insight into data variability. A higher variance indicates more spread-out data.

---
10) standard deviation-->
Standard deviation is the square root of the variance. It quantifies the
amount of variation or dispersion in a dataset, showing how much individual data
points typically deviate from the mean.

---
11) Hypothesis testing-->
Hypothesis testing is a statistical method used to determine if there is
enough evidence in a sample to infer a condition about the larger population. It
involves making an initial assumption (null hypothesis), collecting data, and then
determining whether the data provides enough reason to reject this assumption.

---
12) Type of hypothesis testing--
Common types include:
- T-tests: Compare means between groups (e.g., two-sample t-test).
- ANOVA (Analysis of Variance): Compare means among three or more groups.
- Chi-Square Test: Test relationships between categorical variables.
- Z-tests: Often used for large samples to compare means.
- Non-parametric Tests: Tests like the Mann-Whitney or Wilcoxon for non-
normal data.

---
13) when we do hypothesis testing and why we do it
Hypothesis testing is used to make informed decisions based on sample data.
It helps to confirm or refute assumptions about population parameters and determine
statistical significance. For example, testing if a new treatment is effective or
if there's a relationship between two variables.

---
14) difference in SQL and pandas which one is used to which purposes in data
scientist roles/ projects?
- SQL: Used for querying and managing structured data in relational
databases. It's essential for extracting, filtering, and aggregating large datasets
directly from databases.
- Pandas: A Python library for data manipulation and analysis, ideal for in-
memory data processing. It’s commonly used for complex data transformations,
analysis, and feature engineering in data science workflows.
Data scientists use SQL for database operations and Pandas for more detailed
data analysis and processing.

---
15) what is time series
A time series is a sequence of data points collected or recorded at regular
time intervals. Examples include stock prices, weather data, and sales figures over
time. Time series analysis helps identify trends, seasonality, and patterns in
data.

---
16) random oversampling and udersampling?
- Oversampling: Increases the representation of minority classes by
duplicating instances to balance the class distribution in datasets.
- Undersampling: Reduces the representation of majority classes by removing
instances to balance class distribution.
Both techniques are used to address class imbalance in classification
problems.

---
17) de duplication in random forest?
Random Forest is inherently resilient to duplicates since it’s an ensemble
method that averages results from many decision trees, each using random samples of
data. However, pre-processing steps like removing duplicates can help improve model
performance and efficiency in some cases.

18) differences?
1) difference between random forest and support vector machine(SVM)?

RANDOM FOREST SUPPORT VECTOR


MACHINE
1) large dataset works well 1) works well on
small or mediam dataset
2) captures complex non liner patterns 2) effective for
liner seprable data
3) parallel training of decision trees for efficiency 3) traning may be
slow for large dataset
4) multiple decision trees 4) single model

2) difference between random forest and logistic regession?

RANDOM FOREST LOGISTIC


REGRESSION
1) it is suitable for both classification and regession 1) it
suitable for only binary classification
2) makes prediction based on an ensemble of decision tree 2) makes
prediction based on logistic function
3) can handle missing value outliers and independant and 3) assume
liner relationships between the non-linear relationships
dependant variables
4) more accurate and robust campared to individual decision tree 4) less
accurate and robust campare to random forest
5) can handle large amount of data efficiently 5) can
not handle large amount of data efficiently
6) can be time consuming for traning 6) quick
to train compared to random forest

3) difference between random forest and xgboost?

Random Forest
XGBoost
1)Model Building 1)Ensemble learning using independently built 1)Sequential
ensemble learning with trees
decision trees.
correcting errors of previous ones.
2)Optimization 2)Makes predictions by averaging individual 2)Employs
gradient boosting to minimize a loss
Approach tree outputs.
function and improve accuracy iteratively.
3)Handling Unbalanced 3)Can struggle a bit 3)Handles
it like a pro
Datasets
4)Ease of Tuning 4)Simple and straightforward 4)Requires
more practice but offers higher

accuracy
5)Adaptability to 5)Works well with multiple machines
5)Needs more coordination but can handle large Distributed Computing
datasets efficiently
6)Handling Large 6)Can handle them but may slow down with very 6)Built
for speed, perfect for big Datasets large data
datasets

7)Predictive Accuracy 7)Good, but not always the most precise 7)Superior
accuracy, especially in tough

situations
4) difference between liner and logistic regression?
LINEAR REGRESSION(Regression)
LOGISTIC REGRESSION(Classification)
1)Linear Regression is a supervised regression model. 1)Logistic
Regression is a supervised classification

model.

2)In Linear Regression, we predict the value by an integer 2)In Logistic


Regression, we predict the value by 1 or
number. 0.
3)Here no threshold value is needed. 3)Here a
threshold value is added.
4)Here we calculate Root Mean Square Error(RMSE) to predict 4)Here we use
precision to predict the next weight
the next weight value. value.
5)It is based on the least square estimation. 5)It is based
on maximum likelihood estimation.
6)Linear regression is used to estimate the dependent 6)Whereas
logistic regression is used to calculate the
variable in case of a change in independent variables. probability
of an event.
For example, predict the price of houses. For example,
classify if tissue is benign or

malignant.
19) Assumptions in linear regression?
1. there is relation in between dependant and independent variables
2. bias is very low
3. coliearnity

20) what is a activation funcation used in for logistic function to decide


thershold value.(sigmoidal or relu)
sigmoidal is based upon probablity

21) How to create logistic funcation?

22)ensemble techniques?
1)Bagging ---random forest
2)Boosting ---XGBoost

19) Gini index in random forest?


The Gini index (or Gini impurity) is a measure of the “impurity” or diversity
of data in decision trees, which are used in Random Forest models. It quantifies
how often a randomly chosen element from the set would be incorrectly labeled if it
were randomly labeled according to the distribution of labels in the set.

For a binary classification, the Gini index for a node is calculated as:

text{Gini} = 1 - (p^2 + q^2)

where:
(p) is the probability of one class,
(q) is the probability of the other class (for binary classes, ( q = 1 -
p )).

### How Gini Index is Used in Random Forest


In a Random Forest, Gini impurity helps select the best feature and split at
each node of each decision tree. During training, the algorithm evaluates potential
splits by calculating the Gini index for each feature, and the split that best
reduces impurity (lowest Gini index) is chosen. Lower Gini values mean purer nodes,
where each node ideally has samples from only one class, contributing to the
overall accuracy and stability of the Random Forest model.

Q) what is mean by entropy? how will you find the entropy of column of
dataset?------->Code
degree of Disorderness

Q) CART Algoritham?

Q)

You might also like