0% found this document useful (0 votes)
3 views5 pages

VIVA

VIVA QUES ANS ON ML & ETC

Uploaded by

Anjan Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views5 pages

VIVA

VIVA QUES ANS ON ML & ETC

Uploaded by

Anjan Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1. What is Bagging (Bootstrap Aggregating)?

● Answer: Bagging involves training multiple models (usually of the same type) on
different subsets of the training data, created by bootstrapping (sampling with
replacement). The final prediction is made by averaging (for regression) or voting (for
classification) across all the models.
● Advantages:
○ Reduces overfitting.
○ Improves model stability and accuracy.
● Disadvantages:
○ Can be computationally expensive.
○ May not perform well on certain types of data (e.g., with high bias).

2. What is Random Forest?

● Answer: Random Forest is an ensemble method using bagging, where many decision
trees are built and their predictions are averaged or voted upon.
● Advantages:
○ Handles large datasets well.
○ Robust to overfitting and noise.
● Disadvantages:
○ Can be computationally intensive.
○ Less interpretable compared to a single decision tree.

3. What is Boosting?

● Answer: Boosting is an ensemble technique that trains models sequentially. Each new
model corrects the errors made by the previous one, focusing on difficult-to-classify
instances.
● Advantages:
○ Often improves performance and accuracy.
○ Can turn weak learners into strong learners.
● Disadvantages:
○ Prone to overfitting, especially with noisy data.
○ Can be computationally expensive due to sequential training.
4. What is an Ensemble of Machine Learning Algorithms (Ensemble
Classifier)?

● Answer: Ensemble methods combine the predictions of multiple models (either the
same type or different types) to improve overall accuracy. Common methods include
bagging, boosting, and stacking.
● Advantages:
○ Can outperform individual models by reducing bias and variance.
○ More robust to overfitting and can handle complex data.
● Disadvantages:
○ More computationally demanding.
○ Difficult to interpret due to multiple models being combined.

5. What is Feature Engineering?

● Answer: Feature engineering is the process of creating, selecting, and transforming


features from raw data to improve model performance. It involves handling missing
values, encoding categorical data, scaling features, and creating new features.
● Advantages:
○ Can significantly improve model accuracy.
○ Provides deeper insights into the data.
● Disadvantages:
○ Time-consuming and requires domain knowledge.
○ Poor feature engineering can lead to worse model performance.

6. What is Exploratory Data Analysis (EDA)?

● Answer: EDA is the process of analyzing data sets to summarize their main
characteristics, often using visual methods (such as histograms, box plots) and statistical
methods (mean, median, variance).
● Advantages:
○ Helps in understanding the underlying data patterns and structures.
○ Identifies potential issues such as outliers, missing values, or correlations.
● Disadvantages:
○ Can be time-consuming.
○ Requires domain knowledge to interpret the results accurately.

7. What is a Transaction Data Model?

● Answer: A transaction data model represents data involving individual transactions or


events. It tracks activities like purchases, sales, or user interactions.
● Advantages:
○ Useful in analyzing customer behavior and transactional data.
○ Helps in applications like market basket analysis.
● Disadvantages:
○ May become large and difficult to manage with many transactions.
○ Can require complex systems to store and analyze.

8. What are Association Rules?

● Answer: Association rules are used to identify relationships between items in transaction
data. The goal is to discover items that frequently occur together.
● Advantages:
○ Useful for market basket analysis and recommendation systems.
○ Helps in identifying patterns and relationships in data.
● Disadvantages:
○ May lead to a large number of trivial rules.
○ Rules can be computationally expensive to generate.

9. What is the Apriori Algorithm?

● Answer: The Apriori algorithm is used to find frequent itemsets in a dataset and
generate association rules based on minimum support and confidence thresholds.
● Advantages:
○ Simple and easy to understand.
○ Can be applied to large datasets with sparse transactions.
● Disadvantages:
○ Can be slow for very large datasets.
○ Needs a lot of memory to store candidate itemsets.

10. What is Dimensionality Reduction?

● Answer: Dimensionality reduction is a technique used to reduce the number of input


variables (features) in a dataset while retaining important information. Common methods
include PCA and LDA.
● Advantages:
○ Reduces computational cost and overfitting.
○ Helps visualize high-dimensional data.
● Disadvantages:
○ Can result in loss of information.
○ May be difficult to interpret if the reduced dimensions don’t have clear meanings.
Here’s a concise list of the steps in Feature Engineering for Machine Learning (ML):

1. Data Collection

● Gather raw data from various sources.

2. Data Cleaning

● Handle missing values (imputation or removal).


● Remove duplicates and fix errors.

3. Feature Transformation

● Scaling/Normalization: Standardize or normalize features to bring them to the same


scale.
● Log Transformation: Apply transformations (e.g., log or square root) to handle skewed
data.

4. Handling Categorical Data

● One-Hot Encoding: Convert categorical variables into binary vectors.


● Label Encoding: Convert categories to numeric labels.

5. Feature Creation

● Generate new features from existing ones (e.g., interaction terms, date components,
aggregations).
● Example: Create new features like price per unit from total price and
quantity.

6. Feature Selection

● Select the most relevant features using techniques like correlation analysis, chi-square
tests, or model-based methods.
● Eliminate irrelevant or redundant features.

7. Outlier Detection and Removal

● Identify and handle outliers using methods like Z-scores or IQR (Interquartile Range).

8. Dimensionality Reduction

● Use techniques like PCA (Principal Component Analysis) to reduce the number of
features while retaining most of the information.
9. Encoding for High Cardinality Features

● Handle features with many unique categories using target encoding or frequency
encoding.

10. Feature Engineering for Time Series Data

● Create time-based features like day of the week, lag features, or moving averages for
time-dependent data.

11. Data Splitting

● Split the data into training, validation, and test sets to evaluate model performance.

These steps help improve the quality of the data and enhance model performance.

You might also like