VIVA
VIVA
● Answer: Bagging involves training multiple models (usually of the same type) on
different subsets of the training data, created by bootstrapping (sampling with
replacement). The final prediction is made by averaging (for regression) or voting (for
classification) across all the models.
● Advantages:
○ Reduces overfitting.
○ Improves model stability and accuracy.
● Disadvantages:
○ Can be computationally expensive.
○ May not perform well on certain types of data (e.g., with high bias).
● Answer: Random Forest is an ensemble method using bagging, where many decision
trees are built and their predictions are averaged or voted upon.
● Advantages:
○ Handles large datasets well.
○ Robust to overfitting and noise.
● Disadvantages:
○ Can be computationally intensive.
○ Less interpretable compared to a single decision tree.
3. What is Boosting?
● Answer: Boosting is an ensemble technique that trains models sequentially. Each new
model corrects the errors made by the previous one, focusing on difficult-to-classify
instances.
● Advantages:
○ Often improves performance and accuracy.
○ Can turn weak learners into strong learners.
● Disadvantages:
○ Prone to overfitting, especially with noisy data.
○ Can be computationally expensive due to sequential training.
4. What is an Ensemble of Machine Learning Algorithms (Ensemble
Classifier)?
● Answer: Ensemble methods combine the predictions of multiple models (either the
same type or different types) to improve overall accuracy. Common methods include
bagging, boosting, and stacking.
● Advantages:
○ Can outperform individual models by reducing bias and variance.
○ More robust to overfitting and can handle complex data.
● Disadvantages:
○ More computationally demanding.
○ Difficult to interpret due to multiple models being combined.
● Answer: EDA is the process of analyzing data sets to summarize their main
characteristics, often using visual methods (such as histograms, box plots) and statistical
methods (mean, median, variance).
● Advantages:
○ Helps in understanding the underlying data patterns and structures.
○ Identifies potential issues such as outliers, missing values, or correlations.
● Disadvantages:
○ Can be time-consuming.
○ Requires domain knowledge to interpret the results accurately.
● Answer: Association rules are used to identify relationships between items in transaction
data. The goal is to discover items that frequently occur together.
● Advantages:
○ Useful for market basket analysis and recommendation systems.
○ Helps in identifying patterns and relationships in data.
● Disadvantages:
○ May lead to a large number of trivial rules.
○ Rules can be computationally expensive to generate.
● Answer: The Apriori algorithm is used to find frequent itemsets in a dataset and
generate association rules based on minimum support and confidence thresholds.
● Advantages:
○ Simple and easy to understand.
○ Can be applied to large datasets with sparse transactions.
● Disadvantages:
○ Can be slow for very large datasets.
○ Needs a lot of memory to store candidate itemsets.
1. Data Collection
2. Data Cleaning
3. Feature Transformation
5. Feature Creation
● Generate new features from existing ones (e.g., interaction terms, date components,
aggregations).
● Example: Create new features like price per unit from total price and
quantity.
6. Feature Selection
● Select the most relevant features using techniques like correlation analysis, chi-square
tests, or model-based methods.
● Eliminate irrelevant or redundant features.
● Identify and handle outliers using methods like Z-scores or IQR (Interquartile Range).
8. Dimensionality Reduction
● Use techniques like PCA (Principal Component Analysis) to reduce the number of
features while retaining most of the information.
9. Encoding for High Cardinality Features
● Handle features with many unique categories using target encoding or frequency
encoding.
● Create time-based features like day of the week, lag features, or moving averages for
time-dependent data.
● Split the data into training, validation, and test sets to evaluate model performance.
These steps help improve the quality of the data and enhance model performance.