DS 1
DS 1
Here are concise and well-structured 5-mark answers (in 5–6 points each) for your questions:
3. Forward/Backward Fill: Use previous or next values to fill in the missing ones.
Best when: Data is time-series.
4. Predictive Imputation: Use models (like regression or KNN) to predict missing values.
Best when: Strong relationships exist between features.
5. Constant Imputation: Fill missing values with a specific constant (e.g., "Unknown").
Best when: Handling categorical variables.
6. Use Algorithms That Handle Missing Values: Some models like XGBoost can handle
missing data internally.
1/20
2. Label Encoding vs One-Hot Encoding
1. Label Encoding: Assigns a unique number to each category (e.g., Red = 0, Blue = 1).
Best for: Ordinal data where order matters.
3. Min-Max Normalization
1. Definition: Rescales features to a fixed range [0, 1] using:
X−Xmin
Xnorm =
Xmax −Xmin
4. Useful for: Algorithms sensitive to scale (e.g., KNN, SVM, neural networks).
5. Not Ideal When: Outliers are present – as they can compress other values.
6. Example: Rescaling age from a range of 20–70 to 0–1 helps improve gradient descent
convergence.
2/20
2. Reduces Overfitting: Less noise leads to better generalization on test data.
3. Improves Accuracy: Removes irrelevant data that may confuse the model.
5. Example: In a spam filter, selecting features like "contains 'free'" or "click here" instead of
all email text improves accuracy.
6. Methods: Filter (e.g., correlation), Wrapper (e.g., RFE), Embedded (e.g., Lasso).
2. Feature Selection: Choosing the most relevant existing features for training.
6. Key Difference:
yes
stop this .
3/20
Would you like the notes in a different format (like a Word doc or plain text), or do you want
to continue editing them here?
Here are clear, point-wise answers (5–6 bullet points each) for your 5-mark questions:
6. Example: Filling in all missing ages with the average age might hide age-related trends.
4/20
2. Increases memory usage and computation time with many categories.
5. Example: For a feature like "Product ID" with 5000 unique values, use frequency or
target encoding.
4. Transforms raw data into suitable input for machine learning algorithms.
3. Min-Max Normalization rescales all features to [0, 1], giving equal weight.
5. Example: In KNN, if age is scaled (0–1) and income is not (0–100K), income dominates
unless scaled.
6. Conclusion: Proper scaling is essential for fair and effective distance calculations.
5/20
10. Automating Feature Selection with RFE
1. Recursive Feature Elimination (RFE) is a wrapper method that selects features
recursively.
2. Trains model multiple times, removing the least important feature each time.
3. Ranks features based on importance weights from the model (e.g., coefficients or
feature_importances).
4. Works well with models like logistic regression, SVM, and tree-based models.
6. Example: RFE with a random forest can select top 10 features out of 100 based on
performance.
Would you like me to compile all ten answers into a formatted document now?
6/20
losing important information.
20. Explain how categorical data can be transformed using
both label encoding and one-hot encoding,
including any potential pitfalls in each method.
answer me with subpoints.
Here are concise answers with subpoints for your 5-mark questions (Q11 to Q20), designed
for clarity and exam-readiness:
Median Imputation:
7/20
Dimensionality explosion: Large number of new columns.
Best used for ordinal categorical data (e.g., low, medium, high).
Xmax −Xmin
8/20
Scaled values:
(20–20)/(80–20) = 0
(40–20)/(80–20) = 0.33
(60–20)/(80–20) = 0.67
(80–20)/(80–20) = 1
Median:
Mode:
Depends on:
Distribution
Presence of outliers
9/20
19. Trade-Offs in Dimensionality Reduction
Pros:
Cons:
One-Hot Encoding:
Would you like me to combine all 20 questions into a printable document or study guide?
10/20
missing values (mean, median, and mode
imputation). Include in your answer the conditions under
which each method is most effective, their
drawbacks, and the impact of imputation on subsequent
analyses.
2. Discuss in detail the encoding techniques for categorical
variables: label encoding vs. one-hot
encoding. Provide examples, describe scenarios where one
method is preferred over the other, and
explain any issues that may arise when using these
methods with machine learning models.
3. Describe the entire process of data normalization using
min-max scaling. Include mathematical
formulation, step-by-step transformation of a sample
dataset, and discuss the benefits and
limitations of this method compared to other scaling
techniques (e.g., Z-score normalization).
4. Explain feature selection and feature engineering in
depth. Discuss how each contributes to model
building, include examples of techniques (e.g., filter
methods, wrapper methods, and embedded
methods for feature selection), and highlight potential
challenges in applying these techniques.
5. You are given a dataset with missing values, categorical
variables, and numerical features with
different scales. Design a complete data preprocessing
pipeline that includes missing value
imputation, encoding, normalization, and feature selection.
Justify your choices at each step.
6. Discuss the impact of imputation methods on the
statistical properties of data. How do mean,
median, and mode imputation alter the distribution of a
feature, and what are the potential
consequences for model training and inference?
7. Explore the challenges and potential solutions when
applying one-hot encoding to high-cardinality
categorical variables. In your discussion, include strategies
to mitigate issues like the curse of
dimensionality and model overfitting.
8. Analyze the role of data normalization in machine
learning algorithms, especially in algorithms
that rely on distance metrics. Provide theoretical
11/20
explanations as well as practical examples to
support your analysis.
9. Examine the effects of feature selection on model
interpretability and performance. Discuss
different feature selection techniques, and provide a case
study or example to illustrate how
reducing the number of features can benefit or harm the
model.
10. Consider a scenario where you must preprocess data
for a model that is highly sensitive to input
scale and outliers. Propose and justify a comprehensive
preprocessing strategy, including methods
for imputation, scaling, encoding, and feature engineering,
discussing how each step addresses
these sensitivities.
11. Discuss how missing data mechanisms (Missing
Completely at Random, Missing at Random, and
Missing Not at Random) influence the choice of imputation
techniques. Provide examples and
explain the potential biases that can arise from improper
imputation.
12. Critically analyze the advantages and limitations of
using min-max normalization in contrast to
other scaling methods. Provide a detailed example where
min-max scaling might fail, and suggest
alternative solutions.
13. Design an experiment to compare the effects of label
encoding and one-hot encoding on the
performance of a classification algorithm. Outline the
experimental setup, metrics for evaluation,
and discuss how you would interpret the results.
14. Provide a detailed discussion on the importance of
feature engineering in transforming raw data
into features suitable for modeling. Include examples of
techniques used for creating new features,
handling interactions, and dealing with non-linear
relationships.
10 marks from 1 to 15
15. Discuss the integration of data preprocessing
techniques (imputation, encoding, normalization,
feature selection) into modern machine learning workflows.
Evaluate how automated machine
12/20
learning (AutoML) systems handle these tasks and the
potential benefits and pitfalls of relying on
such automation.
Median Imputation: Use the median. Best for skewed data or presence of
outliers. Drawback: ignores other variable relationships.
Impact: Imputation can bias distributions, reduce variance, and affect model
accuracy if not chosen correctly.
One-Hot Encoding: Creates binary columns for each category. Ideal for
nominal data. Drawback: high dimensionality.
Example: For color = [Red, Blue], label encoding = [0, 1]; one-hot = [[1,0], [0,1]].
3. Min-Max Scaling
Formula: X_scaled = (X - X_min) / (X_max - X_min)
Steps: For [10, 20, 30], min=10, max=30; scaled = [0, 0.5, 1]
13/20
Feature Selection: Removes irrelevant features. Techniques: filter (correlation),
wrapper (RFE), embedded (Lasso).
Feature Engineering: Creates new features from existing data (e.g., age group
from age).
8. Data Normalization in ML
Role: Ensures fair feature contribution in distance-based models (e.g., KNN,
SVM).
14/20
9. Feature Selection Effects
Interpretability: Simpler models, easier to explain.
Example: Using top 10 features improved test accuracy vs using all 100.
15/20
14. Importance of Feature Engineering
Definition: Transform raw data into informative features.
I've compiled detailed answers for all 15 ten-mark questions in a structured, exam-friendly
format. Let me know if you'd like this exported as a PDF or want help reviewing or
summarizing specific answers.
16/20
and one-hot encoding?
16. Why might one-hot encoding increase the
dimensionality of data?
17. What is the effect of min-max scaling on the range of
features?
18. Why is feature scaling necessary in many machine
learning algorithms?
19. Give one benefit of normalizing data before model
training.
20. How can feature selection improve model performance?
21. What does it mean to ‘engineer’ a feature?
22. Name one method for feature selection.
23. What is a potential risk when imputing missing values?
24. How does one handle missing values in a dataset with
mixed data types?
25. When would you use mean imputation instead of
median imputation?
26. Explain why one-hot encoding might be necessary for
non-ordinal categorical variables.
27. What is the primary goal of data preprocessing?
28. How does label encoding affect the ordering of
categorical variables?
29. What is the potential issue with using label encoding on
nominal data?
30. Name one scenario where feature engineering can be
crucial.
31. How does handling missing values contribute to model
accuracy?
Here are simple 2-mark answers for your questions, one by one:
17/20
for skewed data.
12. Give one example of a scenario where median imputation is preferred over mean
imputation.
In income data with extreme values (outliers), median is better because it's not affected by
outliers.
15. What is the main difference between label encoding and one-hot encoding?
Label encoding gives one column with numbers; one-hot makes multiple binary columns.
18/20
17. What is the effect of min-max scaling on the range of features?
It limits the range of values to between 0 and 1.
24. How does one handle missing values in a dataset with mixed data types?
Use mean/median for numeric and mode for categorical columns.
25. When would you use mean imputation instead of median imputation?
When the data is normally distributed without outliers.
26. Explain why one-hot encoding might be necessary for non-ordinal categorical
variables.
Because it avoids implying an order where none exists.
28. How does label encoding affect the ordering of categorical variables?
It assigns numbers that may wrongly suggest an order or ranking.
29. What is the potential issue with using label encoding on nominal data?
Models may think higher numbers mean higher value, which is false.
19/20
Would you like these compiled into a printable format or quiz sheet?
20/20