0% found this document useful (0 votes)
33 views21 pages

Case Studies ML

Uploaded by

d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views21 pages

Case Studies ML

Uploaded by

d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Case Studies ML

Case Study: Predictive Maintenance in Manufacturing


Background: A leading manufacturing company, "AutoMech", specializing in
automotive parts, decided to implement a machine learning solution to predict
equipment failures, thereby reducing downtime and maintenance costs. They chose
Amazon SageMaker for its comprehensive machine learning capabilities.
Goal: To develop a predictive maintenance system that could accurately forecast
machine failures, allowing proactive maintenance and avoiding costly unplanned
downtime.
The Project:
1. Data Collection: AutoMech collected vast amounts of data from their machinery
– temperature readings, vibration levels, operational hours, maintenance logs,
and failure history.
2. Data Preparation: Using SageMaker, the data science team preprocessed the
data – normalizing sensor readings, encoding categorical data, and handling
missing values.
3. Model Selection: They chose a Gradient Boosting algorithm for its effectiveness
in handling diverse data and its robustness in predicting binary outcomes (failure
or no failure).
4. Training and Evaluation: The team used SageMaker's built-in XGBoost
algorithm to train the model. Initial results on the validation set were promising.
Incident: Upon deploying the model into production, the system began to produce a
high number of false positives – predicting failures that weren't occurring. This led to
unnecessary maintenance actions, disrupting the workflow and causing frustration
among the maintenance staff.
Investigation:
1. Reviewing Data Pipeline: The team first reviewed the data pipeline and
preprocessing steps. They found no issues – the data fed into the model was
clean and well-prepared.
2. Model Reevaluation: Reassessing the model, they discovered the issue wasn't
with the algorithm but with how the training and validation were set up.
Root Cause – The 'Setting Gotcha': The team had overlooked a crucial detail in the
SageMaker setting – the handling of data shuffling in the training process. SageMaker's
default setting for the XGBoost algorithm includes data shuffling before splitting into
training and validation sets. However, in the context of time-series data (like machine
sensor readings), this default setting breaks the temporal sequence, leading to overly
optimistic performance during validation.
Resolution:
1. Correcting Data Shuffling: The team turned off data shuffling and restructured
the dataset to maintain the chronological order. This change provided a more
realistic scenario where the model trained on past data to predict future events.
2. Hyperparameter Tuning: They conducted further hyperparameter tuning,
considering the new data setup. Key parameters like max_depth and
min_child_weight were adjusted to prevent overfitting.
3. Incremental Learning Approach: To adapt to the ever-changing nature of
machine wear and tear, they implemented an incremental learning strategy
where the model would be periodically updated with new data.
Outcome: The revised model showed a significant reduction in false positives, aligning
the predictions more closely with actual equipment failures. This led to a more efficient
maintenance schedule, reduced costs, and increased trust in the ML system among the
staff.
Lessons Learned:
 Understand Default Settings: Knowledge of the platform's default behaviors
(like data shuffling in SageMaker) is crucial, especially when dealing with time-
series data.
 Temporal Integrity in Time-Series Data: For predictive maintenance,
maintaining the temporal sequence of data is key to model accuracy.
 Ongoing Monitoring and Updates: Machine learning models in production
require continuous monitoring and updates to remain effective.
Conclusion: This case study highlights the importance of understanding every detail of
the machine learning pipeline, especially when using managed services like
SageMaker. Assumptions about default settings can lead to significant issues in model
performance. The incident reinforced the need for meticulous planning and a thorough
understanding of both the machine learning algorithms and the platform being used.
Case done in 2021
Case Study 2

Case Study: Energy Consumption Forecasting in a Smart Grid System


Background: An emerging energy company, "EcoSmart Energy," (I changed the name
k…) ventured into smart grid technology to optimize energy distribution. They aimed to
use machine learning to forecast energy consumption across various sectors, including
residential, commercial, and industrial.
Goal: To create an accurate forecasting model to predict energy demand, enabling
efficient energy distribution, reducing wastage, and optimizing energy production.
The Project:
1. Data Collection: EcoSmart gathered historical data on energy consumption,
weather patterns, population density, and special events across different regions.
2. Data Preprocessing: The data was cleaned and preprocessed, normalizing
various features like temperature, humidity, and demographic factors.
3. Model Selection: They chose a Time Series Forecasting model, specifically a
Long Short-Term Memory (LSTM) network, suitable for handling sequential data
with temporal dependencies.
4. Training and Evaluation: Using Amazon SageMaker, the LSTM model was
trained. The initial evaluations showed promising results in predicting energy
consumption patterns.
Incident: Shortly after deploying the model, EcoSmart noticed anomalies in energy
forecasting in certain regions. The model predicted significantly higher energy demands
than actual consumption during specific periods, leading to overproduction of energy.
Investigation:
1. Analyzing Forecast Results: The team analyzed the periods and regions with
inaccurate forecasts. They noticed the issue primarily occurred in regions with
large industrial sectors.
2. Data Review: Upon reviewing the training data, they found that energy
consumption in these regions had significant fluctuations, often correlating with
irregular industrial activities.
Root Cause – The 'Outlier' Problem: The team identified that the problem was with
outliers in the energy consumption data, particularly in industrial areas. These outliers
represented periods of unusually high or low energy consumption due to factors like
unscheduled industrial operations, shutdowns, or maintenance activities, which weren’t
accounted for in the model.
Resolution:
1. Outlier Detection and Treatment: The team implemented an outlier detection
system to identify and treat these anomalies in the data. They used techniques
like IQR (Interquartile Range) to spot outliers.
2. Data Segmentation: They segmented the data based on the type of region
(residential, commercial, industrial) and retrained separate models for each
segment, considering the unique consumption patterns.
3. Incorporating External Factors: Additional data, such as industrial activity
schedules and planned outages, were integrated to improve the accuracy of
predictions in industrial regions.
4. Continuous Monitoring and Adjustment: The model was set up for continuous
monitoring, with periodic adjustments to accommodate new consumption
patterns and data.
Outcome: Post-adjustment, the forecasting model became more reliable, particularly in
industrial regions. This led to more efficient energy distribution, reduced wastage, and
cost savings for both EcoSmart and its consumers.
Lessons Learned:
 Importance of Data Quality: Outliers in training data can significantly impact
model performance, especially in complex systems like energy distribution.
 Segmented Modeling Approach: Different sectors may require separate
models due to their unique consumption behaviors.
 Inclusion of Contextual Data: Including relevant external data can enhance
model accuracy in predicting demand.
Conclusion: This case study demonstrates the importance of detailed data analysis
and the need to consider sector-specific characteristics in predictive modeling,
especially in diverse systems like energy distribution. It highlights how addressing
outliers and incorporating contextual information can lead to more effective and efficient
solutions in the realm of smart grid technology.
Case done in 2022

Case Study 3

Case Study: Job Applicant Screening Tool in the Tech Industry


Background: A tech company, "TechTalent Inc.", (again, changed the name..but in
Singapore)…decided to develop a machine learning-based tool to screen job
applicants. The goal was to streamline the hiring process by automatically evaluating
resumes and shortlisting candidates.
Goal: To create an unbiased, efficient, and automated screening system that could
evaluate job applications fairly and select candidates based on merit and relevant
qualifications.
The Project:
1. Data Collection: TechTalent Inc. compiled a dataset of past job applications,
which included resumes, application forms, and the outcomes of these
applications (whether the applicant was shortlisted, interviewed, and hired).
2. Feature Extraction: The tool was designed to extract features from resumes,
such as education, work experience, skills, and extracurricular activities.
3. Model Selection: They chose a supervised machine learning classification
model to predict the likelihood of a candidate’s success in the hiring process.
4. Initial Training: Using Amazon SageMaker, they trained the model on their
historical hiring data.
Incident Avoided: Before deploying the model, the team conducted a thorough review
and discovered a significant issue.
Investigation:
1. Bias Audit: A routine bias audit was conducted, where the model’s predictions
were analyzed for fairness across different demographic groups.
2. Discovery of Bias: The audit revealed that the model was biased against certain
groups. Applicants from certain universities and specific demographic
backgrounds were consistently being favored.
Root Cause – Bias in Training Data: The root cause was traced back to the training
data, which reflected historical biases in the company’s hiring practices. Because the
model was trained on this data, it learned and replicated these biases.
Resolution:
1. Debiasing the Data: The team took steps to remove biased features from the
training data and restructured the dataset to ensure a more balanced
representation of successful candidates across all demographic groups.
2. Feature Reassessment: They reassessed which features were being used to
make predictions, removing or adjusting features that were contributing to biased
outcomes.
3. Algorithmic Fairness Techniques: The team applied algorithmic fairness
techniques, like fairness constraints and regularization, to actively mitigate bias in
the model’s predictions.
4. Validation for Fairness: After retraining, the model was rigorously tested again
for bias, using a diverse and independent validation dataset.
5. Ongoing Monitoring Plan: They established an ongoing monitoring plan to
regularly check for and address any biases that might emerge as new data is
collected and the model is updated.
Outcome: The revised model showed significantly reduced bias across demographic
groups. By catching the issue before deployment, TechTalent Inc. avoided potential
reputational damage and legal issues, and more importantly, moved towards a fairer
and more ethical hiring process.
Lessons Learned:
 Proactive Bias Checks: It’s crucial to proactively check for biases in machine
learning models, especially those used in sensitive areas like hiring.
 Historical Data Isn’t Always Right: Historical data can reflect past biases and
doesn’t always represent ideal or ethical decision-making practices.
 Continuous Bias Monitoring: Regular monitoring for bias is essential as
models can develop biases over time with new data.
Conclusion: This case study underscores the importance of addressing and mitigating
bias in machine learning, especially in applications with significant social implications
like job screening. TechTalent Inc.’s proactive approach in detecting and correcting bias
before deployment exemplifies responsible AI practices, emphasizing fairness and
ethical considerations in machine learning applications.

In the TechTalent Inc. case study, the bias identified in the job applicant screening tool,
which originated from the company's historical hiring data, included the following
examples:
1. Educational Background Bias: The model showed a preference for candidates
from certain prestigious universities or colleges. This meant that equally qualified
candidates from less renowned institutions were less likely to be shortlisted. This
bias could have originated from a pattern in the historical data where past hiring
decisions favored candidates from specific universities.
2. Demographic Bias: The model exhibited biases against certain demographic
groups. For instance, it might have been less likely to shortlist candidates based
on factors like age, gender, or ethnicity. Such bias could reflect past hiring
practices where certain demographic groups were either consciously or
unconsciously favored over others.
3. Experience and Skill Set Bias: The model might have shown a tendency to
favor candidates with certain types of work experience or specific skill sets that
mirrored the profiles of previously successful candidates. This could overlook
candidates with diverse or unconventional career paths who could bring valuable
perspectives and skills to the company.
4. Extracurricular Activities Bias: There could have been a bias towards
candidates with certain types of extracurricular activities, possibly those
historically common among past successful applicants. This type of bias might
inadvertently disadvantage candidates who did not have the opportunity or
inclination to engage in these activities.
In this scenario, the biases in the company's hiring process were essentially a reflection
of historical patterns in their recruitment data. If the company had previously, even
unintentionally, favored certain universities, demographics, or career profiles, these
preferences would have been embedded in the historical data used to train the machine
learning model. The model, therefore, learned to replicate these preferences,
perpetuating the existing biases in its predictions.
Detecting and addressing these biases was crucial for ensuring a fair and equitable
hiring process. By retraining the model with debiased data and applying fairness
techniques, TechTalent Inc. aimed to create a more inclusive screening tool that
evaluated candidates based on their merits and relevant qualifications, free from
historical prejudices. Just need to be careful when dealing with this…
Case in 2022

Case Study 4
This one is my favouriteyyyy…
Case Study: Agricultural Crop Yield Prediction System
Background: An agri-tech uni, "AgriFutureTech," (it’s a university in Malaysia, but not
putting the name)…decided to develop a machine learning-based crop yield prediction
system. Their goal was to provide farmers with accurate predictions of crop yields to
optimize farming practices and maximize outputs.
Goal: To create a precise and reliable system that could predict crop yields based on
various factors like weather data, soil quality, and farming practices.
The Project:
1. Data Collection: AgriFutureTech collected data from various farms, including soil
composition, weather patterns, crop types, irrigation schedules, and historical
yield data.
2. Model Selection: They chose a Random Forest model for its robustness and
ability to handle diverse data types.
3. Training and Evaluation: The model was trained using Amazon SageMaker,
and initial testing showed promising results in yield prediction accuracy.
Incident: Upon piloting the system with a group of farmers, the predictions were found
to be significantly off in certain regions, leading to mistrust among the farmers and
reluctance to adopt the system.
Investigation:
1. Analyzing Prediction Errors: The team analyzed regions with high prediction
errors and compared them against regions where predictions were more
accurate.
2. Feedback from Farmers: Direct feedback from farmers revealed that certain
local farming practices and microclimatic conditions weren't adequately
represented in the data.
Root Cause – Lack of Domain Expertise: The primary issue was identified as a lack
of domain expertise in agriculture. The AgriFutureTech team, primarily composed of
data scientists and technologists, had not involved agricultural experts in the
development process. This led to key factors being overlooked:
 Local Farming Practices: Specific cultivation techniques used in certain regions
were not captured in the data.
 Microclimatic Variations: Small-scale climatic conditions affecting certain areas
were not considered.
 Soil Variation Complexity: The complexity of soil variations and their impact on
different crops were underestimated.
Resolution:
1. Incorporating Agricultural Expertise: AgriFutureTech onboarded agricultural
scientists and experienced local farmers to gain insights into critical factors
influencing crop yields.
2. Data Enrichment: The data was enriched with detailed local farming practices,
microclimatic data, and more nuanced soil health parameters.
3. Model Reconfiguration: The model was reconfigured to account for these
additional factors, with input from agricultural experts on feature importance and
data interpretation.
4. Pilot Program Redesign: The pilot program was redesigned to include
continuous feedback loops with the participating farmers, ensuring real-time
validation of the predictions.
Outcome: The redeveloped system showed significant improvements in accuracy,
gaining trust among the farming community. The collaboration with agricultural experts
led to a more nuanced and practical solution.
Lessons Learned:
 Importance of Domain Expertise: In-depth domain knowledge is crucial,
especially in fields like agriculture, where local knowledge can be as vital as data.
 Collaborative Development: Collaboration between data scientists and domain
experts is essential for developing practical and effective solutions.
 Continuous Feedback Loop: Ongoing feedback from end-users is vital for fine-
tuning and validating machine learning models in real-world scenarios.
Conclusion: This case study highlights the importance of domain expertise in
developing machine learning applications. The initial oversight by AgriFutureTech
underscores the necessity of understanding the specific domain nuances and
integrating this knowledge throughout the development process, particularly in fields
where local and expert knowledge plays a critical role.

The above is generalizing it… let’s explore specific agricultural techniques and regional
characteristics of Malaysia that were initially overlooked and how their inclusion could
have improved the model's accuracy. It’s better if you can get a Malaysia map to see
what I meen.. and the travelling was fantastic..
1. Local Farming Practices in Malaysia
 Paddy Field Water Management: In Malaysia, especially in regions like Kedah
and Perak, paddy fields are predominant. The traditional water management
practices in these areas, which can significantly affect yield, were not captured.
Techniques such as intermittent irrigation or specific drainage practices during
different growth stages are crucial for yield prediction.
 Mixed Cropping Systems: Small-scale farmers in Malaysia often practice mixed
cropping, growing multiple types of crops on the same land. This diversification
affects soil nutrients and pest dynamics, impacting crop yields differently than
monoculture systems.
 Organic Farming Practices: The rise of organic farming, especially in areas like
Cameron Highlands, involves unique practices like natural pest control and
organic fertilization, which significantly influence crop health and yield.
2. Microclimatic Variations
 Regional Microclimates: Malaysia has diverse microclimates due to its
topography and proximity to water bodies. The model failed to account for
microclimates like the highland areas (Cameron Highlands), which have different
temperature and humidity profiles compared to lowland areas.
 Rainfall Patterns: Malaysia experiences monsoon seasons with varying intensity
across regions. The model did not adequately account for the localized impact of
these seasonal changes on crop growth cycles.
3. Soil Variation Complexity
 Peat Soils in Sarawak: In parts of Sarawak, peat soils are common, which have
unique properties affecting nutrient availability and water retention. These
aspects are critical for certain crops like oil palm.
 Acid Sulfate Soils: In coastal areas, acid sulfate soils pose a challenge for
agriculture due to their low pH and high sulfate content. Crops grown in these
soils have different yield patterns.
Techniques to Capture These Factors:
 Geospatial Data Analysis: Incorporating satellite imagery and GIS (Geographic
Information System) data to identify and account for regional farming practices
and microclimates.
 Soil Health Monitoring: Using IoT sensors for real-time soil monitoring to
capture variations in soil pH, moisture, and nutrient levels across different
regions.
 Seasonal Weather Data Integration: Including granular weather data such as
localized rainfall patterns, temperature variations, and humidity levels specific to
the Malaysian climate.
 Collaborative Data Gathering: Partnering with local agricultural bodies and
farmers to gather qualitative data on farming practices, crop rotation patterns,
and regional agricultural events (like local festivals or market demands that
influence farming practices).
In conclusion, the initial failure to incorporate these specific regional agricultural
practices and environmental factors into AgriFutureTech's model highlights the
importance of localized knowledge in the development of accurate and reliable
predictive systems, particularly in a diverse and climatically varied country like Malaysia.
Special case study: IRB

Note, the services, the names have been generalized. It doesn’t mean its AWS that is
providing the services but just to make things clear, I am using aws services as a
reference.
By no means this is the most accurate representation of the IRB system. Do not take
this as a blueprint for any unauthorized activities.
Case Study: Implementing ML for Fraudulent Income Tax Return Detection
Background
A national tax authority aimed to modernize its fraud detection systems by leveraging
machine learning (ML) to identify and investigate potentially fraudulent income tax
returns. The initiative's goal was to minimize losses due to fraud and increase taxpayer
confidence in the system's integrity.
Objectives
 Automate the detection of anomalous behavior in the tax filing process.
 Process and analyze clickstream data in real-time.
 Establish a scalable, secure, and compliant ML solution.
Solution Design
The solution involved several AWS services and machine learning techniques, designed
to process large volumes of user interaction data and identify patterns indicative of
fraudulent activity.
Data Collection
 Clickstream Data Capture: As users navigated the tax return website,
clickstream data, including mouse movements, clicks, and typing behavior, were
captured.
 AWS Kinesis Data Streams: This service was utilized to stream interaction data
in real-time, enabling immediate processing and analysis.
Data Processing and Enrichment
 AWS Lambda: Triggered by Kinesis, these functions preprocessed the data by
cleaning, normalizing, and transforming raw clickstream data into a structured
format.
 Data Enrichment: Additional contextual data, such as login times and previous
filing history, was appended to each event to provide a richer dataset for
analysis.
Feature Engineering
 Behavioral Analysis: Features were engineered to quantify user behavior, such
as time spent on pages, navigation paths, and error rates in form completion.
 Historical Comparison: Features were also designed to compare current
behavior against historical patterns at an individual and aggregate level.
Model Training and Selection
 Algorithm Choice: Given the nature of fraud detection, unsupervised learning
algorithms such as Isolation Forest and Autoencoders were chosen due to their
effectiveness in anomaly detection.
 Model Training with AWS SageMaker: Amazon SageMaker facilitated the
training, tuning, and validation of the selected models, offering a managed
environment that streamlined these processes.
Real-time Prediction and Monitoring
 SageMaker Inference: The trained models were deployed using SageMaker's
real-time endpoints for immediate inference on streaming data.
 Threshold Determination: Anomalies were flagged based on a sensitivity
threshold derived from historical fraud patterns and expert input.
Human-in-the-Loop
 Review Process: Flagged cases were reviewed by human auditors for
confirmation, ensuring that the ML system worked in tandem with experienced
personnel.
 Feedback Mechanism: Auditors' findings were fed back into the model to refine
its predictive capabilities continually.
Security and Compliance
 Data Protection: All data handling was designed to be compliant with privacy
laws, ensuring taxpayer information remained confidential and secure.
 AWS IAM: Roles and permissions were strictly managed, with access granted
only to necessary personnel and systems.
Challenges and Best Practices
 Challenge - Data Sensitivity: Handling personal taxpayer data required strict
adherence to data privacy regulations.
 Best Practice: Employ encryption in transit and at rest, alongside access
control measures.
 Challenge - Model Bias: Ensuring the model did not unfairly target specific user
demographics was crucial.
 Best Practice: Regular bias assessments and updates to training data
were performed to mitigate this risk.
 Challenge - Dynamic Fraud Tactics: Fraudulent tactics evolve, so the model
needed to adapt to changing patterns.
 Best Practice: Implement a robust feedback loop and conduct periodic
model retraining.
Outcome
The implementation of the ML-driven fraud detection system resulted in a significant
reduction in fraudulent filings and improved the efficiency of the audit process. With
real-time data analysis and machine learning, the tax authority could proactively prevent
fraud, saving millions in potential losses and reinforcing the integrity of the tax filing
system.
Quizzes

These quizzes are composed by some of our instructors, but it is not by AWS officially.

1. What is the best method to handle missing values in a dataset?


 A) Always delete rows with missing values
 B) Replace missing values with the mean or median
 C) Ignore missing values
 D) Convert them to zero
 Answer: B) Replace missing values with the mean or median
 Explanation: Deleting rows can lead to loss of valuable data, especially if
missing values are common. Imputing missing values with the mean or
median (depending on the data distribution) is a standard practice to retain
data integrity.
2. If you discover two features in your dataset that are highly correlated, what
should you do?
 A) Keep both features
 B) Delete both features
 C) Delete one of the features
 D) Transform one of the features
 Answer: C) Delete one of the features
 Explanation: Keeping both might lead to multicollinearity, which can skew
the results of linear models. Deleting one reduces redundancy without
losing much information.
3. Which library is best suited for creating statistical data visualizations?
 A) Pandas
 B) NumPy
 C) Scikit-learn
 D) Seaborn
 Answer: D) Seaborn
 Explanation: Seaborn is built on top of Matplotlib and provides a high-
level interface for drawing attractive and informative statistical graphics.
4. How would you display the 25th percentile, median, and other statistical
details of a numerical dataset?
 A) Use the .info() method in Pandas
 B) Use the .describe() method in Pandas
 C) Use the .groupby() method in Pandas
 D) Use the .plot() method in Pandas
 Answer: B) Use the .describe() method in Pandas
 Explanation: The .describe() method provides a summary of the central
tendency, dispersion, and shape of a dataset’s distribution.
5. What is a good practice when dealing with outliers in your data?
 A) Always delete outliers
 B) Assess the impact of outliers on your model before deciding
 C) Convert outliers to the mean value
 D) Ignore outliers
 Answer: B) Assess the impact of outliers on your model before deciding
 Explanation: Outliers can sometimes be informative. It’s important to
understand their impact on your analysis or model before deciding to
remove or adjust them.
6. For splitting a dataset into training and testing sets, which function is
typically used in Python?
 A) pandas.split()
 B) numpy.split()
 C) scikit-learn's train_test_split()
 D) matplotlib.split()
 Answer: C) scikit-learn's train_test_split()
 Explanation: The train_test_split() function from Scikit-learn is
commonly used for splitting datasets into training and testing sets.
7. When dealing with a time-series dataset, what is the best way to split it into
training and testing sets?
 A) Randomly split the dataset
 B) Split based on a specific time point
 C) Use k-fold cross-validation
 D) Always use the first half for training
 Answer: B) Split based on a specific time point
 Explanation: For time-series data, it's crucial to maintain the temporal
order of observations. Therefore, splitting the dataset based on time (e.g.,
training on earlier data and testing on later data) is appropriate.
8. Which method can be used to encode categorical variables into a format
suitable for machine learning models?
 A) pandas.get_dummies()
 B) numpy.get_dummies()
 C) matplotlib.encode()
 D) seaborn.encode()
 Answer: A) pandas.get_dummies()
 Explanation: pandas.get_dummies() is a useful method for converting
categorical variable(s) into dummy/indicator variables.
9. What should you do if your linear regression model shows high variance?
 A) Add more features
 B) Increase the model complexity
 C) Apply regularization techniques
 D) Remove all features
 Answer: C) Apply regularization techniques
 Explanation: Regularization techniques like Lasso or Ridge can help
reduce overfitting (high variance) in linear models by penalizing large
coefficients.
10. If you have a small dataset, what cross-validation technique should you
consider using?
 A) Hold-out validation
 B) Leave-One-Out Cross-Validation (LOOCV)
 C) Time-series split
 D) Stratified k-fold with a large k
 Answer: B) Leave-One-Out Cross-Validation (LOOCV)
 Explanation: LOOCV is particularly useful for small datasets. It
maximizes both the training and testing data, ensuring that each data
point is used for validation exactly once.

Quiz set 2

1. What is the purpose of K-Fold cross-validation in machine learning?


 A) To maximize the model's accuracy on the training data
 B) To estimate the model's performance on an independent dataset
 C) To reduce the computational cost of training a model
 D) To increase the size of the training dataset
2. What does the "K" in K-Fold cross-validation represent?
 A) The number of iterations in the cross-validation process
 B) The number of models to be trained
 C) The number of times the model is tested
 D) The number of subsets the data is divided into
3. How many times is the model trained in a standard K-Fold cross-validation setup
with K=5?
 A) Once
 B) Five times
 C) Ten times
 D) Twenty times
4. In the context of K-Fold cross-validation, what does "shuffling" the data do?
 A) Increases the number of folds
 B) Randomizes the order of data points to prevent order bias
 C) Reduces the total number of data points
 D) Organizes the data by class labels
5. What is the main benefit of using iterated K-Fold validation with shuffling?
 A) It simplifies the model training process
 B) It provides a more robust estimate of model performance
 C) It decreases the training time of the model
 D) It eliminates the need for a test dataset
6. In iterated K-Fold cross-validation, if the iteration is performed 3 times with K=5,
how many total model trainings will occur?
 A) 3
 B) 5
 C) 15
 D) 8
7. Which of the following is a drawback of iterated K-Fold validation with shuffling?
 A) Decreased model robustness
 B) More biased performance estimates
 C) Increased computational cost
 D) Less thorough evaluation of the model
8. When might you prefer to use a single round of K-Fold cross-validation over
iterated K-Fold with shuffling?
 A) When you have a very large dataset
 B) When your data is in a completely random order
 C) When computational resources are limited
 D) All of the above
9. What is the usual range for the number of folds (K) in K-Fold cross-validation
when starting out?
 A) 2-4
 B) 5-10
 C) 10-15
 D) 20-30
10. What does a "fold" in K-Fold cross-validation correspond to?
 A) A single data point used for testing
 B) A single iteration of training and testing
 C) One of the subsets the dataset is split into for cross-validation
 D) The final model generated after cross-validation
Answers: 1-B, 2-D, 3-B, 4-B, 5-B, 6-C, 7-C, 8-C, 9-B, 10-C.

Quiz 3

1. What is the primary purpose of feature engineering in machine learning?


 A) To increase the computational complexity of models
 B) To transform raw data into a format suitable for modeling
 C) To decrease the accuracy of models
 D) To make models more overfitted
2. The 'Curse of Dimensionality' refers to:
 A) A spell cast by a data scientist
 B) The phenomena that arise when dealing with low-dimensional data
 C) The benefits of having more data features
 D) The problems that occur when analyzing high-dimensional data
3. Which of the following is a common technique used in feature extraction?
 A) Subtraction of the mean from each feature
 B) Principal Component Analysis (PCA)
 C) Assigning '0' to missing categorical data
 D) Increasing the number of features in a dataset
4. When would you most likely use feature selection?
 A) When you want to add more features to the model
 B) When you need to reduce overfitting and improve model accuracy
 C) When you have a small amount of data
 D) When you want to increase the model's variance
5. Feature creation in the context of a bank's risk assessment would involve:
 A) Removing features like Age and Income
 B) Creating a new feature such as 'Debt-to-Income Ratio'
 C) Using only raw data without any transformations
 D) Applying a cube root transformation to the number of dependents
6. In e-commerce product recommendation, what might be a result of feature
selection?
 A) Choosing 'browsing history' as the least relevant feature
 B) Identifying 'search terms' and 'past purchases' as the most relevant
features
 C) Selecting 'color of the product' as the only feature
 D) Eliminating all features to simplify the model
7. Applying Principal Component Analysis (PCA) to medical images would
help to:
 A) Increase the resolution of the images
 B) Reduce the dimensionality of the image data
 C) Assign binary values to each pixel
 D) Calculate the exact disease probability
8. What is a potential caveat when doing feature engineering, especially in
time-series applications?
 A) Overly simplistic model evaluations
 B) Introducing bias or leaking future information into the training process
 C) Using too few features
 D) Applying transformations is unnecessary
9. In the context of numerical data, binning is used to:
 A) Increase the range of continuous variables
 B) Divide a feature into several categories based on value range
 C) Apply a logarithmic transformation to the data
 D) Transform categorical data into numerical values
10. Which scaling technique is particularly useful when you have outliers in
your data?
 A) MinMax Scaling
 B) Mean/Variance Standardization
 C) MaxAbs Scaling
 D) Robust Scaling
The correct answers are:
1. B
2. D
3. B
4. B
5. B
6. B
7. B
8. B
9. B
10. D

You might also like