Case Studies ML
Case Studies ML
Case Study 3
In the TechTalent Inc. case study, the bias identified in the job applicant screening tool,
which originated from the company's historical hiring data, included the following
examples:
1. Educational Background Bias: The model showed a preference for candidates
from certain prestigious universities or colleges. This meant that equally qualified
candidates from less renowned institutions were less likely to be shortlisted. This
bias could have originated from a pattern in the historical data where past hiring
decisions favored candidates from specific universities.
2. Demographic Bias: The model exhibited biases against certain demographic
groups. For instance, it might have been less likely to shortlist candidates based
on factors like age, gender, or ethnicity. Such bias could reflect past hiring
practices where certain demographic groups were either consciously or
unconsciously favored over others.
3. Experience and Skill Set Bias: The model might have shown a tendency to
favor candidates with certain types of work experience or specific skill sets that
mirrored the profiles of previously successful candidates. This could overlook
candidates with diverse or unconventional career paths who could bring valuable
perspectives and skills to the company.
4. Extracurricular Activities Bias: There could have been a bias towards
candidates with certain types of extracurricular activities, possibly those
historically common among past successful applicants. This type of bias might
inadvertently disadvantage candidates who did not have the opportunity or
inclination to engage in these activities.
In this scenario, the biases in the company's hiring process were essentially a reflection
of historical patterns in their recruitment data. If the company had previously, even
unintentionally, favored certain universities, demographics, or career profiles, these
preferences would have been embedded in the historical data used to train the machine
learning model. The model, therefore, learned to replicate these preferences,
perpetuating the existing biases in its predictions.
Detecting and addressing these biases was crucial for ensuring a fair and equitable
hiring process. By retraining the model with debiased data and applying fairness
techniques, TechTalent Inc. aimed to create a more inclusive screening tool that
evaluated candidates based on their merits and relevant qualifications, free from
historical prejudices. Just need to be careful when dealing with this…
Case in 2022
Case Study 4
This one is my favouriteyyyy…
Case Study: Agricultural Crop Yield Prediction System
Background: An agri-tech uni, "AgriFutureTech," (it’s a university in Malaysia, but not
putting the name)…decided to develop a machine learning-based crop yield prediction
system. Their goal was to provide farmers with accurate predictions of crop yields to
optimize farming practices and maximize outputs.
Goal: To create a precise and reliable system that could predict crop yields based on
various factors like weather data, soil quality, and farming practices.
The Project:
1. Data Collection: AgriFutureTech collected data from various farms, including soil
composition, weather patterns, crop types, irrigation schedules, and historical
yield data.
2. Model Selection: They chose a Random Forest model for its robustness and
ability to handle diverse data types.
3. Training and Evaluation: The model was trained using Amazon SageMaker,
and initial testing showed promising results in yield prediction accuracy.
Incident: Upon piloting the system with a group of farmers, the predictions were found
to be significantly off in certain regions, leading to mistrust among the farmers and
reluctance to adopt the system.
Investigation:
1. Analyzing Prediction Errors: The team analyzed regions with high prediction
errors and compared them against regions where predictions were more
accurate.
2. Feedback from Farmers: Direct feedback from farmers revealed that certain
local farming practices and microclimatic conditions weren't adequately
represented in the data.
Root Cause – Lack of Domain Expertise: The primary issue was identified as a lack
of domain expertise in agriculture. The AgriFutureTech team, primarily composed of
data scientists and technologists, had not involved agricultural experts in the
development process. This led to key factors being overlooked:
Local Farming Practices: Specific cultivation techniques used in certain regions
were not captured in the data.
Microclimatic Variations: Small-scale climatic conditions affecting certain areas
were not considered.
Soil Variation Complexity: The complexity of soil variations and their impact on
different crops were underestimated.
Resolution:
1. Incorporating Agricultural Expertise: AgriFutureTech onboarded agricultural
scientists and experienced local farmers to gain insights into critical factors
influencing crop yields.
2. Data Enrichment: The data was enriched with detailed local farming practices,
microclimatic data, and more nuanced soil health parameters.
3. Model Reconfiguration: The model was reconfigured to account for these
additional factors, with input from agricultural experts on feature importance and
data interpretation.
4. Pilot Program Redesign: The pilot program was redesigned to include
continuous feedback loops with the participating farmers, ensuring real-time
validation of the predictions.
Outcome: The redeveloped system showed significant improvements in accuracy,
gaining trust among the farming community. The collaboration with agricultural experts
led to a more nuanced and practical solution.
Lessons Learned:
Importance of Domain Expertise: In-depth domain knowledge is crucial,
especially in fields like agriculture, where local knowledge can be as vital as data.
Collaborative Development: Collaboration between data scientists and domain
experts is essential for developing practical and effective solutions.
Continuous Feedback Loop: Ongoing feedback from end-users is vital for fine-
tuning and validating machine learning models in real-world scenarios.
Conclusion: This case study highlights the importance of domain expertise in
developing machine learning applications. The initial oversight by AgriFutureTech
underscores the necessity of understanding the specific domain nuances and
integrating this knowledge throughout the development process, particularly in fields
where local and expert knowledge plays a critical role.
The above is generalizing it… let’s explore specific agricultural techniques and regional
characteristics of Malaysia that were initially overlooked and how their inclusion could
have improved the model's accuracy. It’s better if you can get a Malaysia map to see
what I meen.. and the travelling was fantastic..
1. Local Farming Practices in Malaysia
Paddy Field Water Management: In Malaysia, especially in regions like Kedah
and Perak, paddy fields are predominant. The traditional water management
practices in these areas, which can significantly affect yield, were not captured.
Techniques such as intermittent irrigation or specific drainage practices during
different growth stages are crucial for yield prediction.
Mixed Cropping Systems: Small-scale farmers in Malaysia often practice mixed
cropping, growing multiple types of crops on the same land. This diversification
affects soil nutrients and pest dynamics, impacting crop yields differently than
monoculture systems.
Organic Farming Practices: The rise of organic farming, especially in areas like
Cameron Highlands, involves unique practices like natural pest control and
organic fertilization, which significantly influence crop health and yield.
2. Microclimatic Variations
Regional Microclimates: Malaysia has diverse microclimates due to its
topography and proximity to water bodies. The model failed to account for
microclimates like the highland areas (Cameron Highlands), which have different
temperature and humidity profiles compared to lowland areas.
Rainfall Patterns: Malaysia experiences monsoon seasons with varying intensity
across regions. The model did not adequately account for the localized impact of
these seasonal changes on crop growth cycles.
3. Soil Variation Complexity
Peat Soils in Sarawak: In parts of Sarawak, peat soils are common, which have
unique properties affecting nutrient availability and water retention. These
aspects are critical for certain crops like oil palm.
Acid Sulfate Soils: In coastal areas, acid sulfate soils pose a challenge for
agriculture due to their low pH and high sulfate content. Crops grown in these
soils have different yield patterns.
Techniques to Capture These Factors:
Geospatial Data Analysis: Incorporating satellite imagery and GIS (Geographic
Information System) data to identify and account for regional farming practices
and microclimates.
Soil Health Monitoring: Using IoT sensors for real-time soil monitoring to
capture variations in soil pH, moisture, and nutrient levels across different
regions.
Seasonal Weather Data Integration: Including granular weather data such as
localized rainfall patterns, temperature variations, and humidity levels specific to
the Malaysian climate.
Collaborative Data Gathering: Partnering with local agricultural bodies and
farmers to gather qualitative data on farming practices, crop rotation patterns,
and regional agricultural events (like local festivals or market demands that
influence farming practices).
In conclusion, the initial failure to incorporate these specific regional agricultural
practices and environmental factors into AgriFutureTech's model highlights the
importance of localized knowledge in the development of accurate and reliable
predictive systems, particularly in a diverse and climatically varied country like Malaysia.
Special case study: IRB
Note, the services, the names have been generalized. It doesn’t mean its AWS that is
providing the services but just to make things clear, I am using aws services as a
reference.
By no means this is the most accurate representation of the IRB system. Do not take
this as a blueprint for any unauthorized activities.
Case Study: Implementing ML for Fraudulent Income Tax Return Detection
Background
A national tax authority aimed to modernize its fraud detection systems by leveraging
machine learning (ML) to identify and investigate potentially fraudulent income tax
returns. The initiative's goal was to minimize losses due to fraud and increase taxpayer
confidence in the system's integrity.
Objectives
Automate the detection of anomalous behavior in the tax filing process.
Process and analyze clickstream data in real-time.
Establish a scalable, secure, and compliant ML solution.
Solution Design
The solution involved several AWS services and machine learning techniques, designed
to process large volumes of user interaction data and identify patterns indicative of
fraudulent activity.
Data Collection
Clickstream Data Capture: As users navigated the tax return website,
clickstream data, including mouse movements, clicks, and typing behavior, were
captured.
AWS Kinesis Data Streams: This service was utilized to stream interaction data
in real-time, enabling immediate processing and analysis.
Data Processing and Enrichment
AWS Lambda: Triggered by Kinesis, these functions preprocessed the data by
cleaning, normalizing, and transforming raw clickstream data into a structured
format.
Data Enrichment: Additional contextual data, such as login times and previous
filing history, was appended to each event to provide a richer dataset for
analysis.
Feature Engineering
Behavioral Analysis: Features were engineered to quantify user behavior, such
as time spent on pages, navigation paths, and error rates in form completion.
Historical Comparison: Features were also designed to compare current
behavior against historical patterns at an individual and aggregate level.
Model Training and Selection
Algorithm Choice: Given the nature of fraud detection, unsupervised learning
algorithms such as Isolation Forest and Autoencoders were chosen due to their
effectiveness in anomaly detection.
Model Training with AWS SageMaker: Amazon SageMaker facilitated the
training, tuning, and validation of the selected models, offering a managed
environment that streamlined these processes.
Real-time Prediction and Monitoring
SageMaker Inference: The trained models were deployed using SageMaker's
real-time endpoints for immediate inference on streaming data.
Threshold Determination: Anomalies were flagged based on a sensitivity
threshold derived from historical fraud patterns and expert input.
Human-in-the-Loop
Review Process: Flagged cases were reviewed by human auditors for
confirmation, ensuring that the ML system worked in tandem with experienced
personnel.
Feedback Mechanism: Auditors' findings were fed back into the model to refine
its predictive capabilities continually.
Security and Compliance
Data Protection: All data handling was designed to be compliant with privacy
laws, ensuring taxpayer information remained confidential and secure.
AWS IAM: Roles and permissions were strictly managed, with access granted
only to necessary personnel and systems.
Challenges and Best Practices
Challenge - Data Sensitivity: Handling personal taxpayer data required strict
adherence to data privacy regulations.
Best Practice: Employ encryption in transit and at rest, alongside access
control measures.
Challenge - Model Bias: Ensuring the model did not unfairly target specific user
demographics was crucial.
Best Practice: Regular bias assessments and updates to training data
were performed to mitigate this risk.
Challenge - Dynamic Fraud Tactics: Fraudulent tactics evolve, so the model
needed to adapt to changing patterns.
Best Practice: Implement a robust feedback loop and conduct periodic
model retraining.
Outcome
The implementation of the ML-driven fraud detection system resulted in a significant
reduction in fraudulent filings and improved the efficiency of the audit process. With
real-time data analysis and machine learning, the tax authority could proactively prevent
fraud, saving millions in potential losses and reinforcing the integrity of the tax filing
system.
Quizzes
These quizzes are composed by some of our instructors, but it is not by AWS officially.
Quiz set 2
Quiz 3