Ba CH-2
Ba CH-2
Data
Data Acquisition
Definition:
Data acquisition is the process of gathering and collecting raw data from various sources for
analysis.
Web Scraping – Extracting data from websites (e.g., Amazon price tracking).
APIs – Accessing data from platforms like Twitter or Google Analytics.
IoT Devices – Collecting sensor data (e.g., Fitbit for health tracking).
Transaction Databases – Recording sales and customer purchases (e.g., Walmart POS
systems).
A bank acquiring customer transaction data faces integration issues when merging data from
different branches with varying formats. Solution: Using data transformation tools to standardize
the format.
Data Processing
Definition:
Data processing involves cleaning, transforming, and preparing data for analysis.
Handling Outliers
Definition:
Outliers are extreme values that deviate significantly from the rest of the dataset and can skew
analysis.
Causes of Outliers:
Data Entry Errors (e.g., a typo in an employee's salary recorded as $1,000,000 instead
of $100,000).
Genuine Extreme Values (e.g., an NBA player's height in a dataset of average humans).
1. Detection Methods:
o Box Plot – Identifies values outside the interquartile range (IQR).
o Z-Score Method – Flags values more than 3 standard deviations from the mean.
2. Treatment Methods:
o Remove Outliers – If caused by errors.
o Transform Data – Log transformations can reduce skewness.
o Cap Values – Replace extreme values with the nearest reasonable value.
Example:
A credit card company notices that some customers have unusually high transactions. These
could either be fraud or a genuine case (high-spending customers). Solution: Investigate before
deciding to remove or keep the outliers.
Deletion:
o Listwise Deletion – Remove entire rows with missing values.
o Pairwise Deletion – Use available values without removing entire rows.
Imputation:
o Mean/Median/Mode Imputation – Replacing missing values with statistical
measures.
o Predictive Modeling (KNN, Regression) – Predict missing values using other
features.
Example:
A hospital's patient dataset has missing blood pressure values. Solution: Use median imputation
to fill in the missing values based on age and weight.
Data Sampling
Definition:
Data sampling is the process of selecting a subset of data for analysis to save time and computing
resources.
Sampling
Description Example
Type
Random Every data point has an equal chance of Selecting 1,000 customers randomly
Sampling selection. from a database of 1 million.
Stratified Data is divided into groups (strata), and Selecting equal male and female
Sampling samples are taken proportionally. respondents for a gender study.
Cluster Data is divided into clusters, and a few Surveying students from 5 randomly
Sampling clusters are randomly selected. selected schools instead of all schools.
Systematic Every nth record is selected from a Selecting every 10th visitor on a
Sampling dataset. website for feedback.
Example:
A marketing firm wants to analyze customer preferences in a large city. Instead of surveying
everyone, they use stratified sampling to ensure responses from different income groups.
Case Studies
Case Study 1: Google’s Search Engine Data Processing
Problem:
Google needs to process massive search queries efficiently while handling missing and
inconsistent data.
Solution:
1. Data Acquisition: Web scraping search results & user behavior data.
2. Handling Outliers & Missing Values: Machine learning detects and removes irrelevant
or low-quality data.
3. Data Sampling: Instead of analyzing all searches, Google samples a fraction of queries
to refine its algorithms.
Results:
Zomato wants to analyze restaurant reviews but faces missing data and outliers.
Solution:
1. Data Acquisition: Collected reviews from customers, including ratings and comments.
2. Handling Missing Values: Used sentiment analysis to predict missing ratings based on
review text.
3. Outlier Detection: Identified fake reviews by detecting extreme rating patterns.
Results:
Tesla collects real-time data from sensors in self-driving cars but faces challenges with missing
and noisy data.
Solution:
1. Data Acquisition: Collected sensor data from cameras, LiDAR, and GPS.
2. Data Cleaning: Removed inaccurate GPS readings and outliers.
3. Data Sampling: Used systematic sampling to analyze specific driving patterns instead of
all data.
Results: