0% found this document useful (0 votes)
3 views6 pages

Ba CH-2

Chapter 2 covers data acquisition, processing, and challenges, detailing primary and secondary data sources, methods like web scraping and APIs, and issues such as data quality and privacy. It also discusses handling outliers and missing values, including methods for detection and imputation. Case studies from Google, Zomato, and Tesla illustrate practical applications and solutions in data processing.

Uploaded by

sk24msg1r43
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views6 pages

Ba CH-2

Chapter 2 covers data acquisition, processing, and challenges, detailing primary and secondary data sources, methods like web scraping and APIs, and issues such as data quality and privacy. It also discusses handling outliers and missing values, including methods for detection and imputation. Case studies from Google, Zomato, and Tesla illustrate practical applications and solutions in data processing.

Uploaded by

sk24msg1r43
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Chapter 2

Data
Data Acquisition
Definition:

Data acquisition is the process of gathering and collecting raw data from various sources for
analysis.

Types of Data Sources:

1. Primary Data – Collected firsthand through surveys, experiments, or direct


observations.
o Example: A company collects customer feedback through online surveys.
2. Secondary Data – Pre-existing data obtained from external sources.
o Example: A business uses government census data for market research.

Methods of Data Acquisition:

 Web Scraping – Extracting data from websites (e.g., Amazon price tracking).
 APIs – Accessing data from platforms like Twitter or Google Analytics.
 IoT Devices – Collecting sensor data (e.g., Fitbit for health tracking).
 Transaction Databases – Recording sales and customer purchases (e.g., Walmart POS
systems).

Challenges in Data Acquisition


Common Challenges & Solutions:

Challenge Explanation Solution


Inconsistent, duplicate, or
Data Quality Issues Use data cleaning techniques
incomplete data
High Volume of Large datasets slow down
Use cloud storage & big data tools
Data processing
Data Privacy & Risk of data breaches & regulatory Implement encryption & adhere to
Security compliance GDPR/CCPA
Data Integration Combining data from multiple Use ETL (Extract, Transform, Load)
Issues sources processes
Example:

A bank acquiring customer transaction data faces integration issues when merging data from
different branches with varying formats. Solution: Using data transformation tools to standardize
the format.

Data Processing
Definition:

Data processing involves cleaning, transforming, and preparing data for analysis.

Key Steps in Data Processing:

1. Data Cleaning – Removing inconsistencies, duplicates, and errors.


2. Data Transformation – Converting raw data into a structured format.
3. Feature Engineering – Creating new meaningful features from existing data.

Handling Outliers
Definition:

Outliers are extreme values that deviate significantly from the rest of the dataset and can skew
analysis.

Causes of Outliers:

 Data Entry Errors (e.g., a typo in an employee's salary recorded as $1,000,000 instead
of $100,000).
 Genuine Extreme Values (e.g., an NBA player's height in a dataset of average humans).

Methods to Handle Outliers:

1. Detection Methods:
o Box Plot – Identifies values outside the interquartile range (IQR).
o Z-Score Method – Flags values more than 3 standard deviations from the mean.
2. Treatment Methods:
o Remove Outliers – If caused by errors.
o Transform Data – Log transformations can reduce skewness.
o Cap Values – Replace extreme values with the nearest reasonable value.
Example:

A credit card company notices that some customers have unusually high transactions. These
could either be fraud or a genuine case (high-spending customers). Solution: Investigate before
deciding to remove or keep the outliers.

Missing Value Treatment


Definition:

Missing values occur when data is unavailable for certain fields.

Types of Missing Data:

1. MCAR (Missing Completely at Random): No pattern in missing data.


2. MAR (Missing at Random): Missing values depend on other known data.
3. MNAR (Missing Not at Random): Missing data is due to hidden reasons (e.g., high-
income respondents skipping salary-related questions).

Methods to Handle Missing Data:

 Deletion:
o Listwise Deletion – Remove entire rows with missing values.
o Pairwise Deletion – Use available values without removing entire rows.
 Imputation:
o Mean/Median/Mode Imputation – Replacing missing values with statistical
measures.
o Predictive Modeling (KNN, Regression) – Predict missing values using other
features.

Example:

A hospital's patient dataset has missing blood pressure values. Solution: Use median imputation
to fill in the missing values based on age and weight.
Data Sampling
Definition:

Data sampling is the process of selecting a subset of data for analysis to save time and computing
resources.

Types of Data Sampling:

Sampling
Description Example
Type
Random Every data point has an equal chance of Selecting 1,000 customers randomly
Sampling selection. from a database of 1 million.
Stratified Data is divided into groups (strata), and Selecting equal male and female
Sampling samples are taken proportionally. respondents for a gender study.
Cluster Data is divided into clusters, and a few Surveying students from 5 randomly
Sampling clusters are randomly selected. selected schools instead of all schools.
Systematic Every nth record is selected from a Selecting every 10th visitor on a
Sampling dataset. website for feedback.

Example:

A marketing firm wants to analyze customer preferences in a large city. Instead of surveying
everyone, they use stratified sampling to ensure responses from different income groups.
Case Studies
Case Study 1: Google’s Search Engine Data Processing
Problem:

Google needs to process massive search queries efficiently while handling missing and
inconsistent data.

Solution:

1. Data Acquisition: Web scraping search results & user behavior data.
2. Handling Outliers & Missing Values: Machine learning detects and removes irrelevant
or low-quality data.
3. Data Sampling: Instead of analyzing all searches, Google samples a fraction of queries
to refine its algorithms.

Results:

 Improved search ranking relevance.


 Faster response times for search queries.

Case Study 2: Zomato’s Customer Review Analysis


Problem:

Zomato wants to analyze restaurant reviews but faces missing data and outliers.

Solution:

1. Data Acquisition: Collected reviews from customers, including ratings and comments.
2. Handling Missing Values: Used sentiment analysis to predict missing ratings based on
review text.
3. Outlier Detection: Identified fake reviews by detecting extreme rating patterns.

Results:

 Better recommendation system for users.


 Reduced impact of fake or biased reviews.
Case Study 3: Tesla’s Sensor Data Analysis for Autonomous
Cars
Problem:

Tesla collects real-time data from sensors in self-driving cars but faces challenges with missing
and noisy data.

Solution:

1. Data Acquisition: Collected sensor data from cameras, LiDAR, and GPS.
2. Data Cleaning: Removed inaccurate GPS readings and outliers.
3. Data Sampling: Used systematic sampling to analyze specific driving patterns instead of
all data.

Results:

 Improved self-driving algorithms.


 Enhanced safety by filtering out inaccurate sensor readings.

You might also like