0% found this document useful (0 votes)

3 views6 pages

Ba CH-2

Chapter 2 covers data acquisition, processing, and challenges, detailing primary and secondary data sources, methods like web scraping and APIs, and issues such as data quality and privacy. It also discusses handling outliers and missing values, including methods for detection and imputation. Case studies from Google, Zomato, and Tesla illustrate practical applications and solutions in data processing.

Uploaded by

sk24msg1r43

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views6 pages

Ba CH-2

Uploaded by

sk24msg1r43

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Chapter 2

Data
Data Acquisition
Definition:

Data acquisition is the process of gathering and collecting raw data from various sources for
analysis.

Types of Data Sources:

1. Primary Data – Collected firsthand through surveys, experiments, or direct

observations.
o Example: A company collects customer feedback through online surveys.
2. Secondary Data – Pre-existing data obtained from external sources.
o Example: A business uses government census data for market research.

Methods of Data Acquisition:

 Web Scraping – Extracting data from websites (e.g., Amazon price tracking).
 APIs – Accessing data from platforms like Twitter or Google Analytics.
 IoT Devices – Collecting sensor data (e.g., Fitbit for health tracking).
 Transaction Databases – Recording sales and customer purchases (e.g., Walmart POS
systems).

Challenges in Data Acquisition

Common Challenges & Solutions:

Challenge Explanation Solution

Inconsistent, duplicate, or
Data Quality Issues Use data cleaning techniques
incomplete data
High Volume of Large datasets slow down
Use cloud storage & big data tools
Data processing
Data Privacy & Risk of data breaches & regulatory Implement encryption & adhere to
Security compliance GDPR/CCPA
Data Integration Combining data from multiple Use ETL (Extract, Transform, Load)
Issues sources processes
Example:

A bank acquiring customer transaction data faces integration issues when merging data from
different branches with varying formats. Solution: Using data transformation tools to standardize
the format.

Data Processing
Definition:

Data processing involves cleaning, transforming, and preparing data for analysis.

Key Steps in Data Processing:

1. Data Cleaning – Removing inconsistencies, duplicates, and errors.

2. Data Transformation – Converting raw data into a structured format.
3. Feature Engineering – Creating new meaningful features from existing data.

Handling Outliers
Definition:

Outliers are extreme values that deviate significantly from the rest of the dataset and can skew
analysis.

Causes of Outliers:

 Data Entry Errors (e.g., a typo in an employee's salary recorded as $1,000,000 instead
of $100,000).
 Genuine Extreme Values (e.g., an NBA player's height in a dataset of average humans).

Methods to Handle Outliers:

1. Detection Methods:
o Box Plot – Identifies values outside the interquartile range (IQR).
o Z-Score Method – Flags values more than 3 standard deviations from the mean.
2. Treatment Methods:
o Remove Outliers – If caused by errors.
o Transform Data – Log transformations can reduce skewness.
o Cap Values – Replace extreme values with the nearest reasonable value.
Example:

A credit card company notices that some customers have unusually high transactions. These
could either be fraud or a genuine case (high-spending customers). Solution: Investigate before
deciding to remove or keep the outliers.

Missing Value Treatment

Definition:

Missing values occur when data is unavailable for certain fields.

Types of Missing Data:

1. MCAR (Missing Completely at Random): No pattern in missing data.

2. MAR (Missing at Random): Missing values depend on other known data.
3. MNAR (Missing Not at Random): Missing data is due to hidden reasons (e.g., high-
income respondents skipping salary-related questions).

Methods to Handle Missing Data:

 Deletion:
o Listwise Deletion – Remove entire rows with missing values.
o Pairwise Deletion – Use available values without removing entire rows.
 Imputation:
o Mean/Median/Mode Imputation – Replacing missing values with statistical
measures.
o Predictive Modeling (KNN, Regression) – Predict missing values using other
features.

Example:

A hospital's patient dataset has missing blood pressure values. Solution: Use median imputation
to fill in the missing values based on age and weight.
Data Sampling
Definition:

Data sampling is the process of selecting a subset of data for analysis to save time and computing
resources.

Types of Data Sampling:

Sampling
Description Example
Type
Random Every data point has an equal chance of Selecting 1,000 customers randomly
Sampling selection. from a database of 1 million.
Stratified Data is divided into groups (strata), and Selecting equal male and female
Sampling samples are taken proportionally. respondents for a gender study.
Cluster Data is divided into clusters, and a few Surveying students from 5 randomly
Sampling clusters are randomly selected. selected schools instead of all schools.
Systematic Every nth record is selected from a Selecting every 10th visitor on a
Sampling dataset. website for feedback.

Example:

A marketing firm wants to analyze customer preferences in a large city. Instead of surveying
everyone, they use stratified sampling to ensure responses from different income groups.
Case Studies
Case Study 1: Google’s Search Engine Data Processing
Problem:

Google needs to process massive search queries efficiently while handling missing and
inconsistent data.

Solution:

1. Data Acquisition: Web scraping search results & user behavior data.
2. Handling Outliers & Missing Values: Machine learning detects and removes irrelevant
or low-quality data.
3. Data Sampling: Instead of analyzing all searches, Google samples a fraction of queries
to refine its algorithms.

Results:

 Improved search ranking relevance.

 Faster response times for search queries.

Case Study 2: Zomato’s Customer Review Analysis

Problem:

Zomato wants to analyze restaurant reviews but faces missing data and outliers.

Solution:

1. Data Acquisition: Collected reviews from customers, including ratings and comments.
2. Handling Missing Values: Used sentiment analysis to predict missing ratings based on
review text.
3. Outlier Detection: Identified fake reviews by detecting extreme rating patterns.

Results:

 Better recommendation system for users.

 Reduced impact of fake or biased reviews.
Case Study 3: Tesla’s Sensor Data Analysis for Autonomous
Cars
Problem:

Tesla collects real-time data from sensors in self-driving cars but faces challenges with missing
and noisy data.

Solution:

1. Data Acquisition: Collected sensor data from cameras, LiDAR, and GPS.
2. Data Cleaning: Removed inaccurate GPS readings and outliers.
3. Data Sampling: Used systematic sampling to analyze specific driving patterns instead of
all data.

Results:

 Improved self-driving algorithms.

 Enhanced safety by filtering out inaccurate sensor readings.

Unit 2 PPT (BA)
No ratings yet
Unit 2 PPT (BA)
33 pages
Business Analytics For Managers - 17.02.2020 PDF
100% (3)
Business Analytics For Managers - 17.02.2020 PDF
295 pages
Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
Unit-I Da
No ratings yet
Unit-I Da
42 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Oracle Academy Mid Term Exam Semester 1 Answers
89% (18)
Oracle Academy Mid Term Exam Semester 1 Answers
11 pages
The Experiences of Working While Studying
No ratings yet
The Experiences of Working While Studying
35 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
94 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
How Should Data Preparation Be Done For An Analytics Project
No ratings yet
How Should Data Preparation Be Done For An Analytics Project
30 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
Data Mining
No ratings yet
Data Mining
22 pages
1708443470801
No ratings yet
1708443470801
71 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
Module 3
No ratings yet
Module 3
76 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Dta Mining
No ratings yet
Dta Mining
15 pages
Chap 3
No ratings yet
Chap 3
26 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Unit 2
No ratings yet
Unit 2
21 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
DSBD
No ratings yet
DSBD
23 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Unit 3
No ratings yet
Unit 3
18 pages
Data Collection Cleaning Preprocessing Presentation
No ratings yet
Data Collection Cleaning Preprocessing Presentation
13 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Unit-2 - DS Notes
No ratings yet
Unit-2 - DS Notes
22 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
Week 3
No ratings yet
Week 3
23 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Integrating Data From Different Sources
No ratings yet
Integrating Data From Different Sources
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Analysis
No ratings yet
Data Analysis
28 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
Cache Coherence - MESI MOESI
No ratings yet
Cache Coherence - MESI MOESI
57 pages
Snowpro Advanced: Data Analyst: Daa-C01 and Daa-R01 Exam Study Guide
No ratings yet
Snowpro Advanced: Data Analyst: Daa-C01 and Daa-R01 Exam Study Guide
16 pages
As You Delve Into The World of Data Analytics
No ratings yet
As You Delve Into The World of Data Analytics
10 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
Superstore Sales Data Analysis Report - 24MSG1R43 - Sanjeev Kumar
No ratings yet
Superstore Sales Data Analysis Report - 24MSG1R43 - Sanjeev Kumar
8 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Stages in Data Mining
No ratings yet
Stages in Data Mining
11 pages
Lecture 1 - Introduction To Databases
No ratings yet
Lecture 1 - Introduction To Databases
32 pages
Processing Data
No ratings yet
Processing Data
4 pages
2017 Summer Model Answer Paper
No ratings yet
2017 Summer Model Answer Paper
29 pages
Module 2
No ratings yet
Module 2
8 pages
Ms Maestro
No ratings yet
Ms Maestro
482 pages
Chapter 1 5 Complete.
No ratings yet
Chapter 1 5 Complete.
62 pages
Frontiers of Digital Transformation: Applications of The Real-World Data Circulation Paradigm Kazuya Takeda
No ratings yet
Frontiers of Digital Transformation: Applications of The Real-World Data Circulation Paradigm Kazuya Takeda
63 pages
Data Analytics Key Notes
No ratings yet
Data Analytics Key Notes
5 pages
Report - 24MSG1R43 - Sanjeev Kumar - 2025
No ratings yet
Report - 24MSG1R43 - Sanjeev Kumar - 2025
10 pages
Query Processing
No ratings yet
Query Processing
5 pages
Chapter 1 Data and Statistics
No ratings yet
Chapter 1 Data and Statistics
20 pages
CAMEA CLORES NAZARENO Lesson Plan FR Module 5
No ratings yet
CAMEA CLORES NAZARENO Lesson Plan FR Module 5
11 pages
Chapter 4
No ratings yet
Chapter 4
5 pages
Discuss The Role of Data Mining Techniques and Data Visualization in e Commerce Data Mining
No ratings yet
Discuss The Role of Data Mining Techniques and Data Visualization in e Commerce Data Mining
13 pages
MOD 2 - The Research Process
No ratings yet
MOD 2 - The Research Process
26 pages
Sorting and Filtering Data
No ratings yet
Sorting and Filtering Data
24 pages
SMPP
No ratings yet
SMPP
6 pages
Woliata Sodo University: Course Title Course Code: Credit Hours Course Instructor: Contact Hourse: Course Description
No ratings yet
Woliata Sodo University: Course Title Course Code: Credit Hours Course Instructor: Contact Hourse: Course Description
2 pages
BA Questions
No ratings yet
BA Questions
5 pages
62530-MDSc-2-Year-SEM 2 Study-Plan
No ratings yet
62530-MDSc-2-Year-SEM 2 Study-Plan
3 pages
Referential Integrity
No ratings yet
Referential Integrity
2 pages
Java Practical No 28
No ratings yet
Java Practical No 28
5 pages
Information, Knowledge and Business Intelligence: According To Davis and Olson
No ratings yet
Information, Knowledge and Business Intelligence: According To Davis and Olson
4 pages
Phase 1
No ratings yet
Phase 1
4 pages
Nurs PDF
No ratings yet
Nurs PDF
13 pages
Total Quality Management Final
No ratings yet
Total Quality Management Final
9 pages
Creating A Query Using Bex Analyzer
No ratings yet
Creating A Query Using Bex Analyzer
10 pages
An O (1) Algorithm For Implementing The LFU Cache Eviction Scheme
No ratings yet
An O (1) Algorithm For Implementing The LFU Cache Eviction Scheme
8 pages
4-Confluence of Multiple Disciplines, Classifictaion, Integration-08-Feb-2021Material I 08-Feb-2021 Mod1 Confluence Classifictaion
No ratings yet
4-Confluence of Multiple Disciplines, Classifictaion, Integration-08-Feb-2021Material I 08-Feb-2021 Mod1 Confluence Classifictaion
4 pages
Object Based Storage System
No ratings yet
Object Based Storage System
2 pages
Data Warehouse
No ratings yet
Data Warehouse
2 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet

Ba CH-2

Uploaded by

Ba CH-2

Uploaded by

Chapter 2

Types of Data Sources:

1. Primary Data – Collected firsthand through surveys, experiments, or direct

Methods of Data Acquisition:

Challenges in Data Acquisition

Challenge Explanation Solution

Key Steps in Data Processing:

1. Data Cleaning – Removing inconsistencies, duplicates, and errors.

Methods to Handle Outliers:

Missing Value Treatment

Missing values occur when data is unavailable for certain fields.

Types of Missing Data:

1. MCAR (Missing Completely at Random): No pattern in missing data.

Methods to Handle Missing Data:

Types of Data Sampling:

 Improved search ranking relevance.

Case Study 2: Zomato’s Customer Review Analysis

 Better recommendation system for users.

 Improved self-driving algorithms.

You might also like