0% found this document useful (0 votes)

27 views4 pages

BA UNIT-3 - Part 1

This document discusses data preparation and management. It covers identifying and handling missing values, outliers and erroneous data. Key steps include visualizing and statistically analyzing data to find anomalies, transforming data to reduce outlier impact, truncating extreme outliers, and implementing validation rules to catch errors. The goal is to clean the data and ensure quality so results will not be misleading.

Uploaded by

Arunim Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views4 pages

BA UNIT-3 - Part 1

Uploaded by

Arunim Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

UNIT 3

PART I- DATA PREPARATION

PART II - DATA WAREHOUSING

PART I- DATA PREPARATION

Data preparation is about constructing a dataset from one or more data sources to be used for
exploration and modeling. It is a solid practice to start with an initial dataset to get familiar with
the data, to discover first insights into the data and have a good understanding of any possible
data quality issues.

Data preparation is often a time consuming process and heavily prone to errors. The old saying
"garbage-in-garbage-out" is particularly applicable to those data science projects where data
gathered with many invalid, out-of-range and missing values.

Analyzing data that has not been carefully screened for such problems can produce highly
misleading results. Then, the success of data science projects heavily depends on the quality of
the prepared data.

Dataset
Dataset is a collection of data, usually presented in a tabular form. Each column represents a
particular variable, and each row corresponds to a given member of the data.

A numerical or continuous variable is one that can accept any value within a finite or infinite
interval (e.g., height, weight, temperature, blood glucose,). There are two types of numerical
data, interval and ratio. Data on an interval scale can be added and subtracted but cannot be
meaningfully multiplied or divided because there is no true zero.
For example, we cannot say that one day is twice as hot as another day. On the other hand,
data on a ratio scale has true zero and can be added, subtracted, multiplied or divided (e.g.,
weight).
A categorical or discrete variable is one that can accept two or more values (categories).
There are two types of categorical data, nominal and ordinal. Nominal data does not have an
intrinsic ordering in the categories.

For example, "gender" with two categories, male and female. In contrast, ordinal data does have
an intrinsic ordering in the categories. For example, "level of energy" with three orderly
categories (low, medium and high).

MISSING VALUES

Missing values can arise from information loss as well as dropouts and nonresponses of the
study participants. The presence of missing values leads to a smaller sample size than intended
and eventually compromises the reliability of the study results. It can also produce biased
results when inferences about a population are drawn based on such a sample, undermining
the reliability of the data.

Types of Missing Values

Types of Description Possible causes

missing values

Missing Missing data occur completely at Consent withdrawal, omission of

completely at random without being influenced major exams, death, discontinued
random by other data. follow-up and serious adverse
reactions.
Missing at Missing data occur at a specific Refusal to continue
random time point in conjunction with measurements.
participant dissatisfaction with
study outcomes and ongoing
participation
Not missing at Missing data occur when a If a patient finds the results of
random patient who is not satisfied with self-measurement dissatisfactory
study outcomes performs the in addition to dissatisfaction
required measurements on his related to the study, the patient
own, before the scheduled may refuse further
measurement. measurements.

TREATMENT OF MISSING VALUES

● Deleting Rows with missing values
● Impute missing values for continuous variable
● Impute missing values for categorical variable
● Other Imputation Methods
● Using Algorithms that support missing values
● Prediction of missing values
● Imputation using Deep Learning Library — Datawig

Identification & Management of Outliers & Erroneous data

Outliers: Data points that deviate significantly from the rest of the data in a dataset.

Identification of Outliers:

● Visual Inspection: Use scatter plots, histograms, or box plots to visually identify data
points that appear significantly different from the majority.
● Statistical Methods:
a. Z-Score: Calculate the z-score for each data point and identify those with z-scores
exceeding a certain threshold (e.g., |z| > 2).
b. IQR (Interquartile Range): Define outliers as data points located outside the range of
Q1 - 1.5 * IQR and Q3 + 1.5 * IQR.
● Machine Learning Techniques: Utilize machine learning algorithms to detect outliers,
such as isolation forests, one-class SVM, or DBSCAN.

Management of Outliers:

● Data Transformation: Apply data transformations like log, square root, or Box-Cox to
make the data more normally distributed, reducing the impact of outliers.
● Data Truncation:Remove extreme outliers from the dataset if they are believed to be
erroneous or have no valid explanation.
● Winsorization:Replace extreme values with less extreme values (e.g., replace outliers
with the 5th or 95th percentile values).
● Robust Statistical Methods:Use statistical methods that are less sensitive to outliers,
such as the median instead of the mean.

Erroneous data: Data that contains errors, inaccuracies, or inconsistencies.

Identification of Erroneous Data:

● Data Validation Rules: Define and apply data validation rules to detect inconsistencies
and errors in the data. Common rules include checking for data type mismatches, range
violations, and missing values.
● Cross-Validation: Cross-reference data across different sources or databases to
identify discrepancies and inconsistencies.
● Data Profiling: Perform data profiling to identify irregular patterns, such as non-standard
formats or unexpected values.
● Domain Knowledge: Leverage domain expertise to identify data errors that may not be
apparent through automated methods. For example, recognizing implausible values or
inconsistencies.

Erroneous Data Management:

● Data Cleansing: Identify and correct errors in data, such as misspellings, duplicates,
missing values, or data entry mistakes.
● Validation Rules: Implement validation rules to prevent erroneous data entry, like range
checks, data type checks, and uniqueness constraints.
● Data Auditing:Regularly audit data for anomalies and inconsistencies to detect and
rectify erroneous data.
● Data Quality Framework:Develop a data quality framework that includes data profiling,
cleansing, enrichment, and monitoring processes.

Practicum 6
No ratings yet
Practicum 6
3 pages
RMAN Interview Questions and Answers Guide.: Global Guideline
No ratings yet
RMAN Interview Questions and Answers Guide.: Global Guideline
12 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
No ratings yet
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
42 pages
BRM Statwiki
No ratings yet
BRM Statwiki
55 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
Summary Data Quality Course
No ratings yet
Summary Data Quality Course
7 pages
Summary - Data Quality
No ratings yet
Summary - Data Quality
7 pages
Unit 1
No ratings yet
Unit 1
21 pages
1.data Cleaning Screening
No ratings yet
1.data Cleaning Screening
21 pages
Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is
No ratings yet
Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is
4 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Data Cleaning Workshop:: Club Data Science and Cloud Computing
No ratings yet
Data Cleaning Workshop:: Club Data Science and Cloud Computing
6 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
Data Quality
100% (2)
Data Quality
16 pages
Data Quality
No ratings yet
Data Quality
14 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Unit 1
No ratings yet
Unit 1
26 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Marketing Analytics (Unit 2)
No ratings yet
Marketing Analytics (Unit 2)
78 pages
Handling Missing Data
No ratings yet
Handling Missing Data
23 pages
Da Mid1
No ratings yet
Da Mid1
32 pages
Data Preparation .1
No ratings yet
Data Preparation .1
37 pages
Project - Tagged 3
No ratings yet
Project - Tagged 3
5 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
1.3 Data Quality
No ratings yet
1.3 Data Quality
6 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Data Quality and Remediation
No ratings yet
Data Quality and Remediation
40 pages
DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
Week 4 DMM
No ratings yet
Week 4 DMM
21 pages
Data Analytics Course Session 1-5
100% (1)
Data Analytics Course Session 1-5
252 pages
Topic Five
No ratings yet
Topic Five
55 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Subtitle
No ratings yet
Subtitle
2 pages
1preparing Data
No ratings yet
1preparing Data
6 pages
Unit-4 Part 1 Preparing Model
No ratings yet
Unit-4 Part 1 Preparing Model
20 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
SOCS0075 Lecture5
No ratings yet
SOCS0075 Lecture5
23 pages
Presentation 3
No ratings yet
Presentation 3
14 pages
Data Screening Assumptions
No ratings yet
Data Screening Assumptions
29 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
FDS U4
No ratings yet
FDS U4
93 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Unit 2
No ratings yet
Unit 2
76 pages
Section 1 Slide
No ratings yet
Section 1 Slide
132 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Unit-2 Open Elective
No ratings yet
Unit-2 Open Elective
19 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet
Statistics Essentials For Dummies
From Everand
Statistics Essentials For Dummies
Deborah J. Rumsey
3.5/5 (26)
Overview Of Bayesian Approach To Statistical Methods: Software
From Everand
Overview Of Bayesian Approach To Statistical Methods: Software
Vinaitheerthan Renganathan
No ratings yet
Balance of Payments - GDS
No ratings yet
Balance of Payments - GDS
17 pages
Foreign Exchange Market - GDS
No ratings yet
Foreign Exchange Market - GDS
28 pages
Eurocurrency - GDS
No ratings yet
Eurocurrency - GDS
10 pages
Exchange Rate Determination - Gagan
No ratings yet
Exchange Rate Determination - Gagan
24 pages
Week 10
No ratings yet
Week 10
3 pages
Week 3 Solution
No ratings yet
Week 3 Solution
3 pages
Week 2 Assignment
No ratings yet
Week 2 Assignment
4 pages
Week 1
No ratings yet
Week 1
4 pages
Frequencies: Frequencies Variables Var00001 /statistics Mean Median Mode /order Analysis
No ratings yet
Frequencies: Frequencies Variables Var00001 /statistics Mean Median Mode /order Analysis
2 pages
Roles of Data Scientists in Business and Society
No ratings yet
Roles of Data Scientists in Business and Society
47 pages
Ordinar Y Level: Office Practice Ordinary Level
No ratings yet
Ordinar Y Level: Office Practice Ordinary Level
16 pages
Final Project
No ratings yet
Final Project
8 pages
Fundamentals of Databases Assignment 1 (Types of Databases) FALL 2020
No ratings yet
Fundamentals of Databases Assignment 1 (Types of Databases) FALL 2020
3 pages
Creating Triggers in The NorthWind
No ratings yet
Creating Triggers in The NorthWind
10 pages
Online Library Management System Project
No ratings yet
Online Library Management System Project
26 pages
Transaction Processing Concepts and Theory
No ratings yet
Transaction Processing Concepts and Theory
56 pages
DISTINCT in SQL - 1keydata
No ratings yet
DISTINCT in SQL - 1keydata
2 pages
SQL Recap
No ratings yet
SQL Recap
22 pages
Pyth Full Stack Schedule Plan
No ratings yet
Pyth Full Stack Schedule Plan
33 pages
31 Days of Backup and Restore With Dbatools
No ratings yet
31 Days of Backup and Restore With Dbatools
15 pages
Advanced Load Runner
No ratings yet
Advanced Load Runner
127 pages
Big Data Architectural Patterns and Best Practices On AWS Presentation
100% (1)
Big Data Architectural Patterns and Best Practices On AWS Presentation
56 pages
Smart Inventory Management System
No ratings yet
Smart Inventory Management System
23 pages
Health Care Systems Analysis
No ratings yet
Health Care Systems Analysis
5 pages
Laundry Data Management System
No ratings yet
Laundry Data Management System
49 pages
Oracle Actualtests 1z0-931-20 Vce Download 2021-May-20 by Beau 23q Vce
No ratings yet
Oracle Actualtests 1z0-931-20 Vce Download 2021-May-20 by Beau 23q Vce
7 pages
Note 3196291 - What To Do If You Suspect A Memory Leak in The HANA Database
No ratings yet
Note 3196291 - What To Do If You Suspect A Memory Leak in The HANA Database
2 pages
NYOUG-Oracle GoldenGate Technical Deep Dive Dec2016
No ratings yet
NYOUG-Oracle GoldenGate Technical Deep Dive Dec2016
45 pages
En ES 8.5.2 Depl Book
No ratings yet
En ES 8.5.2 Depl Book
160 pages
IBM Australia Overview
No ratings yet
IBM Australia Overview
8 pages
Dream Content Analysis Using Artificial Intelli-Gence
No ratings yet
Dream Content Analysis Using Artificial Intelli-Gence
11 pages
Draw ER Model For A Given Database.: Entity
No ratings yet
Draw ER Model For A Given Database.: Entity
3 pages
SQL Ques
No ratings yet
SQL Ques
13 pages
CNN Architectures - LeNet, AlexNet, VGG, GoogLeNet, ResNet and More - by Siddharth Das - Analytics Vidhya - Medium
No ratings yet
CNN Architectures - LeNet, AlexNet, VGG, GoogLeNet, ResNet and More - by Siddharth Das - Analytics Vidhya - Medium
6 pages
Oracle View - Javatpoint
No ratings yet
Oracle View - Javatpoint
9 pages
Type of Log Monitoring in SOC
No ratings yet
Type of Log Monitoring in SOC
36 pages

BA UNIT-3 - Part 1

Uploaded by

BA UNIT-3 - Part 1

Uploaded by

UNIT 3

PART I- DATA PREPARATION

PART I- DATA PREPARATION

Types of Missing Values

Types of Description Possible causes

Missing Missing data occur completely at Consent withdrawal, omission of

TREATMENT OF MISSING VALUES

Identification & Management of Outliers & Erroneous data

Erroneous data: Data that contains errors, inaccuracies, or inconsistencies.

Identification of Erroneous Data:

Erroneous Data Management:

You might also like