0% found this document useful (0 votes)
27 views4 pages

BA UNIT-3 - Part 1

This document discusses data preparation and management. It covers identifying and handling missing values, outliers and erroneous data. Key steps include visualizing and statistically analyzing data to find anomalies, transforming data to reduce outlier impact, truncating extreme outliers, and implementing validation rules to catch errors. The goal is to clean the data and ensure quality so results will not be misleading.

Uploaded by

Arunim Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views4 pages

BA UNIT-3 - Part 1

This document discusses data preparation and management. It covers identifying and handling missing values, outliers and erroneous data. Key steps include visualizing and statistically analyzing data to find anomalies, transforming data to reduce outlier impact, truncating extreme outliers, and implementing validation rules to catch errors. The goal is to clean the data and ensure quality so results will not be misleading.

Uploaded by

Arunim Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

UNIT 3

PART I- DATA PREPARATION


PART II - DATA WAREHOUSING

PART I- DATA PREPARATION

Data preparation is about constructing a dataset from one or more data sources to be used for
exploration and modeling. It is a solid practice to start with an initial dataset to get familiar with
the data, to discover first insights into the data and have a good understanding of any possible
data quality issues.

Data preparation is often a time consuming process and heavily prone to errors. The old saying
"garbage-in-garbage-out" is particularly applicable to those data science projects where data
gathered with many invalid, out-of-range and missing values.

Analyzing data that has not been carefully screened for such problems can produce highly
misleading results. Then, the success of data science projects heavily depends on the quality of
the prepared data.

Dataset
Dataset is a collection of data, usually presented in a tabular form. Each column represents a
particular variable, and each row corresponds to a given member of the data.

A numerical or continuous variable is one that can accept any value within a finite or infinite
interval (e.g., height, weight, temperature, blood glucose,). There are two types of numerical
data, interval and ratio. Data on an interval scale can be added and subtracted but cannot be
meaningfully multiplied or divided because there is no true zero.
For example, we cannot say that one day is twice as hot as another day. On the other hand,
data on a ratio scale has true zero and can be added, subtracted, multiplied or divided (e.g.,
weight).
A categorical or discrete variable is one that can accept two or more values (categories).
There are two types of categorical data, nominal and ordinal. Nominal data does not have an
intrinsic ordering in the categories.

For example, "gender" with two categories, male and female. In contrast, ordinal data does have
an intrinsic ordering in the categories. For example, "level of energy" with three orderly
categories (low, medium and high).

MISSING VALUES

Missing values can arise from information loss as well as dropouts and nonresponses of the
study participants. The presence of missing values leads to a smaller sample size than intended
and eventually compromises the reliability of the study results. It can also produce biased
results when inferences about a population are drawn based on such a sample, undermining
the reliability of the data.

Types of Missing Values

Types of Description Possible causes


missing values

Missing Missing data occur completely at Consent withdrawal, omission of


completely at random without being influenced major exams, death, discontinued
random by other data. follow-up and serious adverse
reactions.
Missing at Missing data occur at a specific Refusal to continue
random time point in conjunction with measurements.
participant dissatisfaction with
study outcomes and ongoing
participation
Not missing at Missing data occur when a If a patient finds the results of
random patient who is not satisfied with self-measurement dissatisfactory
study outcomes performs the in addition to dissatisfaction
required measurements on his related to the study, the patient
own, before the scheduled may refuse further
measurement. measurements.

TREATMENT OF MISSING VALUES


● Deleting Rows with missing values
● Impute missing values for continuous variable
● Impute missing values for categorical variable
● Other Imputation Methods
● Using Algorithms that support missing values
● Prediction of missing values
● Imputation using Deep Learning Library — Datawig

Identification & Management of Outliers & Erroneous data

Outliers: Data points that deviate significantly from the rest of the data in a dataset.

Identification of Outliers:

● Visual Inspection: Use scatter plots, histograms, or box plots to visually identify data
points that appear significantly different from the majority.
● Statistical Methods:
a. Z-Score: Calculate the z-score for each data point and identify those with z-scores
exceeding a certain threshold (e.g., |z| > 2).
b. IQR (Interquartile Range): Define outliers as data points located outside the range of
Q1 - 1.5 * IQR and Q3 + 1.5 * IQR.
● Machine Learning Techniques: Utilize machine learning algorithms to detect outliers,
such as isolation forests, one-class SVM, or DBSCAN.

Management of Outliers:

● Data Transformation: Apply data transformations like log, square root, or Box-Cox to
make the data more normally distributed, reducing the impact of outliers.
● Data Truncation:Remove extreme outliers from the dataset if they are believed to be
erroneous or have no valid explanation.
● Winsorization:Replace extreme values with less extreme values (e.g., replace outliers
with the 5th or 95th percentile values).
● Robust Statistical Methods:Use statistical methods that are less sensitive to outliers,
such as the median instead of the mean.

Erroneous data: Data that contains errors, inaccuracies, or inconsistencies.

Identification of Erroneous Data:


● Data Validation Rules: Define and apply data validation rules to detect inconsistencies
and errors in the data. Common rules include checking for data type mismatches, range
violations, and missing values.
● Cross-Validation: Cross-reference data across different sources or databases to
identify discrepancies and inconsistencies.
● Data Profiling: Perform data profiling to identify irregular patterns, such as non-standard
formats or unexpected values.
● Domain Knowledge: Leverage domain expertise to identify data errors that may not be
apparent through automated methods. For example, recognizing implausible values or
inconsistencies.

Erroneous Data Management:

● Data Cleansing: Identify and correct errors in data, such as misspellings, duplicates,
missing values, or data entry mistakes.
● Validation Rules: Implement validation rules to prevent erroneous data entry, like range
checks, data type checks, and uniqueness constraints.
● Data Auditing:Regularly audit data for anomalies and inconsistencies to detect and
rectify erroneous data.
● Data Quality Framework:Develop a data quality framework that includes data profiling,
cleansing, enrichment, and monitoring processes.

You might also like