BA UNIT-3 - Part 1
BA UNIT-3 - Part 1
Data preparation is about constructing a dataset from one or more data sources to be used for
exploration and modeling. It is a solid practice to start with an initial dataset to get familiar with
the data, to discover first insights into the data and have a good understanding of any possible
data quality issues.
Data preparation is often a time consuming process and heavily prone to errors. The old saying
"garbage-in-garbage-out" is particularly applicable to those data science projects where data
gathered with many invalid, out-of-range and missing values.
Analyzing data that has not been carefully screened for such problems can produce highly
misleading results. Then, the success of data science projects heavily depends on the quality of
the prepared data.
Dataset
Dataset is a collection of data, usually presented in a tabular form. Each column represents a
particular variable, and each row corresponds to a given member of the data.
A numerical or continuous variable is one that can accept any value within a finite or infinite
interval (e.g., height, weight, temperature, blood glucose,). There are two types of numerical
data, interval and ratio. Data on an interval scale can be added and subtracted but cannot be
meaningfully multiplied or divided because there is no true zero.
For example, we cannot say that one day is twice as hot as another day. On the other hand,
data on a ratio scale has true zero and can be added, subtracted, multiplied or divided (e.g.,
weight).
A categorical or discrete variable is one that can accept two or more values (categories).
There are two types of categorical data, nominal and ordinal. Nominal data does not have an
intrinsic ordering in the categories.
For example, "gender" with two categories, male and female. In contrast, ordinal data does have
an intrinsic ordering in the categories. For example, "level of energy" with three orderly
categories (low, medium and high).
MISSING VALUES
Missing values can arise from information loss as well as dropouts and nonresponses of the
study participants. The presence of missing values leads to a smaller sample size than intended
and eventually compromises the reliability of the study results. It can also produce biased
results when inferences about a population are drawn based on such a sample, undermining
the reliability of the data.
Outliers: Data points that deviate significantly from the rest of the data in a dataset.
Identification of Outliers:
● Visual Inspection: Use scatter plots, histograms, or box plots to visually identify data
points that appear significantly different from the majority.
● Statistical Methods:
a. Z-Score: Calculate the z-score for each data point and identify those with z-scores
exceeding a certain threshold (e.g., |z| > 2).
b. IQR (Interquartile Range): Define outliers as data points located outside the range of
Q1 - 1.5 * IQR and Q3 + 1.5 * IQR.
● Machine Learning Techniques: Utilize machine learning algorithms to detect outliers,
such as isolation forests, one-class SVM, or DBSCAN.
Management of Outliers:
● Data Transformation: Apply data transformations like log, square root, or Box-Cox to
make the data more normally distributed, reducing the impact of outliers.
● Data Truncation:Remove extreme outliers from the dataset if they are believed to be
erroneous or have no valid explanation.
● Winsorization:Replace extreme values with less extreme values (e.g., replace outliers
with the 5th or 95th percentile values).
● Robust Statistical Methods:Use statistical methods that are less sensitive to outliers,
such as the median instead of the mean.
● Data Cleansing: Identify and correct errors in data, such as misspellings, duplicates,
missing values, or data entry mistakes.
● Validation Rules: Implement validation rules to prevent erroneous data entry, like range
checks, data type checks, and uniqueness constraints.
● Data Auditing:Regularly audit data for anomalies and inconsistencies to detect and
rectify erroneous data.
● Data Quality Framework:Develop a data quality framework that includes data profiling,
cleansing, enrichment, and monitoring processes.