0% found this document useful (0 votes)
5 views49 pages

Data Preparation

The document provides an introduction to data preparation in data analytics, detailing the steps involved in data analytics projects, including problem definition, data integration, cleaning, and transformation. It outlines the SEMMA methodology and emphasizes the importance of understanding business goals, framing problem statements, and defining success metrics. Additionally, it discusses challenges in data preparation, dimensionality reduction, and methods for data integration and profiling.

Uploaded by

Omkar Bhogte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views49 pages

Data Preparation

The document provides an introduction to data preparation in data analytics, detailing the steps involved in data analytics projects, including problem definition, data integration, cleaning, and transformation. It outlines the SEMMA methodology and emphasizes the importance of understanding business goals, framing problem statements, and defining success metrics. Additionally, it discusses challenges in data preparation, dimensionality reduction, and methods for data integration and profiling.

Uploaded by

Omkar Bhogte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

I N T RO D U C T I O N TO

D ATA P R E P A R AT I O N

2nd Sem, MCA

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


CONTENT
❑ Introduction and overview
• Steps in Data Analytics projects
– Problem definition stage
– Data Preparation
o Data integration,
o Data cleaning,
o Missing values, Noisy data,
o Data transformations,
o Data partitioning.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D ATA A N A LY T I C S

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DATA ANALYTICS
SEMMA Methodology:
• Sample − data sampling.
• large enough dataset to contain sufficient information to retrieve, yet small enough to be used efficiently.
• Explore − understanding data by discovering anticipated and unanticipated relationships between variables,
and also abnormalities, with help of data visualization.
• Modify − methods to select, create and transform variables in preparation for data modeling.
• Model − applying various modeling techniques on the prepared variables in order to create models that
possibly provide the desired outcome.
• Assess − evaluation of the modeling results shows the reliability and usefulness of created models

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D ATA A N A LY T I C S
• Population: group of items whose properties are to be analyzed.
• Sample: (suitable) subset of population.
• Sampling: process of picking sample

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D ATA A N A LY T I C S

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DATA ANALYTICS
Understand Business Goals & Expectations
• Begin by understanding the business’s vision.
• What are the pain point that they are facing?
• What resources are available?
• Infrastructure, Pre-requisite features, Transactions, Business Results.
• What are the potential benefits?
• What risks are there in pursuing the project?
• Determine if the expected benefits are realistic and attainable from a data point of view.
• Determine the duration of the project.
• Perspective, see the problem from business point of view and from their client’s point of view.
• Ensure to obtain domain knowledge required for particular problem.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D ATA A N A LY T I C S
Translate Business Goals to Data Analysis Goals
• Example: A Café franchise has outlet A & B.
Cafe A wants to increase their profits as much that of Cafe B.

o What product is more/less popular? (Popularity categorization)


o How does the price of items at Cafe A compare to that of their competitors at Cafe b? (Price chart analysis)
o How many customers does Cafe A have as compared to Cafe b?
o What is the footfall to each café on a timely basis (hourly/daily/weekly)?
o What are the peak hours at Cafe A and cafe B, is there any convergence?
o What is the average age of the customers at each cafe?
o What is the number of repeat customers to each cafe.

• Such details understanding may derive conclusion that Cafe A sells less coffee than Cafe B in peak hours.
o This changes the problem statement from “How do we increase profits?” to “ How to sell more coffee?”

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DATA ANALYTICS
Frame the Problem Statement
• Write statement that describes the problem, why solving the problem is important and a starting point to
begin solving it.
“The problem P. . .”: problem as defined by company.
“. . . has the impact I .” negative impacts/pain points of the problem.
“. . . which affects B. . .” parties that are affected (business, customers or a third party).
“…, so a good starting point would be S.” benefits of solving the problem.

• “The problem of low coffee sales, has the impact of decreased profits, which affects Cafe A, so a good starting point
would be to compare their coffee price with that of their competitors.”

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D ATA A N A LY T I C S
Success Metric:
• Objective of problem statement should be to generate business insights and drive actionable plans.
• Success of problem statement need to be evaluated at the end.
• Achievement should be measurable.
• Common metrics are:
o Model assessment: Accuracy, Performance etc.
o Benchmarks:
• Increase coffee sales by at least 10% in first month of solution implementation.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
• Data processing is collecting raw data and translating it into usable information.
• Raw data is collected, filtered, sorted, processed, analyzed, stored, and presented in a readable format.
• Data Processing can be: Manual, Mechanical, Electronic.
• Types of Data Processing:
o Batch Processing: data is collected and processed in batches (for large amounts of data).
o Single User Programming Processing: done by single person for personal use (for smaller data).
o Multiple Programming Processing: simultaneously storing and executing multiple program; processed using two
or more CPUs. parallel processing.
o Real-time Processing: processing, which always remains under execution.
o Online Processing: entry and execution of data directly (no need to store; to reduce data entry errors)
o Time-sharing Processing: one form of online data processing that facilitates several users to share resources on
time-based manner.
o Distributed Processing: remote systems remain interconnected forming network for processing.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
• Data preparation: Process of gathering, combining, structuring and organizing data so it can be
used in business intelligence.

• Steps in Data Preparation:


• Collect
• Discover
• Clean
• Transform
• Validate

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
• Data discovery and profiling. explore the collected data to better understand what it contains and what
needs to be done to prepare it for intended uses.
• identifies patterns, relationships and other attributes in data, as well as inconsistencies, anomalies,
missing values and other issues that can be addressed.
• Data Inspection: Detect unexpected, incorrect, and inconsistent data.
• Data cleansing. Correct the identified data errors and issues to create complete and accurate data sets.
• faulty data is removed or fixed,
• missing values are filled in
• inconsistent entries are harmonized.
• Data structuring. data is modeled and organized to meet analytics requirements.
• Data transformation and enrichment. transformed into a unified and usable format.
• Data validation and publishing. data is validated for its consistency, completeness and accuracy.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


DIMENSIONALITY
• Dimensionality: number of input variables or features for a dataset.
• Dimensionality reduction: techniques that reduce number of input variables in a dataset.

Benefits of applying Dimensionality Reduction


• Space required to store the dataset also gets reduced.
• Less Computation training time is required.
• Help in visualizing the data quickly.
• Removes the redundant features (if present).

Disadvantages of dimensionality Reduction


• Some data may be lost due to dimensionality reduction.
• Sometimes the principal components required to consider are unknown.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


CURSE OF DIMENSIONALITY (COD)

• Handling the high-dimensional data is very difficult in practice.


• CoD - difficulties related to training machine learning models due to high dimensional data
• If dimensionality of input dataset increases;
• Data size also gets increased proportionally,
• chance of overfitting also increases.
• any machine learning model becomes more complex.
• If model is trained on high-dimensional data, it
becomes overfitted and results in poor performance.

• Dimensionality reduction is very important.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


DIMENSIONALITY REDUCTION
• some features can be quite redundant, adding noise to dataset and it makes no sense to have them.
• dimensionality reduction essentially transforms data from high-dimensional feature space to a low-
dimensional feature space.
• also important that meaningful properties present in data are not lost during transformation.
• Dimensionality reduction is commonly used in data visualization to understand and interpret data.
Two components of dimensionality reduction:
• Feature selection: find a subset of original set of variables. It involves three ways:
• Filter
• Wrapper
• Embedded
• Feature Extraction: reduces data in a high dimensional space to a lower dimension space (lesser
no. of dimensions).

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


DIMENSIONALITY REDUCTION
Two types of dimensionality reduction methods.
1. only keep most important features in dataset and
removes redundant features.
• no transformation applied to set of features.
2. find combination of new features.
• appropriate transformation is applied to set
of features.
• new set of features contains different values
instead of original values.
• Linear methods
• Non-linear methods (Manifold learning).

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
Challenges in Data (need for Data preparation):

• Inadequate or nonexistent data profiling. errors, anomalies and other problems might not be
identified, which can result in flawed analytics.
• Missing or incomplete data. must be fixed to ensure analytics accuracy
• Invalid data values. Misspellings, other typos and wrong numbers.
• Name & address standardization. may be inconsistent with variations that can affect accuracy in analysis
• Inconsistent data across enterprise systems.
• Data enrichment.
• Maintaining and expanding data prep processes. Data preparation work often becomes a recurring
process that needs to be sustained and enhanced on an ongoing basis.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
• Data integration/Data consolidation: process of combining data from different sources into a single, unified view.
• Data consolidation techniques
o Hand-coding. Using manual process for small, uncomplicated data collection (time-consuming for exploding volumes
of data).
o ETL software. ETL applications can pull data from multiple sources, transform it into necessary format and then
transfer it to final data storage location.
o ELT tools. Data from cloud may not support ETL much. With ELT, first extract data from sources and load it into
data warehouse and then transform that data (much faster, scalable, and more cost-effective).

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
• Data integration approaches:

o Extract, Transform and Load: copies of datasets from disparate sources are gathered together, harmonized, and
loaded into a data warehouse or database.
o Extract, Load and Transform: data is loaded into a big data system and transformed at later time for particular
analytics uses.
o Change Data Capture: identifies data changes in databases in real-time and applies them to a data warehouse or
other repositories.
o Data Replication: data in one database is replicated to other databases to keep the information synchronized to
operational uses and for backup as well.
o Data Virtualization: data from different systems are virtually combined to create a unified view rather than loading
data into a new repository.
o Streaming Data Integration: a real time data integration method.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
2 types of data integration (tight & loose Coupling)
• Tight Coupling:
o data is combined from different sources into a single physical location through the process of ETL.
• Loose Coupling:
o data remains only in the actual source databases.
o an interface is provided that takes query from user/system, transforms it in a way the source database
can understand, and then sends query directly to source databases to obtain ONLY the result.
Issues in Data Integration:
o Schema Integration,
o Redundancy,
o Data value conflicts.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
• Data profiling: explore the collected data to better understand what it contains and what needs
to be done to prepare it for intended uses.
• A summary statistics about the data is helpful to give a general idea about the quality of data.
• Data Inspection: Detect unexpected, incorrect, and inconsistent data.
o Quality of data is critical for final analysis.
o Data which tend to be incomplete, noisy and inconsistent can effect result.
o Data cleaning is the process of detecting and removing corrupt/inaccurate records from record, table
or database.
• By analyzing and visualizing data using statistical methods such as mean, standard deviation, range,
quantiles, etc. one can find values that are unexpected and thus erroneous.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
• Data cleaning: process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
incomplete data within a dataset.
• Check and maintain Data Quality
o Validity: degree to which data conform to defined business rules or constraints.
o Accuracy: degree to which data is close to the true values (error margin)
o Completeness: degree to which all required data is known.
o Consistency: degree to which data is consistent.
o Uniformity: degree to which data is specified using same unit of measure.
• May happen due to wrong data, or during data integration.
• If data is incorrect, outcomes and algorithms are unreliable.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
• Data cleaning: process to remove/correct data that does not belong in dataset.
• Data transformation: process of converting data from one structure into another (data wrangling/munging).
• Incorrect data is either ignored, removed, corrected, or imputed.

• Removal of unwanted observations


o deleting duplicate/redundant/irrelevant values from dataset.
• Fixing Structural errors
o errors due to measurement, data transfer, type conversion, etc..
o include typos in name of features, same attribute with a different name, mislabeled classes, separate
classes that should really be the same, or inconsistent capitalization. (e.g. “N/A” and “Not Applicable”)
• Managing Unwanted outliers & missing value
o Outliers & missing value can cause problems with certain types of models

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
• Handling missing data (drop / fill / flag)
o Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it,
that’s like trying to squeeze in a piece from somewhere else in the puzzle.
1. Dropping observations with missing values.
2. Imputing the missing values with global constant/statistical observation/popular value.
• Using statistical values like mean, median, mode, etc.
• Mean is most useful when original data is not skewed, while median is more robust, not sensitive to outliers
(used when data is skewed).
• Using a linear regression (with best fit line between two variables)
• Manual or automated (depending on data size)
3. Hot-deck: Copying values from other similar records.
4. Flag: Filling missing values leads to a loss in information.
• Ignore tuple/record: not very effective, unless tuple contains several attributes with missing values.
• Missing data is informative in itself, and algorithm should know about it.
SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT
STEPS IN DA - DATA PREPARATION

Missing Value Ratio


• If dataset has too many missing values, then drop those variables
• These variable do not carry much useful information.
Steps:
1. set a threshold level,
2. if a variable has missing values more than that threshold, drop that variable.
• The higher the threshold value, the more efficient the reduction.
• Dropping a feature may result in information loss.
• Replacing missing data with some substitute value to retain most of the data/information of the dataset:
• Mean/Median/Mode Imputation • Forward/Backward fill
• Fixed value Imputation • Hot Deck Imputation
• KNN Imputation
• Null Imputation
• Regression Imputation
• Mean Imputation by category

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E PA R AT I O N

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E PA R AT I O N

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
• Noisy data is meaningless data (corrupt data).
o Data that cannot be understood or interpreted correctly by machines (e.g. unstructured text).
o Spelling errors, industry abbreviations and slang can also impede machine reading.
• Noisy data unnecessarily increases data storage space required and can also adversely affect analysis results.
o Noisy data can be caused by faulty data collection instruments, data entry problems, technology limitation,
hardware failures, programming errors, gibberish input from speech or optical character recognition (OCR)
programs, etc.
• Example :
o Unknown encoding: Marital Status — Q
o Out of range values: Age — -10
o Inconsistent Data: DoB — 4th Oct 1999, Age — 50
o inconsistent formats: DoJ — 13th Jan 2000, DoL — 10/10/2016)

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
• Regression: Data can be smoothed by fitting the data into a regression functions.
• Clustering: Outliers may be detected by clustering (similar values are organized into groups/clusters).
o Values that fall outside of the set of clusters may be considered outliers.
o may be smoothed or removed.

• Noise can be handled using binning.


o Smoothing of sorted data is done using the values around it.
1. Sorted data is placed into bins or buckets.
2. Bins can be created by equal-width (distance) or equal-depth (frequency) partitioning.
3. On these bins, smoothing can be applied (by bin mean, bin median or bin boundaries).
4. Outliers can be treated by using binning and then smoothing it.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
• Binning smooths the data by grouping similar values together.
• Noise is reduced because small fluctuations within bins are averaged out.
• Choice of binning method depends on the nature of the data and the level of smoothing required.

Types of Binning
• Equal-width Binning: Divides the range of the data into equal-sized intervals (bins).
• Equal-frequency Binning: Divides the data so that each bin contains approximately the same number of data points.
• Custom Binning: User-defined binning intervals, based on domain knowledge or specific needs.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N DA - B I N N I N G
Example

Data = 10, 12, 13, 15, 18, 21, 22, 30, 50, 100, 105, 110, 150, 200
Choose Number of Bin = 4
Calculate the Range and Bin Width: Apply Binning (Replace with Mean):
Range=200−10=190 (max - min value) • Bin 1 (10 to 57.5): Mean = (10+12+13+15+18+21+22+30)/8=17.625
• Bin 2 (57.5 to 105): Mean = 85
Bin width=Range/Number of bins=4190=47.5
• Bin 3 (105 to 152.5): Mean = 110
Define Bins: • Bin 4 (152.5 to 200): Mean = 175

Bin 1 (10 to 57.5): 10, 12, 13, 15, 18, 21, 22, 30 After binning, data looks like this:
Bin 2 (57.5 to 105): 50, 100, 105 • Bin 1 (10 to 57.5) → 17.625
• Bin 2 (57.5 to 105) → 85
Bin 3 (105 to 152.5): 110
• Bin 3 (105 to 152.5) → 110
Bin 4 (152.5 to 200): 150, 200
• Bin 4 (152.5 to 200) → 175
Binned Dataset:
17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 85, 85, 85, 110, 175, 175

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - BINNING
Example

Data = 4, 8, 15, 21, 21, 24, 25, 28, 34


Partition into (equal-frequency) bins:
Bin a: 4, 8, 15 Smoothing by bin means:
Bin b: 21, 21, 24 each value in a bin is replaced by the mean value of the bin.
Bin c: 25, 28, 34 Bin a: 9, 9, 9
Bin b: 22, 22, 22
Bin c: 29, 29, 29
Smoothing by bin boundaries:
each bin value is replaced by the closest boundary value.
Bin a: 4, 4, 15
Bin b: 21, 21, 24
Bin c: 25, 25, 34

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
• Outlier extreme values that deviate from other observations on data; indicate variability in measurement.

• Common causes of outliers on data set:


o Data entry errors (human errors)
o Measurement errors (instrument errors)
o Experimental errors (data extraction or experiment planning/executing errors)
o Intentional (dummy outliers made to test specific methods)
o Data processing errors (data manipulation or data set unintended mutations)
o Sampling errors (extracting or mixing data from wrong or various sources)
o Natural (not an error, novelties in data))

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
Types of Outlier:
o Type 1 – Global/Point outlier
o Type 2 – Contextual outlier
o Type 3 – Collective outlier

o Univariate outliers found when looking at distribution of values in single feature space.
o Multivariate outliers found in n-dimensional space (n-features).

• Outlier detection methods:


o Numeric Outlier
o Z-Score Analysis
o DBSCAN
o Isolation forest

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
Numerical outlier Detection Method:

• Simple, non-standard outlier detection technique in one-dimensional feature space.


• Exteriors are calculated by IQR (Inter Quartile Range).
• Analyze with upper & lower bounds using interquartile amplifier value k = 1.5.
o lower bound/fence = Q1 - 1.5(IQR)
o outlier < lower bound/fence
o upper bound/fence = Q3 + 1.5(IQR)
o outlier > upper bound/fence

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
Z-Score outlier Detection Method:
• Z-score tells how many standard deviations away a given observation is from mean.
o 68% of the data points lie between +/- 1 standard deviation.
o 95% of the data points lie between +/- 2 standard deviation
o 99.7% of the data points lie between +/- 3 standard deviation

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
Z-Score outlier Detection Method:

• Z-score technique considers Gaussian distribution of data.


• Outliers are data points on tail of the distribution and are therefore far from average.
• Z-score is a parametric measure and it takes two parameters — mean and standard deviation.
Z score = (x -mean) / std. deviation
• Z-score tells how many standard deviations away a given observation is from mean.
• Limit must be specified in data set → good ‘thumb rule’ limits may be fixed deviations of 2.5, 3, 3.5, or more.

o Example, Z-score of 2.5 means data point is 2.5 standard deviation far from mean.
o Since it is very (2.5 S.D. times) far from center, it’s flagged as outlier/anomaly.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
Transformation:
• Processed data are transformed from one format to another format, that is more appropriate for analysis.
• Data transformation may be:
o Constructive: adds, copies, or replicates data.
o Destructive: deletes fields or records.
o Aesthetic: standardizes the data to meet requirements or parameters.
o Structural: reorganized by renaming, moving, or combining columns.
• Benefits of data transformation:
o Better Organization
o Improved Data Quality
o Perform Faster Queries
o Better Data Management
o More Use Out of Data

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
Data Reduction
• Process of reducing volume of original data to represent in much smaller volume by maintaining integrity of original data.
o Reducing number of attributes/columns/dimension; and/or number of records/tuples/rows.
o Efficiency of data mining process is improved, while producing same analytical results.
o Necessary when processing the entire data set is expensive and time consuming.
• Data cube aggregation: aggregation at various levels of data in a simpler form.
• Dimensionality reduction: Not all attributes are required for data analysis.
o Keep most suitable subset of attributes from a large number of attributes
o techniques like forward selection, backward elimination, decision tree induction or combination these.
• Data compression: large volumes of data is compressed (number of bits used to store data is reduced).
o In loss compression, quality of data is compromised for more compression.
o In lossless compression, quality of data is not compromised for higher compression level.
• Numerosity reduction : reduces volume of data by choosing smaller forms for data representation.
o using clustering or sampling of data.
SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT
S T E P S I N D A - D ATA P R E P A R AT I O N
Data Transformation methods:
o Smoothing: process of removing noise from data (binning, regression, and clustering).
o Aggregation: process where summary or aggregation operations are applied to data.
• daily sales data may be aggregated to compute monthly and annual total amounts (sum, min, max, group, etc)
o Generalization: low-level (primitive/raw) data are replaced with high-level data of categorical value.
• street_name → city or country;
• Converting text to numbers; color (black, red, white) → 0,1,2
• Converting continuous data to categories; age (15-30,30-50,50-70)→ youth, middle-aged, and senior.
o Normalization: Normalize & scale attribute data so as to fall within a small specified range (e.g. 0.0 to 1.0)
o Attribute Construction: New attributes are constructed from given set of attributes.
• New year and month column for original date column.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
Data Transformation Process (ETL → Extract, Transform, and Load)

• Data Discovery: Understand and identify data in its source format. Decide what they need to do to get
data into its desired format.
• Data Mapping: Determine how individual fields to be modified, mapped, filtered, joined, and aggregated.
• Data Extraction: Extract data from original source.
• Code Generation and Execution: Create a code to complete the transformation.
• Review: After transforming the data, check it to ensure everything has been formatted correctly.
• Sending: Send data to its target destination.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
• When attributes are on different ranges or scales, data modelling and mining can be difficult.
• When multiple attributes are there but attributes have values on different scales, this may lead to poor data
models while performing data mining operations.
o They are normalized to bring all the attributes on the same scale.
• Normalization transforms the data to fall under a given range; hence helps in applying data mining
algorithms and extracting data faster.
o Min-max normalization
o Decimal scaling
o Z-score normalization

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
• Min-max normalization − Implements linear transformation on original data.
o minA and maxA are minimum and maximum values of an attribute, A.
o Min-max normalization maps a value, v, of A to v’ in the range [new_minA , new_maxA ]

o Example, $1200 and $9800 are minimum, and maximum value for attribute income.
o [0.0, 1.0] is the range in which we need to map a value of $73,600.
o Datapoint $73,600 would be transformed using min-max normalization as follows:

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
• Z-score (zero-mean) normalization − Values for attribute ‘A’ are normalized based on mean and
standard deviation of A.
• Datapoint, v, of A is normalized to v’ by computing
o where A’ and σA are mean and standard deviation of attribute A.
• Useful when actual minimum and maximum of attribute A are unknown, or when there are outliers.

o Example, mean and standard deviation for attribute A as $65,000 and $18,000.
o Normalized value $85,800 using z-score normalization is;

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
• Decimal Scaling − Normalizes by changing the decimal point of values of attribute A. This movement of a
decimal point depends on the maximum absolute value of A.
• Datapoint, v, of A is normalized to v′ by computing.
o Where j is the smallest integer such that Max (|v′|)<1.

o Example, observed values for attribute A range from -986 to 917.


o Maximum absolute value for attribute A is 986.
o To normalize each value of attribute A using decimal scaling, divide each value of attribute A by 1000,
i.e., j=3 (number of integers in the largest number).
o So, the value -986 would be normalized to -0.986, and 917 would be normalized to 0.917.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
• Data Partitioning: technique for physically dividing data during the loading of Master Data.
• For Easy Management:
o Data volume in data warehouse can grow up to hundreds of gigabytes.
o This huge size is very hard to manage as a single entity.
• To Assist Backup/Recovery
o Partitioning allows to load only as much data required on a regular basis (instead of whole data always).
o Reduces time to load and also enhances system performance.
o Also beneficial for backup purpose.
• To Enhance Performance
o Query performance enhances, having to process lesser/partitioned (required) data; not entire data..

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


STEPS IN DA - DATA
PREPARATION
Horizontal Partitioning

• Partitioning by Time into Equal Segments: Data partitioned on basis of time period (of equal size)
• Partition by Time into Different-sized Segments: Implemented as a set of small partitions for
relatively current data, larger partition for inactive data. (When specific/aged data is accessed infrequently)
• Partition on Different Dimension: Partition on basis of dimensions other than time (product group,
region, supplier, etc).
• Partition by Size of Table: partition on basis of size (When no clear basis/dimension for partitioning).
o Set a predetermined size/critical point. When data exceeds predetermined size, partition is created.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT


S T E P S I N D A - D ATA P R E P A R AT I O N
Vertical Partition
• Splits data vertically.
• Each partition holds subset of fields.
• Normalization
o Standard relational method of database organization.
o removing redundancies from database by splitting tables and linking them with foreign key.

Functional partitioning
• Data is aggregated according to how it is used by each bounded context in the system.
• Example, e-commerce system might store invoice data in one partition and product inventory data in another.

SSS Shameem Jan, 2025 Data Analytics, DSCA, MIT

You might also like