Data Preparation
Data Preparation
D ATA P R E P A R AT I O N
• Such details understanding may derive conclusion that Cafe A sells less coffee than Cafe B in peak hours.
o This changes the problem statement from “How do we increase profits?” to “ How to sell more coffee?”
• “The problem of low coffee sales, has the impact of decreased profits, which affects Cafe A, so a good starting point
would be to compare their coffee price with that of their competitors.”
• Inadequate or nonexistent data profiling. errors, anomalies and other problems might not be
identified, which can result in flawed analytics.
• Missing or incomplete data. must be fixed to ensure analytics accuracy
• Invalid data values. Misspellings, other typos and wrong numbers.
• Name & address standardization. may be inconsistent with variations that can affect accuracy in analysis
• Inconsistent data across enterprise systems.
• Data enrichment.
• Maintaining and expanding data prep processes. Data preparation work often becomes a recurring
process that needs to be sustained and enhanced on an ongoing basis.
o Extract, Transform and Load: copies of datasets from disparate sources are gathered together, harmonized, and
loaded into a data warehouse or database.
o Extract, Load and Transform: data is loaded into a big data system and transformed at later time for particular
analytics uses.
o Change Data Capture: identifies data changes in databases in real-time and applies them to a data warehouse or
other repositories.
o Data Replication: data in one database is replicated to other databases to keep the information synchronized to
operational uses and for backup as well.
o Data Virtualization: data from different systems are virtually combined to create a unified view rather than loading
data into a new repository.
o Streaming Data Integration: a real time data integration method.
Types of Binning
• Equal-width Binning: Divides the range of the data into equal-sized intervals (bins).
• Equal-frequency Binning: Divides the data so that each bin contains approximately the same number of data points.
• Custom Binning: User-defined binning intervals, based on domain knowledge or specific needs.
Data = 10, 12, 13, 15, 18, 21, 22, 30, 50, 100, 105, 110, 150, 200
Choose Number of Bin = 4
Calculate the Range and Bin Width: Apply Binning (Replace with Mean):
Range=200−10=190 (max - min value) • Bin 1 (10 to 57.5): Mean = (10+12+13+15+18+21+22+30)/8=17.625
• Bin 2 (57.5 to 105): Mean = 85
Bin width=Range/Number of bins=4190=47.5
• Bin 3 (105 to 152.5): Mean = 110
Define Bins: • Bin 4 (152.5 to 200): Mean = 175
Bin 1 (10 to 57.5): 10, 12, 13, 15, 18, 21, 22, 30 After binning, data looks like this:
Bin 2 (57.5 to 105): 50, 100, 105 • Bin 1 (10 to 57.5) → 17.625
• Bin 2 (57.5 to 105) → 85
Bin 3 (105 to 152.5): 110
• Bin 3 (105 to 152.5) → 110
Bin 4 (152.5 to 200): 150, 200
• Bin 4 (152.5 to 200) → 175
Binned Dataset:
17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 85, 85, 85, 110, 175, 175
o Univariate outliers found when looking at distribution of values in single feature space.
o Multivariate outliers found in n-dimensional space (n-features).
o Example, Z-score of 2.5 means data point is 2.5 standard deviation far from mean.
o Since it is very (2.5 S.D. times) far from center, it’s flagged as outlier/anomaly.
• Data Discovery: Understand and identify data in its source format. Decide what they need to do to get
data into its desired format.
• Data Mapping: Determine how individual fields to be modified, mapped, filtered, joined, and aggregated.
• Data Extraction: Extract data from original source.
• Code Generation and Execution: Create a code to complete the transformation.
• Review: After transforming the data, check it to ensure everything has been formatted correctly.
• Sending: Send data to its target destination.
o Example, $1200 and $9800 are minimum, and maximum value for attribute income.
o [0.0, 1.0] is the range in which we need to map a value of $73,600.
o Datapoint $73,600 would be transformed using min-max normalization as follows:
o Example, mean and standard deviation for attribute A as $65,000 and $18,000.
o Normalized value $85,800 using z-score normalization is;
• Partitioning by Time into Equal Segments: Data partitioned on basis of time period (of equal size)
• Partition by Time into Different-sized Segments: Implemented as a set of small partitions for
relatively current data, larger partition for inactive data. (When specific/aged data is accessed infrequently)
• Partition on Different Dimension: Partition on basis of dimensions other than time (product group,
region, supplier, etc).
• Partition by Size of Table: partition on basis of size (When no clear basis/dimension for partitioning).
o Set a predetermined size/critical point. When data exceeds predetermined size, partition is created.
Functional partitioning
• Data is aggregated according to how it is used by each bounded context in the system.
• Example, e-commerce system might store invoice data in one partition and product inventory data in another.