0% found this document useful (0 votes)
8 views48 pages

SCA - Module 3

This document discusses various techniques for data preparation, preprocessing, and transformation. It covers topics like data cleaning to handle missing values, noise and outliers. It also discusses data integration, reduction through sampling and feature selection, and transformation through scaling and standardization.

Uploaded by

mahnoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views48 pages

SCA - Module 3

This document discusses various techniques for data preparation, preprocessing, and transformation. It covers topics like data cleaning to handle missing values, noise and outliers. It also discusses data integration, reduction through sampling and feature selection, and transformation through scaling and standardization.

Uploaded by

mahnoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Data preparation, preprocessing

and transformation
Week 3
SC analytics

▪ Issues with Data


▪ Data Cleaning, dealing with missing values, noise and outliers
▪ Data Integration, removing inconsistencies, and deduplication
▪ Data Reduction - Sampling and Feature Selection
▪ Data Transformation - Scaling and Standardization, Numeric
Transformation

2
Data preprocessing

▪ Data preprocessing is a very important step


▪ It helps improve quality of data
▪ Makes the data ready and more suitable for analytics
▪ Should be followed and guided by a thorough EDA
▪ EDA helps identify quality issues in data that are dealt with in this
step
3
Issues with the data

4
Steps in preprocessing
▪ Steps and processes are performed when necessary

5
Data cleaning

▪ Also called data scrubbing, data munging, data wrangling


▪ Dealing with Missing values
▪ Noise Smoothing
▪ Correcting Inconsistencies
▪ Identifying Outliers

6
Data cleaning: Missing values

7
Data cleaning: Missing values
Knowing why and how data is missing could help in data imputation
Missing Completely at Random (MCAR)
▪ Missingness independent of any observed or unobserved variables
Missing at Random (MAR)
▪ Missingness independent of missing values or unobserved variables
▪ Missingness depend on observed variables with complete info
Missing Not at Random (MNAR)
▪ Missingness depends on the missing values or unobserved variable

8
No systematic differences exist
Data cleaning: Missing values; MCAR between participants with missing data
and those with complete data

9
The data are missing is systematically
related to the observed data but not the
Data cleaning: Missing values; MAR unobserved data

10
The data are missing is systematically related to
the unobserved data.
Data cleaning: Missing values; MNAR

11
Data Cleaning: Dealing with missing values

12
Advanced techniques for imputing missing values
▪ Expectation Maximization Imputation
Data Cleaning: Data imputation ▪ Regression based Imputation

▪ Manually fill in, works for small data and few missing values
▪ Use a global constant, e.g. Unknown, or ∞
▪ Substitute a measure of central tendency, e.g. mode, mean or median
▪ Missed Quiz: student mean, class mean, class mean in this or all quizzes, the student
mean in remaining quizzes
▪ Cricket DLS system
▪ Use class-wise mean or median
▪ for missing players score in a match, use player’s average, average of Pak batsmen,
average of Pak batsmen against India, average of middle order Pak batsmen again
India in Summer in Sharjah
▪ Use average of top k similar objects >> based on non-missing attributes
▪ can be weighted by similarity average of all other data objects

13
Data Cleaning: Noise

14
Data Cleaning: Noise
Dealing with noise
▪ Smoothing by Binning
▪ Essentially replace each value by the average of values in the bin
▪ Could be mean, median, midrange etc. of values in the bin
▪ Could use equal width or equal depth (sized) bins
▪ Smoothing by local neighborhoods
▪ k-nearest neighbors, blurring, boundaries
▪ Smoothing is also used for data reduction and discretization
▪ Smoothing Time Series
▪ Moving Average
▪ Divide by variance of each period/cycle

15
Data Cleaning: Correction of inconsistencies

16
Data Cleaning: Identifying Outliers
Outliers are either
▪ Objects that have characteristics substantially different from most other data
>> the object is an outlier
▪ Value of a variable that is substantially different than the variable’s typical values
>> the feature value is an outlier
▪ Unlike noise, outliers can be legitimate data or values
▪ Outliers could be points of interest
▪ Consider students record in LMS, what values of age could be
▪ noise
▪ inconsistency
▪ outlier
17
Data Integration

18
Data Integration
Entity Identification Problem: Objects do not have same IDs in all
sources
▪ e.g. Sentiment analysis on cricket match tweets to assess player contribution
Network Reconciliation Project
▪ Schema Integration
▪ Object Matching
▪ Make sure that player ID in cricinfo dataset is the same as player code in PCB data
(source of domestic games)
▪ Check metadata, names of attributes, range, data types and formats

19
Data Integration
Object Duplication: instance/object may be duplicated
▪ Occasionally two or more object can have all feature values identical, yet
they could be different instances
▪ e.g. two students with the same grades in all courses Integration

20
Data Integration

21
Data Integration

Data Value Conflict Detection and Resolution


▪ Sometimes there are two conflicting values in different sources
▪ e.g. name is spelled differently in educational and NADRA’s record
▪ This might require expert knowledge

22
Data reduction
▪ Apart from duplicates removal etc. ▪ Helps reduce computational
▪ Some-time we do not need all the complexity
data ▪Make data visualization more
▪ We reduce the data in either direction effective

▪ Reduce instances ▪Get a representative sample of data

▪ Reduce dimensions

23
Data Reduction: Sampling
Sampling that results in each person
having the same chance of being
selected

A random sample is a subset of


individuals chosen from a
larger set and a subset of
individuals are chosen
randomly, all with the same
probability

24
Data Reduction: Sampling

25
Data Reduction: Sampling
Imbalanced Classes: Classes or groups have huge difference in frequencies
and the target class is rare
▪ Medical diagnosis: 95% healthy, 5% diseased
▪ eCommerce: 99% do not buy, 1% buy
▪ Security: > 99.99% of people are not terrorists
▪ Similar situation with multiple classes
▪ Predictions can be 97% correct, but useless
▪ Requires special sampling methods

26
Data Reduction: Feature selection
▪ More importantly, one does dimensionality reduction
▪ Curse of Dimensionality (problems associated with high dimensions and
difficulties in dealing with higher dimensional vectors)
▪ We might discuss these techniques for dimensionality reduction (if time
permits)
▪ Locality Sensitive Hashing
▪ Johnson-Lindenstrauss Transform
▪ PCA and SVD diagnosis

27
Data Reduction: Feature selection and extraction

28
Data Transformation

29
Data Transformation

30
Standardization and Scaling

▪ The goal is to make an entire set of values have a particular property


▪ e.g. variables to have the same range, same unit
▪ to shift the data to a manageable range e.g. shifting to positive
▪ Variety of possibilities for different applications

31
Standardization and Scaling

Scaling data so it falls in a smaller, comparable or manageable range


▪ Data could be in different units e.g. kilometers and miles
▪ Units might not be known
▪ Small units means larger values and larger ranges
▪ All attributes will get the same weight
▪ Huge implications in distance values (see clustering & recommenders)

32
MIN-MAX Scaling

33
MIN-MAX Scaling

34
z-Score Normalization

35
Other families of Normalization

36
Reasons for Transformation

37
Reasons for Transformation

38
Reasons for Transformation

39
Reasons for Transformation

40
Common Transformation

41
Logarithms

42
Logarithms

43
Cube Root

44
Square Root

45
Reciprocal and Negative Reciprocal

46
Left Skewed Data: Squares and higher powers

47
Transformation to make linear relationship

48

You might also like