SCA - Module 3
SCA - Module 3
and transformation
Week 3
SC analytics
2
Data preprocessing
4
Steps in preprocessing
▪ Steps and processes are performed when necessary
5
Data cleaning
6
Data cleaning: Missing values
7
Data cleaning: Missing values
Knowing why and how data is missing could help in data imputation
Missing Completely at Random (MCAR)
▪ Missingness independent of any observed or unobserved variables
Missing at Random (MAR)
▪ Missingness independent of missing values or unobserved variables
▪ Missingness depend on observed variables with complete info
Missing Not at Random (MNAR)
▪ Missingness depends on the missing values or unobserved variable
8
No systematic differences exist
Data cleaning: Missing values; MCAR between participants with missing data
and those with complete data
9
The data are missing is systematically
related to the observed data but not the
Data cleaning: Missing values; MAR unobserved data
10
The data are missing is systematically related to
the unobserved data.
Data cleaning: Missing values; MNAR
11
Data Cleaning: Dealing with missing values
12
Advanced techniques for imputing missing values
▪ Expectation Maximization Imputation
Data Cleaning: Data imputation ▪ Regression based Imputation
▪ Manually fill in, works for small data and few missing values
▪ Use a global constant, e.g. Unknown, or ∞
▪ Substitute a measure of central tendency, e.g. mode, mean or median
▪ Missed Quiz: student mean, class mean, class mean in this or all quizzes, the student
mean in remaining quizzes
▪ Cricket DLS system
▪ Use class-wise mean or median
▪ for missing players score in a match, use player’s average, average of Pak batsmen,
average of Pak batsmen against India, average of middle order Pak batsmen again
India in Summer in Sharjah
▪ Use average of top k similar objects >> based on non-missing attributes
▪ can be weighted by similarity average of all other data objects
13
Data Cleaning: Noise
14
Data Cleaning: Noise
Dealing with noise
▪ Smoothing by Binning
▪ Essentially replace each value by the average of values in the bin
▪ Could be mean, median, midrange etc. of values in the bin
▪ Could use equal width or equal depth (sized) bins
▪ Smoothing by local neighborhoods
▪ k-nearest neighbors, blurring, boundaries
▪ Smoothing is also used for data reduction and discretization
▪ Smoothing Time Series
▪ Moving Average
▪ Divide by variance of each period/cycle
15
Data Cleaning: Correction of inconsistencies
16
Data Cleaning: Identifying Outliers
Outliers are either
▪ Objects that have characteristics substantially different from most other data
>> the object is an outlier
▪ Value of a variable that is substantially different than the variable’s typical values
>> the feature value is an outlier
▪ Unlike noise, outliers can be legitimate data or values
▪ Outliers could be points of interest
▪ Consider students record in LMS, what values of age could be
▪ noise
▪ inconsistency
▪ outlier
17
Data Integration
18
Data Integration
Entity Identification Problem: Objects do not have same IDs in all
sources
▪ e.g. Sentiment analysis on cricket match tweets to assess player contribution
Network Reconciliation Project
▪ Schema Integration
▪ Object Matching
▪ Make sure that player ID in cricinfo dataset is the same as player code in PCB data
(source of domestic games)
▪ Check metadata, names of attributes, range, data types and formats
19
Data Integration
Object Duplication: instance/object may be duplicated
▪ Occasionally two or more object can have all feature values identical, yet
they could be different instances
▪ e.g. two students with the same grades in all courses Integration
20
Data Integration
21
Data Integration
22
Data reduction
▪ Apart from duplicates removal etc. ▪ Helps reduce computational
▪ Some-time we do not need all the complexity
data ▪Make data visualization more
▪ We reduce the data in either direction effective
▪ Reduce dimensions
23
Data Reduction: Sampling
Sampling that results in each person
having the same chance of being
selected
24
Data Reduction: Sampling
25
Data Reduction: Sampling
Imbalanced Classes: Classes or groups have huge difference in frequencies
and the target class is rare
▪ Medical diagnosis: 95% healthy, 5% diseased
▪ eCommerce: 99% do not buy, 1% buy
▪ Security: > 99.99% of people are not terrorists
▪ Similar situation with multiple classes
▪ Predictions can be 97% correct, but useless
▪ Requires special sampling methods
26
Data Reduction: Feature selection
▪ More importantly, one does dimensionality reduction
▪ Curse of Dimensionality (problems associated with high dimensions and
difficulties in dealing with higher dimensional vectors)
▪ We might discuss these techniques for dimensionality reduction (if time
permits)
▪ Locality Sensitive Hashing
▪ Johnson-Lindenstrauss Transform
▪ PCA and SVD diagnosis
27
Data Reduction: Feature selection and extraction
28
Data Transformation
29
Data Transformation
30
Standardization and Scaling
31
Standardization and Scaling
32
MIN-MAX Scaling
33
MIN-MAX Scaling
34
z-Score Normalization
35
Other families of Normalization
36
Reasons for Transformation
37
Reasons for Transformation
38
Reasons for Transformation
39
Reasons for Transformation
40
Common Transformation
41
Logarithms
42
Logarithms
43
Cube Root
44
Square Root
45
Reciprocal and Negative Reciprocal
46
Left Skewed Data: Squares and higher powers
47
Transformation to make linear relationship
48