0% found this document useful (0 votes)

27 views10 pages

Week 02 Data Wrangling

Uploaded by

sabrinashah2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views10 pages

Week 02 Data Wrangling

Uploaded by

sabrinashah2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

08-08-2024

TOD 533
Data Wrangling
Amit Das
TODS / AMSOM / AU
[email protected]

Fixed width and delimited files

• Fixed width file format
• All rows of equal length
• Each column allocated same width
of digits / characters
• Unused spaces to be filled with padding
(NULL) characters
• Application must “know” the layout
• Delimited file formats
• Columns separated by delimiters
• Spaces, commas, other …
• Rows of unequal length
• Might continue on multiple lines

1
08-08-2024

Missing values
• If the value for an attribute (a column) is missing
• Will show up as a short row in fixed-file format
• May be harder to detect in a delimited file
• Use a missing value indicator such as NULL, NaN, or some other string
• Failure to detect missing values can corrupt entire file in
reading
• Missing values must be understood properly (why missing)
• No response
• For survey data, is it “WILL NOT ANSWER” or “NOT APPLICABLE”?
• Zero response – this should NOT be recorded as a missing value

Dealing with missing values

• SAFE OPTION: Exclude rows with (any) missing values (“LISTWISE”)
• Downsides
• Loss of sample size
• Bias unless “missing at random” (systematic non-response)
• OTHER OPTIONS
• Some analyses can use incomplete data (“PAIRWISE”)
• In a correlation matrix, pairwise correlations can have different sample sizes
• UNSAFE OPTION: IMPUTATION
• Replace missing values with the means of the columns
• “Predict” missing values from other attributes in the same row
• “Predict” missing value by comparison with “similar” rows
• YOU ARE MAKING UP DATA, AT YOUR PERIL!

2
08-08-2024

Regression with mean imputation

Only complete observations, n=392 Imputed horsepower, n=398
Model Fit Measures Model Fit Measures

Model R R² Model R R²

1 0.841 0.708 1 0.840 0.705

Note. Models estimated using sample size of N=392 Note. Models estimated using sample size of N=398
Model Coefficients - mpg Model Coefficients - mpg

Predictor Estimate SE t p Predictor Estimate SE t p

Intercept 46.26431 2.66941 17.33131 < .001 Intercept 45.86496 2.63511 17.4053 < .001

cylinders -0.39793 0.41054 -0.96927 0.333 cylinders -0.35871 0.41001 -0.8749 0.382

displacement -8.31e−5 0.00907 -0.00916 0.993 displacement -0.00139 0.00910 -0.1530 0.879

horsepower -0.04526 0.01666 -2.71620 0.007 horsepower -0.03903 0.01612 -2.4216 0.016

weight -0.00519 8.17e-4 -6.35149 < .001 weight -0.00537 8.06e-4 -6.6538 < .001

acceleration -0.02910 0.12576 -0.23143 0.817 acceleration -0.00700 0.12275 -0.0571 0.955

Outlier detection
• Single extreme values (univariate)
• Without assuming normal distribution – box plot

3
08-08-2024

Outlier detection (2)

• Single extreme values (univariate)
• Assuming normal distribution
• Unlikely values (on the tails)
may be discarded / capped
• “x% Trimmed Mean”
• Unlikely  Impossible

Density-based clustering
• Density-based spatial clustering
of applications with noise
• (DBSCAN) is a data clustering
algorithm proposed by Martin
Ester, Hans-Peter Kriegel, Jörg
Sander and Xiaowei Xu in 1996.
• Applicable to multi-dimensional
data (difficult to spot manually)

4
08-08-2024

Influential observations in regression

• Cook’s D is a measure of how much a regression model changes when
the ith observation is removed.
• A general rule of thumb for cutoff on Cook's D is to use 4/n.
• If your data had 40 data points, for example, a Cook's D > 0.1 would be
considered influential.
• Not all outliers can be detected
using the Cook’s D statistic

Data Transformation: Min-Max scaling

• x = original value, xscaled = scaled value

• Scaled values of variables lie between 0 and 1
• Many machine learning methods require or prefer scaled variables
serial mpg cylinders displacement horsepower weight acceleration model_year origin car_name
1 18 8 307 130 3504 12 70 1 chevrolet chevelle malibu
2 15 8 350 165 3693 11.5 70 1 buick skylark 320
3 18 8 318 150 3436 11 70 1 plymouth satellite
4 16 8 304 150 3433 12 70 1 amc rebel sst
5 17 8 302 140 3449 10.5 70 1 ford torino

serial mpgMM cylindersMM displacementMM horsepowerMM weightMM accelerationMM model_year origin car_name
1 0.239 1.000 0.618 0.457 0.536 0.238 70 1 chevrolet chevelle malibu
2 0.160 1.000 0.729 0.647 0.590 0.208 70 1 buick skylark 320
3 0.239 1.000 0.646 0.565 0.517 0.179 70 1 plymouth satellite
4 0.186 1.000 0.610 0.565 0.516 0.238 70 1 amc rebel sst
5 0.213 1.000 0.605 0.511 0.521 0.149 70 1 ford torino

5
08-08-2024

Data Transformation: Standardization

• zx = (x – mean(x))/stdev(x)
• x = original value, zx = standardized value
• Most standardized values lie between -3 and +3
serial mpg cylinders displacement horsepower weight acceleration model_year origin car_name
1 18 8 307 130 3504 12 70 1 chevrolet chevelle malibu
2 15 8 350 165 3693 11.5 70 1 buick skylark 320
3 18 8 318 150 3436 11 70 1 plymouth satellite
4 16 8 304 150 3433 12 70 1 amc rebel sst
5 17 8 302 140 3449 10.5 70 1 ford torino

serial mpgMM cylindersMM displacementMM horsepowerMM weightMM accelerationMM model_year origin car_name
1 -0.698 1.482 1.076 0.663 0.620 -1.284 70 1 chevrolet chevelle malibu
2 -1.082 1.482 1.487 1.573 0.842 -1.465 70 1 buick skylark 320
3 -0.698 1.482 1.181 1.183 0.540 -1.646 70 1 plymouth satellite
4 -0.954 1.482 1.047 1.183 0.536 -1.284 70 1 amc rebel sst
5 -0.826 1.482 1.028 0.923 0.555 -1.827 70 1 ford torino

Log transformation …1
• The proportion of words recalled with the passage of time is not linear,
but taking logarithm of time makes the relationship almost linear

6
08-08-2024

Log transformation …2
• Taking logarithm of the dependent variable (gestation period) as a
function of birthweight stabilizes the variance of the DV

Log transformation …3
• Sometimes both the independent (diameter of pine trees) and
dependent variables (volume) must be transformed

7
08-08-2024

One-hot encoding of categorical variables

• Consider colors: violet, indigo, blue, green, yellow, orange, red
• Decision trees might be able to process this data directly
• Except in special cases the ordering
violet > indigo > blue > green > yellow > orange > red
does not make sense and should not be used (“ordinal encoding”)
• Instead, create 7 (yes, seven) variables
“violet”, “indigo”, “blue”, “green”, “yellow”, “orange”, “red”
each taking a value of 0 or 1
• Actually six variables might have been sufficient: use “red” as the reference
• Recall: dummy variables in econometrics

Wide and long forms of data (e.g. time series)

• Long data sometimes called
“tidy” data
• Some analyses require
one or the other
• Tidy data is handled better
by machines (normalized)
• Wide data may be easier
for human readers
• Stats software (and Python, R) have routines for reshaping data

8
08-08-2024

Syntactic vs semantic data cleaning

• Sometimes the meaning of
the data is clear, though it
does not match exactly.
• ISO 3166-1 alpha-3 codes
three-letter country codes
IDN Indonesia
IMN Isle of Man
IND India
IOT British Indian Ocean Territory
IRL Ireland
IRN Iran (Islamic Republic of)
IRQ Iraq
Write code to reconcile other
(variant) spellings.

A modern view of data cleaning

• Data cleaning requires domain knowledge
• Number of cylinders: 3, 4, 6, 7, 8, 10, 12, 16 … in automobiles
which ones are meaningful?
• Orders of magnitude: GHz, ns, microns / nm … in electronics
• The “problem” lies in the “brittleness” of learning algorithms
• Some can run on incomplete data, others cannot
• Some are strongly affected by noise in data, others are more robust
• Mean vs Median

• “Data Cleaning” is a legitimate part of the modeling process

Bill Gates, Warren Buffet, LeBron James, Lionel Messi: outliers?

9
08-08-2024

Data Wrangling hands-on

• https://archive.ics.uci.edu/dataset/9/auto+mpg
• https://www.jamovi.org/download.html
• https://waikato.github.io/weka-wiki/downloading_weka/

Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
Data - Wrangling Analysis
No ratings yet
Data - Wrangling Analysis
26 pages
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All Are Pasted at End)
No ratings yet
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All Are Pasted at End)
16 pages
My Notes
No ratings yet
My Notes
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
MTCARS Regression Analysis
No ratings yet
MTCARS Regression Analysis
5 pages
Week 10
No ratings yet
Week 10
50 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
ISyE7406 Homework3
No ratings yet
ISyE7406 Homework3
20 pages
Feature Engineering
No ratings yet
Feature Engineering
50 pages
Machine Learning: by Team 2
No ratings yet
Machine Learning: by Team 2
41 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Curve Fit
No ratings yet
Curve Fit
212 pages
Learning Book 11 Feb
No ratings yet
Learning Book 11 Feb
322 pages
Lab1: Introduction To R: Islr2
No ratings yet
Lab1: Introduction To R: Islr2
10 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
Statistics and Machine Learning Toolbox™ Release Notes
No ratings yet
Statistics and Machine Learning Toolbox™ Release Notes
150 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
HW3 Isye 7406
No ratings yet
HW3 Isye 7406
8 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Fall 2023-2024 IE 451 Homework 2 Solutions
No ratings yet
Fall 2023-2024 IE 451 Homework 2 Solutions
20 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
Machine Learning Ess - Week 1-4week
No ratings yet
Machine Learning Ess - Week 1-4week
43 pages
Data Mining
No ratings yet
Data Mining
33 pages
Lecture 5
No ratings yet
Lecture 5
26 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
PS Notes (Machine Learning
No ratings yet
PS Notes (Machine Learning
14 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
7406HW03
No ratings yet
7406HW03
2 pages
DATA SCIENCE iNTERVIEW QUESTION
No ratings yet
DATA SCIENCE iNTERVIEW QUESTION
42 pages
Lec 5
No ratings yet
Lec 5
24 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Data Science Using R
No ratings yet
Data Science Using R
11 pages
LinearRegression HandsOn
No ratings yet
LinearRegression HandsOn
3 pages
Solution
No ratings yet
Solution
148 pages
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
No ratings yet
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
10 pages
Data Science Notes
No ratings yet
Data Science Notes
66 pages
Wa0008.
No ratings yet
Wa0008.
63 pages
Using R For Basic Statistical Analysis
No ratings yet
Using R For Basic Statistical Analysis
11 pages
Chapter 1 Introduction To Data Mining
No ratings yet
Chapter 1 Introduction To Data Mining
10 pages
CSE1703 - Fundamental of Data Science
No ratings yet
CSE1703 - Fundamental of Data Science
6 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
ML Notes
No ratings yet
ML Notes
44 pages
UNIT02
No ratings yet
UNIT02
41 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Lecture 5 - Data Preparation
No ratings yet
Lecture 5 - Data Preparation
31 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Dealing With Diverged Git Branches Pro Version
No ratings yet
Dealing With Diverged Git Branches Pro Version
4 pages
AI Syllbus
No ratings yet
AI Syllbus
5 pages
CS202 Current Final Term Paper 2022
0% (1)
CS202 Current Final Term Paper 2022
4 pages
Adspower Script
No ratings yet
Adspower Script
3 pages
2024 March Mrahman What To Expect in Android 15
No ratings yet
2024 March Mrahman What To Expect in Android 15
44 pages
Digital Logic Families - TTL - NMOS
No ratings yet
Digital Logic Families - TTL - NMOS
36 pages
Zebra ZE500™: User Guide
No ratings yet
Zebra ZE500™: User Guide
170 pages
Lecture 1 Object-Oriented Design and Implementation
No ratings yet
Lecture 1 Object-Oriented Design and Implementation
201 pages
Asuquo Happiness CV
No ratings yet
Asuquo Happiness CV
6 pages
Emerging Biometric Modalities and Their Use
No ratings yet
Emerging Biometric Modalities and Their Use
6 pages
22A Oracle Responsys Express - Release Overview - External
No ratings yet
22A Oracle Responsys Express - Release Overview - External
22 pages
Uropean Curriculum Vitae Format: Ersonal Information
No ratings yet
Uropean Curriculum Vitae Format: Ersonal Information
6 pages
Internship Report 4
No ratings yet
Internship Report 4
44 pages
How To Send An Offer Letter With Docusign: Basic Steps
No ratings yet
How To Send An Offer Letter With Docusign: Basic Steps
5 pages
Stock Analysis Spreadsheet (10YR, 2024) (Vers 4.2) PUBLIC
No ratings yet
Stock Analysis Spreadsheet (10YR, 2024) (Vers 4.2) PUBLIC
17 pages
SC 220: Groups and Linear Algebra B.Tech Sem-III: Subgroup
No ratings yet
SC 220: Groups and Linear Algebra B.Tech Sem-III: Subgroup
73 pages
Executor
No ratings yet
Executor
2 pages
DS Quiz
No ratings yet
DS Quiz
4 pages
CC Domain4
No ratings yet
CC Domain4
67 pages
SAQA - 115431 - Learner Guide
No ratings yet
SAQA - 115431 - Learner Guide
21 pages
Report Ece551 Project 3
No ratings yet
Report Ece551 Project 3
4 pages
Yearly C For Class 7
No ratings yet
Yearly C For Class 7
4 pages
Gettingstartedwithmech Mindvisionsystem
No ratings yet
Gettingstartedwithmech Mindvisionsystem
34 pages
Ansible Rhel 90
No ratings yet
Ansible Rhel 90
72 pages
Octnov 23
No ratings yet
Octnov 23
3 pages
Comp5511 20231
No ratings yet
Comp5511 20231
2 pages
Student Attendance Tracking Information System by Maureen
No ratings yet
Student Attendance Tracking Information System by Maureen
27 pages
Early Flood Monitoring System Using Iot Applications: A Mini Project Report
No ratings yet
Early Flood Monitoring System Using Iot Applications: A Mini Project Report
19 pages
Assignment 01 Logika Matematika
No ratings yet
Assignment 01 Logika Matematika
14 pages
DocuCentre-VI C3370
No ratings yet
DocuCentre-VI C3370
16 pages