0% found this document useful (0 votes)
27 views10 pages

Week 02 Data Wrangling

Uploaded by

sabrinashah2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views10 pages

Week 02 Data Wrangling

Uploaded by

sabrinashah2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

08-08-2024

TOD 533
Data Wrangling
Amit Das
TODS / AMSOM / AU
[email protected]

Fixed width and delimited files


• Fixed width file format
• All rows of equal length
• Each column allocated same width
of digits / characters
• Unused spaces to be filled with padding
(NULL) characters
• Application must “know” the layout
• Delimited file formats
• Columns separated by delimiters
• Spaces, commas, other …
• Rows of unequal length
• Might continue on multiple lines

1
08-08-2024

Missing values
• If the value for an attribute (a column) is missing
• Will show up as a short row in fixed-file format
• May be harder to detect in a delimited file
• Use a missing value indicator such as NULL, NaN, or some other string
• Failure to detect missing values can corrupt entire file in
reading
• Missing values must be understood properly (why missing)
• No response
• For survey data, is it “WILL NOT ANSWER” or “NOT APPLICABLE”?
• Zero response – this should NOT be recorded as a missing value

Dealing with missing values


• SAFE OPTION: Exclude rows with (any) missing values (“LISTWISE”)
• Downsides
• Loss of sample size
• Bias unless “missing at random” (systematic non-response)
• OTHER OPTIONS
• Some analyses can use incomplete data (“PAIRWISE”)
• In a correlation matrix, pairwise correlations can have different sample sizes
• UNSAFE OPTION: IMPUTATION
• Replace missing values with the means of the columns
• “Predict” missing values from other attributes in the same row
• “Predict” missing value by comparison with “similar” rows
• YOU ARE MAKING UP DATA, AT YOUR PERIL!

2
08-08-2024

Regression with mean imputation


Only complete observations, n=392 Imputed horsepower, n=398
Model Fit Measures Model Fit Measures

Model R R² Model R R²

1 0.841 0.708 1 0.840 0.705

Note. Models estimated using sample size of N=392 Note. Models estimated using sample size of N=398
Model Coefficients - mpg Model Coefficients - mpg

Predictor Estimate SE t p Predictor Estimate SE t p

Intercept 46.26431 2.66941 17.33131 < .001 Intercept 45.86496 2.63511 17.4053 < .001

cylinders -0.39793 0.41054 -0.96927 0.333 cylinders -0.35871 0.41001 -0.8749 0.382

displacement -8.31e−5 0.00907 -0.00916 0.993 displacement -0.00139 0.00910 -0.1530 0.879

horsepower -0.04526 0.01666 -2.71620 0.007 horsepower -0.03903 0.01612 -2.4216 0.016

weight -0.00519 8.17e-4 -6.35149 < .001 weight -0.00537 8.06e-4 -6.6538 < .001

acceleration -0.02910 0.12576 -0.23143 0.817 acceleration -0.00700 0.12275 -0.0571 0.955

Outlier detection
• Single extreme values (univariate)
• Without assuming normal distribution – box plot

3
08-08-2024

Outlier detection (2)


• Single extreme values (univariate)
• Assuming normal distribution
• Unlikely values (on the tails)
may be discarded / capped
• “x% Trimmed Mean”
• Unlikely  Impossible

Density-based clustering
• Density-based spatial clustering
of applications with noise
• (DBSCAN) is a data clustering
algorithm proposed by Martin
Ester, Hans-Peter Kriegel, Jörg
Sander and Xiaowei Xu in 1996.
• Applicable to multi-dimensional
data (difficult to spot manually)

4
08-08-2024

Influential observations in regression


• Cook’s D is a measure of how much a regression model changes when
the ith observation is removed.
• A general rule of thumb for cutoff on Cook's D is to use 4/n.
• If your data had 40 data points, for example, a Cook's D > 0.1 would be
considered influential.
• Not all outliers can be detected
using the Cook’s D statistic

Data Transformation: Min-Max scaling

• x = original value, xscaled = scaled value


• Scaled values of variables lie between 0 and 1
• Many machine learning methods require or prefer scaled variables
serial mpg cylinders displacement horsepower weight acceleration model_year origin car_name
1 18 8 307 130 3504 12 70 1 chevrolet chevelle malibu
2 15 8 350 165 3693 11.5 70 1 buick skylark 320
3 18 8 318 150 3436 11 70 1 plymouth satellite
4 16 8 304 150 3433 12 70 1 amc rebel sst
5 17 8 302 140 3449 10.5 70 1 ford torino

serial mpgMM cylindersMM displacementMM horsepowerMM weightMM accelerationMM model_year origin car_name
1 0.239 1.000 0.618 0.457 0.536 0.238 70 1 chevrolet chevelle malibu
2 0.160 1.000 0.729 0.647 0.590 0.208 70 1 buick skylark 320
3 0.239 1.000 0.646 0.565 0.517 0.179 70 1 plymouth satellite
4 0.186 1.000 0.610 0.565 0.516 0.238 70 1 amc rebel sst
5 0.213 1.000 0.605 0.511 0.521 0.149 70 1 ford torino

5
08-08-2024

Data Transformation: Standardization


• zx = (x – mean(x))/stdev(x)
• x = original value, zx = standardized value
• Most standardized values lie between -3 and +3
serial mpg cylinders displacement horsepower weight acceleration model_year origin car_name
1 18 8 307 130 3504 12 70 1 chevrolet chevelle malibu
2 15 8 350 165 3693 11.5 70 1 buick skylark 320
3 18 8 318 150 3436 11 70 1 plymouth satellite
4 16 8 304 150 3433 12 70 1 amc rebel sst
5 17 8 302 140 3449 10.5 70 1 ford torino

serial mpgMM cylindersMM displacementMM horsepowerMM weightMM accelerationMM model_year origin car_name
1 -0.698 1.482 1.076 0.663 0.620 -1.284 70 1 chevrolet chevelle malibu
2 -1.082 1.482 1.487 1.573 0.842 -1.465 70 1 buick skylark 320
3 -0.698 1.482 1.181 1.183 0.540 -1.646 70 1 plymouth satellite
4 -0.954 1.482 1.047 1.183 0.536 -1.284 70 1 amc rebel sst
5 -0.826 1.482 1.028 0.923 0.555 -1.827 70 1 ford torino

Log transformation …1
• The proportion of words recalled with the passage of time is not linear,
but taking logarithm of time makes the relationship almost linear

6
08-08-2024

Log transformation …2
• Taking logarithm of the dependent variable (gestation period) as a
function of birthweight stabilizes the variance of the DV

Log transformation …3
• Sometimes both the independent (diameter of pine trees) and
dependent variables (volume) must be transformed

7
08-08-2024

One-hot encoding of categorical variables


• Consider colors: violet, indigo, blue, green, yellow, orange, red
• Decision trees might be able to process this data directly
• Except in special cases the ordering
violet > indigo > blue > green > yellow > orange > red
does not make sense and should not be used (“ordinal encoding”)
• Instead, create 7 (yes, seven) variables
“violet”, “indigo”, “blue”, “green”, “yellow”, “orange”, “red”
each taking a value of 0 or 1
• Actually six variables might have been sufficient: use “red” as the reference
• Recall: dummy variables in econometrics

Wide and long forms of data (e.g. time series)


• Long data sometimes called
“tidy” data
• Some analyses require
one or the other
• Tidy data is handled better
by machines (normalized)
• Wide data may be easier
for human readers
• Stats software (and Python, R) have routines for reshaping data

8
08-08-2024

Syntactic vs semantic data cleaning


• Sometimes the meaning of
the data is clear, though it
does not match exactly.
• ISO 3166-1 alpha-3 codes
three-letter country codes
IDN Indonesia
IMN Isle of Man
IND India
IOT British Indian Ocean Territory
IRL Ireland
IRN Iran (Islamic Republic of)
IRQ Iraq
Write code to reconcile other
(variant) spellings.

A modern view of data cleaning


• Data cleaning requires domain knowledge
• Number of cylinders: 3, 4, 6, 7, 8, 10, 12, 16 … in automobiles
which ones are meaningful?
• Orders of magnitude: GHz, ns, microns / nm … in electronics
• The “problem” lies in the “brittleness” of learning algorithms
• Some can run on incomplete data, others cannot
• Some are strongly affected by noise in data, others are more robust
• Mean vs Median

• “Data Cleaning” is a legitimate part of the modeling process


Bill Gates, Warren Buffet, LeBron James, Lionel Messi: outliers?

9
08-08-2024

Data Wrangling hands-on


• https://archive.ics.uci.edu/dataset/9/auto+mpg
• https://www.jamovi.org/download.html
• https://waikato.github.io/weka-wiki/downloading_weka/

10

You might also like