Week 02 Data Wrangling
Week 02 Data Wrangling
TOD 533
Data Wrangling
Amit Das
TODS / AMSOM / AU
[email protected]
1
08-08-2024
Missing values
• If the value for an attribute (a column) is missing
• Will show up as a short row in fixed-file format
• May be harder to detect in a delimited file
• Use a missing value indicator such as NULL, NaN, or some other string
• Failure to detect missing values can corrupt entire file in
reading
• Missing values must be understood properly (why missing)
• No response
• For survey data, is it “WILL NOT ANSWER” or “NOT APPLICABLE”?
• Zero response – this should NOT be recorded as a missing value
2
08-08-2024
Model R R² Model R R²
Note. Models estimated using sample size of N=392 Note. Models estimated using sample size of N=398
Model Coefficients - mpg Model Coefficients - mpg
Intercept 46.26431 2.66941 17.33131 < .001 Intercept 45.86496 2.63511 17.4053 < .001
cylinders -0.39793 0.41054 -0.96927 0.333 cylinders -0.35871 0.41001 -0.8749 0.382
displacement -8.31e−5 0.00907 -0.00916 0.993 displacement -0.00139 0.00910 -0.1530 0.879
horsepower -0.04526 0.01666 -2.71620 0.007 horsepower -0.03903 0.01612 -2.4216 0.016
weight -0.00519 8.17e-4 -6.35149 < .001 weight -0.00537 8.06e-4 -6.6538 < .001
acceleration -0.02910 0.12576 -0.23143 0.817 acceleration -0.00700 0.12275 -0.0571 0.955
Outlier detection
• Single extreme values (univariate)
• Without assuming normal distribution – box plot
3
08-08-2024
Density-based clustering
• Density-based spatial clustering
of applications with noise
• (DBSCAN) is a data clustering
algorithm proposed by Martin
Ester, Hans-Peter Kriegel, Jörg
Sander and Xiaowei Xu in 1996.
• Applicable to multi-dimensional
data (difficult to spot manually)
4
08-08-2024
serial mpgMM cylindersMM displacementMM horsepowerMM weightMM accelerationMM model_year origin car_name
1 0.239 1.000 0.618 0.457 0.536 0.238 70 1 chevrolet chevelle malibu
2 0.160 1.000 0.729 0.647 0.590 0.208 70 1 buick skylark 320
3 0.239 1.000 0.646 0.565 0.517 0.179 70 1 plymouth satellite
4 0.186 1.000 0.610 0.565 0.516 0.238 70 1 amc rebel sst
5 0.213 1.000 0.605 0.511 0.521 0.149 70 1 ford torino
5
08-08-2024
serial mpgMM cylindersMM displacementMM horsepowerMM weightMM accelerationMM model_year origin car_name
1 -0.698 1.482 1.076 0.663 0.620 -1.284 70 1 chevrolet chevelle malibu
2 -1.082 1.482 1.487 1.573 0.842 -1.465 70 1 buick skylark 320
3 -0.698 1.482 1.181 1.183 0.540 -1.646 70 1 plymouth satellite
4 -0.954 1.482 1.047 1.183 0.536 -1.284 70 1 amc rebel sst
5 -0.826 1.482 1.028 0.923 0.555 -1.827 70 1 ford torino
Log transformation …1
• The proportion of words recalled with the passage of time is not linear,
but taking logarithm of time makes the relationship almost linear
6
08-08-2024
Log transformation …2
• Taking logarithm of the dependent variable (gestation period) as a
function of birthweight stabilizes the variance of the DV
Log transformation …3
• Sometimes both the independent (diameter of pine trees) and
dependent variables (volume) must be transformed
7
08-08-2024
8
08-08-2024
9
08-08-2024
10