0% found this document useful (0 votes)
22 views21 pages

Data Wrangling and Munging

Uploaded by

chise6969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views21 pages

Data Wrangling and Munging

Uploaded by

chise6969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Big Data Analytics

IT4140
Section E
Lecture-18, 19
OUTLINE

▪ DATA WRANGLING

▪ DATA MUNGING
DATA WRANGLING
Data Wrangling

Working with raw data sucks.


DATA WRANGLING
Working with raw data sucks.
• Data comes in all shapes and sizes
– CSV files, PDFs, stone tablets, .jpg…
• Different files have different formatting
– Spaces instead of NULLs, extra rows
• “Dirty” data
– Unwanted anomalies
– Duplicates
DATA WRANGLING
Current Tools:
• Focus on specific problems
– Resolving entities
– Removing duplicates
– Schema matching
• Most systems are non-‐interactive
– Inaccessible to general audience
• A lot of people just use Excel or regular expressions…
DATA WRANGLING
Current Tools:
• Focus on specific problems
– Resolving entities
– Removing duplicates
– Schema matching
• Most systems are non-‐interactive
– Inaccessible to general audience
• A lot of people just use Excel or regular expressions…
DATA WRANGLING: Goal
• Goal: extract and standardize the raw data
– Combine multiple data sources
– Clean data anomalies
• Combine automation with interactive visualizations to
aid in cleaning
• Improve efficiency and scale of data importing
• Lower the threshold for broader audiences
DATA WRANGLING
DATA WRANGLING
Types of Data Problems:
• Missing data
• Incorrect / Inaccurate/ Inconsistent data
• Inconsistent representations of the same data
• Duplicate data
• Ambiguous data
• Hidden data
• Too much data (Data overfit)
• Too less data (Data Underfit)
• About 75% of data problems require human intervention
• Cleaning data vs overly-‐sanitizing data
DATA WRANGLING
Diagnosing data problems:
• Visualizations can convey “raw” data
• Different visual representations highlight different types of data
issues
– Outliers often stand out in a plot
– Missing data will cause gap or zero value
• Becomes increasingly difficult as data gets larger
– Visual design coupled with interaction
– Sampling
DATA WRANGLING
DATA WRANGLING
Visualizing Missing Data:
• Set values to zero?
• Interpolate based on existing data?
• Omit missing data?
DATA WRANGLING
Visualizing Uncertain Data:
• Can arise from:
– Measurement errors
– Missing data
– Sampling
• Visualization must
– Consider all components of uncertainty
– Depict multiple kinds of uncertainty
– Interact with uncertainty depictions
DATA WRANGLING
Transforming Data:
• Splitting columns, converting into meaningful records
• Typical methods: regular expressions, programming by
demonstration
– Prone to errors, tedious
• Interactive tools simplify the process
– Guide user through setting automated constraints
– Generates scripts for the user
DATA WRANGLING
Transforming Data:
• Data formatting, extraction, and conversion
• Correcting erroneous values
• Integrating multiple data sets
DATA WRANGLING
Editing and Auditing Transformations:
• Data Provenance
– Maintaining the data history
– Track the lineage of a specific item’s origins
• Used for the modification, reuse, and understanding of a
transformation
• What transformation language should be used?
– Extend existing languages?
DATA WRANGLING
Wrangling in the Cloud:
• Allows the sharing of data transformations
• Mining records of wrangling
– Better automatic suggestions
• User-‐defined data types
• Feedback from downstream analysts
– Crowdsourcing the final result
– Allow users to annotate or correct the data
DATA WRANGLING
7-commond line tool for DW

https://datascienceworkshops.com/blog/seven-command-line-tools-
for-data-science/
DATA WRANGLING-Type or Problem?-Case Study
• When data are completely random and, the fact that the data are
lacking in independent of the observed and unobserved data.
• Dirty data, also known as which data, are inaccurate, incomplete or
unrelevant data, especially in a computer system or database.
• Data is any record that inadvertently shares data with another
record in a Database.
• The data is easy to spot and it mostly occurs when transferring data
between systems.
• A file refers to every type of data that is not visible at all when using
a standard viewer, or under certain settings.
DATA WRANGLING Vs DATA MUNGING
• Data wrangling, sometimes referred to as data munging.
• Data munging process of transforming and mapping data from one "raw"
data form into another format with the intent of making it more
appropriate and valuable for a variety of downstream purposes such as
analytics.
• It is also preparing your data for a dedicated purpose (i.e. pre-process, data
split, data curation, data normalization etc.)
• Data munging is the process of removing errors and combining complex
data sets to make them more accessible and easier to analyse.
DATA WRANGLING or DATA MUNGING ?- Case study
▪ Calculate the range of the data set for setting the pattern of big data.
▪ When data are completely random and, the fact that the data are
lacking in independent of the observed and unobserved data.
▪ Subtract the minimum x value from the value of this data point to
decide the organisation of big data.
▪ Data is any record that inadvertently shares data with another record
in a Database.
▪ Repeat with additional data points for filling the null values in data
set.
▪ The process of creating, organizing and maintaining data sets so they
can be accessed and used by people looking for information.

You might also like