Data Munging

Data munging, or data wrangling, is essential for data scientists, involving the cleaning and formatting of data from various sources such as proprietary, government, academic, and crowdsourced datasets. Proper data cleaning is crucial to ensure usability, addressing issues like errors, artifacts, compatibility, and missing values. Techniques such as unit conversions, name unification, and outlier detection are vital for accurate data analysis.

Uploaded by

Hassan Faraz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views27 pages

Data Munging

Uploaded by

Hassan Faraz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Munging

• Good data scientists spend most of their time

cleaning and formatting data.
• The rest spend most of their time complaining
there is no data available.
• Data munging or data wrangling is the art of
acquiring data and preparing it for analysis.
Sources of Data
• Proprietary data sources
• Government data sets
• Academic data sets
• Web search
• Sensor data
• Crowdsourcing
• Digitization
Proprietary Data Sources
• Facebook, Google, Amazon, Blue Cross, etc. have exciting
user/transaction/log data sets.
• Most organizations have/should have internal data sets of
interest to their business.
• Getting outside access is usually impossible.
– Business issues, and the fear of helping their competition.
– Privacy issues, and the fear of offending their customers.
• Companies sometimes release rate-limited APIs, including
Twitter and Google.
– Providing customers and third parties with data that can
increase sales
– It is generally better for the company to provide well-behaved
APIs
Government Data Sources
• Governments have made many data open.
• Data.gov has over 100,000 open data sets!
• The Right To Information (RTI) enables you to
ask if something is not open.
• Preserving privacy is often the big issue in
whether a data set can be released.
Academic Data Sets
• Making data available is now a requirement
for publication in many fields.
• Expect to be able to find economic, medical,
demographic, and meteorological data if you
look hard enough.
• Track down from relevant papers, and ask.
Web Search/Scraping
• Scraping is the fine art of stripping text/data
from a webpage.
• Libraries exist in Python to help parse/scrape
the web, but first search:
– Are APIs available from the source?
– Did someone previously write a scraper?
• Terms of service limit what you can legally do.
Few Available Data Sources
• Bulk Downloads: e.g. Wikipedia, IMDB, Million
Song Database.
• API access: e.g. New York Times, Twitter,
• Facebook, Google.

Be aware of limits and terms of use

Sensor Data Logging
• The “Internet of Things” can do amazing
things:
– Image/video data can do many things: e.g.
measuring the weather using Flicker images.
– Measure earthquakes using accelerometers in cell
phones.
– Identify traffic flows through GSP on taxis.
Build logging systems: storage is cheap!
Crowdsourcing
• Many amazing open data resources have been
built up by teams of contributors:
– Wikipedia/Freebase
– IMDB
• Crowdsourcing platforms like Amazon Turk
enable you to pay for armies of people to help
you gather data, like human annotation.
Digitization

• But sometimes you must work for your data

instead of stealing it.
• Much historical data still exists only on paper
or PDF, requiring manual entry/curation.
• At one record per minute, you can enter 1,000
records in only two work days.
Cleaning Data: Garbage In, Garbage
Out
• Data collected in raw form may not be in usable
form for analysis.
• Before we start analysis, a proper cleaning is
required.
• Cleaning of data may include
– Distinguishing errors from artifacts
– Data compatibility / unification
– Imputation of missing values
– Estimating unobserved (zero) counts
– Outlier detection
Artifacts vs. Error
• data errors represent information that is
fundamentally lost in acquisition.
– The Gaussian noise blurring the resolution of our
sensors represents error
– The two hours of missing logs because the server
crashed represents data error
• artifacts are generally systematic problems
arising from processing done to the raw
information
First-time Scientific Authors by Year?
• In a bibliographic study, Skiena analyzed
PubMed data to identify the year of first
publication for the 100,000 most frequently
cited authors.
• What should the distribution of new top
authors by year look like?
• It is important to have a preconception of any
result to help detect anomalies.
Might this be Right?
• What artifacts do you see?
• What possible explanations could cause
them?
• Pubmed used author first names starting in 2002.
• SS Skiena became Steven S Skiena
• Data cleaning gets rid of such artifacts.
Data Compatibility
Data needs to be carefully handled to make
``apple to apple’’ comparisons:
● Unit conversions
● Numerical Representation Conversions
● Name unification
● Time/date unification
● Financial unification
Unit Conversions
• It makes no sense to compare weights of 123.5 against 78.9, when
one is in pounds and the other is in kilograms.
• It makes no sense to directly compare the movie gross of Gone
with the Wind against that of Avatar, because 1939 dollars are
15.43 times more valuable than 2009 dollars.
• It makes no sense to compare the price of gold at noon today in
New York and London, because the time zones are five hours off,
and the prices affected by intervening events.
• It makes no sense to compare the stock price of Microsoft on
February 17, 2003 to that of February 18, 2003, because the
intervening 2-for-1 stock split cut the price in half, but reflects no
change in real value.
• NASA lost the $125 million on September 23, 1999 due to a metric
conversion issue.
Numerical Representation Conversions
• Numerical features are the easiest to incorporate into
mathematical models
• But even turning numbers into numbers can have
issue.
• Numerical fields might be represented in different
ways: as integers (123), as decimals (123.5), or even as
fractions (123 1/2). Numbers can even be represented
as text, requiring the conversion from “ten million" to
10000000 for numerical processing.
• The Ariane 5 rocket exploded in 1996 due to a bad 64-
bit float to 16-bit integer conversion.
Name Unification
• Database show Skiena’s publications as
authored by the Cartesian product of his first
(Steve, Steven, or S.), middle (Sol, S., or blank),
and last (Skiena) names, allowing for nine
different variations.
• And things get worse if we include
misspellings (Skienna and Skeina)
• Use simple transformations to unify names,
like lower case, removing middle names, etc
• Unify records by other parameters like co-
authors, affiliation, nature of publication, field
of research
• Tradeoff between false positives and
negatives.
Date/Time Unification
• align all time measurements in same time
zone.
• The Gregorian calendar is common
throughout the technology world.
• Financial time series are tricky because of
weekends and holidays: how do you correlate
stock prices and temperatures?
Financial Unification
• Currency conversion uses exchange rates.
• The other important correction is for inflation.
• A meaningful way to represent price changes
over time is probably not differences but
returns (in percentage).
Dealing with Missing Data
An important aspect of data cleaning is properly
representing missing data:
● What is the year of death of a living person?
● What about a field left blank or filled with an
obviously outlandish value?
● The frequency of events too rare to see?
Imputing Missing Values
With enough training data, one might drop all records
with missing values, but we may want to use the model
on records with missing fields
Often it is better to estimate or impute missing values
instead of leaving them blank.
● Mean value imputation - leaves mean same.
● Random value imputation - repeatedly selecting
random values permits statistical evaluation of the
impact of imputation.
● Imputation by interpolation - using linear regression to
predict missing values works well if few fields are
missing per record.
Outlier Detection
The largest reported dinosaur vertebra is 50%
larger than all others: presumably a data error.
● Look critically at the maximum and minimum
values for all variables.
● Normally distributed data should not have
large outliers
Detecting Outliers
● Visually, it is easy to detect outliers, but only
in low dimensional spaces.
● It can be thought of as an unsupervised
learning problem, like clustering.
● Points which are far from their cluster center
are good candidates for outliers

Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
AWS Cloud Fundamental
No ratings yet
AWS Cloud Fundamental
10 pages
Data Transformation
100% (2)
Data Transformation
26 pages
Wlan Ac, Fit AP, Fat AP, Cloud AP v200r022c00 Upgrade Guide
No ratings yet
Wlan Ac, Fit AP, Fat AP, Cloud AP v200r022c00 Upgrade Guide
133 pages
Two Stage Water Ring Vacuum Pump
No ratings yet
Two Stage Water Ring Vacuum Pump
24 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
70-412 Exam Dumps With PDF and VCE Download (201-250) PDF
100% (3)
70-412 Exam Dumps With PDF and VCE Download (201-250) PDF
41 pages
Cyclone Separator Efficiency
100% (1)
Cyclone Separator Efficiency
73 pages
AW VolumeShare 7 (AW4.7) User Guide
No ratings yet
AW VolumeShare 7 (AW4.7) User Guide
111 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
12 Cooling Load Calculations
100% (1)
12 Cooling Load Calculations
61 pages
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
No ratings yet
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
40 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
ITSE3123 Advanced Mobile Programming Flutter
No ratings yet
ITSE3123 Advanced Mobile Programming Flutter
3 pages
Societal Impact of Key Technological Milestones
100% (1)
Societal Impact of Key Technological Milestones
30 pages
Pan Os New Features
No ratings yet
Pan Os New Features
102 pages
Lesson 5 Activity EmTech
No ratings yet
Lesson 5 Activity EmTech
3 pages
DHV MODEL 1.2 Data Cleaning
No ratings yet
DHV MODEL 1.2 Data Cleaning
49 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
33 pages
Combine PDF
No ratings yet
Combine PDF
270 pages
Data Science in IOT
No ratings yet
Data Science in IOT
220 pages
En Mybuy Gep Smart User Guide Portal Supplier
No ratings yet
En Mybuy Gep Smart User Guide Portal Supplier
200 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Unit 1
No ratings yet
Unit 1
61 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Unit 2 - Data Munging PDF
No ratings yet
Unit 2 - Data Munging PDF
54 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Sip Message Manipulation Syntax Reference Guide Ver 72
No ratings yet
Sip Message Manipulation Syntax Reference Guide Ver 72
128 pages
Data Mining and BI - Student Notes 2
No ratings yet
Data Mining and BI - Student Notes 2
40 pages
Big Data Analysis With Apache Spark: Uc#Berkeley
No ratings yet
Big Data Analysis With Apache Spark: Uc#Berkeley
80 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
BA Full Note 1
No ratings yet
BA Full Note 1
183 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Data Science
No ratings yet
Data Science
40 pages
Install and Activate Office 2019 For FREE Legally Using Volume License
No ratings yet
Install and Activate Office 2019 For FREE Legally Using Volume License
23 pages
007-012848-003 SAS Sync Agent Configuration Guide v3 5 4 RevB
No ratings yet
007-012848-003 SAS Sync Agent Configuration Guide v3 5 4 RevB
45 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
Chapter 5 Network Layer
No ratings yet
Chapter 5 Network Layer
86 pages
Data Mining
No ratings yet
Data Mining
40 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Thermophysical Properties of Seawater DWT 16 354 2010
No ratings yet
Thermophysical Properties of Seawater DWT 16 354 2010
28 pages
Unit 1-Part3-Compressed
No ratings yet
Unit 1-Part3-Compressed
28 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
Intro
No ratings yet
Intro
144 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
1c. INTRODUCTION-Data-Science-basic
No ratings yet
1c. INTRODUCTION-Data-Science-basic
31 pages
Lecture 01
No ratings yet
Lecture 01
40 pages
DS - Module 1
No ratings yet
DS - Module 1
57 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Chapter 6
No ratings yet
Chapter 6
58 pages
Python Harvard RegularExpressions
No ratings yet
Python Harvard RegularExpressions
20 pages
Data Sceince - UNIT - 4
No ratings yet
Data Sceince - UNIT - 4
70 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
PSK Unit 1 Merged
No ratings yet
PSK Unit 1 Merged
125 pages
Unit 1
No ratings yet
Unit 1
28 pages
Unit1-Data Science Fundamentals
No ratings yet
Unit1-Data Science Fundamentals
35 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
Big Data Analytics (1) : Definition
No ratings yet
Big Data Analytics (1) : Definition
15 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
WLAN Ir Cisco 8AL90476USAB 1 en
No ratings yet
WLAN Ir Cisco 8AL90476USAB 1 en
24 pages
June, 2008 2-1 Workcentre 5222, 5225, 5230 Status Indicator Raps Revision
No ratings yet
June, 2008 2-1 Workcentre 5222, 5225, 5230 Status Indicator Raps Revision
14 pages
OneSpan Ebook Social Engineering Attacks Banking Transactions
No ratings yet
OneSpan Ebook Social Engineering Attacks Banking Transactions
17 pages
Complete Download Principles of Cybercrime 1st Edition Jonathan Clough PDF All Chapters
100% (2)
Complete Download Principles of Cybercrime 1st Edition Jonathan Clough PDF All Chapters
55 pages
Ds1 - Shahana
No ratings yet
Ds1 - Shahana
36 pages
Gatsby Cheatsheet
No ratings yet
Gatsby Cheatsheet
1 page
Data Cleaning: Information Integration
No ratings yet
Data Cleaning: Information Integration
42 pages
TTDS Lectures
No ratings yet
TTDS Lectures
13 pages
2020 Intro
No ratings yet
2020 Intro
58 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
54 pages
NC-WR744G: AC1200 Wireless Dual Band Router
No ratings yet
NC-WR744G: AC1200 Wireless Dual Band Router
3 pages
Pert7 Act1 Kelvin
No ratings yet
Pert7 Act1 Kelvin
11 pages
Fdsa PPT - Unit 1
No ratings yet
Fdsa PPT - Unit 1
19 pages
Introduction To Data Science 1-2-2025
No ratings yet
Introduction To Data Science 1-2-2025
14 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
6 pages
Optional Actions and Processes: Scenario
No ratings yet
Optional Actions and Processes: Scenario
6 pages
Wifi Repeater Setup
No ratings yet
Wifi Repeater Setup
1 page
FPR硬件防火墙初始配置-v1 0
No ratings yet
FPR硬件防火墙初始配置-v1 0
7 pages
Stock Management System
No ratings yet
Stock Management System
9 pages
Unit 1
No ratings yet
Unit 1
11 pages
IKS Assignment
No ratings yet
IKS Assignment
4 pages
Application Guide - Spray Drying
No ratings yet
Application Guide - Spray Drying
4 pages
HTML Csss Anaa
No ratings yet
HTML Csss Anaa
12 pages
Dice Resume CV Sergei Severin
No ratings yet
Dice Resume CV Sergei Severin
3 pages
How To Make Money From Home - 5 Ways Teens Can Earn Money Online
No ratings yet
How To Make Money From Home - 5 Ways Teens Can Earn Money Online
1 page
CSNETWK - Machine Project T3 AY2023-2024
No ratings yet
CSNETWK - Machine Project T3 AY2023-2024
3 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet

Data Munging

Uploaded by

Data Munging

Uploaded by

Data Munging

• Good data scientists spend most of their time

Be aware of limits and terms of use

• But sometimes you must work for your data

You might also like