0% found this document useful (1 vote)

87 views5 pages

Data Wrangling

The document discusses data wrangling, which is the process of transforming raw data into a usable format for analytics. It involves extracting, cleaning, and structuring data through steps like data discovery, cleaning, validation, and publishing. The goal is to produce high quality, consistent data that is easy to analyze. Data wrangling is an important precursor to analyzing and mining data.

Uploaded by

john949

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

87 views5 pages

Data Wrangling

Uploaded by

john949

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Data wrangling

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data
from one "raw" data form into another format with the intent of making it more appropriate and valuable
for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and
useful data. Data analysts typically spend the majority of their time in the process of data wrangling
compared to the actual analysis of the data.

The process of data wrangling may include further munging, data visualization, data aggregation, training a
statistical model, as well as many other potential uses. Data wrangling typically follows a set of general
steps which begin with extracting the data in a raw form from the data source, "munging" the raw data (e.g.
sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a
data sink for storage and future use.[1] It is closely aligned with the ETL process.

Background
The "wrangler" non-technical term is often said to derive from work done by the United States Library of
Congress's National Digital Information Infrastructure and Preservation Program (NDIIPP) and their
program partner the Emory University Libraries based MetaArchive Partnership. The term "mung" has
roots in munging as described in the Jargon File.[2] The term "data wrangler" was also suggested as the
best analogy to describe someone working with data.[3]

One of the first mentions of data wrangling in a scientific context was by Donald Cline during the
NASA/NOAA Cold Lands Processes Experiment.[4] Cline stated the data wranglers "coordinate the
acquisition of the entire collection of the experiment data." Cline also specifies duties typically handled by a
storage administrator for working with large amounts of data. This can occur in areas like major research
projects and the making of films with a large amount of complex computer-generated imagery. In research,
this involves both data transfer from research instrument to storage grid or storage facility as well as data
manipulation for re-analysis via high-performance computing instruments or access via cyberinfrastructure-
based digital libraries.

With the upcoming of artificial intelligence in data science it has become increasingly important for
automation of data wrangling to have very strict checks and balances, which is why the munging process of
data has not been automated by machine learning. Data munging requires more than just an automated
solution, it requires knowledge of what information should be removed and artificial intelligence is not to
the point of understanding such things.[5]

Connection to data mining

Data wrangling is a superset of data mining and requires processes that some data mining uses, but not
always. The process of data mining is to find patterns within large data sets, where data wrangling
transforms data in order to deliver insights about that data. Even though data wrangling is a superset of data
mining does not mean that data mining does not use it, there are many use cases for data wrangling in data
mining. Data wrangling can benefit data mining by removing data that does not benefit the overall set, or is
not formatted properly, which will yield better results for the overall data mining process.
An example of data mining that is closely related to data wrangling is ignoring data from a set that is not
connected to the goal: say there is a data set related to the state of Texas and the goal is to get statistics on
the residents of Houston, the data in the set related to the residents of Dallas is not useful to the overall set
and can be removed before processing to improve the efficiency of the data mining process.

Benefits
With an increase of raw data comes an increase in the amount of data that is not inherently useful, this
increases time spent on cleaning and organizing data before it can be analyzed which is where data
wrangling comes into play. The result of data wrangling can provide important metadata statistics for
further insights about the data, it is important to ensure metadata is consistent otherwise it can cause
roadblocks. Data wrangling allows analysts to analyze more complex data more quickly, achieve more
accurate results, and because of this better decisions can be made. Many businesses have moved to data
wrangling because of the success that it has brought.

Core ideas

g messy data into useful statistics

The main steps in data wrangling are as follows:

1. Data discovery

This all-encompassing term describes how to understand your data. This is the first step to
familiarize yourself with your data.

2. Structuring
The next step is to organize the data. Raw data is typically unorganized and much of it
may not be useful for the end product. This step is important for easier computation
and analysis in the later steps.

3. Cleaning
There are many different forms of cleaning data, for example one form of cleaning data
is catching dates formatted in a different way and another form is removing outliers that
will skew results and also formatting null values. This step is important in assuring the
overall quality of the data.

4. Enriching
At this step determine whether or not additional data would benefit the data set that
could be easily added.

5. Validating
This step is similar to structuring and cleaning. Use repetitive sequences of validation
rules to assure data consistency as well as quality and security. An example of a
validation rule is confirming the accuracy of fields via cross checking data.

6. Publishing
Prepare the data set for use downstream, which could include use for users or
software. Be sure to document any steps and logic during wrangling.

These steps are an iterative process that should yield a clean and usable data set that can then be used for
analysis. This process is tedious but rewarding as it allows analysts to get the information they need out of a
large set of data that would otherwise be unreadable.

Starting data
Name Phone Birth date State

John, Smith 445-881-4478 August 12, 1989 Maine

Jennifer Tal +1-189-456-4513 11/12/1965 Tx

Gates, Bill (876)546-8165 June 15, 72 Kansas

Alan Fitch 5493156648 2-6-1985 Oh

Jacob Alan 156-4896 January 3 Alabama

Result
Name Phone Birth date State

John Smith 445-881-4478 1989-08-12 Maine

Jennifer Tal 189-456-4513 1965-11-12 Texas

Bill Gates 876-546-8165 1972-06-15 Kansas

Alan Fitch 549-315-6648 1985-02-06 Ohio

The result of using the data wrangling process on this small data set shows a significantly easier data set to
read. All names are now formatted the same way, {first name last name}, phone numbers are also formatted
the same way {area code-XXX-XXXX}, dates are formatted numerically {YYYY-mm-dd}, and states are
no longer abbreviated. The entry for Jacob Alan did not have fully formed data (the area code on the phone
number is missing and the birth date had no year), so it was discarded from the data set. Now that the
resulting data set is cleaned and readable, it is ready to be either deployed or evaluated.

Typical use
The data transformations are typically applied to distinct entities (e.g. fields, rows, columns, data values,
etc.) within a data set, and could include such actions as extractions, parsing, joining, standardizing,
augmenting, cleansing, consolidating, and filtering to create desired wrangling outputs that can be
leveraged downstream.
The recipients could be individuals, such as data architects or data scientists who will investigate the data
further, business users who will consume the data directly in reports, or systems that will further process the
data and write it into targets such as data warehouses, data lakes, or downstream applications.

Modus operandi
Depending on the amount and format of the incoming data, data wrangling has traditionally been
performed manually (e.g. via spreadsheets such as Excel), tools like KNIME or via scripts in languages
such as Python or SQL. R, a language often used in data mining and statistical data analysis, is now also
sometimes used for data wrangling.[6] Data wranglers typically have skills sets within: R or Python, SQL,
PHP, Scala, and more languages typically used for analyzing data.

Visual data wrangling systems were developed to make data wrangling accessible for non-programmers,
and simpler for programmers. Some of these also include embedded AI recommenders and programming
by example facilities to provide user assistance, and program synthesis techniques to autogenerate scalable
dataflow code. Early prototypes of visual data wrangling tools include OpenRefine and the
Stanford/Berkeley Wrangler (http://vis.stanford.edu/wrangler/) research system;[7] the latter evolved into
Trifacta.

Other terms for these processes have included data franchising,[8] data preparation, and data munging.

Example
Given a set of data that contains information on medical patients your goal is to find correlation for a
disease. Before you can start iterating through the data ensure that you have an understanding of the result,
are you looking for patients who have the disease? Are there other diseases that can be the cause? Once an
understanding of the outcome is achieved then the data wrangling process can begin.

Start by determining the structure of the outcome, what is important to understand the disease diagnosis.

Once a final structure is determined, clean the data by removing any data points that are not helpful or are
malformed, this could include patients that have not been diagnosed with any disease.

After cleaning look at the data again, is there anything that can be added to the data set that is already
known that would benefit it? An example could be most common diseases in the area, America and India
are very different when it comes to most common diseases.

Now comes the validation step, determine validation rules for which data points need to be checked for
validity, this could include date of birth or checking for specific diseases.

After the validation step the data should now be organized and prepared for either deployment or
evaluation. This process can be beneficial for determining correlations for disease diagnosis as it will reduce
the vast amount of data into something that can be easily analyzed for an accurate result.

See also
Alteryx
Data janitor
Data preparation
OpenRefine
Trifacta

References
1. "What Is Data Munging?" (http://eduunix.ccut.edu.cn/index2/html/oracle/O%27Reilly%20-%2
0Perl.For.Oracle.DBAs.eBook-LiB/oracleperl-APP-D-SECT-1.html). Archived (https://web.ar
chive.org/web/20130818111618/http://eduunix.ccut.edu.cn/index2/html/oracle/O%27Reilly%
20-%20Perl.For.Oracle.DBAs.eBook-LiB/oracleperl-APP-D-SECT-1.html) from the original
on 2013-08-18. Retrieved 2022-01-21.
2. "mung" (http://catb.org/jargon/html/index.html). Mung (http://catb.org/jargon/html/M/mung.htm
l). Jargon File. Archived (https://web.archive.org/web/20120918005339/http://www.catb.org/j
argon/html/M/mung.html) from the original on 2012-09-18. Retrieved 2012-10-10.
3. As coder is for code, X is for data (https://blog.okfn.org/2011/02/11/as-coder-is-for-code-x-is-f
or-data/) Archived (https://web.archive.org/web/20210415175407/https://blog.okfn.org/2011/
02/11/as-coder-is-for-code-x-is-for-data/) 2021-04-15 at the Wayback Machine, Open
Knowledge Foundation blog post
4. Parsons, M. A.; Brodzik, M. J.; Rutter, N. J. (2004). "Data management for the Cold Land
Processes Experiment: improving hydrological science". Hydrological Processes. 18 (18):
3637–3653. Bibcode:2004HyPr...18.3637P (https://ui.adsabs.harvard.edu/abs/2004HyPr...1
8.3637P). doi:10.1002/hyp.5801 (https://doi.org/10.1002%2Fhyp.5801). S2CID 129774847
(https://api.semanticscholar.org/CorpusID:129774847).
5. "What Is Data Wrangling? What are the steps in data wrangling?" (https://expressanalytics.c
om/blog/what-is-data-wrangling-what-are-the-steps-in-data-wrangling/). Express Analytics.
2020-04-22. Archived (https://web.archive.org/web/20201101035026/https://expressanalytic
s.com/blog/what-is-data-wrangling-what-are-the-steps-in-data-wrangling/) from the original
on 2020-11-01. Retrieved 2020-12-06.
6. Wickham, Hadley; Grolemund, Garrett (2016). "Chapter 9: Data Wrangling Introduction". R
for data science : import, tidy, transform, visualize, and model data (https://r4ds.had.co.nz/wr
angle-intro.html) (First ed.). Sebastopol, CA. ISBN 978-1491910399. Archived (https://web.ar
chive.org/web/20211011025448/https://r4ds.had.co.nz/wrangle-intro.html) from the original
on 2021-10-11. Retrieved 2022-01-12.
7. Kandel, Sean; Paepcke, Andreas (May 2011). "Wrangler: Interactive Visual Specification of
Data Transformation Scripts". SIGCHI. doi:10.1145/1978942.1979444 (https://doi.org/10.114
5%2F1978942.1979444). S2CID 11133756 (https://api.semanticscholar.org/CorpusID:1113
3756).
8. What is Data Franchising? (https://www.iri.com/blog/business-intelligence/data-franchising/)
(2003 and 2017 IRI) Archived (https://web.archive.org/web/20210415175408/https://www.iri.
com/blog/business-intelligence/data-franchising/) 2021-04-15 at the Wayback Machine

External links
"What is Data Wrangling? Benefits, tools, and skills?" (https://myinfluencerjourney.com/what
-is-data-wrangling-benefits-tools-and-skills/). My Influencer Journey. Retrieved 2022-01-26.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Data_wrangling&oldid=1164870853"

Web Application Penetration Testing - Final Project
No ratings yet
Web Application Penetration Testing - Final Project
50 pages
DB Station DRT PVC EN
No ratings yet
DB Station DRT PVC EN
2 pages
Business Innovation Unit Plan Consult
No ratings yet
Business Innovation Unit Plan Consult
15 pages
Digital Phased Arrays Challanges
No ratings yet
Digital Phased Arrays Challanges
17 pages
Logan CCTV Complete Manual
No ratings yet
Logan CCTV Complete Manual
18 pages
DATA WRANGLING AND DATA VISUALIZATION - Unit-01
No ratings yet
DATA WRANGLING AND DATA VISUALIZATION - Unit-01
19 pages
Unit-1 DM
No ratings yet
Unit-1 DM
10 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
DWDV Notes
No ratings yet
DWDV Notes
111 pages
Data Wrangling
No ratings yet
Data Wrangling
4 pages
2-Data Wrangling
No ratings yet
2-Data Wrangling
13 pages
Module - 1 (Introduction To Data Wrangling)
No ratings yet
Module - 1 (Introduction To Data Wrangling)
29 pages
Unit-1, 1
No ratings yet
Unit-1, 1
5 pages
Data Wrangling
No ratings yet
Data Wrangling
6 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Ijitcs V10 N1 4
No ratings yet
Ijitcs V10 N1 4
9 pages
Optimisation and Dddddimension Reduction Tech-Unlocked
No ratings yet
Optimisation and Dddddimension Reduction Tech-Unlocked
29 pages
Unit II Notes
No ratings yet
Unit II Notes
39 pages
Unit IV
No ratings yet
Unit IV
27 pages
Unit 4
No ratings yet
Unit 4
60 pages
Data Wrangling
No ratings yet
Data Wrangling
17 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
Unit-1, 2
No ratings yet
Unit-1, 2
5 pages
Unit 4
No ratings yet
Unit 4
60 pages
Data Wrangling
No ratings yet
Data Wrangling
3 pages
Data Wrangling and Munging
No ratings yet
Data Wrangling and Munging
21 pages
Lecture Week 6-Data Scraping and Data Wrangling
No ratings yet
Lecture Week 6-Data Scraping and Data Wrangling
16 pages
211101088math - Data Ass 2
No ratings yet
211101088math - Data Ass 2
12 pages
Data Wrangling
No ratings yet
Data Wrangling
13 pages
Data Mining and Wrangling
No ratings yet
Data Mining and Wrangling
3 pages
DWDV Unit 1
No ratings yet
DWDV Unit 1
21 pages
Data Wrangling
No ratings yet
Data Wrangling
9 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
110 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
Math211101020
No ratings yet
Math211101020
12 pages
Data Wrangling Steps
No ratings yet
Data Wrangling Steps
10 pages
Data Munging
No ratings yet
Data Munging
20 pages
DATA WRANGLING New
No ratings yet
DATA WRANGLING New
13 pages
Data Wrangling: T.Y. B.Sc. DS
No ratings yet
Data Wrangling: T.Y. B.Sc. DS
24 pages
Unit 1 (DWV)
No ratings yet
Unit 1 (DWV)
12 pages
Dsbda Lab Manual
No ratings yet
Dsbda Lab Manual
112 pages
Data Wrangling and Visualization
No ratings yet
Data Wrangling and Visualization
48 pages
Data Wrangling
0% (1)
Data Wrangling
7 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
Data Wrangling Assignment - R23DG001
No ratings yet
Data Wrangling Assignment - R23DG001
2 pages
Data Pre Processing
No ratings yet
Data Pre Processing
4 pages
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
91 pages
Exp 1
No ratings yet
Exp 1
3 pages
Data Sceince - UNIT - 4
No ratings yet
Data Sceince - UNIT - 4
70 pages
Ds With Py
No ratings yet
Ds With Py
39 pages
Data Analytics - Module-1.1
No ratings yet
Data Analytics - Module-1.1
42 pages
UNIT 2-Upto Chapter 2.3
No ratings yet
UNIT 2-Upto Chapter 2.3
23 pages
1708443470801
No ratings yet
1708443470801
71 pages
Big Data
No ratings yet
Big Data
51 pages
Data Wrangling, Also Known As Data Munging, Is An Iterative Process That Involves Data
No ratings yet
Data Wrangling, Also Known As Data Munging, Is An Iterative Process That Involves Data
9 pages
IJCRT2405424
No ratings yet
IJCRT2405424
8 pages
Unit V
No ratings yet
Unit V
47 pages
Data Processing
No ratings yet
Data Processing
3 pages
BIA 5000 Introduction To Analytics - Lesson 6
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 6
59 pages
Data Wrangling Study Guide
No ratings yet
Data Wrangling Study Guide
3 pages
Data Wrangling Tools
No ratings yet
Data Wrangling Tools
3 pages
Digital Signal Processing
No ratings yet
Digital Signal Processing
8 pages
Data Blending
No ratings yet
Data Blending
3 pages
List of Datasets For Machine-Learning Research
100% (1)
List of Datasets For Machine-Learning Research
61 pages
Wavelet
No ratings yet
Wavelet
19 pages
Data Integration
No ratings yet
Data Integration
8 pages
Extract, Transform, Load
No ratings yet
Extract, Transform, Load
9 pages
Data Science
No ratings yet
Data Science
7 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
Data Lineage
No ratings yet
Data Lineage
14 pages
Data Philanthropy
No ratings yet
Data Philanthropy
5 pages
XLDB
No ratings yet
XLDB
3 pages
Document-Oriented Database
No ratings yet
Document-Oriented Database
10 pages
Computational Phylogenetics
No ratings yet
Computational Phylogenetics
18 pages
Causal Loop Diagram
No ratings yet
Causal Loop Diagram
4 pages
Very Large Database
No ratings yet
Very Large Database
6 pages
Bayesian Programming
No ratings yet
Bayesian Programming
16 pages
Computational Intelligence
No ratings yet
Computational Intelligence
6 pages
Hierarchical Temporal Memory
No ratings yet
Hierarchical Temporal Memory
11 pages
Curse of Dimensionality
No ratings yet
Curse of Dimensionality
9 pages
Parallel Coordinates
No ratings yet
Parallel Coordinates
5 pages
Structured Data Analysis (Statistics)
No ratings yet
Structured Data Analysis (Statistics)
1 page
Principal Component Analysis
No ratings yet
Principal Component Analysis
33 pages
Multidimensional Scaling
No ratings yet
Multidimensional Scaling
6 pages
Query Plan Interpretation
No ratings yet
Query Plan Interpretation
81 pages
Learning Resources - Ok
No ratings yet
Learning Resources - Ok
4 pages
Sop For The D.G Start and Stop
No ratings yet
Sop For The D.G Start and Stop
4 pages
Autoranging Insulation Tester RS-232 Window Version, Data Logger
No ratings yet
Autoranging Insulation Tester RS-232 Window Version, Data Logger
2 pages
Avaya Aura Commuincicaions Manger
No ratings yet
Avaya Aura Commuincicaions Manger
96 pages
MEC18R443 Automatic Guided Vehicle 3 0 0 3: L T P C
No ratings yet
MEC18R443 Automatic Guided Vehicle 3 0 0 3: L T P C
2 pages
Bookstore Management Project Final
No ratings yet
Bookstore Management Project Final
9 pages
Top 3 Alumina PCB Manufacturers in The World
No ratings yet
Top 3 Alumina PCB Manufacturers in The World
10 pages
ACN Microrproject 1
No ratings yet
ACN Microrproject 1
19 pages
CNv6 instructorPPT Chapter6
No ratings yet
CNv6 instructorPPT Chapter6
44 pages
Errpro 705004 08.22
No ratings yet
Errpro 705004 08.22
14 pages
Kfaa - DSR #333 - 27 Feb 2021 - Phase 2
No ratings yet
Kfaa - DSR #333 - 27 Feb 2021 - Phase 2
4 pages
Data Analysis
No ratings yet
Data Analysis
4 pages
IOS Security Model (TAJUK 10)
No ratings yet
IOS Security Model (TAJUK 10)
13 pages
Buy Apple AirPods Pro (2nd Generation) With MagSa
No ratings yet
Buy Apple AirPods Pro (2nd Generation) With MagSa
1 page
PC7000-6 Loading Shovel PC7000-6 Backhoe
No ratings yet
PC7000-6 Loading Shovel PC7000-6 Backhoe
9 pages
LED User's Manual
No ratings yet
LED User's Manual
12 pages
MSC 128 PDF
No ratings yet
MSC 128 PDF
8 pages
Honda Generator-Brochure 23-24
No ratings yet
Honda Generator-Brochure 23-24
16 pages
001-014 Connecting Rod: General Information
No ratings yet
001-014 Connecting Rod: General Information
13 pages
ESIC IP Portal Help File
No ratings yet
ESIC IP Portal Help File
20 pages
vs121 vs121 P Datasheet en
No ratings yet
vs121 vs121 P Datasheet en
4 pages
Wiring Voting F-DI F-DO V20 en
No ratings yet
Wiring Voting F-DI F-DO V20 en
98 pages
FICCI WAZIR Report Building New Age Textile Industry PDF
No ratings yet
FICCI WAZIR Report Building New Age Textile Industry PDF
33 pages
EPP Performance Activity 2
No ratings yet
EPP Performance Activity 2
16 pages

Data Wrangling

Uploaded by

Data Wrangling

Uploaded by

Data wrangling

Connection to data mining

g messy data into useful statistics

The main steps in data wrangling are as follows:

John, Smith 445-881-4478 August 12, 1989 Maine

Gates, Bill (876)546-8165 June 15, 72 Kansas

Alan Fitch 5493156648 2-6-1985 Oh

John Smith 445-881-4478 1989-08-12 Maine

Bill Gates 876-546-8165 1972-06-15 Kansas

Alan Fitch 549-315-6648 1985-02-06 Ohio

Retrieved from "https://en.wikipedia.org/w/index.php?title=Data_wrangling&oldid=1164870853"

You might also like