Step by Step Data Wrangling
Step by Step Data Wrangling
Data wrangling, also known as data munging, is the process of transforming raw and
unstructured data into a clean, structured, and usable format. The purpose of data
wrangling is to prepare data so that it can be effectively used for decision-making, machine
learning models, and business intelligence applications.
1. Discovering Data
What is it?
Data discovery is the first step in data wrangling, where analysts identify relevant datasets
from various sources and explore their structure, quality, and completeness. This step helps
in understanding patterns, inconsistencies, and missing values in the data.
Steps:
• Collect data from various sources like databases, spreadsheets, APIs, or web
scraping.
Common Tools:
• R: summary(), str()
2. Cleaning Data
What is it?
Data cleaning involves removing errors, inconsistencies, and missing values from datasets to
improve accuracy and reliability. This step ensures that the data is free from irrelevant
information.
Steps:
Common Tools:
• R: na.omit(), mutate()
3. Structuring Data
What is it?
Structuring involves organizing raw data into a well-defined format that is easy to analyze. It
ensures that data is properly categorized and formatted.
Steps:
• Convert unstructured data (text, JSON, XML) into structured formats (tables,
relational databases).
Common Tools:
• Python: Pandas
• R: (tidyverse package)
4. Enriching Data
What is it?
Data enrichment involves adding relevant external or additional data to improve the
dataset’s value and completeness. It enhances insights by merging new sources.
Steps:
• Feature engineering: Create new useful columns from existing ones (e.g., extracting
the year from a date column).
• Adding external data: Merge datasets to include more information (e.g., adding
weather data to sales records).
• Categorizing values: Convert numeric ranges into categories (e.g., age groups: Child,
Adult, Senior).
Common Tools:
5. Validating Data
What is it?
Validation ensures data accuracy, consistency, and integrity by applying rules and constraints
to detect errors.
Steps:
Common Tools:
• R: validate package
6. Storing Data
What is it?
Storing involves saving the cleaned and processed data in an organized and secure format
for further analysis.
Steps:
• Save data in commonly used formats like CSV, JSON, Excel.
• Use cloud storage (AWS S3, Google Drive) for easy access.
Common Tools:
What is it?
Documentation ensures that the entire data wrangling process is recorded for future
reference, reproducibility, and collaboration.
Steps:
Common Tools: