Intro To Data Analytics - Cleanup & Transformation
Intro To Data Analytics - Cleanup & Transformation
ai
DATAWEEK
2024
Welcome onboard J
Welcome to Day one of DATAWEEK – 2024! It’s great to have you on board 😃
Over the course of next dew days, you’ll learn the most used tools by Analysts world over,
take on the role of a data analyst and work with a real dataset to solve a business
challenge.
Share results
Data Analysis
Share results
Data Analysis
Share results
Data Analysis
- “appropriate data”
- first party – second party –
Define the question third party
Share results
Data Analysis
- “appropriate data”
- first party – second party –
Define the question third party
Share results
Data Analysis
- Building hypothesis
- Proving hypothesis
Clean & Transform
- “appropriate data”
- first party – second party –
Define the question third party
Share results
- Visualisation
- storytelling
Data Analysis
- Building hypothesis
- Proving hypothesis
Clean & Transform
- “appropriate data”
- first party – second party –
Define the question third party
Irrelevant Data
Structural Errors
Duplicates
Missing Data
Outliers
What is Data Cleanup ?
Irrelevant Data Remove distraction and noise à Make sure that the data
you’re including really needs to be there.
Structural Errors For example, if you are collecting data on women between
the ages of 18-35, there is no reason for a 60-year-old man
to appear in your data set.
Duplicates
- Personally identifiable (PII) data
- URLs
Missing Data - HTML tags
- Boilerplate text (such as in emails)
- Tracking codes
Outliers - Excessive blank space between text
What is Data Cleanup ?
Irrelevant Data Structural errors in your data include things like typos,
inconsistent formatting, incorrect capitalization, and any
spelling issues or formatting that might confuse a machine
Structural Errors learning model
Irrelevant Data When you collect or scrape data from various sources,
there’s a good chance you’ll end up with duplicate items.
These duplicates could result from human error, such as an
Structural Errors error committed by the individual entering data or when
filling out a form.
Missing Data They can also make data difficult to interpret when you try
to visualize it, so it’s preferable to get rid of them as soon
as possible.
Outliers
What is Data Cleanup ?
Missing Data
Outliers
What is Data Cleanup ?
Structural Errors Outliers are not incorrect, but they may give an
inaccurate representation of your data if you take
them into account.
Duplicates
We discuss this more during Exploratory data
analysis!
Missing Data
Outliers
Data Cleanup with Excel
1. Import the data from an external data source.
2. Create a backup copy of the original
3. Ensure that the data is in a tabular format of rows and columns
with: similar data in each column, all columns and rows visible,
and no blank rows within the range. For best results, use an
Excel table.
4. Do tasks that don't require column manipulation first, such as
spell-checking or using the Find and Replace dialog box.
5. Next, do tasks that do require column manipulation :
• Insert a new column (B) next to the original column (A) that
needs cleaning.
• Add a formula that will transform the data at the top of the
new column (B).
• Fill down the formula in the new column (B). In an Excel
table, a calculated column is automatically created with
values filled down.
• Select the new column (B), copy it, and then paste as values
into the new column (B).
• Remove the original column (A), which converts the new
column from B to A.
Let’s clean this data
Attribute
Smoothing
Construction
Generalization Aggregation
Normalization Discretization
Data Transformation
Real life analyst spend over 60% of their time cleaning
and transforming data!
Too Advanced for now!
• Clustering: Where you can group similar values together to form a cluster
while labeling any value out of the cluster as an outlier.
• Binning: Using an algorithm for binning will help you split the data into bins
and smooth the data value within each bin.
Attribute
Smoothing
Construction
Now that you have clean data, let us see
how can we make it more useful.
Generalization Aggregation
You know what to do!
Normalization Discretization
CONNECT WITH US
+91 93217 48851
30