0% found this document useful (0 votes)

16 views30 pages

Intro To Data Analytics - Cleanup & Transformation

Uploaded by

choudhary.singhcs.aiml23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views30 pages

Intro To Data Analytics - Cleanup & Transformation

Uploaded by

choudhary.singhcs.aiml23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

by bluetick.

DATAWEEK
2024
Welcome onboard J

Welcome to Day one of DATAWEEK – 2024! It’s great to have you on board 😃

Over the course of next dew days, you’ll learn the most used tools by Analysts world over,
take on the role of a data analyst and work with a real dataset to solve a business
challenge.

By the end of the week, you will:

• Be familiar with all the key steps in the data analysis process
• Understand, and be able to apply, some fundamental analysis techniques
• Have a first-hand glimpse of what it’s like to work as a data analyst
But why a Career in data?

- Is it as big as everybody says The Jobs of Tomorrow report published

it is ? by the World Economic Forum in 2020
identifies data and artificial
- What kinds of industries and
intelligence (AI) as one of seven high-
companies might you work
growth emerging professions,
for?
showing the highest growth rate
- Is this really a secure career at 41% per year.
choice with high demand?

Let’s take a look 👀

if you research the most in-demand tech
skills for 2024 and beyond, you’ll find that
data analytics crops up time and time again
Introduction to Data Analytics
Real life analyst spend over 80% of their time cleaning
and transforming data!
What is Data Analytics ?

Share results

Data Analysis

Clean & Transform

Collect the data

Define the question

Data Analysis Process

Share results

Data Analysis

Clean & Transform

Collect the data

Define the question

- the ‘problem statement

- Data needed to solve this
- Sometimes there is back &
forth
Data Analysis Process

Share results

Data Analysis

Clean & Transform

Collect the data

- “appropriate data”
- first party – second party –
Define the question third party

- the ‘problem statement

- Data needed to solve this
- Sometimes there is back &
forth
Data Analysis Process

Share results

Data Analysis

Clean & Transform

- High quality data

- garbage in – garbage out
Collect the data

- “appropriate data”
- first party – second party –
Define the question third party

- the ‘problem statement

- Data needed to solve this
- Sometimes there is back &
forth
Data Analysis Process

Share results

Data Analysis

- Building hypothesis
- Proving hypothesis
Clean & Transform

- High quality data

- garbage in – garbage out
Collect the data

- “appropriate data”
- first party – second party –
Define the question third party

- the ‘problem statement

- Data needed to solve this
- Sometimes there is back &
forth
Data Analysis Process

Share results

- Visualisation
- storytelling
Data Analysis

- Building hypothesis
- Proving hypothesis
Clean & Transform

- High quality data

- garbage in – garbage out
Collect the data

- “appropriate data”
- first party – second party –
Define the question third party

- the ‘problem statement

- Data needed to solve this
- Sometimes there is back &
forth
Help Cinamate compete with Netflix

Cinemate, an open source online

streaming platform boasts an extensive
collection of TV shows and movies
spanning over a century, offering diverse
genres to the viewers. Any boyd with a
valid rating can upload the content on
Cinemate

You been tasked to extract valuable

insights that will aid in understanding
planning their operations and marketing
activities for next few months.

Company is looking to understand their

current library and plan for the operations
and events based on what is being
uploaded on the product for last few years
Data Cleanup & Transformation
Real life analyst spend over 60% of their time cleaning
and transforming data!
What is Data Cleanup ?

Irrelevant Data

Structural Errors

Duplicates

Missing Data

Outliers
What is Data Cleanup ?

Irrelevant Data Remove distraction and noise à Make sure that the data
you’re including really needs to be there.

Structural Errors For example, if you are collecting data on women between
the ages of 18-35, there is no reason for a 60-year-old man
to appear in your data set.
Duplicates
- Personally identifiable (PII) data
- URLs
Missing Data - HTML tags
- Boilerplate text (such as in emails)
- Tracking codes
Outliers - Excessive blank space between text
What is Data Cleanup ?

Irrelevant Data Structural errors in your data include things like typos,
inconsistent formatting, incorrect capitalization, and any
spelling issues or formatting that might confuse a machine
Structural Errors learning model

- Typos like spelling out a date rather than using a

Duplicates number
- Standardizing date and time formats or units of
measurement.
Missing Data - Standardizing capitalisation
- Numbers as texts
- unnecessary punctuation in data such as email
Outliers addresses
What is Data Cleanup ?

Irrelevant Data When you collect or scrape data from various sources,
there’s a good chance you’ll end up with duplicate items.
These duplicates could result from human error, such as an
Structural Errors error committed by the individual entering data or when
filling out a form.

Duplicates Duplicates will significantly alter your data and/or cause

confusion in your results.

Missing Data They can also make data difficult to interpret when you try
to visualize it, so it’s preferable to get rid of them as soon
as possible.
Outliers
What is Data Cleanup ?

Irrelevant Data 3 possibilities when it comes to incomplete data:

- Remove all observations with missing values.

Structural Errors - Fill in the blanks.
- Leave blanks as-is.

Duplicates What you do depends on your analytical aims and

what you want from the data!

Missing Data

Outliers
What is Data Cleanup ?

Irrelevant Data An outlier is a minority data point that varies greatly

from the majority of the other data.

Structural Errors Outliers are not incorrect, but they may give an
inaccurate representation of your data if you take
them into account.
Duplicates
We discuss this more during Exploratory data
analysis!
Missing Data

Outliers
Data Cleanup with Excel
1. Import the data from an external data source.
2. Create a backup copy of the original
3. Ensure that the data is in a tabular format of rows and columns
with: similar data in each column, all columns and rows visible,
and no blank rows within the range. For best results, use an
Excel table.
4. Do tasks that don't require column manipulation first, such as
spell-checking or using the Find and Replace dialog box.
5. Next, do tasks that do require column manipulation :
• Insert a new column (B) next to the original column (A) that
needs cleaning.
• Add a formula that will transform the data at the top of the
new column (B).
• Fill down the formula in the new column (B). In an Excel
table, a calculated column is automatically created with
values filled down.
• Select the new column (B), copy it, and then paste as values
into the new column (B).
• Remove the original column (A), which converts the new
column from B to A.
Let’s clean this data

Check each column one by one and make

sure to understand what is happening.

What do you see odd and why ?

Plan for things before you start making

changes.
Data Transformation : Organizing / Shaping data

Attribute
Smoothing
Construction

Generalization Aggregation

Normalization Discretization
Data Transformation
Real life analyst spend over 60% of their time cleaning
and transforming data!
Too Advanced for now!

Data Transformation : Smoothing

Smoothing is a technique where you apply an algorithm in order to remove

noise from your dataset when trying to identify a trend. Noise can have a bad
effect on your data and by eliminating or reducing it you can extract better
insights or identify patterns that you wouldn’t see otherwise.

There are 3 algorithm types that help with data smoothing:

• Clustering: Where you can group similar values together to form a cluster
while labeling any value out of the cluster as an outlier.

• Binning: Using an algorithm for binning will help you split the data into bins
and smooth the data value within each bin.

• Regression: Regression algorithms are used to identify the relation between

two dependent attributes and help you predict an attribute based on the
value of the other.
Data Transformation : Attribute Construction
Attribution construction is one of the most common
techniques in data transformation pipelines.

Attribution construction or feature construction is the

process of creating new features from a set of existing
features/attributes in the dataset.

Imagine working in marketing and trying to analyze the

performance of a campaign. You have all the impressions
that your campaign generated and the total cost for the
given time frame.

Instead of trying to compare these two metrics across all of

your campaigns, you can construct another metric to
calculate the cost per million impressions or CPM.
This will make your data mining and analysis process a lot
easier, as you’ll be able to compare the campaign
performance on a single metric rather than two separate
metrics.
Data Transformation : Data Generalization
Data generalization refers to the process of transforming
low-level attributes into high-level ones by using the
concept of hierarchy.

Data generalization is applied to categorical data where

they have a finite but large number of distinct values.

This is something that we, as people, are already doing

without noticing and it helps us get a clearer picture of the
data.

For ex. Address is divided into 4 categorical attributes :

• City
• Street
• Country
• State/province.
Data Transformation : Aggregation
Data aggregation is possibly one of
the most popular techniques in
data transformation. When you’re
applying data aggregation to your
raw data you are essentially storing
and presenting data in a summary
format.

This is ideal when you want to

perform statistical analysis of your
data as you might want to
aggregate your data over a specific
time period and provide statistics
such as average, sum, minimum,
and maximum
Too Advanced for now!

Data Transformation : Normalization

process of scaling the data to a much smaller range,

without losing information to help minimize or exclude
duplicated data and improve algorithm efficiency and
data extraction performance.

There are three methods to normalize an attribute:

• Min-max normalization: Where you perform a linear
transformation on the original data.
• Z-score normalization: In z-score normalization (or
zero-mean normalization) you are normalizing the
value for attribute A using the mean and standard
deviation.
• Decimal scaling: Where you can normalize the value
of attribute A by moving the decimal point in the
value.

Normalization methods are frequently used when you

have values that skew your dataset and you find it hard
to extract valuable insights.
Data Transformation : Discretization
Data discretization refers to the process of transforming
continuous data into a set of data intervals. This is an
especially useful technique that can help you make the
data easier to study and analyze and improve the
efficiency of any applied algorithm.

Imagine having tens of thousands of rows representing

people in a survey providing their first name, last name,
age, and gender.

Age is a numerical attribute that can have a lot of

different values. To make our life easier we can divide
the range of this continuous attribute into intervals.

Mapping this attribute to a higher-level concept, like

youth, middle-aged, and senior, can help a lot with the
efficiency of the task and improve the speed of the
algorithms applied.
Make this clean data useful now!

Attribute
Smoothing
Construction
Now that you have clean data, let us see
how can we make it more useful.
Generalization Aggregation
You know what to do!

Normalization Discretization
CONNECT WITH US
+91 93217 48851

[email protected]

Please connect with us for detailed references

and learner feedback.

Unit 2 - Data Munging PDF
No ratings yet
Unit 2 - Data Munging PDF
54 pages
UNIT-I DA
No ratings yet
UNIT-I DA
42 pages
Data Cleaning in Excel
100% (1)
Data Cleaning in Excel
68 pages
CS822-DataMining-Week3
No ratings yet
CS822-DataMining-Week3
91 pages
Data Proprocesing
No ratings yet
Data Proprocesing
18 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Unit 2 Data Preprocessing and Association Rule Mining
No ratings yet
Unit 2 Data Preprocessing and Association Rule Mining
31 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
23 pages
Handouts
No ratings yet
Handouts
19 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Data Cleaning in Power Query_ Best Practices and Techniques
No ratings yet
Data Cleaning in Power Query_ Best Practices and Techniques
20 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
1 Data Cleaning a Foundation for Data Analysis
No ratings yet
1 Data Cleaning a Foundation for Data Analysis
9 pages
M2.pptx
No ratings yet
M2.pptx
33 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
Data Mining
No ratings yet
Data Mining
22 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
2The data analysis process
No ratings yet
2The data analysis process
7 pages
L3
No ratings yet
L3
34 pages
Unit 2
No ratings yet
Unit 2
22 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
Programming Presentation
No ratings yet
Programming Presentation
8 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Big Data Lec5
No ratings yet
Big Data Lec5
37 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
29 pages
Part II, Meet 4 - Ch 6 Dan 7 UNP
No ratings yet
Part II, Meet 4 - Ch 6 Dan 7 UNP
19 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
4. Data segmentation
No ratings yet
4. Data segmentation
11 pages
Emerging - 2021 - Module 2 PDF
No ratings yet
Emerging - 2021 - Module 2 PDF
61 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
Delta PLC-Program O en 20130530
100% (1)
Delta PLC-Program O en 20130530
754 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Unit 3
No ratings yet
Unit 3
18 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
(M3S1) Data Analytics Framework
No ratings yet
(M3S1) Data Analytics Framework
12 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
BIA 5000 Introduction To Analytics - Lesson 6
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 6
59 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
As You Delve Into The World of Data Analytics
No ratings yet
As You Delve Into The World of Data Analytics
10 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
Gidget Haddad - Euclidean Symmetries in Mathematics - White Words Publications, 2012 - 106p
100% (1)
Gidget Haddad - Euclidean Symmetries in Mathematics - White Words Publications, 2012 - 106p
106 pages
Differential Equation Maths
100% (2)
Differential Equation Maths
45 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Data Quality and Preprocessing Concepts ETL
No ratings yet
Data Quality and Preprocessing Concepts ETL
64 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
Ants 2
No ratings yet
Ants 2
41 pages
The Bulk State
No ratings yet
The Bulk State
41 pages
21cs54-Module 1
No ratings yet
21cs54-Module 1
15 pages
Computaional Fluid Dynamics by W.H. Mason
No ratings yet
Computaional Fluid Dynamics by W.H. Mason
450 pages
By Brian Makabayi
No ratings yet
By Brian Makabayi
196 pages
Algebra II Diagnostic Test
No ratings yet
Algebra II Diagnostic Test
13 pages
UG022528 International GCSE in Mathematics Spec B For Web
No ratings yet
UG022528 International GCSE in Mathematics Spec B For Web
55 pages
Grade 2 - Number Patterns
No ratings yet
Grade 2 - Number Patterns
15 pages
Air Standard Cycle Report 1
No ratings yet
Air Standard Cycle Report 1
17 pages
Transportation Problem: Transportation Problems. Basically, The Purpose Is To Minimize The Cost of Shipping Goods From
No ratings yet
Transportation Problem: Transportation Problems. Basically, The Purpose Is To Minimize The Cost of Shipping Goods From
20 pages
Flexibility Method (Truss)
No ratings yet
Flexibility Method (Truss)
13 pages
882-AN-R1 - Analysis and Optimisation of A Transmission For Component Life
No ratings yet
882-AN-R1 - Analysis and Optimisation of A Transmission For Component Life
9 pages
Pdflib Plop: PDF Linearization, Optimization, Protection Page Inserted by Evaluation Version
No ratings yet
Pdflib Plop: PDF Linearization, Optimization, Protection Page Inserted by Evaluation Version
24 pages
N4 Mathematics April 2016
No ratings yet
N4 Mathematics April 2016
6 pages
5-MEASURES of DISPERSION-02-Aug-2019Material I 02-Aug-2019 Exp. No. 1 - Measures of Central Tendency Dispersion Skewness and Kurtosi
No ratings yet
5-MEASURES of DISPERSION-02-Aug-2019Material I 02-Aug-2019 Exp. No. 1 - Measures of Central Tendency Dispersion Skewness and Kurtosi
10 pages
(Final)5 D workshop on Recent Research Trends in Computer Science and Engg_V2
No ratings yet
(Final)5 D workshop on Recent Research Trends in Computer Science and Engg_V2
4 pages
Dynamic Programming:: Example 1: Assembly Line Scheduling. Instance
No ratings yet
Dynamic Programming:: Example 1: Assembly Line Scheduling. Instance
14 pages
GCF and LCM
No ratings yet
GCF and LCM
16 pages
Conversion Tables
No ratings yet
Conversion Tables
5 pages
Mid Term Exam Review Sheet 1
No ratings yet
Mid Term Exam Review Sheet 1
9 pages
Math8 3rdquarter (Week 1-7)
No ratings yet
Math8 3rdquarter (Week 1-7)
8 pages
20240524 paper 2
No ratings yet
20240524 paper 2
2 pages
MATH 163 Syllabus S18
No ratings yet
MATH 163 Syllabus S18
3 pages
7th Maths Unit 8 Lesson Plan
No ratings yet
7th Maths Unit 8 Lesson Plan
8 pages
ML0101EN Reg Simple Linear Regression Co2 Py v1
No ratings yet
ML0101EN Reg Simple Linear Regression Co2 Py v1
4 pages
Name:: College: Branch: Month & Year of Passing: College Code
No ratings yet
Name:: College: Branch: Month & Year of Passing: College Code
3 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet

Intro To Data Analytics - Cleanup & Transformation

Uploaded by

Intro To Data Analytics - Cleanup & Transformation

Uploaded by

by bluetick.

By the end of the week, you will:

- Is it as big as everybody says The Jobs of Tomorrow report published

Let’s take a look 👀

Clean & Transform

Collect the data

Define the question

Clean & Transform

Collect the data

Define the question

- the ‘problem statement

Clean & Transform

Collect the data

- the ‘problem statement

Clean & Transform

- High quality data

- the ‘problem statement

- High quality data

- the ‘problem statement

- High quality data

- the ‘problem statement

Cinemate, an open source online

You been tasked to extract valuable

Company is looking to understand their

- Typos like spelling out a date rather than using a

Duplicates Duplicates will significantly alter your data and/or cause

Irrelevant Data 3 possibilities when it comes to incomplete data:

- Remove all observations with missing values.

Duplicates What you do depends on your analytical aims and

Irrelevant Data An outlier is a minority data point that varies greatly

Check each column one by one and make

What do you see odd and why ?

Plan for things before you start making

Data Transformation : Smoothing

Smoothing is a technique where you apply an algorithm in order to remove

There are 3 algorithm types that help with data smoothing:

• Regression: Regression algorithms are used to identify the relation between

Attribution construction or feature construction is the

Imagine working in marketing and trying to analyze the

Instead of trying to compare these two metrics across all of

Data generalization is applied to categorical data where

This is something that we, as people, are already doing

For ex. Address is divided into 4 categorical attributes :

This is ideal when you want to

Data Transformation : Normalization

process of scaling the data to a much smaller range,

There are three methods to normalize an attribute:

Normalization methods are frequently used when you

Imagine having tens of thousands of rows representing

Age is a numerical attribute that can have a lot of

Mapping this attribute to a higher-level concept, like

Please connect with us for detailed references

You might also like