0% found this document useful (0 votes)
8 views

Intro To Data Analytics - Cleanup & Transformation

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Intro To Data Analytics - Cleanup & Transformation

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

by bluetick.

ai

DATAWEEK
2024
Welcome onboard J

Welcome to Day one of DATAWEEK – 2024! It’s great to have you on board 😃

Over the course of next dew days, you’ll learn the most used tools by Analysts world over,
take on the role of a data analyst and work with a real dataset to solve a business
challenge.

By the end of the week, you will:


• Be familiar with all the key steps in the data analysis process
• Understand, and be able to apply, some fundamental analysis techniques
• Have a first-hand glimpse of what it’s like to work as a data analyst
But why a Career in data?

- Is it as big as everybody says The Jobs of Tomorrow report published


it is ? by the World Economic Forum in 2020
identifies data and artificial
- What kinds of industries and
intelligence (AI) as one of seven high-
companies might you work
growth emerging professions,
for?
showing the highest growth rate
- Is this really a secure career at 41% per year.
choice with high demand?

Let’s take a look 👀


if you research the most in-demand tech
skills for 2024 and beyond, you’ll find that
data analytics crops up time and time again
Introduction to Data Analytics
Real life analyst spend over 80% of their time cleaning
and transforming data!
What is Data Analytics ?

Share results

Data Analysis

Clean & Transform

Collect the data

Define the question


Data Analysis Process

Share results

Data Analysis

Clean & Transform

Collect the data

Define the question

- the ‘problem statement


- Data needed to solve this
- Sometimes there is back &
forth
Data Analysis Process

Share results

Data Analysis

Clean & Transform

Collect the data

- “appropriate data”
- first party – second party –
Define the question third party

- the ‘problem statement


- Data needed to solve this
- Sometimes there is back &
forth
Data Analysis Process

Share results

Data Analysis

Clean & Transform

- High quality data


- garbage in – garbage out
Collect the data

- “appropriate data”
- first party – second party –
Define the question third party

- the ‘problem statement


- Data needed to solve this
- Sometimes there is back &
forth
Data Analysis Process

Share results

Data Analysis

- Building hypothesis
- Proving hypothesis
Clean & Transform

- High quality data


- garbage in – garbage out
Collect the data

- “appropriate data”
- first party – second party –
Define the question third party

- the ‘problem statement


- Data needed to solve this
- Sometimes there is back &
forth
Data Analysis Process

Share results

- Visualisation
- storytelling
Data Analysis

- Building hypothesis
- Proving hypothesis
Clean & Transform

- High quality data


- garbage in – garbage out
Collect the data

- “appropriate data”
- first party – second party –
Define the question third party

- the ‘problem statement


- Data needed to solve this
- Sometimes there is back &
forth
Help Cinamate compete with Netflix

Cinemate, an open source online


streaming platform boasts an extensive
collection of TV shows and movies
spanning over a century, offering diverse
genres to the viewers. Any boyd with a
valid rating can upload the content on
Cinemate

You been tasked to extract valuable


insights that will aid in understanding
planning their operations and marketing
activities for next few months.

Company is looking to understand their


current library and plan for the operations
and events based on what is being
uploaded on the product for last few years
Data Cleanup & Transformation
Real life analyst spend over 60% of their time cleaning
and transforming data!
What is Data Cleanup ?

Irrelevant Data

Structural Errors

Duplicates

Missing Data

Outliers
What is Data Cleanup ?

Irrelevant Data Remove distraction and noise à Make sure that the data
you’re including really needs to be there.

Structural Errors For example, if you are collecting data on women between
the ages of 18-35, there is no reason for a 60-year-old man
to appear in your data set.
Duplicates
- Personally identifiable (PII) data
- URLs
Missing Data - HTML tags
- Boilerplate text (such as in emails)
- Tracking codes
Outliers - Excessive blank space between text
What is Data Cleanup ?

Irrelevant Data Structural errors in your data include things like typos,
inconsistent formatting, incorrect capitalization, and any
spelling issues or formatting that might confuse a machine
Structural Errors learning model

- Typos like spelling out a date rather than using a


Duplicates number
- Standardizing date and time formats or units of
measurement.
Missing Data - Standardizing capitalisation
- Numbers as texts
- unnecessary punctuation in data such as email
Outliers addresses
What is Data Cleanup ?

Irrelevant Data When you collect or scrape data from various sources,
there’s a good chance you’ll end up with duplicate items.
These duplicates could result from human error, such as an
Structural Errors error committed by the individual entering data or when
filling out a form.

Duplicates Duplicates will significantly alter your data and/or cause


confusion in your results.

Missing Data They can also make data difficult to interpret when you try
to visualize it, so it’s preferable to get rid of them as soon
as possible.
Outliers
What is Data Cleanup ?

Irrelevant Data 3 possibilities when it comes to incomplete data:

- Remove all observations with missing values.


Structural Errors - Fill in the blanks.
- Leave blanks as-is.

Duplicates What you do depends on your analytical aims and


what you want from the data!

Missing Data

Outliers
What is Data Cleanup ?

Irrelevant Data An outlier is a minority data point that varies greatly


from the majority of the other data.

Structural Errors Outliers are not incorrect, but they may give an
inaccurate representation of your data if you take
them into account.
Duplicates
We discuss this more during Exploratory data
analysis!
Missing Data

Outliers
Data Cleanup with Excel
1. Import the data from an external data source.
2. Create a backup copy of the original
3. Ensure that the data is in a tabular format of rows and columns
with: similar data in each column, all columns and rows visible,
and no blank rows within the range. For best results, use an
Excel table.
4. Do tasks that don't require column manipulation first, such as
spell-checking or using the Find and Replace dialog box.
5. Next, do tasks that do require column manipulation :
• Insert a new column (B) next to the original column (A) that
needs cleaning.
• Add a formula that will transform the data at the top of the
new column (B).
• Fill down the formula in the new column (B). In an Excel
table, a calculated column is automatically created with
values filled down.
• Select the new column (B), copy it, and then paste as values
into the new column (B).
• Remove the original column (A), which converts the new
column from B to A.
Let’s clean this data

Check each column one by one and make


sure to understand what is happening.

What do you see odd and why ?

Plan for things before you start making


changes.
Data Transformation : Organizing / Shaping data

Attribute
Smoothing
Construction

Generalization Aggregation

Normalization Discretization
Data Transformation
Real life analyst spend over 60% of their time cleaning
and transforming data!
Too Advanced for now!

Data Transformation : Smoothing

Smoothing is a technique where you apply an algorithm in order to remove


noise from your dataset when trying to identify a trend. Noise can have a bad
effect on your data and by eliminating or reducing it you can extract better
insights or identify patterns that you wouldn’t see otherwise.

There are 3 algorithm types that help with data smoothing:

• Clustering: Where you can group similar values together to form a cluster
while labeling any value out of the cluster as an outlier.

• Binning: Using an algorithm for binning will help you split the data into bins
and smooth the data value within each bin.

• Regression: Regression algorithms are used to identify the relation between


two dependent attributes and help you predict an attribute based on the
value of the other.
Data Transformation : Attribute Construction
Attribution construction is one of the most common
techniques in data transformation pipelines.

Attribution construction or feature construction is the


process of creating new features from a set of existing
features/attributes in the dataset.

Imagine working in marketing and trying to analyze the


performance of a campaign. You have all the impressions
that your campaign generated and the total cost for the
given time frame.

Instead of trying to compare these two metrics across all of


your campaigns, you can construct another metric to
calculate the cost per million impressions or CPM.
This will make your data mining and analysis process a lot
easier, as you’ll be able to compare the campaign
performance on a single metric rather than two separate
metrics.
Data Transformation : Data Generalization
Data generalization refers to the process of transforming
low-level attributes into high-level ones by using the
concept of hierarchy.

Data generalization is applied to categorical data where


they have a finite but large number of distinct values.

This is something that we, as people, are already doing


without noticing and it helps us get a clearer picture of the
data.

For ex. Address is divided into 4 categorical attributes :


• City
• Street
• Country
• State/province.
Data Transformation : Aggregation
Data aggregation is possibly one of
the most popular techniques in
data transformation. When you’re
applying data aggregation to your
raw data you are essentially storing
and presenting data in a summary
format.

This is ideal when you want to


perform statistical analysis of your
data as you might want to
aggregate your data over a specific
time period and provide statistics
such as average, sum, minimum,
and maximum
Too Advanced for now!

Data Transformation : Normalization

process of scaling the data to a much smaller range,


without losing information to help minimize or exclude
duplicated data and improve algorithm efficiency and
data extraction performance.

There are three methods to normalize an attribute:


• Min-max normalization: Where you perform a linear
transformation on the original data.
• Z-score normalization: In z-score normalization (or
zero-mean normalization) you are normalizing the
value for attribute A using the mean and standard
deviation.
• Decimal scaling: Where you can normalize the value
of attribute A by moving the decimal point in the
value.

Normalization methods are frequently used when you


have values that skew your dataset and you find it hard
to extract valuable insights.
Data Transformation : Discretization
Data discretization refers to the process of transforming
continuous data into a set of data intervals. This is an
especially useful technique that can help you make the
data easier to study and analyze and improve the
efficiency of any applied algorithm.

Imagine having tens of thousands of rows representing


people in a survey providing their first name, last name,
age, and gender.

Age is a numerical attribute that can have a lot of


different values. To make our life easier we can divide
the range of this continuous attribute into intervals.

Mapping this attribute to a higher-level concept, like


youth, middle-aged, and senior, can help a lot with the
efficiency of the task and improve the speed of the
algorithms applied.
Make this clean data useful now!

Attribute
Smoothing
Construction
Now that you have clean data, let us see
how can we make it more useful.
Generalization Aggregation
You know what to do!

Normalization Discretization
CONNECT WITH US
+91 93217 48851

[email protected]

Please connect with us for detailed references


and learner feedback.

30

You might also like