0% found this document useful (0 votes)

74 views

Module 4_(Process Data from Dirty to Clean)

Uploaded by

lostbilla66

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views

Module 4_(Process Data from Dirty to Clean)

Uploaded by

lostbilla66

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Google data analytics professional course

Week - 1
Data integrity and analytics objectives

Why data integrity is important

Data integrity is the accuracy, completeness, consistency, and
trustworthiness of data throughout its lifecycle.
Data can also be compromised through human error, viruses,
malware, hacking, and system failures

More about data integrity and compliance

Scenario: calendar dates for a global company

Calendar dates are represented in a lot of different short forms. Depending

on where you live, a different format might be used.

● In some countries,12/10/20 (DD/MM/YY) stands for October 12,

2020.
● In other countries, the national standard is YYYY-MM-DD so October
12, 2020 becomes 2020-10-12.
● In the United States, (MM/DD/YY) is the accepted format so
October 12, 2020 is going to be 10/12/20.
Data replication compromising data integrity: Continuing with the example,
imagine you ask your international counterparts to verify dates and stick to
one format. One analyst copies a large dataset to check the dates. But
because of memory issues, only part of the dataset is actually copied. The
analyst would be verifying and standardizing incomplete data. That partial
dataset would be certified as compliant but the full dataset would still
contain dates that weren't verified. Two versions of a dataset can introduce
inconsistent results. A final audit of results would be essential to reveal
what happened and correct all dates.

Data transfer compromising data integrity: Another analyst checks the

dates in a spreadsheet and chooses to import the validated and standardized
data back to the database. But suppose the date field from the spreadsheet
was incorrectly classified as a text field during the data import (transfer)
process. Now some of the dates in the database are stored as text strings.
At this point, the data needs to be cleaned to restore its integrity.

Data manipulation compromising data integrity: When checking dates,

another analyst notices what appears to be a duplicate record in the
database and removes it. But it turns out that the analyst removed a unique
record for a company’s subsidiary and not a duplicate record for the
company. Your dataset is now missing data and the data must be restored
for completeness.
Well-aligned objectives and data

Clean data + alignment to business objective = accurate conclusions

Alignment to business objective + additional data cleaning = accurate conclusions

Overcoming the challenges of insufficient data

Ways you can address them

● You can identify trends with the available data
● wait for more data if time allows
● you can talk with stakeholders and adjust your objective
● you can look for a new data set

What to do when you find an issue with your data

Data issue
No data:
● Gather the data on a small scale to perform a preliminary analysis and
then request additional time to complete the analysis after you have
collected more data.
● If there isn’t time to collect data, perform the analysis using proxy
data from other datasets. This is the most common workaround.
Too little data:
● Do the analysis using proxy data along with actual data.
● Adjust your analysis to align with the data you already have.

wrong data, including data with errors*

● If you have the wrong data because requirements were misunderstood,
communicate the requirements again.
● Identify errors in the data and, if possible, correct them at the
source by looking for a pattern in the errors.
● If you can’t correct data errors yourself, you can ignore the wrong
data and go ahead with the analysis if your sample size is still large
enough and ignoring the data won’t cause systematic bias.
The importance of sample size
Population is all possible data values in a certain dataset.

sample size!
When you use sample size or a sample, you use a part of a population
that's representative of the population.

Sampling bias is when a sample isn't representative of the population as a

whole. This means some members of the population are being
overrepresented or underrepresented.

Random sampling is a way of selecting a sample from a population so

that every possible type of the sample has an equal chance of being chosen.

Things to remember when determining the size of your sample

● Don’t use a sample size less than 30.
● The confidence level most commonly used is 95%, but 90% can work in
some cases.

Increase the sample size to meet specific needs of your project:

● For a higher confidence level, use a larger sample size

● To decrease the margin of error, use a larger sample size
● For greater statistical significance, use a larger sample size

Why a minimum sample of 30?

Central Limit Theorem (CLT)
Testing your data

Using statistical power

Hypothesis testing
If a test is statistically significant,
It means the results of the test are real and not an error caused by random
chance.

What to do when there is no data

open data is the information that has been published on

government-sanctioned portals. In the best case, this data is structured,
machine-readable, open-licensed, and well maintained.

Public data is the data that exists everywhere else. This is information
that’s freely available (but not really accessible) on the web. It is frequently
unstructured and unruly, and its usage requirements are often vague.
Different types of data set on kaggle
● https://www.kaggle.com/datasnaek/youtube-new
● https://www.kaggle.com/sakshigoyal7/credit-card-customers
● https://www.kaggle.com/rtatman/188-million-us-wildfires
● https://www.kaggle.com/bigquery/google-analytics-sample

Sample size calculator

● https://www.surveymonkey.com/mp/sample-size-calculator/
● http://www.raosoft.com/samplesize.html

Consider the margin of error

Margin of error is the maximum amount that the sample results are
expected to differ from those of the actual population.
Eg: Imagine you are playing baseball and that you are up at bat. The crowd is
roaring, and you are getting ready to try to hit the ball. The pitcher delivers
a fastball traveling about 90-95mph, which takes about 400 milliseconds
(ms) to reach the catcher’s glove. You swing and miss the first pitch because
your timing was a little off. You wonder if you should have swung slightly
earlier or slightly later to hit a home run. That time difference can be
considered the margin of error, and it tells us how close or far your timing
was from the average home run swing.
Week-2
Data cleaning is a must

Clean it up!
Dirty data is data that's incomplete, incorrect, or irrelevant to the problem
you're trying to solve.
Clean data is data that's complete, correct, and relevant to the problem
you're trying to solve.

What is dirty data?

● Types of dirty data you may encounter
● What may have caused the data to become dirty
● How dirty data is harmful to businesses

Types of dirty data

Inconsistent data
Any data that uses different formats to represent the same thing

Field is a single piece of information from a row or column of a spreadsheet.

Data validation is a tool for checking the accuracy and quality of data
before adding or importing it.

Begin cleaning data

Common data-cleaning pitfalls

Top ten ways to clean your data
● https://support.microsoft.com/en-us/office/top-ten-ways-to-clean-yo
ur-data-2844b620-677c-47a7-ac3e-c2e157d1db19
● https://support.google.com/a/users/answer/9604139?hl=en#zippy=

Hands-On Activity: Cleaning data with spreadsheets

● Filter
● Transpose (while pasting)
● Data cleanup (option: cleanup suggestion)
● Change text format (Add on: caps to lower etc…)

Cleaning data in spreadsheets

Data-cleaning features in spreadsheets

Conditional formatting
● Conditional formatting (to find empty cell)
● Remove duplicates
● Date formatting (format->number->date)
● specified text separating also called the delimiter.
● Data validation
Optimize the data-cleaning process
A function is a set of instructions that performs a specific calculation using
the data in a spreadsheet.
Some basic types of functions in Spreadsheet
● COUNTIF
● LEN
● LEFT
● RIGHT
● CONCATENATE
● TRIM

Workflow automation
● https://towardsdatascience.com/automating-scientific-data-analysis-p
art-1-c9979cd0817e
● https://news.mit.edu/2016/automating-big-data-analysis-1021
● https://technologyadvice.com/blog/information-technology/top-10-wo
rkflow-automation-software/

Different data perspectives

● Pivot table
● VLOOKUP - vertical lookup
● Find
● Graph plotting
Even more data-cleaning techniques
Data mapping is the process of matching fields from one database to
another.
Compatibility describes how well two or more data sets are able to work
together.
● CONCATENATE

Hands-On Activity: Clean data with spreadsheet

functions
● SPLIT
● COUNTIF
● Sort

Learning Log: Develop your approach to cleaning data

Step 1: Create your checklist

Some things you might include in your checklist:

● Size of the data set

● Number of categories or labels
● Missing data
● Unformatted data
● The different data types

Step 2: List your preferred cleaning methods

After you have compiled your personal checklist, you can create a list of
activities you like to perform when cleaning data. This list is a collection of
procedures that you will implement when you encounter specific issues
present in the data related to your checklist or every time you clean a new
dataset.

For example, suppose that you have a dataset with missing data, how would
you handle it? Moreover, if the data set is very large, what would you do to
check for missing data? Outlining some of your preferred methods for
cleaning data can help save you time and energy.

Step 3: Choose a data cleaning motto

Now that you have a personal checklist and your preferred data cleaning
methods, you can create a data cleaning motto to help guide and explain your
process. The motto is a short one or two sentence summary of your
philosophy towards cleaning data. For example, here are a few data cleaning
mottos from other data analysts:

1. "Not all data is the same, so don't treat it all the same."
2. "Be prepared for things to not go as planned. Have a backup plan.”
3. "Avoid applying complicated solutions to simple problems."

My list

● Find Empty cell

● Remove duplicates
● Date format
● Split wanted informations
● Check conditions
Week - 3
Using SQL to clean data

Understanding SQL capabilities

Relational databases
This is a database that contains a series of tables that can be connected to
form relationships.

Using SQL as a junior data analyst

SQL dialects and their uses
Links to learn SQL documentation
● https://learnsql.com/blog/what-sql-dialect-to-learn/
● https://www.softwaretestinghelp.com/sql-vs-mysql-vs-sql-server/
● https://www.datacamp.com/community/blog/sql-differences
● https://sqlite.org/windowfunctions.html
● https://www.sqltutorial.org/what-is-sql/

Hands-On Activity: Processing time with SQL

SELECT
language,
title,
SUM(views) AS views
FROM
`bigquery-samples.wikipedia_benchmark.Wiki10B`
WHERE
title LIKE '%Google%'
GROUP BY
language,
title
ORDER BY
views DESC;
Learn basic SQL queries

Widely used SQL queries

➢ INSERT INTO
➢ VALUES
➢ UPDATE
➢ SET
➢ SELECT COUNT SUM * DISTINCT
➢ FROM
➢ WHERE
➢ ORDERED BY
➢ GROUP BY
➢ LIMIT
SELECT
● COUNT
● SUM
● *
● DISTINCT
● LENGTH()
WHERE
● SUBSTR()
● TRIM()
● LENGTH()
Hands-On Activity: Clean data using SQL
● MIN
● MAX
● UPDATE
● SET
● DISTINCT
Step 1:
SELECT
DISTINCT(fuel_type)
FROM
`dulcet-velocity-294320.From_course.automobile_data`
Multi line command
*/
--STEP 2
/*SELECT
MIN(length) as min_length,
MAX(length) as max_length
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/

--STEP 3
/*SELECT
*
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
num_of_doors is NULL*/

--STEP 4
/*UPDATE
`dulcet-velocity-294320.From_course.automobile_data`
SET
num_of_doors = "four"
WHERE
make = "dodge"
AND fuel_type = "gas"
AND body_style = "sedan";*/ --KASU KATUNA THA WORK AAGUM

--MY CODE
/*SELECT
*
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
make = "dodge"
OR fuel_type = "gas"
AND body_style = "sedan"*/

--STEP 5
/*SELECT
DISTINCT(num_of_cylinders)
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/

--STEP 6
/*UPDATE
cars.car_info
SET
num_of_cylinders = "two"
WHERE
num_of_cylinders = "tow";*/

--STEP 7
/*SELECT
MIN(compression_ratio) AS min_compression_ratio,
MAX(compression_ratio) AS max_compression_ratio
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
compression_ratio <> 70;*/ --omit 70

--STEP 8
/*SELECT
COUNT(*) AS num_of_rows_to_delete
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
compression_ratio = 70;*/

--STEP 9
/*DELETE
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
compression_ratio = 70;*/

--STEP 9
/*SELECT
DISTINCT drive_wheels,
LENGTH(drive_wheels) AS string_length
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/

--STEP 10
/*UPDATE
cars.car_info
SET
drive_wheels = TRIM(drive_wheels)
WHERE
TRUE;*/

--STEP 10
/*SELECT
TRIM(drive_wheels),
LENGTH(drive_wheels) AS string_length,
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/

--TEST
/*SELECT
MAX(price) as MAX_PRICE
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/
Transforming data

Upload the store transactions dataset to BigQuery

[
{
"description": "date",
"mode": "NULLABLE",
"name": "date",
"type": "DATETIME"
},
{
"description": "transaction id",
"mode": "NULLABLE",
"name": "transaction_id",
"type": "INTEGER"
},
{
"description": "customer id",
"mode": "NULLABLE",
"name": "customer_id",
"type": "INTEGER"
},
{
"description": "product name",
"mode": "NULLABLE",
"name": "product",
"type": "STRING"
},
{
"description": "product_code",
"mode": "NULLABLE",
"name": "product_code",
"type": "STRING"
},
{
"description": "product color",
"mode": "NULLABLE",
"name": "product_color",
"type": "STRING"
},
{
"description": "product price",
"mode": "NULLABLE",
"name": "product_price",
"type": "FLOAT"
},
{
"description": "quantity purchased",
"mode": "NULLABLE",
"name": "purchase_size",
"type": "INTEGER"
},
{
"description": "purchase price",
"mode": "NULLABLE",
"name": "purchase_price",
"type": "STRING"
},
{
"description": "revenue",
"mode": "NULLABLE",
"name": "revenue",
"type": "FLOAT"
}
]
Three types of file uploading
● Direct csv file upload which has header
● Txt file upload with including headers and type, same for csv which
does not have header
● Changing data type while uploading with headers

Type conversion
PART-1
SELECT
*
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
ORDER BY
CAST(purchase_price AS FLOAT64 ) DESC

PART-2
--SORTING WITH DATE
/*
SELECT
date,
purchase_price
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
WHERE
date BETWEEN '2020-12-1' and '2020-12-31' */
--CAST Change data types
/*
SELECT
CAST(date as date) as DATE,
purchase_price
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
ORDER BY
CAST(date as date) */

--CONCAT join strings to form substring

/*
SELECT
CONCAT(product_code,product_color) as unic_clr_id
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
WHERE
product = 'couch'*/

--COALESCE() return non null values

SELECT
COALESCE(product,product_code) as product_info
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
Part 1 & 2
● CAST() change data type
● CONCAT() join 2 string
● COALESCE() from this or that

Week - 4
Manually cleaning data

Verifying and reporting results

Verification is a process to confirm that a data cleaning effort was well-
executed and the resulting data is accurate and reliable.
A changelog is a file containing a chronologically ordered list of
modifications made to a project.

Cleaning and your data expectations

● Using spreadsheet
● Use SQL
● Big picture verification (including graphs)
The final step in data cleaning
● Spell check
● Spreadsheet => find and replace
● SQL => CASE
SELECT
customer_id,
CASE
WHEN product = 'fan' THEN 'FAN'
WHEN product = 'lamps' THEN 'LAMP'
ELSE product
END AS Dhamu
FROM `dulcet-velocity-294320.From_course.customer_purchase`

Data-cleaning verification: A checklist

● Sources of errors: Did you use the right tools and functions to find
the source of the errors in your dataset?
● Null data: Did you search for NULLs using conditional formatting and
filters?
● Misspelled words: Did you locate all misspellings?
● Mistyped numbers: Did you double-check that your numeric data has
been entered correctly?
● Extra spaces and characters: Did you remove any extra spaces or
characters using the TRIM function?
● Duplicates: Did you remove duplicates in spreadsheets using the
Remove Duplicates function or DISTINCT in SQL?
● Mismatched data types: Did you check that numeric, date, and string
data are typecast correctly?
● Messy (inconsistent) strings: Did you make sure that all of your
strings are consistent and meaningful?
● Messy (inconsistent) date formats: Did you format the dates
consistently throughout your dataset?
● Misleading variable labels (columns): Did you name your columns
meaningfully?
● Truncated data: Did you check for truncated or missing data that
needs correction?
● Business Logic: Did you check that the data makes sense given your
knowledge of the business?

The goal of your project

● Confirm the business problem
● Confirm the goal of the project
● Verify that data can solve the problem and is aligned to the goal
Documenting results and the cleaning process

Capturing cleaning changes

Documentation
Which is the process of tracking changes, additions, deletions and errors
involved in your data cleaning effort.

Data errors are the crime, data cleaning is gathering evidence, and
documentation is detailing exactly what happened for peer review or
court.

Here is how a version control system affects a change to a query:

1. A company has official versions of important queries in their version

control system.
2. An analyst makes sure the most up-to-date version of the query is the
one they will change. This is called syncing
3. The analyst makes a change to the query.
4. The analyst might ask someone to review this change. This is called a
code review and can be informally or formally done. An informal review
could be as simple as asking a senior analyst to take a look at the
change.
5. After a reviewer approves the change, the analyst submits the
updated version of the query to a repository in the company's version
control system. This is called a code commit. A best practice is to
document exactly what the change was and why it was made in a
comments area. Going back to our example of a query that pulls daily
revenue, a comment might be: Updated revenue to include revenue
coming from the new product, Calypso.
6. After the change is submitted, everyone else in the company will be
able to access and use this new query when they sync to the most
up-to-date queries stored in the version control system.
7. If the query has a problem or business needs change, the analyst can
undo the change to the query using the version control system. The
analyst can look at a chronological list of all changes made to the query
and who made each change. Then, after finding their own change, the
analyst can revert to the previous version.
8. The query is back to what it was before the analyst made the change.
And everyone at the company sees this reverted, original query, too.
9. Changelogs are for humans, not machines, so write legibly.

Embrace changelogs

● Spreadsheet
● Excel
● Big query

Typically, a changelog records this type of information:

● Data, file, formula, query, or any other component that changed

● Description of what changed
● Date of the change
● Person who made the change
● Person who approved the change
● Version number
● Reason for the change
Changelog documentation

# Changelog

This file contains the notable changes to the project

Version 1.0.0 (02-23-2019)

## New

- Added column classifiers (Date, Time, PerUnitCost, TotalCost, etc. )

- Added Column “AveCost” to track average item cost

## Changes

- Changed date format to MM-DD-YYYY

- Removal of whitespace (cosmetic)

## Fixes

- Fixed misalignment in Column "TotalCost" where some rows did not

match with correct dates

- Fixed SUM to run over entire column instead of partial

Some of the most common errors involve

● human mistakes like mistyping or misspelling,

● flawed processes like poor design of a survey form, and
● system issues where older systems integrate data incorrectly.
Advanced functions for speedy data cleaning

● QUERY
● IMPORTRANGE
● FILTER

QUERY

● https://support.google.com/docs/answer/3093343?hl=en

Filter

● https://support.google.com/docs/answer/3093197?hl=en
● https://support.google.com/docs/answer/3093197?hl=en

IMPORTRANGE

● https://support.google.com/docs/answer/3093340?hl=en#
Week - 5
Understand the elements of a data analyst
resume

CareerCon resources on YouTube

Youtube links
● https://www.youtube.com/playlist?list=PLqFaTIg4myu-npFrYu6cO7h7
AI6bkcOlL
● https://www.youtube.com/watch?v=cBbYhhH399c&list=PLqFaTIg4myu
-npFrYu6cO7h7AI6bkcOlL&index=9

Adding professional skills to your resume

Highlighting experiences on resumes

Adding soft skills to your resume

Quick Review
Week -1
● Data integrity
● Manage insufficient data
● Statistics

Week - 2
● Spreadsheet

Week - 3
● SQL

Week - 4
● Verification and Cleaning
● Changelog and documentation
● Checklist

Week - 5
● Hiring process
● Resume building

Dhamodharan
14/10/2021

Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
Data Clean R
100% (1)
Data Clean R
11 pages
Data Quality Services
No ratings yet
Data Quality Services
196 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
11 pages
Data Conversion Plan Template
0% (1)
Data Conversion Plan Template
80 pages
Material Master Data Cleansing
No ratings yet
Material Master Data Cleansing
11 pages
Data Migration With Sap Frank Densborn
No ratings yet
Data Migration With Sap Frank Densborn
32 pages
4. Process Data from Dirty to Clean
No ratings yet
4. Process Data from Dirty to Clean
34 pages
Process of Data Form Dirty Cleaning
No ratings yet
Process of Data Form Dirty Cleaning
48 pages
Data Analytics Program - Introduction To Data Analytics - Lesson 1
No ratings yet
Data Analytics Program - Introduction To Data Analytics - Lesson 1
56 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
Process Data From Dirty To Clean
No ratings yet
Process Data From Dirty To Clean
30 pages
Module 2 Data Science New
No ratings yet
Module 2 Data Science New
57 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
Data Analitics 4
No ratings yet
Data Analitics 4
10 pages
Data Analysis
No ratings yet
Data Analysis
29 pages
Lec 05
No ratings yet
Lec 05
49 pages
Introduction To Analytics
100% (1)
Introduction To Analytics
45 pages
Coursera - Data Analytics - Course 4
No ratings yet
Coursera - Data Analytics - Course 4
6 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
CAC 428 Topic 2_Data Quality
No ratings yet
CAC 428 Topic 2_Data Quality
29 pages
Data Quality
No ratings yet
Data Quality
14 pages
(M3S1) Data Analytics Framework
No ratings yet
(M3S1) Data Analytics Framework
12 pages
EDA
100% (1)
EDA
9 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Data Analytics
No ratings yet
Data Analytics
5 pages
Lect 6
No ratings yet
Lect 6
36 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Data Preparation and Analysis
No ratings yet
Data Preparation and Analysis
22 pages
mylessons 4
No ratings yet
mylessons 4
6 pages
BIA 5000 Introduction To Analytics - Lesson 6
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 6
59 pages
Unit 2
No ratings yet
Unit 2
22 pages
ML-Lecture-5-data-quality
No ratings yet
ML-Lecture-5-data-quality
19 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
DS-Unit-2_ABM_final
No ratings yet
DS-Unit-2_ABM_final
134 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
KMBN IT01 LM Consolidated
No ratings yet
KMBN IT01 LM Consolidated
123 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
Big Data Lec5
No ratings yet
Big Data Lec5
37 pages
Why Data Cleaning Is Critical
No ratings yet
Why Data Cleaning Is Critical
5 pages
3 - Data Trials and Triumphs
No ratings yet
3 - Data Trials and Triumphs
3 pages
Data Integrity and Compliance
No ratings yet
Data Integrity and Compliance
4 pages
DATA MANAGEMENT QUIZ
No ratings yet
DATA MANAGEMENT QUIZ
4 pages
2 - Data Management and Wrangling
No ratings yet
2 - Data Management and Wrangling
33 pages
Integrating Data From Different Sources
No ratings yet
Integrating Data From Different Sources
11 pages
03 Data Science Process_Spring-24-25
No ratings yet
03 Data Science Process_Spring-24-25
48 pages
Notes 3 (Prepare Coursera)
No ratings yet
Notes 3 (Prepare Coursera)
67 pages
Data Prepration Presentation
No ratings yet
Data Prepration Presentation
34 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
DATA ANALYTICS Concepts
No ratings yet
DATA ANALYTICS Concepts
11 pages
Subtitle Big Data Coursera 4
No ratings yet
Subtitle Big Data Coursera 4
2 pages
Lesson 1 Notes
No ratings yet
Lesson 1 Notes
14 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
03-Free-Annual-Calendar
No ratings yet
03-Free-Annual-Calendar
7 pages
al4-02
100% (1)
al4-02
20 pages
Dynamic Chart Data Labels Stacked Column
No ratings yet
Dynamic Chart Data Labels Stacked Column
7 pages
cafe-products-sales-analysis
No ratings yet
cafe-products-sales-analysis
10 pages
Slicer and Timeline Stylesheet
No ratings yet
Slicer and Timeline Stylesheet
10 pages
Instant-Pot-Pressure-Cooker-Recipe-Book-ENG-OPTIMIZED
No ratings yet
Instant-Pot-Pressure-Cooker-Recipe-Book-ENG-OPTIMIZED
19 pages
al4-01
No ratings yet
al4-01
23 pages
al4-14
No ratings yet
al4-14
14 pages
al4-10
No ratings yet
al4-10
19 pages
al4-09
No ratings yet
al4-09
17 pages
al4-13
No ratings yet
al4-13
10 pages
al4-05
No ratings yet
al4-05
17 pages
al4-03
No ratings yet
al4-03
17 pages
al4-16
No ratings yet
al4-16
9 pages
al4-07
No ratings yet
al4-07
14 pages
al4-11
No ratings yet
al4-11
18 pages
al4-08
No ratings yet
al4-08
11 pages
al4-06
No ratings yet
al4-06
15 pages
al4-04
No ratings yet
al4-04
14 pages
al4-12
No ratings yet
al4-12
13 pages
MODULE 2_(Ask Questions to Make Data-Driven Decisions)
No ratings yet
MODULE 2_(Ask Questions to Make Data-Driven Decisions)
23 pages
Sap Data Services: Louis de Gouveia Hashtag: #Analyticssig
No ratings yet
Sap Data Services: Louis de Gouveia Hashtag: #Analyticssig
19 pages
Dunning Procedure in SAP S - 4HANA
No ratings yet
Dunning Procedure in SAP S - 4HANA
91 pages
Data Mining Basic Techniques
No ratings yet
Data Mining Basic Techniques
14 pages
Informatica Data Quality Online Training
No ratings yet
Informatica Data Quality Online Training
9 pages
Mastering Data Cleaning Techniques with SQL — Explained Examples _ by ? panData _ Level Up Coding
No ratings yet
Mastering Data Cleaning Techniques with SQL — Explained Examples _ by ? panData _ Level Up Coding
31 pages
ds cs
No ratings yet
ds cs
22 pages
p1605-chu
No ratings yet
p1605-chu
4 pages
TEMENOS Data Migration Services - Approach - Process - Utilities - V28 - 20110519
100% (4)
TEMENOS Data Migration Services - Approach - Process - Utilities - V28 - 20110519
84 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
POWER BI TOPIC 2
No ratings yet
POWER BI TOPIC 2
5 pages
ACF5320 Assignment 1 Data Cleaning
No ratings yet
ACF5320 Assignment 1 Data Cleaning
4 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Note On Data Analytics
No ratings yet
Note On Data Analytics
21 pages
DMBI Theory
No ratings yet
DMBI Theory
15 pages
Prac 7
No ratings yet
Prac 7
5 pages
Data Quality
No ratings yet
Data Quality
76 pages
Quality Stage User Guide
No ratings yet
Quality Stage User Guide
233 pages
Knowledge Representation in Data Mining
No ratings yet
Knowledge Representation in Data Mining
22 pages
VIDEO PRES
No ratings yet
VIDEO PRES
7 pages
Unit-1 3
No ratings yet
Unit-1 3
58 pages
5 Data Cleaning
No ratings yet
5 Data Cleaning
36 pages
03 Nishikant ETL Testing Strategy - Migration Project
0% (1)
03 Nishikant ETL Testing Strategy - Migration Project
11 pages
Data Analytics - Additional Course Information
No ratings yet
Data Analytics - Additional Course Information
11 pages

Module 4_(Process Data from Dirty to Clean)

Uploaded by

Module 4_(Process Data from Dirty to Clean)

Uploaded by

Google data analytics professional course

Why data integrity is important

More about data integrity and compliance

Calendar dates are represented in a lot of different short forms. Depending

● In some countries,12/10/20 (DD/MM/YY) stands for October 12,

Data transfer compromising data integrity: Another analyst checks the

Data manipulation compromising data integrity: When checking dates,

Clean data + alignment to business objective = accurate conclusions

Overcoming the challenges of insufficient data

Ways you can address them

What to do when you find an issue with your data

wrong data, including data with errors*

Sampling bias is when a sample isn't representative of the population as a

Random sampling is a way of selecting a sample from a population so

Things to remember when determining the size of your sample

Increase the sample size to meet specific needs of your project:

● For a higher confidence level, use a larger sample size

Why a minimum sample of 30?

Using statistical power

What to do when there is no data

open data is the information that has been published on

Sample size calculator

Consider the margin of error

What is dirty data?

Types of dirty data

Field is a single piece of information from a row or column of a spreadsheet.

Begin cleaning data

Common data-cleaning pitfalls

Hands-On Activity: Cleaning data with spreadsheets

Cleaning data in spreadsheets

Data-cleaning features in spreadsheets

Different data perspectives

Hands-On Activity: Clean data with spreadsheet

Learning Log: Develop your approach to cleaning data

Step 1: Create your checklist

● Size of the data set

Step 2: List your preferred cleaning methods

Step 3: Choose a data cleaning motto

● Find Empty cell

Understanding SQL capabilities

Using SQL as a junior data analyst

Hands-On Activity: Processing time with SQL

Widely used SQL queries

Upload the store transactions dataset to BigQuery

--CONCAT join strings to form substring

--COALESCE() return non null values

Verifying and reporting results

Cleaning and your data expectations

Data-cleaning verification: A checklist

The goal of your project

Capturing cleaning changes

Here is how a version control system affects a change to a query:

1. A company has official versions of important queries in their version

Typically, a changelog records this type of information:

● Data, file, formula, query, or any other component that changed

This file contains the notable changes to the project

Version 1.0.0 (02-23-2019)

- Added column classifiers (Date, Time, PerUnitCost, TotalCost, etc. )

- Added Column “AveCost” to track average item cost

- Changed date format to MM-DD-YYYY

- Removal of whitespace (cosmetic)

- Fixed misalignment in Column "TotalCost" where some rows did not

- Fixed SUM to run over entire column instead of partial

Some of the most common errors involve

● human mistakes like mistyping or misspelling,

CareerCon resources on YouTube

Adding professional skills to your resume

Adding soft skills to your resume

You might also like