Module 4_(Process Data from Dirty to Clean)
Module 4_(Process Data from Dirty to Clean)
Week - 1
Data integrity and analytics objectives
Data issue
No data:
● Gather the data on a small scale to perform a preliminary analysis and
then request additional time to complete the analysis after you have
collected more data.
● If there isn’t time to collect data, perform the analysis using proxy
data from other datasets. This is the most common workaround.
Too little data:
● Do the analysis using proxy data along with actual data.
● Adjust your analysis to align with the data you already have.
sample size!
When you use sample size or a sample, you use a part of a population
that's representative of the population.
Public data is the data that exists everywhere else. This is information
that’s freely available (but not really accessible) on the web. It is frequently
unstructured and unruly, and its usage requirements are often vague.
Different types of data set on kaggle
● https://www.kaggle.com/datasnaek/youtube-new
● https://www.kaggle.com/sakshigoyal7/credit-card-customers
● https://www.kaggle.com/rtatman/188-million-us-wildfires
● https://www.kaggle.com/bigquery/google-analytics-sample
Clean it up!
Dirty data is data that's incomplete, incorrect, or irrelevant to the problem
you're trying to solve.
Clean data is data that's complete, correct, and relevant to the problem
you're trying to solve.
Data validation is a tool for checking the accuracy and quality of data
before adding or importing it.
Conditional formatting
● Conditional formatting (to find empty cell)
● Remove duplicates
● Date formatting (format->number->date)
● specified text separating also called the delimiter.
● Data validation
Optimize the data-cleaning process
A function is a set of instructions that performs a specific calculation using
the data in a spreadsheet.
Some basic types of functions in Spreadsheet
● COUNTIF
● LEN
● LEFT
● RIGHT
● CONCATENATE
● TRIM
Workflow automation
● https://towardsdatascience.com/automating-scientific-data-analysis-p
art-1-c9979cd0817e
● https://news.mit.edu/2016/automating-big-data-analysis-1021
● https://technologyadvice.com/blog/information-technology/top-10-wo
rkflow-automation-software/
For example, suppose that you have a dataset with missing data, how would
you handle it? Moreover, if the data set is very large, what would you do to
check for missing data? Outlining some of your preferred methods for
cleaning data can help save you time and energy.
1. "Not all data is the same, so don't treat it all the same."
2. "Be prepared for things to not go as planned. Have a backup plan.”
3. "Avoid applying complicated solutions to simple problems."
My list
--STEP 3
/*SELECT
*
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
num_of_doors is NULL*/
--STEP 4
/*UPDATE
`dulcet-velocity-294320.From_course.automobile_data`
SET
num_of_doors = "four"
WHERE
make = "dodge"
AND fuel_type = "gas"
AND body_style = "sedan";*/ --KASU KATUNA THA WORK AAGUM
--MY CODE
/*SELECT
*
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
make = "dodge"
OR fuel_type = "gas"
AND body_style = "sedan"*/
--STEP 5
/*SELECT
DISTINCT(num_of_cylinders)
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/
--STEP 6
/*UPDATE
cars.car_info
SET
num_of_cylinders = "two"
WHERE
num_of_cylinders = "tow";*/
--STEP 7
/*SELECT
MIN(compression_ratio) AS min_compression_ratio,
MAX(compression_ratio) AS max_compression_ratio
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
compression_ratio <> 70;*/ --omit 70
--STEP 8
/*SELECT
COUNT(*) AS num_of_rows_to_delete
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
compression_ratio = 70;*/
--STEP 9
/*DELETE
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
compression_ratio = 70;*/
--STEP 9
/*SELECT
DISTINCT drive_wheels,
LENGTH(drive_wheels) AS string_length
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/
--STEP 10
/*UPDATE
cars.car_info
SET
drive_wheels = TRIM(drive_wheels)
WHERE
TRUE;*/
--STEP 10
/*SELECT
TRIM(drive_wheels),
LENGTH(drive_wheels) AS string_length,
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/
--TEST
/*SELECT
MAX(price) as MAX_PRICE
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/
Transforming data
Type conversion
PART-1
SELECT
*
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
ORDER BY
CAST(purchase_price AS FLOAT64 ) DESC
PART-2
--SORTING WITH DATE
/*
SELECT
date,
purchase_price
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
WHERE
date BETWEEN '2020-12-1' and '2020-12-31' */
--CAST Change data types
/*
SELECT
CAST(date as date) as DATE,
purchase_price
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
ORDER BY
CAST(date as date) */
Week - 4
Manually cleaning data
Data errors are the crime, data cleaning is gathering evidence, and
documentation is detailing exactly what happened for peer review or
court.
Embrace changelogs
In
● Spreadsheet
● Excel
● Big query
# Changelog
## New
## Changes
## Fixes
● QUERY
● IMPORTRANGE
● FILTER
QUERY
● https://support.google.com/docs/answer/3093343?hl=en
Filter
● https://support.google.com/docs/answer/3093197?hl=en
● https://support.google.com/docs/answer/3093197?hl=en
IMPORTRANGE
● https://support.google.com/docs/answer/3093340?hl=en#
Week - 5
Understand the elements of a data analyst
resume
Week - 2
● Spreadsheet
Week - 3
● SQL
Week - 4
● Verification and Cleaning
● Changelog and documentation
● Checklist
Week - 5
● Hiring process
● Resume building
Dhamodharan
14/10/2021