Advance Data Analytics Unit 2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 56

Noida Institute of Engineering and Technology, Greater Noida

Advanced Concepts of Analytics

Unit: 2

Advanced Data Cleaning


Archana Verma
NIET
Course Details MCA Department
(MCA 3rd Sem)

Faculty Name Subject code and abbreviation Unit Number


1
11/24/2022
Brief Introduction about Me

Designation Assistant Professor MCA Dept NIET GR Noida


B SC. Computer Miranda House Delhi University Delhi 1992
Science
University of
Master of Information
Qualification Science ADFA New South Canberra,
Wales Australia 1995
MCA MDU MDU Rohtak 2004
MTech Amity University Noida 2013

Experience 26 years INDUSTRY Teaching


6 Years (TCS) 20 Years  
National
International National Journal
Papers Journal 1 3 Conference
Publications 1
National
Books    
4
First Year Second Year
Awards Merit Scholarship Session 2010- Session 2011-
M Tech (2) 2011 2012  
Archana Verma KCS-058 UNIT-1

11/24/2022 2
Evaluation Scheme

• MCA CURRICULUM STRUCTURE


   
Subject
  Periods Evaluation Schemes End      
S. No. Subject Name Semester Total Credit
Codes
L T P CT TA Total PS TE PE  

1 AMCA0301 Software Engineering 3 0 0 30 20 50   100     150 3

2 NEW Problem Solving using 3 0 0 30 20 50   100     150 3


Python
3   Web Technology 3 0 0 30 20 50   100     150 3

4   Elective-II 2 0 0 30 20 50   50     100 2

5 NEW Computer Networks 3 0 0 30 20 50   100     150 3

6   Web Technology Lab 0 0 4       50   50   100 2

7 AMCA0351 Software Engineering Lab 0 0 4       50   50   100 2

    Elective-II Lab 0 0 2       50       50 1

8 NEW Problem Solving using 0 0 4       50   50   100 2


Python Lab
  NEW Mini Project 0 0 4       50   50   100 2

Archana Verma
    KCS 058 HCI
GRAND TOTAL Unit
  -1         250 250 11/24/2022
450 200   1150 23

3
Evaluation Scheme

• MCA CURRICULUM STRUCTURE

S. No. Subject Course Name University / Industry No of Hours Credits


Code Partner Name

I   Process Data from Dirty to clean Offered by Google 22 hrs.  

II   Analyze Data to Answer Questions Offered by Google 84 hrs.  

    Share Data through Art of      


III Visualization Offered by Google 23 hrs.

      USDAVIS University of    
IV Introduction to Google SEO California 14 hrs.

      USDAVIS University of    
V Google SEO Fundamentals California 29 hrs.

    Optimizing a website for Google USDAVIS University of    


VI Search California 14 hrs.

Archana Verma KCS 058 HCI Unit -1 11/24/2022

4
Autonomous Syllabus

UNIT 1 - Process Data from Dirty to Clean


Introduction to focus on integrity, why data integrity is important, balancing objectives with data integrity, dealing with insufficient
data, the importance of sample size, using statistical power, Determine the best sample size Clean it up! Why data cleaning is
important Recognize and remedy dirty data, Data-cleaning tools and techniques, Cleaning data from multiple sources, Data-cleaning
features in spreadsheets, Optimize the data-cleaning process.
 
UNIT 2 - Advance Data Cleaning
Different data perspectives, Using SQL to clean data, Understanding SQL capabilities, Spreadsheets versus SQL, Widely used SQL
queries, Advanced data cleaning functions Manually cleaning data: Verifying and reporting results Cleaning and your data
expectations The final step in data cleaning Documenting results and the cleaning process: Capturing cleaning changes, Why
documentation is important, Feedback and cleaning
UNIT 3- Analyze Data to Answer Questions
Data analysis basics: The analysis process, Organize data for analysis: Always a need to organize, More on sorting and filtering, Sort
data in spreadsheets: Sorting datasets, The SORT function, Sort data using SQL: Sorting queries in SQL, Convert and format data:
Getting started with data formatting, From one type to another, Data validation, Conditional formatting Combine multiple datasets:
Merging and multiple sources, Strings in spreadsheets. VLOOKUP for data aggregation, Aggregate data for analysis, Preparing for
VLOOKUP, VLOOKUP in action, Identifying common VLOOKUP errors.
 
UNIT 4 - Share Data through the Art of Visualization
Communicating your data insights, Introduction to communicating your data insights, Understand data visualization: Why data
visualization matters, Connecting images with data, A recipe for a powerful visualization, Dynamic visualizations, Design data
visualizations: Elements of art, Data visualization impact, Design thinking and visualizations.
 
UNIT 5 - Sharing data with Tableau
Get started with Tableau: Data visualizations with Tableau, Tableau Public and other online tools Meet Tableau, create a data
visualization in Tableau, create visualizations in Tableau: The good, the bad, and the ugly, Use data to develop stories: Storytelling
with data, bringing ideas to life Use Tableau dashboards: Tableau dashboard basics, From filters to charts. Creating your first Tableau
dashboard. Compelling presentation tips, sharing a narrative. The art and science of an effective presentation. Presenting with a
framework Weaving data into your presentation, Brittany: Presentation skills for new data analysts, Proven presentation tips, Present
ArchanalikeVerma KCS
a pro, Anticipate the 058 HCIHandling objections,
question, Unit -1Q&A best practice, Connor: Becoming an expert data11/24/2022
translator
 
 
5
CONTENT

• Verifying and reporting results


• Objective
• Cleaning and your data
• Course Outcome expectations
• CO – PO Mapping • The final step in data cleaning
• Prerequisite • Pivot Table
• Recap • CASE Statement SQL
• Different data • Capturing cleaning changes
• Why documentation is important
perspectives • Feedback and cleaning
• Using SQL to clean data • Daily Quiz
• Spreadsheets versus • Video Lectures Link
SQL
• Weekly Assignment
• Widely used SQL
• MCQs
queries
• Expected questions in university Exam
• Cleaning String
• Summary
Variables using SQL
• Advanced data cleaning
functions
Dr Tushar Jain AMCA 0214
11/24/2022 6
UNIT-II
Course Objective

•1
•To help students understand digital marketing practices, inclination of digital
consumers and role of content marketing.
•2
•To provide understanding of the concept of E-commerce and developing marketing
strategies in the virtual world
•3
•To impart learning on various digital channels and how to acquire and engage
consumers online.
•4
•To provide insights on building organizational competency by way of digital marketing
practices and cost considerations.
•5
•To develop understanding of the latest digital practices for marketing and promotion.
d.

Faculty Name Subject code and abbreviation Unit


11/24/2022 Number 7
Course Outcome

•CO1
•It will develop proficiency in interpreting marketing strategies in the digital age and
provide fundamental knowledge for working in an online team.
•CO2
•It will enable them to develop various online marketing strategies for various
marketing-mix measures.
•CO3
•It will guide them to use various digital marketing channels for consumer acquisition
and engagement.
•CO4
•It will help in evaluating the productivity of digital marketing channels for business
success.
•CO5
•It will prepare candidates for global exposure of digital marketing practices to make
them employable in a high growth industry

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 8


CO-PO and PSO Mapping

Course PO1 PO PO PO PO PO PO PO PO PO PO PO
outcom 2 3 4 5 6 7 8 9 10 11 12
es
CO1 2 2 2 2 3 3 3 3 2 3 3

CO2
3 3 3 3 3 3 3 3 3 3 3

CO3
3 3 3 3 3 3 3 3 2 3 3

CO4
3 3 3 3 3 3 3 3 3 3 3

CO5
3 3 3 3 3 3 3 3 2 3 3

1-weak 2-Medium 3-strong

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 9


Prerequisite and Recap

• The students are required to have basic


knowledge of computers and maths.

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 10


Different data perspectives

•  Different projects require us to focus on


- different information differently.
•  Different methods that data analysts use
- to look at data differently and
- how that leads to more efficient and
- effective data cleaning.

• Some of these methods include


– sorting and
– filtering,
– pivot tables, 
– VLOOKUP (FUNCTION)
– Plotting
– Find outliers

• Sorting brings duplicate entries


- closer together for
- faster identification.

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 11


Different data perspectives

• Filters, on the other hand, are very useful in data cleaning when you want to 
- find a particular piece of information.
• Another way to change the way you view data is
- by using pivot tables.
• pivot table is a data summarization tool that is used in 
- data processing.

• Pivot tables
-sort,
-reorganize,
- group,
- count,
- total or
- average data stored in the database. 

• In data cleaning, pivot tables are used to give you a


- quick,
- clutter- free view of your data.
11/24/2022 Faculty Name Subject code and abbreviation Unit Number 12
Different data perspectives

• You can choose to look at the specific parts 


- of the data set
- with pivot table.

• VLOOKUP
- is a function that determines
- vertical data

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 13


Using SQL to clean data
•  There are different data cleaning functions in spreadsheets and SQL 
• How SQL can be used to clean large data sets
• Data analysts usually use SQL to deal with large datasets 
- because it can handle huge amounts of data. 
- And It means trillions of rows. 
Spreadsheets functions and formulas or SQL queries?
Features of Spreadsheets Features of SQL Databases 
Smaller data sets Larger datasets
Enter data manually Access tables across a database
Create graphs and visualizations in the Prepare data for further analysis in
same program another software
Built-in spell check and other useful
Fast and powerful functionality
functions
Great for collaborative work and
Best when working solo on a project
tracking queries run by all users
11/24/2022 Faculty Name Subject code and abbreviation Unit Number 14
Spreadsheets versus SQL

 
• Supposed data is being stored in 
- different places, in 
- different formats, and
- each location might have 
- millions of rows and
- hundreds of related tables. 

• This is way too much data to input 


D  
• SQL comes in handy with such data.

•  Instead of having to look at each individual data source and


- record it in our spreadsheet,
- we can use SQL to pull 
- all this information from 
- different locations in our database. 

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 15


Spreadsheets versus SQL

 
• Suppose we want to find something specific in
- all this data,
- like how many patients 
- with a certain diagnosis
- came in today. 

• In a spreadsheet we can use the


- COUNTIF function to find that out, or
D • In SQL  
- combine the COUNT and WHERE queries
- to find out how many rows match our search criteria. 

• Spreadsheets are generated with a program like


- Excel or
- Google Sheets. 

• These programs are designed to execute certain built-in functions. 

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 16


Spreadsheets versus SQL

• SQL on the other hand is a language that can be used to


- interact with database programs, like 
- Oracle MySQL or
- Microsoft SQL Server. 

• The differences between the two are mostly in how they're used. 

D • If a data analyst was given data in the form of a spreadsheet


- they'll  do their data cleaning and
- analysis within that spreadsheet,

• if they're working with a large data set with


- more than a million rows or
- multiple files within a database
- They will use SQL

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 17


Daily Quiz

Q1. Fill in the blank: To count the total number of spreadsheet values within a
specified range, a data analyst uses the _____ function.
1) SUM
2) WHOLE
3) COUNTA
4) TOTAL

Q2. data analyst is cleaning a dataset with inconsistent formats and repeated cases.
They use the TRIM function to remove extra spaces from string variables. What other
tools can they use for data cleaning? Select all that apply.
1) Protect sheet
2) Find and replace
3) Import data
4) Remove duplicates

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 18


Widely used SQL queries

  • We can use SELECT to specify exactly


- what data we want to interact with in a table. 

• If we combine SELECT with FROM, 


- we can pull data from any table
- in this database if we  know
- what the columns and rows are named. 

• We can also insert


- new data into a database or
- update existing data. 

• For example,
- INSERT INTO 
- UPDATE <table name>
SET <field name>
WHERE <cond>

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 19


Widely used SQL queries

  • We can use the CREATE TABLE IF NOT EXISTS statement. 


- if the table does not exist.

• Running a SQL query 


- doesn't create a table
- for the data we extract. 
- It just stores it in our local memory. 
- To save it,
- we'll need to download it as 
- a spreadsheet or
- save the result into a new table. 

• Which tool should we use.


-It really depends on 
- what kind of data you're pulling and
- how often. 

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 20


Widely used SQL queries

  • If you’re only using a total number of customers, 


- you don't need a CSV file or
- a new table in your database. 

• If you're using the total number of customers per day 


- to say track a weekend promotion in a store, 
- you might download that data as 
- a CSV file so you can visualize it in a spreadsheet. 

• But if you're being asked to pull this


- trend on a regular basis, 
- you can create a table that will 
- automatically refresh when a query is written. 

• That way, you can directly download 


- the results whenever you need them for a report. 

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 21


Cleaning String Variables using SQL

• Why is SQL such a widely used language? 


- There are so many things that you can do with it. 
- It is very easy to return data
- from within a database or a data set. 

• It's meant to be an interactive querying language, 


- "query" means "asking a question." 

• In SQL, To Remove Duplicates we can


- use DISTINCT in our SELECT statement. 
Suppose
• We want to get the customer IDs of customers who live in Ohio. 
• But customer information occurs multiple times. 

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 22


Cleaning String Variables using SQL

• We can get these customer IDs by writing


- SELECT customer_id
FROM customer_data.customer_address. 
- This will give duplicates  
- If customer ID 9080 occurs 3 times in our table, 
- in results it will show 3 times 

• To get a list of unique customer IDs we will add


-  DISTINCT to our SELECT statement
-  SELECT DISTINCT customer_id
FROM customer_data.customer_address.

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 23


Cleaning String Variables using SQL

• Suppose we are working with the customer_address table  


- We want to make sure that
- all country codes have the same length  
- LENGTH function is used
  SELECT
LENGTH (COUNTRY) As Num_of_let
FROM 
customer_data.customer_address.

• Suppose we want to list the customer ids


- of the countries which have been entered correctly
- or those countries whose length is greater than 2
- like US instead of USA
SELECT
Customer_id
FROM 
customer_data.customer_address.
WHERE LENGTH(Country) > 2

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 24


Cleaning String Variables using SQL

•Suppose we want all the customers in the US by their IDs. 


- we will filter out only American customers. 
- we will use the SUBSTRING function
- it will pull the first two letters of each country
- Substring uses three parameters
1) name of the attribute to be compared
2) starting position from where to check
- 3) how many charachtrers from the starting position to compare
SELECT
Customer_id
FROM 
customer_data.customer_address.
WHERE SUBSTRING(Country,1,2) = ‘US’

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 25


Cleaning String Variables using SQL

•The TRIM function removes any extra spaces. 

•Let's say we want a


- list of all customer IDs
- who live in "OH" for Ohio. 

SELECT
Customer_id
FROM 
customer_data.customer_address.
WHERE TRIM(state) = ‘OH’

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 26


Advanced data cleaning functions,

•CAST function is used to correctly format data. 


- When you import data  
- the datatypes
- might not have been imported correctly. 

•CAST can be used to


- convert anything from one data type
- to another.

•Suppose by Oder by clause we have sorted the purchase_price column in descending order
- It appears 88.79 and then799.89
- this is wrong
- because database is not recognizing these as numbers.
-The database thinks they are strings
- but actually they are float
- it compared the first letter of each digit
-it found 8 is greater than 7 so it put 8 first and then 7

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 27


Advanced data cleaning functions,

• We use the CAST function to replace purchase_price with 


• the new purchase_price that 
• the database recognizes as float instead of string. 
• We start by replacing purchase_price with CAST. 

SELECT
CAST(purchase_price AS FLOAT64)
FROM 
customer_data.customer_purchase
ORDER BY
CAST(purchase_price AS FLOAT64) DESC

• In this our data will appear as


799.89
89.89

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 28


Advanced data cleaning functions,

• We use CAST function for conversion of other data types also


• The date data type
• Suppose we are listing

SELECT
CAST(purchase_price AS FLOAT64)
FROM 
customer_data.customer_purchase
ORDER BY
CAST(purchase_price AS FLOAT64) DESC

• In this our data will appear as


799.89
89.89

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 29


Advanced data cleaning functions,

• We can use CAST with other data types too


• Suppose  we want all purchases that occurred between December 1st, 2021, and
December 31st, 2021. 
• This will fetch all dates between the given dates but with time also.
• If we want only date and no time then we will need to type cast it as follows:
SELECT
CAST(date AS date) AS date_only,
purchase_price
FROM 
customer_data.customer_purchase
WHERE
date BETWEEN ‘2021-12-01’ AND ‘2021-12-31’

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 30


Advanced data cleaning functions,

• CONCAT function. 
• CONCAT lets you add strings together to create new text strings
• Eg.
- product_code is the same,
- product color may be different 
- We need to separate products by color, 
- the first column we want is product_code,
- the other column we want, product_color.
- we want this for chair, 

SELECT
CONCAT(product_code, product_color) AS new_prod_code
FROM 
customer_data.customer_purchase
WHERE
product = ‘chair’

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 31


Advanced data cleaning functions,


COALESCE. function. 
 

-  COALESCE can be used to return non-null values in a list. 


-If you have a field that’s optional in your table, 
-it'll have null in that field for rows that don't have appropriate value
-We want product names, 
-but if names aren't available, 
-then give us the product code. 
- we first check the column product, and
-The second column we check is product_code
-if the first column is null,
-Then it will put the value of second column
SELECT
 COALESCE (product, product_code) AS product_info
FROM 
customer_data.customer_purchase

This will give


bed
chair
SU1234
SU2345

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 32


Daily Quiz

Q1. In which of the following situations would a data analyst use spreadsheets instead
of SQL? Select all that apply.
1) When using a language to interact with multiple database programs
2) When working with a small dataset
3) When visually inspecting data
4) When working with a dataset with more than 1,000,000 rows

Q2. A data analyst creates many new tables in their company’s database. When the
project is complete, the analyst wants to remove the tables so they don’t clutter the
database. What SQL commands can they use to delete the tables?

1) DROP TABLE IF EXISTS


2) INSERT INTO
3) UPDATE
4) CREATE TABLE IF NOT EXISTS

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 33


Verifying and reporting results

• Verifying and Reporting on the integrity of your clean data. 

• Verification is a process to confirm that 


- a data cleaning effort was well- executed and
- the resulting data is accurate and reliable. 
- It involves rechecking your clean dataset, 
- doing some manual clean ups if needed, 

• Suppose I forgot to remove a semicolon


- Sounds like a really tiny error, I know, 
- but if I hadn't caught it
- it would have led to some big changes in my results. 

• Reports are a super effective way to show your team that you're 
- being 100 percent transparent about your data cleaning. 
11/24/2022 Faculty Name Subject code and abbreviation Unit Number 34
Cleaning and your data expectations

• Reporting helps to
- show stakeholders that you're accountable, 
- build trust with your team, 
- and make sure you're all on the same page 
- of important project details

• Verification is a critical part of any analysis project. 


• Without it
- you have no way of knowing
- that your insights
- can be relied on
- for data-driven decision-making. 

• Verification is a stamp of approval.


- your data is actually capable of
- solving that problem and
- achieving those goals.
- meeting the project objective

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 35


Cleaning and your data expectations

• Sometimes data analysts can be


- too familiar with their own data, 
- that they miss something or
- make assumptions.

• Asking a teammate to review your data


- from a fresh perspective and 
- getting feedback from others
- is very valuable in this stage.

• This is also the time to notice


- if anything is suspicious or
- potentially problematic in your data. 

• Again take a big picture view, and 


- ask yourself, do the numbers make sense?

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 36


Cleaning and your data expectations

• Suppose an analyst is reviewing the cleaned up data 


- from the customer satisfaction survey. 
- The survey was originally sent to 1,000 customers,
- but analyst discovers >1000 responses
- This means one customer entered > 1 entry
- Or something went wrong and a field was duplicated. 

• Any way, we need to go back to the data-cleaning process


- and correct the problem.

• Verifying your data ensures that the


- insights you gain from analysis can be trusted.
 
• It's an essential part of data-cleaning that
- helps companies avoid big mistakes. 

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 37


The final step in data cleaning

• Your data should be verified so that


- it's 100 percent ready to go. 

• You search for common problems. After that,


- you clean up the problems manually. For example, 
- by eliminating extra spaces or
- removing an unwanted quotation mark. 

• Some great tools for fixing common errors automatically, 


- TRIM and
- remove duplicates.

• If you had an error that shows up repeatedly, and


- it can't be resolved manually or 
- fixed automatically. 

• In these cases, it's helpful to create a pivot table. 

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 38


Pivot Table

• A pivot table is a


- data summarization tool that is
-used in data processing. 

• Pivot tables
- sort,
- reorganize,
- group, 
- count, total or
- average data stored in a database.
 

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 39


CASE Statement SQL

• In SQL,  misspellings can be solved using a CASE statement. 


-The CASE statement goes through 
- one or more conditions and 
- returns a value as soon as a condition is met. 

SELECT
customer_id,
CASE
WHEN first_name =‘Tnoy’ THEN ‘Tony’
WHEN first_name = ‘Johb’ THEN ‘John’
ELSE first_name
END AS cleaned_name
FROM 
customer_data.customer_name

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 40


Capturing cleaning changes

•  Keeping track of changes is 


- important to data project and
- how to document all cleaning changes
- makes sure everyone stays informed. 

• Documentation is the process of


- tracking changes, 
- additions,
- deletions and
- errors 

• Having a record of how a data set evolved 


- does 3 very important things. 

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 41


Capturing cleaning changes

• 1) it lets us recover data-cleaning errors. 


- we have a cheat sheet to rely on if we 
- come across the same errors again later. 
- we create a clean table instead of
- overriding your existing table. 
- So original data still stays
- in case you need to redo.

2)  Inform other users of changes you've made. 


- If you are not available, 
- the other analyst
- will have a reference sheet to check in with. 

3) The quality of the data can be determined


- to be used in analysis. 

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 42


Capturing cleaning changes

• We use a changelog to access information. 

• A changelog is a file
- containing a chronologically ordered list of 
- modifications made to a project. 

• You can use and view a changelog in spreadsheets.


- We can find who edited the file and 
- the changes they made in the column next to their name.
- in a cloud you can see version history as to which
-user modified the file and
-the date
- the change

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 43


Capturing cleaning changes

• You can see the changelog in SQL also


- by clicking on History tab in SQL editor
- It shows the query and the date and time it was updated
- Click on the query listing and it will bring the query in the editor

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 44


Why documentation is important
 

• Data analysts are counted on to present their findings after a data cleaning effort. 
Changelogs helped store changes chronologically, 
it provides a real-time account of every modification. 

• Documenting is time saver for future data analyst. 

• We can do this by create a doc 


- listing out the steps we took and the impact they had. 
• For example
- first your list said  
- remove the duplicate instance,
- this decreased the number of rows from 33 to 32,

• If we were working with SQL, 


- we could include a comment in the statement
- describing the reason for a change
- without affecting the execution of the statement. 

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 45


Feedback and cleaning

• The feedback we get when


- we report on our cleaning
- can transform
- data collection 
- processes, and
- ultimately business development. 

• For example, 
• one of the biggest challenges of working with data
- is dealing with errors. 
- Some of the most common errors
- involve human mistakes like
- mistyping or 
- misspelling, 

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 46


Feedback and cleaning

• flawed processes like


- poor design of a survey form, and
- system issues where
- older systems integrate data incorrectly. 

• Whatever the reason,


- data-cleaning can shine a light on
- the nature and 
- severity of error-generating processes.

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 47


Feedback and cleaning

• With consistent documentation and reporting, 


- we can uncover error patterns in
- data collection and
- entry procedures and
- use the feedback we get to
- make sure common errors aren't repeated.

• Maybe we need to reprogram


– the way the data is collected or
• change specific questions on the survey form.
- In more extreme cases, 
- we can even send analysts back to the drawing board 
- to rethink expectations and possibly
- update quality control procedures.

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 48


Daily Quiz

Q1. Why is it important for a data analyst to document the evolution of a dataset? Select all that
apply.

1) To inform other users of changes


2) To identify best practices in the collection of data
3) To determine the quality of the data
4) To recover data-cleaning errors

Q2. Fill in the blank: While cleaning data, documentation is used to track _____. Select all that apply.

1)Bias
2) Deletions
3) Errors
4)Change

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 49


Faculty Video Links, Youtube & NPTEL Video Links and Online
Courses Details

• Youtube/other Video Links

• https://maung-sutikno.medium.com/process-data-from-dirty-to-clean-
eb6758190d92
• https://www.youtube.com/watch?v=sNkvWJmucQs
• https://www.youtube.com/watch?v=kCP-H8VRDCw

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 50


Weekly Assignment

• Q1. What is the use of spreadsheet in cleaning


data?
• Q2 What is the use of SQL in cleaning data?
• Q3 Discuss advanced cleaning tools
• Q4 Why do we need to document and verify
data?
• Q5. Discuss the use of Pivot Table

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 51


MCQ s

Q1.What is involved in seeing the big picture when verifying data cleaning?
Select all that apply.
1) Consider the data
2) Consider the business problem
3) consider the goal
4) Consider the reporting

Q2.Which of the following functions automatically remove extra spaces when


cleaning data?
1) SNIP
2) CLEAR
3) REMOVE
4) TRIM

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 52


Old Question Papers

• 1st Time Subject Offered

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 53


Expected Questions for University Exam

• Discuss the string variables using SQL?


• Discuss the widely used SQL queries?
• Differentiate SQL with Spreadsheets in data
cleaning.
• What is the importance of feedback?
• What is the use of changelog?
• What expectations do you have while cleaning
data?

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 54


Summary

We have learnt how SQL and spreadsheets are used in data cleaning. What
are the advanced functions used in cleaning. The string variables. What are
the expectations. The use of documentation, verification and feedback.

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 55


References

Thank You

11/24/2022 Faculty Name Subject code and abbreviation Unit Number 56

You might also like