0% found this document useful (0 votes)
6 views56 pages

CertPREP Instructor PPT ITDataAnlytics 02

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views56 pages

CertPREP Instructor PPT ITDataAnlytics 02

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Data Manipulation

Lesson 2

IT Specialist: Data Analytics


Topics Covered
• Skill 2.1: Import, store, and export data
• Skill 2.2: Clean data
• Skill 2.3: Organize data
• Skill 2.4: Aggregate data

2
Skill 2.1: Import, store, and export data
• This skill covers how to:
• Describe ETL processing
• Perform ETL with relational data
• Perform ETL with data stored in delimited files
• Perform ETL with data stored in XML files
• Perform ETL with data stored in JSON files

3
Describe ETL processing
• Figure 2-1: The ETL process

4
Extract
• During the data extraction phase, raw data is extracted from
one or more source systems to a staging area. The raw data
can be structured, semi-structured, or unstructured. Possible
sources include, but are not limited to the following:
• Relational or non-relational databases
• Flat files like CSV, JSON, or XML
• Sensors, email, or web pages

5
Transform
• During the data transformation phase, the raw data extracted from the
source system is processed and transformed for its intended analytical use
case. Some of the tasks performed during this phase are as follows:
• Filtering, cleansing, and deduplicating (i.e., eliminating duplicate or
redundant data) data
• Removal, encryption, decryption, or hashing of critical data as per industry
or government data regulations
• Performing necessary calculations or translations like currency conversion,
measurement units, or standardizing text formats
• Changing the data format to match the schema of the target system
• Some methods of filtering, cleaning, and formatting data are discussed later
in this lesson.
6
Load
• During the data load phase, the transformed data is loaded into
the target system.
• The data loading process can be one of the following types:
• Full data load where all data is loaded into the target system
• Incremental data load where periodic loading of incremental
data changes is done after the initial full loading of data
• Full refresh data load where the old data is fully replaced by the
new data in the target system

7
Perform ETL with relational data
• Varieties of RDBMS • Key components
• Microsoft SQL Server • Primary key
• MySQL • Foreign key
• Oracle database

8
Structured Query Language
• Categories of SQL statements • SELECT statements
• Data Definition Language • SQL alias
• Data Query Language

• SQL Joins
Data Manipulation Language
• Inner
• Data Control Language
• Left Outer
• Right Outer
• Cross
• Full outer
• Self

9
SQL

10
Select Statements
• SELECT * FROM table_name;
• SELECT col1, col2 FROM table_name;
• SELECT DISTINCT col1,col2,... FROM table_name;
• SELECT col1, col2,... FROM table_name WHERE condition;
• SELECT TOP N col1, col2,... FROM table_name;

11
SQL Alias Statement
• SELECT emp_id AS employee_id, emp_name AS
employee_name FROM employee;
• SELECT e.emp_id,
e.emp_name,
d.dept_name
FROM employee e INNER JOIN department d ON e.dept_id =
d.dept_id;

12
Cross Join or Cartesian Join
• Generates paired combination of each row of first table with
each row of second table

• Select column1, column2,….


from table1 cross join table2;

13
Perform ETL with data stored in delimited
files (Slide 1 of 3)
• Common delimiters • Types of delimited files
• Comma (,) • CSV file
• Semicolon (;) • TSV file
• Tab
• Space

14
Perform ETL with data stored in delimited
files (Slide 2 of 3)
• Importing delimited files to Excel
1. Open Excel and click on Data -> From Text
2. A dialog box will be opened to allow you to select the file. Select your
delimited file and click on Import.
3. Excel will open a preview of the data in the selected file.
• Click Next.
• Choose the delimiter (for example, comma for a CSV file ) and click on Finish.
• A new Import Data dialog box will be opened. Choose either Existing
worksheet or New worksheet and click OK

15
Perform ETL with data stored in delimited
files (Slide 3 of 3)
• Reading and writing delimited • Reading and writing delimited
files using Python files using R
• Read_csv () • Read_csv ()
• To_csv () • To_csv ()

16
Perform ETL with data stored in XML files
• XML follows a tree structure that must contain a root element
• The root element is the parent of all other elements
• Each XML element may contain text, attributes, or sub-child
elements
• All attribute values must be quoted with either single or double
quotes

17
Perform ETL with data stored in JSON files
• JSON file values
• String
• Number
• JSON object
• Array
• Boolean
• Null

18
Skill 2.2: Clean data
• This skill covers how to:
• Perform data cleaning common practices
• Perform truncation
• Describe data validation

19
Perform data cleaning common practices
• Common practices used in data cleaning process
1. Remove irrelevant data
2. Remove duplicate data
3. Remove unnecessary spaces
4. Handle inconsistent capitalization
5. Data type conversion
6. Handle missing or null values using imputation
7. Deal with outliers
8. Standardize data

20
Removing or filtering out irrelevant data
• In most cases, only part of the dataset is relevant to our data
analysis. In such cases, we either filter out or delete the irrelevant
data and select only the part of the data that is relevant to us.

• Filter out all the inactive employees during any calculation


• SELECT * FROM employee2 WHERE is_active != 0;

• Delete all inactive employees permanently from the table


• DELETE FROM employee2 WHERE is_active = 0;

21
Remove duplicate data
• It is very common to have duplicate records. Data can be
collected and gathered from multiple different sources. It is very
normal to have duplicates in the raw and unprocessed data.

• SELECT DISTINCT * FROM raw_employee;

22
Remove unnecessary spaces
• The unnecessary leading and/or trailing space can cause the same
data to be considered different. For example, the values “male”, “ male”,
“male ” and “ male ” are the same, but are considered different by a
string comparison due to leading and/or trailing spaces. Extra spaces
can be handled using the following SQL functions.

• TRIM() removes both leading and trailing spaces.


• LTRIM() removes only leading spaces.
• RTRIM() removes only trailing spaces.

• SELECT TRIM(department) FROM employee


23
Handling inconsistent capitalization
• Sometimes, the same data looks different to an algorithm or
function that uses string comparison due to inconsistent
capitalization.

• SELECT UPPER(department) FROM employee;

• SELECT Lower(department) FROM employee;

24
Data type conversion
• Syntax: CAST(expression AS datatype(length))

Value Description
expression Required. The value to convert
datatype Required. The datatype to convert expression to. Can be one of the
following: bigint, int, smallint, tinyint, bit, decimal, numeric, money,
smallmoney, float, real, datetime, smalldatetime, char, varchar, text,
nchar, nvarchar, ntext, binary, varbinary, or image
(length) Optional. The length of the resulting data type (for char, varchar,
nchar, nvarchar, binary and varbinary)

• SELECT CAST(age AS INT) AS age FROM employee;

25
Handle missing or null values using
imputation
In the real world, there may be some datasets that contain missing values.
Missing values are generally represented as NULL, N/A, blanks, etc.
Missing values are one of the most common problems in data analysis.
There are various ways to handle missing values:
• One way is to discard the records having missing values. But doing so
may result in the loss of valuable information.
• A better way is to replace missing data with some substituted value. This
technique is known as imputation.
• The substituted value that is used to replace missing data is known
as imputed data. The imputed data is derived from the existing part of
the data.

26
Handle missing or null values using
imputation
The following are some of the popular methods of data imputation:
• Imputation Using Mean or Median Values: In this technique, the
missing values in a column are replaced by the mean or median
value of non-missing values in the same column. This method can
be used only with numeric data.
Mean or average: It is the sum of all the numbers divided by the total
number of numbers. For example, the mean of 5 numbers [9, 12, 8, 14,
7] is (9+12+8+14+7)/5 , i.e., 10.
Median: It is the middle number in the sorted list of numbers in ascending
or descending order. For example, to find the median of 5 numbers [9, 12,
8, 14, 7], first sort the numbers in ascending order [7, 8, 9, 12, 14] and find
the middle number, i.e., 9. Therefore 9 is the median value.
27
Handle missing or null values using
imputation
• Imputation Using Most Frequent Values: In this technique,
the missing values in a column are replaced by the most
frequent value of non-missing values in the same column. This
method can be used for both numeric and non-numeric data.
• Imputation Using Zero or Constant Values: In this technique,
the missing values of a column are replaced by zero or any
other constant value. This method can be used for both numeric
and non-numeric data.

28
Deal with outliers
• An outlier is an extremely high or extremely low data value
compared to the other data values in the dataset.
The outliers are considered abnormal data values but they
should be investigated before eliminating them because they
may be valuable to the data and the analysis. To investigate
them, ask questions such as:
• Why did such data values appear?
• Is it a rare case, or is it likely to appear again?
Based on the investigation, a data analyst may either eliminate
those data points or perform data imputation.
29
Standardize data
• Standardizing data is a process of changing data into a
consistent format. This is required in various scenarios like
these:
• Changing temperature to either Fahrenheit or Celsius to have a
consistent unit.
• Ensuring that all instances of a length measurement are given
in the same unit (meters or kilometers).

• SELECT (length_km * 1000) AS length_meter FROM lengths;

30
Describe data validation
• Data validation is performed after the data cleaning process to
validate the data. During data validation, you take steps to
ensure that the data is accurate, complete, consistent, and
uniform.
• Common examples of data validation rules
• Data completeness check to ensure that the required records are not
missing
• Data type validation to verify that each field has the correct data type
(For example integer, float, string)
• Range validation to ensure the values are in the correct range (eg., a
number between 1-100)
31
Describe data validation
• Uniqueness check (For example, in a relational database, the
uniqueness can be ensured at the time of table creation by creating the
primary key constraints for fields like employee_id in
the EMPLOYEE table and department_id in DEPARTMENT table.)
• Consistent expressions (For example, the same department name
should not have different values like “HR”, “H.R”, and “Hr”)
• No null values (For example, the field name in the EMPLOYEE table
should not have null values).

32
Skill 2.3: Organize data
• This skill covers how to:
• Describe data organization
• Perform sorting
• Perform filtering
• Perform appending and slicing
• Perform pivoting
• Perform transposition

33
Describe data organization
• Data organization plays a vital role in managing and accessing data.
When data is well organized in an Excel worksheet or in a database
table, it allows users to access and process data easily and efficiently.
It is very difficult to access and process data that is not well
organized.
• Data organization helps in categorizing and classifying data to make it
more usable. Some of the following processes are used to organize
data.
• Sorting data
• Filtering data
• Appending data
• Slicing data
34
Perform sorting (Slide 1 of 2)
• Figure 2-5 Sorting Data in Excel

35
Perform sorting (Slide 2 of 2)
• SQL provides the ORDER BY clause to sort the data selected
from the database.
• ORDER BY sorts the records in ascending order by default.
• The ASC keyword is used to return the result in ascending
order, but it is optional as the ORDER BY returns results in
ascending order by default.
• The DESC keyword is used to return the result in descending
order.

36
Perform filtering
• Follow these steps to filter data in Excel.
1. Select any cell within the range of your dataset.
2. Select Data and then click on Filter.
3. After clicking on Filter, the column header will show arrow icons as
shown below. Select any of the column header arrows to filter the
data on that column.
4. Select Text Filters or Number Filters, and then select a
comparison, like Between.
5. Enter the filter criteria and click on OK.

37
Perform filtering
Operator Meaning
= Equal to
!=or<> Not equal to
> Greater than
< Less than
<= Less than or equal to
>= Equal to or greater than
BETWEEN To select items within a specified range where the start and end items are inclusive

LIKE To search for a pattern


IN To specify multiple possible values for a column to include

NOT IN To specify multiple possible values for a column to exclude


38
Perform filtering
• The following query will pull all records from
the EMPLOYEE table where the name begins with the
character 'R'. Here, the wildcard character % is being used.

• SELECT * FROM employee2 WHERE name LIKE 'R%';

39
Perform filtering
• The following query will return all records from
the EMPLOYEE table where the id is between 103 and 105,
inclusive of the beginning and end values.

• SELECT * FROM employee2 WHERE id BETWEEN 103 AND


105;

40
Perform filtering
• The following query will return all records from
the EMPLOYEE table where the id is either 103, 104, or 105.

• SELECT * FROM employee2 WHERE id IN (103,104,105);

41
Perform filtering
• The following query will return all records from
the EMPLOYEE table where the id is not 103, 104, or 105.

• SELECT * FROM employee2 WHERE id NOT IN (103,104,105);

42
AND and OR operators:
• The AND and OR operators are used to filter records based on
more than one condition in the WHERE clause

• The AND operator returns TRUE if all the conditions separated


by AND are TRUE

• The OR operator returns TRUE if any of the conditions


separated by OR is TRUE

43
Perform appending and slicing (Slide 1 of 3)
• Appending is used to combine two or more strings together. In
SQL, the + operator as well as the CONCAT() function are used
to combine strings.
• The following queries combine the two strings 'Data' and '
Analyst' together.

SELECT CONCAT('Data', ' Analyst'); Data Analyst


SELECT 'Data' + ' Analyst'; Data Analyst

44
Perform appending and slicing (Slide 2 of 3)
• Slicing is used to extract a subset of elements from a string. In
SQL, the SUBSTRING() function is used for slicing.
• string - It is the string to extract from.
• start - It is the start position. The first position in the string is 1.
• length - It is the number of characters required to extract.

• The following query extracts the first four characters from the
string 'Data Analyst’.
Query Output
SELECT SUBSTRING('Data Analyst', 1, 4); Data
45
Perform appending and slicing (Slide 3 of 3)
• Slicing Data in Excel
1. Select all the data in Excel and format it as a table (Insert -> Table).
Check My Table has Headers and click OK.
2. Click anywhere in the table and select Insert -> Slicer
3. A dialog box for Insert Slicers will open in which you need to select
the fields that you want to use to slice data and then select OK.
4. For each of the selected fields, a slicer will be created and each
slicer will have buttons corresponding to the distinct values in the
selected field.
5. When any of the slicer buttons is clicked, then only the matching
rows in the linked table will be shown.

46
Perform pivoting
1. Click anywhere in the table and select Insert -> PivotTable
2. The Create PivotTable dialog will be opened. Click OK.
3. A new worksheet will be opened that allows you to select the
pivot fields.
4. By default, the calculation is the sum, but it can be changed to
the count, min, or max.
• Select Value Field Settings for the VALUE dropdown.
• Choose your calculation (Sum, Count, Average, Max, Min, Product)
from the Value Field Settings dialog box and click OK

47
Perform transposition
• Steps to create a transposition of a table
1. Select blank cells where you would like the transposed table to be
created
2. After selecting blank cells, type the transpose formula =
TRANSPOSE (A1:E4)
3. After writing the transpose formula, press ENTER, which will generate
a transposed table.

48
Skill 2.4: Aggregate data
• This skill covers how to:
• Describe the aggregation function
• Use aggregation functions like COUNT, SUM, MIN, MAX, and AVG in
SQL
• Use GROUP BY and HAVING in SQL

49
Describe the aggregation function
• The aggregation of data is one of the most important aspects of
data analytics. It is useful in knowing the summary of the data.
Consider that you want to know the total number of employees
as well as the maximum, minimum, and average salary of
employees working in your organization. In order to find these
details, you need to apply the appropriate aggregation functions
on the employee records stored in the database.

50
Describe the aggregation function
• Table 2-43 Common and frequently used aggregation functions
COUNT Returns the number of records.

SUM Returns the total sum of values in a numeric column.

MIN Returns the smallest value in a column.

MAX Returns the largest value in a column.

AVG Returns the average of all the values in a column.

51
Use aggregation functions like COUNT,
SUM, MIN, MAX, and AVG in SQL
• SELECT
count(*) as total_employee,
sum(salary) as total_salary,
min(salary) as min_salary,
max(salary) as max_salary,
avg(salary) as average_salary
FROM employee2;

52
Use GROUP BY and HAVING in SQL
• The GROUP BY statement groups rows in different categories
where the aggregation functions can be applied on the rows of
a category independently.
• The HAVING clause is used to filter grouped data using
conditions calculated with the aggregate functions.

53
Use GROUP BY and HAVING in SQL
• SELECT department, COUNT(*) as total FROM employee2
GROUP BY department;

• SELECT department, COUNT(*) as total FROM employee2


GROUP BY department HAVING count(*)>1;

54
Use GROUP BY and HAVING in SQL
• SELECT department,
COUNT(*) as total_employee,
MIN(salary) as min_salary,
MAX(salary) as max_salary,
AVG(salary) as average_salary
FROM Employee2 GROUP BY department;

55
Summary
• This lesson covered importing, storing, and exporting data;
cleaning data; organizing data; and aggregating data.

56

You might also like