DAA - Chapter 02
DAA - Chapter 02
CHAPTER 02
Mastering Data
1
Prepared by Nguyen Huu [email protected]
Objectives
2
Prepared by Nguyen Huu [email protected]
Contents
3
Prepared by Nguyen Huu [email protected]
1
24/09/2024
4
Prepared by Nguyen Huu [email protected]
6
Prepared by Nguyen Huu [email protected]
2
24/09/2024
7
Prepared by Nguyen Huu [email protected]
8
Prepared by Nguyen Huu [email protected]
9
Prepared by Nguyen Huu [email protected]
3
24/09/2024
identifier.
• Descriptive attributes include
everything else.
10
Prepared by Nguyen Huu [email protected]
10
11
Prepared by Nguyen Huu [email protected]
11
12
Prepared by Nguyen Huu [email protected]
12
4
24/09/2024
13
Prepared by Nguyen Huu [email protected]
13
14
The ETL process begins with identifying which data you need and is
complete when the clean data are loaded in the appropriate format
into the tool to be used for analysis. The Requesting data is an iterative
practice involving 5 steps
15
Prepared by Nguyen Huu [email protected]
15
5
24/09/2024
16
Extract
17
Extract
18
Prepared by Nguyen Huu [email protected]
18
6
24/09/2024
19
Request Date
Number Received Date Date
Completed Provided
Assigned
Received by
to
Initial review comments (discussion with client— Revisions
revisions required? agreement to proceed? etc.) Required
20
Extract
• If you have direct access to a data warehouse, you can use SQL and
other tools to pull the data yourself.
• Identify the tables that contain the information you need. You can do
this by looking through the data dictionary or the relationship model.
• Identify which attributes, specifically, hold the information you need in
each table.
• Identify how those tables are related to each other. 21
Prepared by Nguyen Huu [email protected]
21
7
24/09/2024
Transform
22
Prepared by Nguyen Huu [email protected]
22
Transform
23
Prepared by Nguyen Huu [email protected]
23
24
Prepared by Nguyen Huu [email protected]
24
8
24/09/2024
25
Prepared by Nguyen Huu [email protected]
25
Load
26
Prepared by Nguyen Huu [email protected]
26
27
9
24/09/2024
Chapter 2 Summary
• The first step in the IMPACT cycle is to identify the questions that • Once you have the data, they will need to be validated
you intend to answer through your data analysis project. Once a for completeness and integrity—that is, you will need to
data analysis problem or question has been identified, the next ensure that all of the data you need were extracted, and
step in the IMPACT cycle is mastering the data, which can be that all data are correct. Sometimes when data are
broken down to mean obtaining the data needed and preparing it extracted, some formatting or sometimes even entire
for analysis. records will get lost, resulting in inaccuracies. Correcting
the errors and cleaning the data is an integral step in
• In order to obtain the right data, it is important to have a firm
mastering the data.
grasp of what data are available to you and how that information
is stored. • Finally, after the data have been cleaned, there may be
Data are often stored in a relational database, which helps to one last step of mastering the data, which is to load
ensure that an organization’s data are complete and to them into the tool that will be used for analysis. Often,
avoid redundancy. Relational databases are made up of the cleaning and correcting of data occur in Excel and
tables with uniquely identified records (this is done through the analysis will also be done in Excel. In this case, there
primary keys) and are related through the usage of foreign is no need to load the data elsewhere. However, if you
keys. intend to do more rigorous statistical analysis than Excel
provides, or if you intend to do more robust data
• To obtain the data, you will either have access to extract the data visualization than can be done in Excel, it may be
yourself or you will need to request the data from a database necessary to load the data into another tool following
administrator or the information systems team. If the latter is the the transformation process.
case, you will complete a data request form, indicating exactly
which data you need and why.
28
Prepared by Nguyen Huu [email protected]
28
Key words
29
Prepared by Nguyen Huu [email protected]
29
10