0% found this document useful (0 votes)

78 views61 pages

Data Structures and Caatt'S For Data Extraction: Maria Keyceelyn Jane L. Maron

Here are the key entities I identified: - PURCHASING AGENT - INVENTORY STATUS REPORT - SUPPLIER - PURCHASE ORDER - RECEIVING CLERK - INVENTORY These entities meet the two conditions to be valid - they consist of multiple occurrences and contribute unique attributes.

Uploaded by

Tempo Loveshot

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views61 pages

Data Structures and Caatt'S For Data Extraction: Maria Keyceelyn Jane L. Maron

Uploaded by

Tempo Loveshot

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 61

DATA STRUCTURES AND CAATT’s

FOR DATA EXTRACTION

Maria Keyceelyn Jane L. Maron
DATA STRUCTURES
• Data structures have two fundamental components:
organization and access method.
• Organization refers to the way records are physically
arranged on the secondary storage device. This may be
either sequential or random.
• The records in sequential files are stored in contiguous
locations that occupy a specified area of disk space
• Records in random files are stored without regard for
their physical relationship to other records of the same
file. Random files may have records distributed
throughout a disk
• The access method is the technique used to
locate records and to navigate through the
database or file classified as either direct
access or sequential access methods
Flat-File Structures
• End users in this environment own their data
files rather than share them with other users
• Data files are structured, formatted, and
arranged to suit the specific needs of the
owner or primary user
• See fig 8.1  sequential storage and access
method
• Sequential files are simple and easy to process
• An indexed structure is so named because, in
addition to the actual data file, there exists a
separate index that is itself a file of record
addresses
• The data file itself may be organized either
sequentially or randomly
• The Virtual Storage access method (VSAM)
structure is used for very large files that
require routine batch processing and a
moderate degree of individual record
processing
Hashing Structure
• employs an algorithm that converts the
primary key of a record directly into a storage
address
• Advantage :
– Speed access
• Disadvantage :
– this technique does not use storage space
efficiently
Pointer Structures
• See fig 8.6
• used to link records between files
• Types of Pointers (see 8.8)
– physical address
– Relative address
– logical key pointer
Relational Database
Structure, Concepts and
Terminology
Diane U. Bautista,

9
Relational Database Structure,
Concepts and Terminology
 Relational databases are based on the
indexed sequential file structure. This structure
uses an index in conjunction with a sequential
file organization

 Multiple indexes can be used to create a

cross-reference, called an inverted list, which
allows even more flexible access to data.

10
Accordingly, a system is relational if it:

 The relational model portrays data in the

form of two dimensional ‘tables’.
 Relational model is based on the
relational algebra functions of restrict,
project, and join.

11
The Relational Algebra Functions
Restrict, Project, and Join

12
FIGURE 8-12. Data model using and ERD
Entity, Occurrence, and Attributes
 An entity is anything about which the organization
wishes to capture data. Figure 8-12 is an example of
Entity.

This data model is the blueprint for ultimately creating

the physical database. The graphical representation
used to depict the model is called an entity
relationship (ER) diagram.

 The term occurrence is used to describe the number

of instances or records that pertain to a specific
entity.

 Attributes are the data elements that define an entity

 Association
◦ Represented by a line connecting two entities
◦ Described by a verb, such as ships, requests, or
receives
 Cardinality – the degree of association
between two entities
◦ The number of possible occurrences in one table
that are associated with a single occurrence in a
related table
◦ Used to determine primary keys and foreign keys

15
Four basic forms of cardinality are possible:

•zero or one (0,1)

•one and only one (1,1)
•zero or many (0,M)
•one or many (1,M)
Examples of Entity Associations

17
Properly designed tables possess the following four
characteristics:

1. The value of at least one attribute in each occurrence

(row) must be unique. This attribute is the primary key.
The values of the other (nonkey) attributes in the row need
not be unique.
2. All attribute values in any column must be of the same
class.
3. Each column in a given table must be uniquely named.
However, different tables may contain columns with the
same name.
4. Tables must conform to the rules of normalization. This
means they must be free from structural dependencies
including repeating groups, partial dependencies, and
transitive dependencies
Logically related tables need to be physically
connected to achieve the associations described in
the data model. Using foreign keys accomplishes
this, as illustrated in Figure 8-14. In this example
the foreign keys are embedded in the related table.
 the primary key of the Customer table (CUST NUM)
is embedded as a foreign key in both the Sales
Invoice and Cash Receipts tables.
 the primary key in the Sales Invoice table (INVOICE
NUM) is an embedded foreign key in the Line Item
table.
 No explicit pointers are present. The data are viewed as
a collection of independent tables.
 Relations are formed by an attribute that is common to
both tables in the relation.
 Assignment of foreign keys:
◦ if 1 to 1 association, either of the table’s primary key
may be the foreign key.
◦ if 1 to many association, the primary key on one of
the sides is embedded as the foreign key on the other
side.
◦ if many to many association, may embed foreign keys
or create a separate linking table.
User Views
A user view is the set of data that a particular
user sees. Examples of user views are
computer screens for entering or viewing
data, management reports, or source
documents such as an invoice.
Database Anomalies
Such tables exhibit negative operational
symptoms called anomalies. Specifically these
are the update anomaly, the insertion
anomaly, and the deletion anomaly.
 Insertion Anomaly: A new item cannot be
added to the table until at least one entity
uses a particular attribute item.
 Deletion Anomaly: If an attribute item
used by only one entity is deleted, all
information about that attribute item is
lost.
 Update Anomaly: A modification on an
attribute must be made in each of the
rows in which the attribute appears.
 Anomalies can be corrected by creating
additional relational tables.
24
 Each row in the table must be unique in
at least one attribute, which is the
primary key.
◦ Tables are linked by embedding the primary
key into the related table as a foreign key.
 The attribute values in any column must
all be of the same class or data type.
 Each column in a given table must be
uniquely named.
 Tables must conform to the rules of
normalization, i.e., free from structural
dependencies or anomalies.

25
 A process which systematically splits
unnormalized complex tables into smaller
tables that meet two conditions:
◦ all nonkey (secondary) attributes in the table are
dependent on the primary key
◦ all nonkey attributes are independent of the other
nonkey attributes
 When unnormalized tables are split and
reduced to third normal form, they must
then be linked together by foreign keys.

26
 Business Rule 1. Each vendor supplies the
firm with three (or fewer) different items of
inventory, but each item is supplied by only
one vendor.
 Business Rule 2. Each vendor supplies the
firm with any number of inventory items, but
each item is supplied by only one vendor.
 Business Rule 3. Each vendor supplies the
firm with any number of inventory items, and
each item may be supplied by any number of
vendors.
 Update anomalies can generate conflicting
and obsolete database values.
 Insertion anomalies can result in
unrecorded transactions and incomplete
audit trails.
 Deletion anomalies can cause the loss of
accounting records and the destruction of
audit trails.
 Accountants should understand the data
normalization process and be able to
determine whether a database is properly
normalized.

29
Designing
Relational
Databases

Kresylene R. Torres
Six Phases of database design
(view modeling):

1. Identify entities.
2. Construct a data model showing entity
associations.
3. Add primary keys and attributes to the
model.
4. Normalize the data model and add foreign
keys.
5. Construct the physical database.
6. Prepare the user views
Step1. Identify entities.

– Primary entities ( physical, conceptual)

– Things about which the organization wishes to capture data.
– Entities are represented as nouns in a system description.

– To pass as valid entities, two conditions need to be met:

Condition 1. An entity must consist of two or more occurrences.
Condition 2. An entity must contribute at least one attribute that is
not provided through other entities.
We need to test these conditions for each candidate to eliminate any
false entities.
1. The PURCHASING AGENT reviews the INVENTORY
STATUS REPORT for items need to be reordered.
2. The agent selects a SUPPLIER and prepares an online
PURCHASE ORDER.
3. The agent prints a copy of purchase order.
4. The supplier ships INVENTORY to the company. Upon its
arrival, the RECEIVING CLERK inspects the inventory and
prepare an online RECEIVING REPORT.
Step 2. Construct a Data Model
Showing Entity Associations

– To determine the associations between the entities and

document them with an ER diagram.
– Associations represent business rules.
– The organization’s business rules directly impact the
structure of the database tables. If the database is to
function properly, it’s designers need to understand
the organization’s business rules as well as the specific
needs of the individual users.
Step 3. Add primary keys and
attributes to the model.

• Assign primary keys to the entities in the model. The analyst should
select a primary key that logically defines the non key attributed and
uniquely identifies each occurrence in the entity. Sometimes this can be
accomplished using sequential code such as Invoice Number, check
number or purchase order number.
• Every attribute in Ana entity should appear directly or indirectly in one
or more user views. Entity attributes are originally derived and modeled
from user views.
• If stores data are not used in the document, report, or a calculation that
is reported in some way, then it serves no purpose and should not be
part of the database.
Step 4. Normalize Data Model
and add foreign keys

1. removal of repeating group ( existence of multiple

values for particular attribute in a specific record)
2. Transitive dependencies
- Occurs in a table where non key attributes are
dependent on another attribute and independent of the
table’s primary key.
Step 5. Construct the Physical
Database.

• Create the Physical tables and populate them with data.

This is an involved step that must be carefully planned
and executed and may take many months in a large
installation.
• Programs will need to be written to transfer
organization data currently stores in flat filed or legacy
databases to new relational tables. Data currently
stored on paper documents may need to be entered
into the database tables manually.
Step 6. Prepare the User Views

• The query function of a relational DBMS allows the

system designer to easily create user views from tables.
The designer simply tells the DBMS which tables to use,
their primary key and foreign keys and the attributes to
select from each table. Older DBMS require the designer
to specify view matters directly in SQL. Newer systems
do this visually. the designer simply points and click at
the tables and attributes.
EMBEDDED AUDIT MODULE

– also known as continuous auditing, is to identify

important transactions while they are being processed
and extract copies of them in real time.
– An EAM is a specially programmed module embedded
in a host application to capture predetermined
transaction types for subsequent analysis.
Disadvantages of EAMs
– Operational Efficiency
– The presence of an audit module within the host application
may create significant overhead, especially when the amount of
testing is extensive. One approach for relieving this burden from
the system is to design modules that may be turned on and off
by the auditor. Doing so will, of course, reduce the effectiveness
of the EAM as an ongoing audit tool.

– Verifying EAM Integrity

– EAM may not a viable audit technique in environments with high
level of program maintenance. When host applications undergo
frequent changes, the EAMs embedded within the host will also
require frequent modifications.
– The integrity EAM directly affects the quality of the audit
process. Auditors must evaluate the EAM integrity. This
evaluation is accomplished in the same way as testing the host
application controls.
GENERALIZED AUDIT
SOFTWARE

- Enable auditors to review computer files

without reviewing processing programs
- Specifically tailored to auditor tasks
- Have been developed in-house in large
firms
- Available from various software suppliers
- Example: Audit Command Language
(ACL)
Popular because

– Easy to use and little computer background

– Many products are platform independent, works on mainframes
and PCs
– Auditors can perform independently of IT staff
– GAS can be used to audit the data currently being stored in most
file structure and formats

 Using GAS to access simple structures

 Using GAS to access complex structures
ACL Software
Joyreene Anne C. Lopez
ACL Software
A table that tells a computer operating system which
access rights each user has to a particular system object,
such as a file directory or individual file.
ACL stands for Audit Command Language
was designed as a meta-language, for auditor to access
data stored in various digital formats and to test them
comprehensively.
Many of the problems associated with accessing complex
data structures have been solved by ACL’s Open Database
Connectivity (ODBC) interface.
ACL Software
One of the advantage of ACL is the ability to read data
stored in most formats and for this purpose ACL uses the
feature of Data Definition
To create data definition, the auditor needs to know both
where the source file physically resides and its field
structure layout.
Allows the auditor to define important characteristics of
the source file, including overall record length, the name
given to each field the type of data contained in each field,
and the starting point and length of each field in the file.
The definition is stored in a table under a name assigned
by the auditor and once the data definition is complete,
future access to the table is accomplished simply by
selecting the name assigned.
ACL Software
Customizing a View
oAllows the auditor to customize the original view created
during data definition to one that better meets his or her
audit needs.
oBy using ACL Software, auditor can create and reformat
new views without changing or deleting the data in the
underlying file and only the presentation of the data is
affected.
oThe auditor can easily delete, and/or rearrange the data
to facilitate effective usage.
ACL Software
Filtering Data
oACL provides powerful options for filtering data that
support various audit tests.
oACL’s expression builder allows the auditor to use logical
operators such as AND, OR NOT and others to define and
test conditions of any complexity and to process only
those records that match specific conditions.
oWhen auditor executes this filter procedure, ACL
produces a new view.

Note: Filters – are expressions that search for records that

meet the filter criteria.
ACL Software
Stratifying Data
oACL’s stratification feature allows the auditor to view the
distribution o records that fall into specified strata.
oData can be stratified on any numeric field such as sales
price, unit cost, quantity sold, and so on.
oData are summarized and classified by strata, which can
be equal size (called interval) or vary in size (called free).
ACL Software
Statistical Analysis
oACL offers many sampling method for statistical analysis.
oTwo of the most frequently used are record sampling
and monetary unit sampling (MUS).
•Record Sampling – when record in a file are fairly
evenly distributed across strata, the auditor may
want an unbiased sample.
•Monetary Unit Sampling (MUS) – if the file is heavily
skewed with large value items, and will produce a
sample that includes all the larger dollar amounts.
Normalizing Tables in a
Relational Database
Normalization process involves systematically identifying
and removing these dependencies fro the table(s) under
review.
3NF will be free of anomalies and will meet two
conditions:
1. All non key attributes will be wholly and uniquely
dependent on (defined by) the primary key.
2. None of the non key attributes will be dependent
on (defined by) other non key attributes.
User View