Basic Audit Data Analytics
Basic Audit Data Analytics
with R
&
Preface xii
Authors xiv
Acknowledgments xvi
2 Intro to SQL 19
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Simple SQL Queries . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 SELECT ... FROM . . . . . . . . . . . . . . . . . . . . 20
2.2.2 WHERE . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 DISTINCT . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 Practice Problem: ORDER BY . . . . . . . . . . . . . 22
2.3 Creating Aggregate data . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 GROUP BY . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 WHERE v. HAVING . . . . . . . . . . . . . . . . . . . 24
2.3.3 Practice problems . . . . . . . . . . . . . . . . . . . . . 24
2.3.4 Create Aggregate Sales Data by Store . . . . . . . . . . 25
Load and Review Sales Data . . . . . . . . . . . . . . . 25
GROUP BY Store . . . . . . . . . . . . . . . . . . . . 26
2.4 Create Sales Data by Month . . . . . . . . . . . . . . . . . . . 26
2.4.1 Format Date Variable . . . . . . . . . . . . . . . . . . . 26
GROUP BY Month . . . . . . . . . . . . . . . . . . . . 27
2.5 Create Monthly Sales Data per Store . . . . . . . . . . . . . . 28
2.6 Merging Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6.1 Practice Problems . . . . . . . . . . . . . . . . . . . . . 30
2.6.2 INNER JOIN . . . . . . . . . . . . . . . . . . . . . . . 31
2.6.3 LEFT JOIN . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Create Categorical Variables . . . . . . . . . . . . . . . . . . . 33
2.8 Solutions to Selected Practice Problems . . . . . . . . . . . . . 35
II Basic ADA 55
4 Inventory & Cost of Sales 57
4.1 Information Regarding Audited Company . . . . . . . . . . . 57
4.2 ADA Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Data Understanding . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.1 Obtaining an Understanding of Sales Prices . . . . . . 59
Import and Review Sales Data . . . . . . . . . . . . . . 59
Create Aggregate Sales Data . . . . . . . . . . . . . . . 60
Review Aggregated Sales Data . . . . . . . . . . . . . . 60
Summary Statistics for Aggregated Sales Data . . . . . 61
Outliers in Price Range . . . . . . . . . . . . . . . . . . 62
4.3.2 Obtaining an Understanding of Inventory Costs . . . . 62
Import and Review Inventory Data . . . . . . . . . . . 62
4.3.3 Create Aggregate Inventory Data . . . . . . . . . . . . 63
4.3.4 Review Summary Statistics - Aggregate Data . . . . . 66
4.4 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.1 Create Comparison Data . . . . . . . . . . . . . . . . . 67
4.4.2 Review Comparison Data . . . . . . . . . . . . . . . . 68
4.4.3 Summary Statistics . . . . . . . . . . . . . . . . . . . . 69
4.4.4 Focus on Relatively Large Discrepancies . . . . . . . . 69
4.5 Evaluation and Communication . . . . . . . . . . . . . . . . . 70
4.6 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.7 Practice Problems . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7.1 Most Expensive Products . . . . . . . . . . . . . . . . 72
5 Payroll 79
5.1 Information Regarding Bibitor’s Payroll . . . . . . . . . . . . 79
5.2 ADA Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Data Understanding . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.1 Load and Review Payroll Data . . . . . . . . . . . . . 82
Practice Problems: Verify Data . . . . . . . . . . . . . 85
5.3.2 Load and Review Data from Employee Table . . . . . . 85
Headcount: HQ vs. stores . . . . . . . . . . . . . . . . 86
Practice Problem: List of job titles . . . . . . . . . . . 87
5.3.3 Load and Review Store Data . . . . . . . . . . . . . . . 87
5.3.4 Create Combined Data for ADA . . . . . . . . . . . . . 87
5.4 ADA Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.1 Salaried and Hourly Employees . . . . . . . . . . . . . 89
Practice Problem: Deductions . . . . . . . . . . . . . . 92
5.4.2 Bi-Weekly Wages . . . . . . . . . . . . . . . . . . . . . 92
Abstract
Accounting professors, students and many auditors need readily available,
non-proprietary training material on audit data analytics (ADA). This need
was identified at the panel discussion on data analytics at the AAA meeting
in San Diego (August 2017). Several attendees stated that they need ma-
terials (instructions, cases, data, and code) to first train themselves in the
basics of ADA and then use the same materials to teach the topic in their
classes. In addition, feedback from a survey contracted by the CPA Canada
Audit Data Analytics Committee (Hampton and Stratopoulos 2016)1 indi-
cated that, outside of the big four public accounting firms, there are relatively
limited opportunities for training existing auditors in the use of data analyt-
ics.
“Basic Audit Data Analytics with R” is intended to meet the need noted
above. The training uses the software R because it is open-source (free) and
it provides virtually endless possibilities to those who learn it. The cases,
including practice problems, use comprehensive large data sets for an entire
company accessed from the HUB of Analytics Education. Millions of data
points are updated regularly.
The primary learning objective is to provide trainees with capabilities
required to perform entry level ADA. This means that - given a set of well-
defined objectives and a reasonably clean data set - those who have suc-
cessfully taken the training should have an understanding of how basic data
analytics can be effectively applied in a aspects of a financial statement audit.
These basics include: 1. Setting ADA objectives. Identify aspects of audit
where the audit team can use data analytics tools to obtain audit evidence.
2. Data Understanding: Identify sources of data, collect and extract data,
become familiar with data structure, identify data quality issues. 3. Data
Preparation: Be able to clean and transform data to enable effective and
1
Hampton, C., and T. C. Stratopoulos. 2016. Audit Data Analytics Use: An Ex-
ploratory Analysis. https://ssrn.com/abstract=2877358
2. Provide trainees with a basis for pursuing additional training in the use
of more advanced ADA, such as predictive analytics involving the use
of machine learning.
2
The Wikipedia article provides a very good introduction to CRISP-DM: https://en.
wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining. How-
ever, if you want to get a more in depth understanding or hope to leverage
data mining to advance your career you should read some of the original ar-
ticles: ftp://ftp.software.ibm.com/software/analytics/spss/support/Modeler/
Documentation/14/UserManual/CRISP-DM.pdf, and ftp://ftp.software.ibm.com/
software/analytics/spss/documentation/modeler/14.2/en/CRISP_DM.pdf.
Also, it is assumed that the financial statements being audited are meant
to be prepared in accordance with generally accepted accounting principles
(GAAP), such as the International Financial Reporting Standards (IFRS) or,
in the USA, standards promulgated by the Financial Accounting Standards
Board (FASB).
When appropriate, this training material makes brief references to as-
pects of GAAP and GAAS to help illustrate objectives and conclusions for
the example ADA. However, this is not a training course in auditing or ac-
counting. Auditors using ADA would be expected to understand GAAP
applicable to the particular engagement, and understand and comply with
all relevant requirements in the applicable GAAS.
In complying with GAAS, auditors performing ADA would consider, for
example, matters related to setting objectives for an ADA, communications
among members of the engagement team regarding the ADA as well as mat-
ters that would be communicated to management or those charged with
governance of the audited entity and audit documentation.
In using an ADA for one or more of the overall purposes noted above, the
auditor will determine specific objectives which use of the ADA is intended
to achieve. Those specific objectives are used in determining, for example,
the nature and extent of the data to be used in performing the ADA, and
how that data will be analyzed.
Communications
Communications Among Engagement Team Members
Communication objectives GAAS include requirements regarding com-
munications among engagement team members. These requirements help
ensure that the planning, performance and evaluation of an ADA are of high
quality. Two-way communications enable more senior/experienced engage-
ment team members to provide appropriate direction and supervision to more
junior members of the team on the performance of the ADA. Those perform-
ing the ADA will provide timely feedback on any problems encountered in
performing the ADA, and on the results obtained. Examples of matters to
be communicated include:
1. The objectives of the ADA, including specific risks (at the assertion
level) that the ADA is intended to address, in the context of whether
the ADA is a risk assessment procedure, a test of controls, a substantive
procedure, or a combination thereof.
2. The nature, sources, format, timing, extent, level of disaggregation and
reliability of the data to be used.
3. The tools to be used to obtain the data and perform the ADA.
4. The expected output from the ADA including graphics.
5. Types of matters that may be identified in planning, performing and
evaluating the results of the ADA, including how and when such mat-
ters are to be communicated.
(a) ADA used as risk assessment procedures
• Risk of material misstatement not previously identified.
• Risk of material misstatement that is higher or lower than
previously identified.
• Matters that are relevant to planning the nature, timing and
extent of other audit procedures.
(b) ADA used to test controls over financial reporting
• Deficiencies, significant deficiencies (material weaknesses) in
internal control.
Documentation of an ADA
GAAS require that the documentation of an ADA should be sufficient to
enable an experienced auditor, having no previous connection with the audit,
to understand the nature, timing and extent of the ADA, significant matters
arising in performing the ADA, and the results of the ADA, including the
audit evidence obtained.
The documentation of the example ADA described in this training would
therefore include the key aspects of the use of R in performing each stage of
the ADA noted, including for example:
xiii
The authors greatly appreciate the financial support for this education project
provided by the University of Waterloo Centre for Information Integrity and
Information Systems Assurance (UW-CISA). In addition, the project would
not be successful without the cooperation and support of professor Charles
Bame-Aldred of Northeastern University.
xv
Data Understanding
Learning Objectives
By the end of this chapter, trainees should have learned how information is
organized in files in a database and how these files are interconnected. The
trainees should also have learned how to:
the sale of liquor. For example, Bibitor is not permitted to ship inven-
tory out of state.
4. The company has been selling spirits and wine products for over 50
years. The business has been successful to date. In the audit of the
preceding year, the auditor did not identify any material uncertainty
regarding Bibitor’s ability to continue as a going concern.
5. During the fiscal year under audit, the company had over 10,000 brands
available for sale. Most of the products are low to medium-priced and
intended to appeal to a wide range of consumers (all of whom are
required to be over the legal drinking age). Bibitor also stocks limited
quantities of high-priced spirits and wines. Some of these brands are
individually priced at thousands of dollars.
1. tSales.csv 7. tEndInv.csv
2. tPurchases.csv 8. tEmployees.csv
3. tStores.csv 9. tPayroll.csv
4. tVendors.csv 10. tExpenses.csv
5. tProducts.csv 11. tInvoices.csv
6. tBegInv.csv
The reason that the client has provided us with .csv files, rather than
Excel files, has to do with the size of these data sets. Unlike Excel files,
which are currently limited to about a million lines, .csv files do not have
an upper limit. This difference matters since some of the clients files have
more than a million observations (e.g. the sales file has over 12 million lines).
the firm’s ERD (Figure 1.1) says that the firm has stores in different loca-
tions and of different size. More specifically, information about the Bibitor
stores, such as the store identification number (Store), address (City and
Location) as well the size of each store in square feet (SqFt) is listed in the
table tStores. The first line of the stores table is shown below:
On each business day these stores generate many sales. Information about
each individual sales transaction, such as the sales identification number
(salesID), the store that made the sale (Store), the identification number
of the product sold (Brand), the salesPrice, salesQuantity, salesDate, and
exciseTax is captured in the table tSales. The first line from this table is
shown below:
Notice that in Figure 1.2 there is a line that connects the field Store from
the table tStores to the field Store from the table tSales. The line has the
number one (1) on the tStores side and the infinity sign (∞) on the tSales
side. This means that one observation (one store) on the tStores side can be
matched to a large number (many) sales transactions on the table tSales.1
1
The field Store in the tStores is known as the primary key. The same field when it
repeats in the tSales side is known as the foreign key.
The third table (tProducts) has information about each product that
the firm sells. This includes the product identification number (Brand),
product Description, product Size and Classification 2 , and the identification
of the vendor (VendorNo) who supplies this product. The first line from the
tProducts table is shown below.
The line connecting the table tProducts to tSales starts and ends at the
product identification number (Brand). The line has 1 on the tProducts side
and ∞ on the tSales side. This means that Brand = 1, i.e., the product
number for Gekkeikan Black & Gold Sake appears only once in the table
tProducts, but the Brand = 1 appears many times, once for each sale of
the product, on the tSales table.
The last table in Figure 1.2 is the table tVendors. It captures the unique
identification number for each vendor (VendorNo), the vendor’s company
name (VendorName), and the type of vendor.3 The line connecting the table
tProducts to tVendors starts and ends at the vendor identification number
(VendorNo). The line has ∞ on the tProducts side and 1 on the tVendors
side. This means that VendorNo = 4425, i.e., the identification number
for MARTIGNETTI COMPANIES appears only once in the table tVendors
and many times - once for each product supplied by this vendor - on the
tProducts table.
Note: The complete R Script used for the creation of this chapter is avail-
able from the following URL: https://goo.gl/H2PHRW.
Import Data To import store data we are going to use the library data.table.
If this the first time you are using this package, follow directions on Appendix
A (p. 114) on how to install packages in R. Remember, we install the pack-
age once, but we have to load it using the function library() every time we
want to use it.
After the package has been loaded (library(data.table)), we use the
function fread() and specify the name of the file we want to import in
quotation marks, as follows:
> library(data.table)
> tStores <- fread("tStores.csv")
The data set has four variables: Store is the unique identifier for each
store/observation (primary key), City, Location, and SqFt.
Data Structure: We can get detailed information about the entire data
set (e.g., number of observations, variables, and format for each variable)
with the function str(...). The function str refers to data structure and
takes one argument, the name of the data set.
> str(tStores)
The output starts with a statement indicating the classification of the data
set as data.table and data.frame. You can think of a data.frame as the
equivalent of spreadsheet in Excel. It allows you to store data in rows and
columns. The data.table is a data.frame that is designed to work with
relatively large data sets.4 The data set has 79 observations and 4 variables.
The variables Store and SqFt are integers (int), while the variables City
and Location are formatted as text (chr). Knowing the format of variables
matters when doing statistical analysis. For example, we can calculate the
average of numeric data, but we will try to create a frequency table (or pie
chart) when working with categorical data. Similarly, when performing data
mining analysis; certain methods require only numbers, while some others
can work with a combination of numbers and text.
4
A more detailed discussion of the differences between the two classes and the advan-
tages of using data.table are beyond the scope of this introductory primer.
> head(tStores)
The visual inspection of data does not provide significant incremental infor-
mation above what has been provided by the str() function. As we will see
below (p. 12), the functions head(...) and tail(...) are useful, when we
want to review data that has been ordered.
Both functions can take a second argument that specifies the number of
observations to be shown. With the script below, we limit the number of
observations to three.
> head(tStores,3)
> tStores[1:3,]
The main message from these examples is that when we work with data
sets we use the following format: data[rows, columns]. Using this format
we can generate the following combinations:
1. All rows and all columns by including just a comma inside the bracket:
data[,]
Combine head() with subseting Recall that the argument within the
function head is the name of the data set. This means that we can use any
subset that we have created as the argument within the function head. The
following example, shows how we do this to achieve the same output as the
one shown in Example 3 above. Notice that the argument in the brackets
([,2:4]) is the data subset.
> head(tStores[,2:4],3)
> head(tStores[order(tStores$SqFt),2:4],3)
City Location SqFt
1: HORNSEY HORNSEY #3 1100
2: FURNESS FURNESS #18 2700
3: CESTERFIELD CESTERFIELD #64 2800
From this we can see that the three smallest Bibitor stores are 1100, 2700,
and 2800 square feet respectively.
Average We can use the function mean() to calculate the average store
size as follows:
> mean(tStores$SqFt)
[1] 7893.671
The average Bibitor store is 7893.671 square feet in size.
> summary(tStores$SqFt)
1.4.4 Graphs
The main objective of looking at summary statistics is to get a better un-
derstanding of the shape of the distribution. Alternatively, we can create a
histogram. The ggplot2 package allows for the creation of some very ad-
vanced graphs.5 The following example shows how to leverage this package
for creating a histogram. The resulting histogram is shown in Figure 1.3.
> library(ggplot2)
> storeSizeHistogram <- qplot(tStores$SqFt, geom="histogram")
> storeSizeHistogram
Refining the Graph The ggplot2 package gives us the ability to re-
fine the graph incrementally. Changing the graph incrementally or in layers
means the we can add more to an existing graph. In the following examples,
we make some very basic changes. The default label for the x-axis in Figure
1.3 is just the name of the variable (tStores$SqFt). With the following script,
we are changing the label for the x-axis.
> storeSizeHistogram <- storeSizeHistogram
+ xlab("Store Size (Sq.Ft)")
> storeSizeHistogram
As you can see, we can do this by saying that our updated graph is equal
to the existing graph plus a new piece of information. The piece that has
been added is the instruction that specifies the x-label as xlab("Store Size
(Sq.Ft)"). Please note that we include the label in quotation marks. The
updated version of the histogram in shown in Figure 1.4.
With the following script, we repeat this one more time to update the
label for the y-axis. The resulting new histogram is shown in Figure 1.5.
> storeSizeHistogram <- storeSizeHistogram
+ ylab("Count (Frequency)")
> storeSizeHistogram
In the above example, we have changed the labels in two separate stages.
However, this is not necessary. We could have made the two statement back-
to-back as follows:
> storeSizeHistogram <- storeSizeHistogram
+ xlab("Store Size (Sq.Ft)")
+ ylab("Count (Frequency)")
> storeSizeHistogram
The resulting new histogram is the same one shown in Figure 1.5.
1.5 Communication
The preface sets out types of matters that the auditor would consider in
evaluating, communicating and documenting ADA. The subsequent chapters
in this training material provide brief illustrations of how the auditor might
address these matters in the context of specific ADA examples.
Intro to SQL
Learning Objectives
Leveraging SQL (Structured Query Language) to extract appropriate data
from a large relational database and transform these data is one of the most
basic steps of data analytics. Our objective in this chapter is to learn how to
run queries with SQL.1 More specifically, by the end of this chapter trainees
should have learned how to:
1. Use basic SQL commands to write simple queries to create and review
subsets.
2. Use SQL code to create aggregated data by grouping based on at-
tributes of one or more variables.
3. Combine (merge) two tables based on a common field and create subsets
or aggregate data.
2.1 Background
SQL is a standardized programming language commonly used to manage
relational databases. The auditor uses SQL commands (queries and other
operations written as statements) in performing an ADA. These commands
enable the auditor to obtain and analyze particular subsets of data relevant
to achieving the objectives of the ADA being performed.
1
This chapter will provide a very high level introduction to SQL. Those interested in
getting a more thorough training in SQL may want to consider one of the online training
courses, such as the https://www.w3schools.com/sql/default.asp
19
> library(data.table)
> tStores <- fread("tStores.csv")
> str(tStores)
The data set has 79 observations (i.e., one for each of Bibitor’s stores) of the
4 variables noted in the left column of the str() output shown above.
Once the data set has been loaded, we can start using SQL commands to
extract (view) subsets of the data obtained. The advantage of SQL is that we
can generate most of the queries that we want with a handful of commands.
> library(sqldf)
> dt1_a <- sqldf("SELECT Store, City, Location, SqFt
FROM tStores")
> str(dt1_a)
table.* We can use the table.* or simply the * sign to view all variables
from the table stores as follows:
2.2.2 WHERE
If we want to limit our records to a subset that meets certain conditions we
can use the command WHERE. For example, Bibitor’s largest stores may have
operational characteristics relevant to the audit that are different from those
of its smaller stores. We may want to limit our records to stores larger than
19500 square feet. As we have seen in p. 13 this will return the two largest
stores.
2.2.3 DISTINCT
Bibitor has stores in various cities, and in some cases, more than one store
in a particular city. Each city may have its own demographic, economic
or other characteristics relevant to the audit. Additionally, aspects of how
Bibitor operates may vary for operations in cities in which it has more than
one store. It may therefore be useful, for example, to identify the cities in
which Bibitor operates. We can do this using the DISTINCT command as
follows:
1. Create a query that lists the Bibitor stores in descending order of store
size.
2. Create a query that lists the Bibitor stores in ascending order of store
size.
2.3.1 GROUP BY
In our next query, we are going to explore the ability to aggregate data by
specific group. For example, in an audit of a company such as Bibitor, it
may be useful to count how many products we buy from each supplier or
how many stores are in each city. We create an aggregate query that will
provide us with store count in each city by using the command GROUP BY
as follows:
> dt1_f <- sqldf("SELECT DISTINCT(City),
count(Store) AS storeCount
FROM tStores GROUP BY City")
> str(dt1_f)
The results show that Bibitor has stores in 67 cities. Running a query that
returns DISTINCT values for the variable storeCount, we can see (below)
that the store count ranges from 1 to 4.
> sqldf("SELECT DISTINCT dt1_f.storeCount FROM dt1_f")
storeCount
1 1
2 2
3 3
4 4
City storeCount
1 DONCASTER 2
2 EANVERNESS 3
3 GOULCREST 2
4 HARDERSFIELD 2
5 HORNSEY 4
6 LARNWICK 2
7 MOUNTMEND 4
that have more than one store. Notice that the first constraint is a
WHERE while the second is a HAVING.
GROUP BY Store
With the following aggregate query, we specify that we want to create store
level variables for total (sum) of units/bottles of wine or alcohol sold (sa-
lesQ_Store), average price per unit of products sold (avgPrice_Store), and
total (sum) revenues (revenue_Store).
> dt2_a <- sqldf("
SELECT Store, sum(SalesQuantity) AS salesQ_Store,
avg(SalesPrice) AS avgPrice_Store,
sum(SalesPrice*SalesQuantity) AS revenue_Store
FROM tSales GROUP BY Store")
> str(dt2_a)
'data.frame': 79 obs. of 4 variables:
$ Store : int 1 2 3 4 5 6 7 8 9 10 ...
$ salesQ_Store : int 576092 403023 33962 279623 ...
$ avgPrice_Store: num 15 16.1 17.3 13.9 12.9 ...
$ revenue_Store : num 6912704 6091406 436062 3201746 1322488 ...
Question: How can you verify that the above command has converted the
variable SalesDate from text(chr) to date?
The next step is to create a new variable month from the DATE formatted
variable tSales$SalesDate. The approach as you can see below is layered
like an onion and it is best understood from inside out. First, we use the
function format(...) to format/extract the month portion ("%m") of the
variable tSales$SalesDate as a date (as.Date). Second, we use the function
as.numeric(...) to convert the extracted values to numbers.7
The output above shows that the new variable that we created to capture
the month is numeric (num). For the first handful of observations it takes the
value of 7 (July).
GROUP BY Month
With the aggregate query below, we specify that we want to create the fol-
lowing monthly level variables:
• salesQ_Month, the total (sum) of units sold across all Bibitor stores.
• salesQ_Month_avgStore, the store average of total (sum) of units sold
across all Bibitor stores. In essence, this is the same as the variable
salesQ_Month divided by number of stores.
• avgPrice_Month the average price per unit.
• revenue_Month, the total sales (revenues) across all Bibitor stores.
• revenue_Month_avgStore the store average of total sales (revenues)
across all Bibitor stores. This is the same as revenue_Month divided
by number of stores.
sum(SalesQuantity)/count(DISTINCT(Store))
AS salesQ_Month_avgStore,
avg(SalesPrice) AS avgPrice_Month,
sum(SalesPrice*SalesQuantity) AS revenue_Month,
sum(SalesPrice*SalesQuantity)/count(DISTINCT(Store))
AS revenue_Month_avgStore
FROM tSales GROUP BY month")
> str(dt2_b)
3 10 3 47725 15.40777
4 10 4 45800 15.73450
5 10 5 51742 15.71040
6 10 6 50958 15.48979
7 10 7 52846 14.97397
8 10 8 48871 14.82037
9 10 9 45936 15.33874
10 10 10 55961 15.99462
11 10 11 58761 15.43290
12 10 12 73609 16.02871
store_revenue_Month
1 581298.0
2 573628.8
3 571602.2
4 561714.6
5 645376.5
6 630687.2
7 642664.0
8 565636.5
9 549645.3
10 691374.2
11 798268.3
12 962990.4
With the INNER JOIN we create a new table (InJn) that has only the
records where the b.Brand = a.Brand. As we can see from table 2.1, the
table InJn has only two observations corresponding to Product 2 and Product
5 which appear in both tables.
b a InJn
Brand Price Brand Price Brand b.Price a.Price
Product 1 10
Product 2 10 Product 2 12 Product 2 10 12
Product 3 11
Product 4 20
Product 5 25 Product 5 39 Product 5 25 39
When we perform a LEFT JOIN (table 2.2), the position of the table mat-
ters. The table listed first is left, and the table listed second is right. A LEFT
JOIN will return all records from the left table and the records from the right
table where the b.Brand = a.Brand.
As we can see from table 2.2 the file LfJn has all three records from the
left table (b), but only the two matching records (Product 2 and Product
5 ) from the right table (a). Notice, that in the table LfJn the a.Price for
Product 3 is missing (NA) because the product is not in table a.
b a LftJn
Brand Price Brand Price Brand b.Price a.Price
Product 1 10
Product 2 10 Product 2 12 Product 2 10 12
Product 3 11 Product 3 11 NA
Product 4 20
Product 5 25 Product 5 39 Product 5 25 39
> nrow(dt2_a)
[1] 79
> nrow(tStores)
[1] 79
Both tables have the same number of observations (79 stores) and a common
field (Store). Therefore, we should expect that creating a new table (dt3_a)
as an INNER JOIN between the two tables (dt2_a and tStores) would generate
the same number of observations.
Creating an INNER JOIN means that we need to specify the two tables that
will be linked (dt2_a INNER JOIN tStores), and the common field based on
which the two tables will be matched (ON dt2_a.Store=tStores.Store).8
The above output shows that the new table has 79 observations (stores), all
variables from dt2_a (SELECT dt2_a.*), as well as the variables City and
SqFt from the tStores table.
8
In R if we want to specify that we want to use a specific variable x from a data set
dt, we do this by using the $ as follows: dt$x. If we want to do the same in SQL we
use the period (.) to separate the data set from the variable (i.e., dt.x). To avoid error
messages, we need to make sure that our variable have names that are compatible with
SQL notation. If a table has column/field with a name that is not a single word, i.e., it
contains a period in its name, we need to replace it.
[1] 79
> names(tStores)
The table dt1_g (see p. 24) has 7 observations (cities that have more
than one store), and two variables (City and storeCount).
> nrow(dt1_g)
[1] 7
> names(dt1_g)
Start with the INNER JOIN Creating the inner join of these two tables
shows that the new data set has 19 observations. These are the 19 stores
which are in cities that have more than one store.9
In a LEFT JOIN the position of the table to the left or right matters. In
the following example, we want all records from the table tStores and only
the matching records from the table dt1_g. This means that the tStores will
have to be on the left and dt1_g on the right (tStores LEFT JOIN dt1_g).
The rest of the query is the same as the one for the INNER JOIN.
> dt3_b2 <- sqldf("SELECT tStores.*, storeCount
FROM tStores LEFT JOIN dt1_g ON tStores.City=dt1_g.City")
> str(dt3_b2)
The new data set has the same number of observations as the table tStores.
Please notice that some of the entries for storeCount are missing (NA). These
are the entries corresponding to cities that have only one store. Remember
that dt1_g contains only the cities with more than one store.
CASE
WHEN ... THEN ...
WHEN ... THEN ...
ELSE ...
END
In the following example, we use this approach to add a new variable mar-
ketSize in the data set dt3_2.
10
The advantage of the CASE ... END versus the typical ifelse statement is that with
the CASE ... END we can specify as many alternatives as we want. We can do this by
simply adding more WHEN ... THEN ... lines in our query.
City storeCount
1 HORNSEY 4
2 MOUNTMEND 4
Learning Objectives
By the end of this chapter trainees should have learned how to:
1. Detect statistical outliers using the interquartile range (IQR) approach
and/or boxplot.
2. Create a subset that captures detected outliers for further analysis.
3. Analyze aggregate data for outliers within groups (i.e., create side-by-
side boxplots).
4. Create time-series graphs and recognize seasonal patterns.
37
very small are called outliers. In this section, we will learn how to leverage
R in order to detect outliers in a data set using the interquartile range and
visualizing them with the boxplot.
Interquartile range (IQR) is the difference between the third (Q3 ) and
first quartile (Q1 ). That is, the IQR measures the range of data values
in the middle of the population. Based on table 3.11 , we have that the
IQR = Q3 − Q1 = 10200 − 4000 = 6200.
> library(data.table)
> tStores <- fread("tStores.csv")
> names(tStores)
> summary(tStores$SqFt)
1
The table is based on summary statistics generated in Chapter 1 (p. 13)
2
The complete R Script used for the creation of this chapter is available from the
following URL: https://goo.gl/QfEXDa
3.1.2 Boxplot
A boxplot is a way of graphically showing data in their quartiles, including the
"whiskers" as noted above (i.e., indicators of the variability of data beyond
the upper quartile and below the first quartile). Relevant commands for
creating a boxplot leveraging the package ggplot2 are set out below.
First, the function ggplot specifies the data set to be used (tStores) and
the aesthetic mappings (aes) used to describe how variables in the data are
mapped to visual properties in the graphic. Second, we specify the type of
graph to be made (i.e., geom_boxplot).
> library(ggplot2)
> storeSizeBPlot <- ggplot(tStores, aes(x=Store, y=SqFt)) +
geom_boxplot()
We can refine our graph, by using the argument outlier.color within
the function geom_boxplot to specify the color of observations in the graph,
which are outliers. Figure 3.1 shows the resulting boxplot for Bibitor store
size.
> storeSizeBPlot <- storeSizeBPlot +
geom_boxplot(outlier.color = "red")
> storeSizeBPlot
We use the new variable to create and view the subset that returns only the
outliers (tStores[tStores$sqftOutlier == 1 , ]) above. Alternatively,
we can simply use the condition from the function ifelse to generate the
subset.
3
In R the vertical line (|) indicates the OR and the ampersand (&) the AND.
> tStores[tStores$SqFt<lw|tStores$SqFt>uw,]
The first method (creating a new variable) is useful if there are a lot of
outliers/exceptions and we need to perform further statistical analysis to
understand patterns or common themes across the entire data set of outliers.
The second is more useful when dealing with just a handful of observations,
and a simple visual review would be enough to see what is going on.
> library(sqldf)
> dt1_a <- sqldf("
SELECT Store, sum(SalesQuantity) AS salesQ_Store,
avg(SalesPrice) AS avgPrice_Store,
sum(SalesPrice*SalesQuantity) AS revenue_Store
FROM tSales GROUP BY Store")
> names(dt1_a)
> summary(dt1_a[,2:4])
Interestingly, the largest volume of sales (units sold) was not generated
by one of the largest store in size. The largest volume (1623158 bottles) came
from a store in the second quartile (Q2 ) in terms of store size.
Consistent with what we have seen in Figure 3.2, there are four obser-
vations (one red dot from Q2 and three from Q4) that are higher than the
max volume generated by one of the two largest stores. More specifically,
the max volume of store #76 from Q2, and the max volumes of stores #73,
#34, and #38 from Q4, are higher than the max volume of store #66 which
is the second largest store. The sales volume of the largest store (#49) is not
in the top ten list.
> head(dt1_b[order(-dt1_b$salesQ_Store),],10)
storeSize
76 Q2
73 Q4
34 Q4
38 Q4
66 sqftOutlier
67 Q3
69 Q2
50 Q2
60 Q3
15 Q4
> str(tSales$SalesDate)
Using the function head() we review the aggregate sales data for the first
six months.
> head(dt2_a)
revenue_Month revenue_Month_avgStore
1 29854028 377899.1
2 28876607 365526.7
3 28988412 366941.9
4 30723735 388908.0
5 36041211 456217.9
6 39290701 497350.7
> summary(dt2_a[,2:6])
revenue_Month revenue_Month_avgStore
Min. :28876607 Min. :365527
1st Qu.:30506308 1st Qu.:386156
Median :37234433 Median :474440
Mean :36755959 Mean :467950
3rd Qu.:40376265 3rd Qu.:517644
Max. :48769674 Max. :617338
Mental Math Problem A data analyst glanced through these data and
said that: There are no extreme outliers in these data. Can you validate
whether this statement is correct? Are there extreme outliers in these vari-
ables? Remember, the objective of mental math is to establish extreme
outliers. This means that you should round generously.
1. Within the ggplot we specify the data set (dt2_a) and in the aes
argument we specify just the time (x=month).
2. We add (+) the function geom_point, which means that we want the
see the data points. Within the function and using aes we specify the
y-variable (y=salesQ_Month).
3. We add (+) the function geom_line, which means that we want a line
to connect the data points. Within the function and using aes we
specify the y-variable (y=salesQ_Month).
4. The x-axis (month) is numeric. To avoid breaks that may include
decimal points, we add (+) the function scale_x_continuous, and we
specify that we want the values to go from 1 to 12, in increments of 1.
The resulting graph is shown in Figure 3.3. From this we can see that
Bibitor’s volume of units sold has a spike in July (7) and in December (12).
The volume follows an upward trend from January through July (1-7) and
seems to remain relatively stable at an average level from August to Novem-
ber (8-11).
store_revenue_Month
1 496657.2
2 478458.7
3 549330.8
4 488541.6
5 523794.3
6 570393.5
> summary(dt2_b[,3:5])
store_salesQ_Month store_avgPrice_Month store_revenue_Month
Min. : 2245 Min. :12.36 Min. : 29427
1st Qu.: 15983 1st Qu.:14.04 1st Qu.: 190772
Median : 26191 Median :14.66 Median : 324252
Mean : 34256 Mean :14.89 Mean : 467732
3rd Qu.: 40543 3rd Qu.:15.42 3rd Qu.: 531786
Max. :202865 Max. :18.91 Max. :3214605
Mental Math Problem A data analyst glanced through these data and
said that: There seem to be extreme outliers in volume and revenue but not
in price. Can you validate whether this statement is correct?
store_revenue_Month
589 908139.8
590 871620.8
591 947189.6
592 974966.6
593 1057386.3
594 1119092.8
595 1459538.6
596 1119384.4
597 934588.4
598 1184220.2
599 1601722.9
600 1781116.5
The resulting Figure 3.4 shows that store #76 (blue) is experiencing a much
bigger spike in the summer months (around July) than both other stores.
As the final step, we would like to create a graph that lets us compare
the time series for these stores with monthly volume for the average Bibitor
store. These data have been captured in variable salesQ_Month_avgStore
in the data set dt2_a (See section 3.4.1).
Taking advantage of the incremental approach used to create ggplot we
can do this by adding into the existing graph g4_store_salesQ_Month. More
specifically, we add the functions geom_point and geom_line and inside the
aes, we specify the new data set dt2_a. This way R knows the source of the
y-variable. In addition to this and since we are adding only one series, we
explicitly specify that ggplot should assign a color/label for these data with
the following statement: color="avgStore.
Figure 3.5: Monthly Units Sold - Average Per Q2 Store vs. Average of All
Stores
ggplot(dt1_b,aes(storeSize, revenue_Store)) +
geom_boxplot(outlier.color = "green")
Basic ADA
55
Learning Objectives
By the end of this chapter trainees should have achieved the following objec-
tives:
57
• All sales are cash or credit card sales. The company does not have
trade accounts receivable.
• The company’s headquarters are in the State of Lincoln (a fictional
state used for illustrative purposes) and it has 79 stores located in var-
ious cities throughout the state. In common with other states, Lincoln
has extensive and rigorously enforced laws and regulations regarding
the sale of liquor. For example, Bibitor is not permitted to ship inven-
tory out of state.
• The company has been selling spirits and wine products for over 50
years. The business has been successful to date. In the audit of the
preceding year, the auditor did not identify any material uncertainty
regarding Bibitor’s ability to continue as a going concern.
• During the fiscal year under audit, the company had over 10,000 brands
available for sale. Most of the products are low to medium-priced and
intended to appeal to a wide range of consumers (all of whom are
required to be over the legal drinking age). Bibitor also stocks limited
quantities of high-priced spirits and wines. Some of these brands are
individually priced at thousands of dollars.
We load and review the sales data. For a review of the commands used and
the interpretation of results see section 2.3.4 (p. 25).1
> library(data.table)
> tSales <- fread("tSales.csv")
> names(tSales)
> names(tSales)[1] <- "salesID"
> str(tSales)
> library(sqldf)
> tSalesByProduct <- sqldf("
SELECT Brand, sum(SalesQuantity) AS sumQSales,
avg(SalesPrice) AS avgPrice, max(SalesPrice) AS maxPrice,
min(SalesPrice) AS minPrice,
(max(SalesPrice)-min(SalesPrice)) AS rangePrice
FROM tSales GROUP BY Brand")
> names(tSalesByProduct)
> head(tSalesByProduct,5)
> summary(tSalesByProduct$avgPrice)
> summary(tSalesByProduct[,2:6])
minPrice rangePrice
Min. : 0.00 Min. : 0.000
1st Qu.: 9.99 1st Qu.: 0.000
Median : 14.99 Median : 0.000
Mean : 31.78 Mean : 3.537
3rd Qu.: 27.95 3rd Qu.: 3.000
Max. :4999.99 Max. :13967.910
The above statistics would be reviewed in the context of the auditor’s under-
standing of the company’s operations obtained in previous years’ audits. For
example, in this case,the auditor likely would not be surprised at the wide
range between maximum and minimum prices. This is because Bibitor’s
products have traditionally included very small sample bottles, costing pen-
nies as well as individual bottles of very expensive whiskies and wines. If
the statistics did not show this wide variation, then this would not be con-
sistent with the auditor’s understanding of the company’s operations and
would warrant further attention. In this case, the existence of “0” sales
prices would warrant investigation to determine the reason why zero prices
exist.
> tSalesByProduct[tSalesByProduct$rangePrice>100,]
What might be the implications of wide price ranges? A wide price range,
particularly a range as wide for that of brand 2696, may be an indication of
a pricing error. If the error is the result of a control deviation, there could a
significant number of other pricing errors.
• sum(onHand) AS totalQoH
With the following R script (SQL query) we create the new variables and
save them in a new data set named tEndInvByProduct.
> str(tEndInvByProduct)
Review the top (head) and bottom (tail) observations of the tEndInvByProduct
data set.
totalQoH
1 369
2 31
3 12
4 466
5 401
6 5
totalQoH
8408 49
8409 169
8410 45
8411 47
8412 296
8413 372
> summary(tEndInvByProduct[,2:7])
4.4 Modeling
Given our 3rd objective for our ADA stated in section 4.2, we need to focus
on products with average cost (acquisition cost) higher than average selling
price.3 This means that we need to compare average selling price with average
acquisition cost for each product available in the end of year inventory. To
make this comparison we will need to combine information from two separate
data sets (tables).
First: we need the average selling price from the table tSalesByProduct.
The following command shows the name of the variable for selling price in
tSalesByProduct is avgPrice.
> names(tSalesByProduct)
Second: we need the average acquisition cost from the table tEndInvByProduct.
The name of the variable for acquisition cost is avgPPrice.
> names(tEndInvByProduct)
Third: we need to create a new data set that has all observations from the
inventory table (tEndInvByProduct) but only the matching records from
the sales table (tSalesByProduct). In SQL this type of join is called a LEFT
JOIN.
CASE
WHEN avgPPrice<=avgPrice THEN 0
ELSE avgPPrice-avgPrice
END
The query below, can be divided into the following steps:
1. We perform a LEFT JOIN. The table tEndInvByProduct, end of year
inventory data, is positioned on left with an alias of b. The table
tSalesByProduct, sales data, is positioned on right with an alias of a.
2. The two tables are linked based ON the common field a.Brand=b.Brand.
3. We select the fields b.Brand, avgPPrice, avgPrice, sumQSales, and
totalQoH.
4. We use the CASE ... END to create the target variable deltaCostPrice.
5. Since, the summary statistics have shown that there are some products
that have an average selling price of zero, we limit our data to products
with positive price by specifying: WHERE avgPrice>0.
SQL Query
> dt_comp <- sqldf("SELECT b.Brand, avgPPrice, avgPrice,
sumQSales, totalQoH,
CASE
WHEN avgPPrice<=avgPrice THEN 0
ELSE avgPPrice-avgPrice
END AS deltaCostPrice
FROM tEndInvByProduct AS b LEFT JOIN tSalesByProduct AS a
ON a.Brand=b.Brand
WHERE avgPrice>0")
> nrow(dt_comp)
[1] 8181
To explore if any of the 8181 observations of the data set are associated
with products where the average inventory cost (acquisition cost) is higher
than the average selling price, we look at the descriptive statistics.
[1] 10
> dt_compFlag
4.6 Documentation
An example audit working paper showing documentation for this ADA can
be accessed from the following url:
https://docs.google.com/document/d/1Xeq6ignMMxMgVT6Dg-6IINL8aLOHaWHMKvZtOAHaKpQ/
edit?usp=sharing.
Order data by purchase price (descending) and review the set of 15 observa-
tions.
> head(tPurchases[order(-tPurchases$PurchasePrice),2:7],15)
Notice that the most expensive product sold (Brand=2696) is not included
in the list of purchased products!
Order data in terms of inventory cost (Price descending) and review the
first 15 observations.
> head(tBegInv[order(-tBegInv$PPrice),2:6],15)
Notice that the most expensive product sold (Brand=2696) is not included
in the list of beginning inventory products!
Order data in terms of inventory cost (Price descending) and review the
first 15 observations.
> head(tEndInv[order(-tEndInv$PPrice),2:6],15)
Notice that the most expensive product sold (Brand=2696) is not included in
the list of ending inventory products!
ExciseTax
1: 1.84
2: 1.84
3: 1.84
4: 1.84
5: 1.84
6: 1.84
Notice that the same product was sold on the same date in six different stores.
> tPurchases[tPurchases$Brand==2696,2:7]
The data set shows that there were 650 purchase line items and they seem
to be priced at just $25.19. To get a better feeling about the distribution
of the purchase prices we run summary statistics for just these observations.
Notice that the variables is specified using the the $ sign after the closing of
the square bracket.
> summary(tPurchases[tPurchases$Brand==2696,]$PurchasePrice)
Summary statistics show that the purchase price is constant at $25.19. All
descriptive statistics are the same.
> tBegInv[tBegInv$Brand==2696,2:6]
> summary(tBegInv[tBegInv$Brand==2696,]$Price)
Summary statistics show that the inventory cost for product 2696 is constant
at $ 25.19 across all stores.
> tEndInv[tEndInv$Brand==2696,]
> summary(tEndInv[tEndInv$Brand==2696,2:6]$Price)
Summary statistics show that the inventory cost for product 2696 is constant
at $25.19 across all stores.
Payroll
Learning Objectives
By the end of this chapter trainees should have learned how to design, per-
form, evaluate, and communicate the results of ADA to obtain an under-
standing of an entity’s payroll expenses and assess risks of material misstate-
ment of those expenses. This chapter builds on matters noted in chapters 1
– 3, by discussing how to use R including SQL queries, to:
79
> library(data.table)
> dtPayroll <- fread("tPayroll.csv")
> names(dtPayroll)
Format Dates
The review of payroll data has shown that dates are formatted as characters
(chr). This is going to be problematic if we want to use these dates for the
creation of graphs and data aggregation. Following directions from section
2.4, we convert these two variables to as.Date format and review the new
structure as follows:
...
$ BegDate : Date, format: "2015-07-01" ...
$ EndDate : Date, format: "2015-07-15" ...
...
Verify that beginning payroll dates are within the fiscal year
We can verify this by ordering our data in ascending and descending order of
payroll beginning dates (BegDate) and viewing the top three observations,
as follows:
3
FICA stands for Federal Insurance Contributions Act.
> head(dtPayroll[order(dtPayroll$BegDate),3:9],3)
> head(dtPayroll[order(-dtPayroll$BegDate),3:9],3)
The 3 rows in the first table above show data for the earliest payroll
period included in the dataset. As we would expect, the first pay period is
for the first two weeks in July 2015. The employee numbers for the first 3
employees paid in the period (by employee ID number) are noted. The Salary
column shows the total salary for the respective employees for the year. The
amount paid (again as we would expect) is the annual salary divided by 24
(the number of pay periods during the year).
Since these are salaried employees, the rate per hour, and hours worked
columns are blank. The 3 rows in the second table above show similar data
for the last payroll period included in the dataset. The observations in the
two tables show that the first payroll period starts on July 1, 2015 and the
last payroll period ends on June 30, 2016. Therefore, the data being used for
the ADA all fall within Bibitor’s 2016 fiscal year.
Are there any employees with more than one paycheck in one pay
period?
Early in performing the ADA, it may be useful to determine if there is any em-
ployee who has obtained more than one paycheck in a bi-weekly pay period.
> library(sqldf)
> sqldf("SELECT PayPeriod, empID, count(DISTINCT empID)
FROM dtPayroll GROUP BY PayPeriod, empID
HAVING count(empID)>1")
The query generate zero rows, which means that in our data set there are no
employees who received more than one paycheck in one period.
> str(dtEmployees)
Based on the above, Bibitor has 1120 employees and for each employee
we have the following information: unique identification number, first and
last name, their store, and title. Please note that the firm uses store zero (0)
to indicate the firm’s headquarters.
Similar to the approach taken above regarding payroll dates, we can de-
termine that our data includes that for all locations (0 to 79) by ordering the
data. The resulting tables are set out below.
> head(dtEmployees[order(dtEmployees$Store),],3)
> head(dtEmployees[order(-dtEmployees$Store),],3)
location emplCount
1 HQ 61
2 stores 1059
> nrow(dtStores)
[1] 79
> nrow(dt1)
[1] 13557
Second, we will add the store size (SqFt) from the store file into the combined
file (dt1 ) created in the previous step.
> nrow(dt1)
[1] 13557
Salary SalaryWagesPerPP
Min. : 10346 Min. : 431.1
1st Qu.: 39780 1st Qu.:1657.5
Median : 45950 Median :1914.6
Mean : 47687 Mean :1986.9
3rd Qu.: 52290 3rd Qu.:2178.8
Max. :107823 Max. :4492.6
> summary(dt1[dt1$status=="hourlyEmpl",c(6,8,9)])
The results shown in the above tables provide information that may be
useful in identifying outliers. For example, based on the results of audits of
previous years, and updated information obtained by inquires of management
in the current year, the auditor would have expectations regarding what the
summary statistics should look like. If the statistics obtained were to differ
significantly from what the auditor expected, the auditor would perform
further procedures to determine why there are significant variances from
expectations.
This output also provides information on the accuracy of the calculation
of payroll. For example, for salaried employees, the total salary in each row in
the first table divided by the number of pay periods (24) equals the amount
paid in each pay period. For wage employees, in row 1 of the second table,
the minimum hourly rate ($11.67) times the minimum hours (6.0) equals the
minimum amount paid in a pay period ($70.02). Similarly, in the last row,
the maximum hourly rate ($17.98) times the maximum hours (88.0) equals
the maximum amount paid in a pay period ($1,582.24).
The new data set has 963 observations. An analysis of staff by position
would show that these represent the 884 retail store clerks and 79 receivers
who are paid hourly (see Practice Problem 5.3.2). There are 96 retail store
managers who are salaried employees. Therefore, the total number of staff
located in stores is 1059 (963 + 96), consistent with our earlier analysis.
> summary(dt1_a[,2:4])
The results show that the distribution of hours worked is widely spread
and reflect two groups. Employees in the top quartile seem to work full time
(around 40 hours/week times 50 weeks produces around 2,000 hours per
year). The remaining hourly employees seem to work one or two weeks per
year. This most likely occurs during the weeks of peak sales (e.g., Christmas).
This same pattern is reflected in total wages paid.
emplCount totalSalaries
1 157 7466950
The total of wages and salaries paid per our analysis is $17,601,381
($10,134,431 + $7,466,950). If we were to perform similar calculations for the
various payroll deductions, we might find that they total $6,164,722. In that
case, our total calculated payroll expense would be $23,766,103 which would
agree with the amount shown as “personnel services expense” in Bibitor’s
2016 income statement. If the amounts did not agree, the auditor would
determine why that was the case. However, this ADA would provide only
some limited evidence of the accuracy of the process to sum payroll amounts
and record them in the GL. Evidence from other procedures, including ADA,
would be needed to address the various assertions related to payroll. For ex-
ample, as noted earlier, procedures would need to be performed to verify the
reliability of the data used in the payroll process.
the rest of the year because of longer store hours to accommodate increasing
demand for products in the holiday season. If this or another anticipated
fluctuation is not reflected in the results, the auditor would perform further
procedures to obtain information on why this occurred.
We create a new data set that aggregates (GROUP BY) payroll information,
such as total payments and employee count, at the period and status level,
as follows:
> summary(dt1_b[dt1_b$status=="hourlyEmpl",3:4])
totalPmts emplCount
Min. :404182 Min. :368.0
1st Qu.:410950 1st Qu.:372.0
Median :412980 Median :380.0
Mean :422268 Mean :408.3
3rd Qu.:420146 3rd Qu.:404.2
Max. :535426 Max. :713.0
> summary(dt1_b[dt1_b$status=="salariedEmpl",3:4])
totalPmts emplCount
Min. :310087 Min. :156.0
1st Qu.:310087 1st Qu.:156.0
Median :311863 Median :157.0
Mean :311123 Mean :156.6
3rd Qu.:311863 3rd Qu.:157.0
Max. :311863 Max. :157.0
> library(ggplot2)
> g4_emplCount <-
ggplot(dt1_b[dt1_b$status=="hourlyEmpl",], aes(x=BegDate)) +
geom_point(aes(y=emplCount))+
geom_line(aes(y=emplCount))+
xlab("Beginning Date")+ylab("Employee Count")+
scale_x_date(date_breaks = "1 month")
> g4_emplCount
The graph (Figure 5.2) shows that the number of hourly employees spikes
in July and December. This pattern in consistent with the prior analysis of
sales data (see Figure 3.3).
The resulting graphs can be accessed from the following link: https://goo.
gl/yw1UjG. The auditor would look at the resulting graphs (and similar
graphs generated for the other stores) to identify those stores, if any, having
unusual employee counts. For example, these might be employee counts that
12
See section 1.4.1 to review the instructions on how to create subsets.
that are significantly higher or lower than those of other stores. Also, some
stores might have employee counts that vary significantly from what might
be considered a normal pattern during the pay period cycle. For example,
looking at the graph generated for stores 1-20, we might expect that virtually
all stores would hire more staff (i.e., increase the employee count) in the peak
of the summer season and near the end of the calendar year (that for many
people is the “Holiday Season”). It would be reasonable to expect increased
consumption of alcoholic beverages in those periods compared to other times
during the year, with a resulting need to hire more hourly staff to help
maintain a high quality of customer service.
Looking at the graphs generated for each store, we see, for example,
that virtually all the stores hire more staff at the start of the year and for
the holiday season. However, there are exceptions. For example, store 3
only increases its staff complement in autumn and store 12 does not hire
more staff for the holiday season. There are legitimate reasons why such
exceptions may occur. For example, the community served by these two
stores may consist mainly of people who do not participate in the “holiday
season.” On the other hand, there may, for example, be a risk that these stores
do, in fact, increase their staff compliment in the holiday season but this is
done by means of making cash payments outside the normal payroll system.
This may indicate matters that are qualitatively material (e.g., violations of
employment, tax and other laws). The auditor may therefore make inquiries
about the reasons for these (and other unusual trends identified) and obtain
evidence to support, or contradict, management’s responses to the inquiries.
because staff have different pay grades (e.g., some stores may, on average,
have more senior employees than other stores).
With the following script we generate the graphs for stores 1 through
20. The resulting graphs can be accessed from the following link: https:
//goo.gl/o7bB2F.
> g4_totalPmtsStore <-
ggplot(dt1_c[dt1_c$Store<=20,], aes(x=BegDate)) +
geom_point(aes(y=totalPmts))+
geom_line(aes(y=totalPmts))+
xlab("Beginning Date")+ylab("Total Payments")
> g4_totalPmtsStore+facet_wrap(~Store, nrow = 5)
As might be expected, the patterns for stores 3 and 12 are similar to those
noted in the analysis of bi-weekly hours per store. In addition, it becomes
apparent from looking at both bi-weekly hours graphics and wages graphics
that the number of hourly staff, and consequently wages paid, differs signifi-
cantly among stores. For example, the wages paid, and number of employees,
for store 15 is significantly greater than for store 18. This may relate to store
size (as we explore in another ADA below). But such variations may be in-
dicative of other issues. For example, employees at a store may be being paid
at rates above those authorized. So again, the auditor may make inquiries
about the reasons for these (and other unusual trends identified) and obtain
evidence to support, or contradict, management’s responses to the inquiries.
be useful to account, at least in part, for the effect that store size has on
payroll. For example, the payroll expense per square foot for one or more
stores may be significantly different from what the auditor anticipates, based
on knowledge of Bibitor’s business. This might warrant more audit work on
payroll expense for these stores.
Leveraging prior knowledge from section 3.3.2, we classify stores in terms
of store size (SqFt). We create a new data set (dt1_d) that has all information
from dt1_c (i.e., store level payroll data for each payperiod), as well as the
variable that captures store size.
5.6 Documentation
An example audit working paper showing documentation for this ADA is lo-
cated at https://docs.google.com/document/d/1g0j6rNluvhvulwXczbs9SVfeUdk7D-4Omqe1Ul6P9
edit?usp=sharing.
> head(dtPayroll[order(dtPayroll$EndDate),3:9],3)
> head(dtPayroll[order(-dtPayroll$EndDate),3:9],3)
> head(dtPayroll[order(dtPayroll$Salary),3:9],3)
> head(dtPayroll[order(-dtPayroll$Salary),3:9],3)
> head(dtPayroll[order(-dtPayroll$RatePerHour),3:9],3)
1: 70.02 6
2: 81.69 7
3: 105.03 9
> head(dtPayroll[order(-dtPayroll$Hours),3:9],3)
2 1
IT SPECIALIST LEGAL ASSISTANT
2 1
LEGAL SECRETARY PAYROLL OFFICER
1 1
REGIONAL STORE SUPERVISOR Retail Store Clerk I
7 629
Retail Store Clerk II Retail Store Clerk III
121 134
Retail Store Manager I Retail Store Manager II
14 13
Retail Store Manager III Retail Store Manager IV
30 39
SENIOR ACCOUNTANT Store Receiver
4 79
TRAINING AND DEVELOPMENT TRANSPORTATION SPECIALIST
4 1
Finally, we have the graph for stores 61 through 79. The resulting graphs
can be accessed from the following link: https://goo.gl/531ruu.
Script for stores 41 through 60. The resulting graphs can be accessed
from the following link: https://goo.gl/zanEjm.
Script for stores 61 through 79. The resulting graphs can be accessed
from the following link: https://goo.gl/YpyThd.
The resulting graphs can be accessed from the following link: https://goo.
gl/mhwP7M
The resulting graphs can be accessed from the following link: https://goo.
gl/DE5nRW
The resulting graphs can be accessed from the following link: https://goo.
gl/B7MW8J
The resulting graphs can be accessed from the following link: https://goo.
gl/neBN4R
The resulting graphs can be accessed from the following link: https://goo.
gl/gnoYiz
The resulting graphs can be accessed from the following link: https://goo.
gl/TT9Lqm
The resulting graphs can be accessed from the following link: https://goo.
gl/QVGqqK
The resulting graphs can be accessed from the following link: https://goo.
gl/A8qwGF
Appendix
109
Introduction to R
A.1 Introduction
R is a very powerful and versatile software package. In very general terms,
R can be used to extract, transform, and analyze structured (e.g., sales and
inventory data from a company’s enterprise system) and unstructured data
(e.g., Tweets, emails, blogs).
While the list of R applications is very extensive, here are some examples
of what you will learn in this introduction to ADA: run SQL queries in order
to extract and transform data; organize data (e.g., pivot tables); run sim-
ple statistical analysis, such as generating descriptive statistics, identifying
outliers, and perform hypothesis testing.
Chances are that you have heard that the learning curve of R is relatively
steep, and that some of the things that you can do with R you can also do
with other packages such as Excel or Tableau. You may wonder, why bother
with R. There are at least a couple of reasons on why it is worth your time
to make this investment.
• First, the spectrum of applications that you can work on is very large.
Figure A.1,1 shows the trade-off between the difficulty and complexity
of Excel and R. As you can see the difficulty (learning curve) of Excel is
very low. The red Excel line grows very slowly along the horizontal axis.
The complexity of Excel (i.e., what you can do with Excel) becomes
almost vertical, which means there is a limit. Looking on the R line, we
can see that it becomes steep very quickly. This means that; initially,
it is more difficult to learn. However, the list of what you can do with
R keeps going much longer after Excel has reached its limit.
1
The figure was prepared by Gordon Shotwell and it is available from the following
URL: http://blog.yhat.com/posts/R-for-excel-users.html.
111
A.2 R Installation
R is pre-installed in some computers. Check the applications in your Mac
or All Programs in your Windows machine. If it is not per-installed, you
can download and install from the appropriate URL (shown below). For the
basic installation your can simply follow the directions on your screen and
accept the default settings.
A.3 RStudio
While we can run our analysis from R, the interface of RStudio is more
user friendly. RStudio is an integrated development environment (IDE) that
will let us see all components of our R project in an integrated interface.
Download RStudio from http://www.rstudio.com/. Please, remember that
first we install R, and then RStudio.
The interface of RStudio (shown in Figure A.2) is divided into four panels
(areas):2
1. Source, this is where we write script that we want to save. This is
helpful when we work on projects that we may want to revisit at a later
point. We can create a new R script file by selecting File > New File > R
Script. The new file will show as untitled1 in the source pane.
2. The Console plays a dual role. This is where we execute script inter-
actively, and we see the output/results of our script. Executing interactively
2
If the interface looks different than the one shown in Figure A.2, we can change it
by selecting preferences in Mac or Tools - Global Options in Windows and then selecting
Pane Layout. If we want our interface to match Figure A.2, we can use the drop down
menu in each pane and select source for the upper left pane, console for the upper right,
etc.
means that when we type an instruction in the console, and we hit return,
the line is immediately executed and the output will be shown in the console.
If we execute a line or an entire script in the source area, the output will
shown in the console.
3. The Files, Plots, Packages, Help, Viewer area serves multiple needs.
For example, if our script contains instructions for the creation of a graph,
the output will show up in the Plots. If we need to install specific packages
that would allow us to execute some sophisticated functions, we do this from
the Packages area. We can view the files in our working directory or access
the built in Help functionality.
4. The Environment area shows the data set(s) currently open and the
variables in each one of these data sets. The History as the name indicates
keeps track of every single line of script that we have entered through the
console.
A.4 R Packages
R has a collection of ready to use programs, called packages. For example,
we can make some very powerful graphs with the package ggplot2 , we can
generate detailed descriptive statistics with the package psych, run SQL
queries from within R using sqldf , and perform sentiment analysis based on
Twitter data using twitteR and stringr.
There are a couple of ways that we can install these packages. First,
we can use the console to type the command install.packages(). For
example, the following line would install the psych package:
i n s t a l l . p a c k a g e s ( ‘ psych ’ )
Alternatively, we can click Packages > Install and select the package from
the new window/drop down menu (see Figure A.3).
Once the packages have been installed, we can load them (i.e., include
them in our script), using the library function. For example, the following
line would load the psych.
l i b r a r y ( psych )
Please keep in mind that we install a package only once. However, we need
to load the appropriate package using the function library() every time we
need to leverage the functionality of the specific package in our script.
A.5 R Basics
A.5.1 Operations
In its simplest form, we can use R as a calculator to perform basic operations.
For example, in the console area we can write (See also Figure A.4):
> 50+20
It will return
[1] 70
> 80-30
It will return
[1] 50
> 20*3
It will return
[1] 60
> 54/9
It will return
[1] 6
> 2^3
[1] 8
or
> 2**3
[1] 8
> (2^3)+(80-50)
[1] 38
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> x<- 5
> y<- 9
> x*y
[1] 45
> x-y
[1] -4
> x+y
[1] 14
We can combine existing variables to create new variables using the c().
For example, we can create a new variable that combines the value of the
variable x and y as follows:
As you can see the new variable z has two data points.
[1] 5 9
A.7). Once, we are done we can select save or save as and select to save
the file in our working directory on our computer.
> x=seq(2,20,2)
> x
[1] 2 4 6 8 10 12 14 16 18 20
> length(x)
[1] 10
> sum(x)
[1] 110
> min(x)
[1] 2
> max(x)
[1] 20
> mean(x)
[1] 11
123