Chapter 1-5 Notes Data Analytics

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

Chapter 1

Using Data Analytics to Ask and Answer Accounting Questions

I. The Abundance of Data


 The increasing amount of data may hinder the work of the accountant through
information overload, where too much information may not be properly synthesized
or interpreted.
 The abundance of data can be helpful in addressing company question, problems,
and challenges to the extent the accountant can harness and analyze the available
data.
 With computers doing so much of the basic accounting work from data collection to
reporting, the focus of the accountant is changing from data collection and
summarization to data analytics.

II. Blooms Taxonomy


 Blooms Taxonomy provides a hierarchical view of critical thinking skills

 The basic introductory accounting classes covering topics of financial and


managerial accounting, primarily address the remember, understand, and apply
skills.
 Accountants must master all levels of Blooms Taxonomy because they cannot
Analyze, Evaluate, and Create if they have not mastered the lower-level skills.
 The challenge is then to move to higher-order thinking skills because much of
the lower-level can be performed by machines.

III. The Analytic Mindset


 The components of the analytic mindset are as follows:
 Ask the right questions.
 Extract, transform, and load relevant data.
 Apply appropriate data analytic techniques.
 Interpret and share results with stakeholders
 Data Analytics is the process of evaluating data with the purpose of drawing
conclusions to address all types of questions (including accounting questions).
 AMPS Model
1. Ask the Question (Chapter 1)
2. Master the Data (Chapters 2-4)
3. Perform the Analysis (Chapters 5-9)
4. Share the Story (Chapter 10)

Page 1 of 24
1. Ask the Question
 One should ask carefully constructed questions that can potentially be solved using
data and Data Analytics.
 Narrowing the scope of the question makes it easier to find the necessary data,
perform the analytics, and potentially be able to address the question.
 Four common types of questions are:
1. What happened? What is happening? (descriptive analytics)
2. Why did it happen? What are the root causes? (diagnostic
analytics)
3. Will it happen in the future? Is it forecastable? (predictive analytics)
4. What should we do, based on what we expect will happen?
(prescriptive analytics)

2. Master the Data


 Is the data relevant to the question to be addressed?
-The data must provide information that can be used to appropriately address the
question.
 Does the data exhibit high levels of integrity?
-Characteristics affecting the integrity of data include:
 Accurate – free from error
 Valid – data is complete
 Consistent – based on facts rather than just opinions
 Are the benefits of using the data > the cost of acquiring the data?
-The cost of the data could be affected by who owns the data and how difficult it is
to access.
 What types of analysis can be performed on the data?
-The accountant must consider traits of the data such as:
 Is the data alphanumeric (text)?
 Is the data categorical (for example male or female)?
 Is the data numerical?

3. Perform the Analysis


 Different questions require different types of analysis, including:
 Descriptive analysis summarizes past data to describe what happened.
 Diagnostic analysis takes the insights from descriptive analysis and drills down to
find the causes of those outcome.
 Predictive analysis uses statistical modeling from previous data to make logical
predictions about future outcomes.
 Prescriptive analysis combines insights from all previous analysis to determine
the course of action to take in a current problem or decision.
 A histogram or scatterplot might be used to evaluate journal entries that are
excessively big or excessively small (or negative) when testing internal controls.
 Regression analysis might be used to evaluate the ability of advertising cost to
generate sales.
 A pivot table might be used to consider characteristics of past due customer
accounts.

Page 2 of 24
 What-if/goal seek analysis might be used to perform an analysis of how changing
costs and other factors affect the break-even point for a new product.

4. Share the Story


 It is important to share the story by communicating the results to decision makers
Updating
 Static - a one-time analytic performed on a periodic basis (i.e., annually
quarterly).
 Dynamic – frequent updates on a continuous, daily, or weekly basis.
Form of Presentation
 Written report – summarizing the report in paragraph format.
 Visualization – such as a chart or graph.
 Dashboard – a graphical measure of various measures that a company tracks.
 The AMPS model must often be performed multiple times, refining the question
(Ask the Questiion), considering different types of data (Master the Data),
performing additional analytics (Perform the analysis), and retelling the story in
each iteration (Sharing the Story) before the issue/problem/challenge can be
addressed with some confidence.

IV. What Tools Do Accountants Need


 Spreadsheets such as Excel are usually the first tool for accountants to perform
analytics:
1. Basic formulas and functions.
2. References to connect data in the spreadsheet.
3. Macros to automate repetitive processes.
4. PivotTables to allow for the reorganization and summarization of certain
data using crosstabulations without changing the underlying spreadsheet or
data.
 Query is used to access data needed for analytics:
 VLOOKUP in Excel retrieves the data from a table or worksheet.
 SQL might be used to access specific data from large datasets.
 Visualizations can be performed in both Excel and Tableau.
 Scripting is the use of programs to perform analytics to do repetitive tasks. Such
programs can be written in R or Python.

Chapter 2
Master the Data: An Introduction to Accounting Data

Presentation Outline
I. Accounting Data Analytics and Data Sources
II. The Meaning of Big Data
III. Sources of Data for Accounting Analysis
IV. The PIVOTTABLE

I. Accounting Data Analytics and Data Sources


 Data Analytics is defined as the process of evaluating data with the purpose of
drawing conclusions to address all types of questions.

Page 3 of 24
A. Data Analytics and Accounting
B. Master the Data

A. Data Analytics and Accounting


 More data has been created in recent years than in the entire previous history of
the human race. It should be kept in mind that it is not necessarily the volume or
availability that is important. What is important is how it is analyzed and used to
create value.
 Accounting data analytics is encapsulated as the technologies, systems, practices,
methodologies, databases, statistics, and applications used to analyze diverse
accounting and nonaccounting data to give organizations the information they need
to make sound and timely business decisions
 Examples of Applying Accounting Data Analytics:
o Evaluate journal entries to find errors or fraud.
o Use product reviews to help identify obsolete inventory that may need to
be marked down to properly value the inventory.
o Classifying customer ability to pay in determining the allowance for
doubtful accounts.
o Estimate fixed and variable costs needed for breakeven analysis.
 Effective data analytics provides a way to search through large data sets to discover
unknown patterns or relationships.
B. Master the Data
A. Ask the Question (Chapter 1)
B. Master the Data (Chapters 2-4)
C. Perform the Analysis (Chapters 5-9)
D. Share the Story (Chapter 10)
 A major part of the Master the Data step of the AMPS model is to learn and become
aware of the variety of data sources available to answer our accounting questions.

II. The Meaning of Big Data


 Big Data can be defined as data sets that are too large or complex for businesses’
existing systems (i.e., traditional capabilities) to capture, store, manage, and
analyze. The following four V’s are often used to describe Big Data
A. Volume
B. Variety
C. Velocity
D. Veracity
A. Volume
 Volume is the sheer amount of data, regardless of its source.
 It might come from corporate systems, clickstream data from social media (i.e.,
Facebook, Instagram, Blogs, etc.), the government (e.g., census records), speeches
by the CEO, press releases from the company, word searches from Internet search
engines (i.e., Google, Yahoo), or just the Internet in general.

Page 4 of 24
 One of the Biggest Challenges:
-Finding which of all the mounds of data that is relevant to the decision maker.

B. Variety
 Variety is the form of the data.
 Structured data is highly organized and fits neatly into a table or in a database. The
best accounting example of structured data is a financial statement in a tabular
format.
 Unstructured data is data without internal organization. Examples include blogs,
social media, and pictures posed in Instagram.
 Semi-structured data does not have labeled columns, but its data may come with
tags or markers explaining what the data represents. An accounting example is
XBRL data which puts tags on financial statement data so that computers can easily
read and evaluate financial statement data.
C. Velocity
 Velocity is the speed that the data is being generated.
 Stock prices might be generated and analyzed on a second-by-second basis.
 A company’s financial statements might be generated and analyzed on a monthly or
quarterly basis

D. Veracity
 Veracity is the quality of the data. It is defined in terms of whether the data is truthful,
accurate (and clean), and worthy of trust. Some suggest that veracity is the
cornerstone of Big Data and Data Analytics.
 Fact vs. estimate – in accounting, some data is generally considered factual (i.e.
cash balance in a bank account), and other data is considered estimated (i.e.,
balance in Allowance for Doubtful Accounts or amount of Goodwill).
 Accurate – Some data contains errors (i.e., incorrect check posting) or is missing
data (i.e., accountant forgets to include the date of a transaction). Still another
possibility is fraud concealed through manipulation of records (i.e., lapping of
accounts receivable).

III. Sources of Data for Accounting Analysis


A. Financial Accounting Data
B. Financial Accounting-Related Data
C. Managerial Accounting Data
D. Tax Data
E. Nonaccounting Data
A. Financial Accounting Data
 Financial accounting data refers to materials prepared by a company to help decision
makers external to the company.
 Financial Statement Data
1. Journals and General Ledger

Page 5 of 24
2. Subsidiary Fixed Asset Ledger
3. Subsidiary Accounts Receivable Ledger
4. Subsidiary Inventory Ledger

1. Financial Statement Data


 Some of the most commonly used financial statements include:
 Income Statement
 Statement of Stockholders’ Equity
 Balance Sheet
 Statement of Cash Flows
 Sources of financial statements include:
 Investor relations section of a company’s website.
 The Securities and Exchange Commission (SEC) maintains a repository of
financial statements for public companies at its EDGAR website.
 Compustat maintains a database of financial statement items. An example is
Yahoo! Finance (https://finance.yahoo.com)

2. Journals and the General Ledger


 The general ledger summarizes the current balance of all asset, liability, equity,
revenue, and expense accounts.
 In each special journal, all transactions are totaled at the end of the month, and
these totals are posted to the general ledger.
 Note: General journal transactions are posted into the general ledger individually
by account.
 The balances in the general ledger serve as the basis for the financial statements.
Summary accounts known as control accounts in the general ledger are often
supported by detail contained in subsidiary ledgers.

3. Subsidiary Fixed Asset Ledger

 Fixed assets include property, plant (i.e., factories, office buildings, stores, etc.),
equipment (vehicles, forklifts, computers, tools, etc.), and furniture.
 This ledger also keeps details regarding the purchase date, depreciation
method, and accumulated depreciation for each fixed asset.
 The detailed balances in each category of fixed asset account in the fixed asset
subsidiary ledger supports the control account balance in the general ledger.
 Many simply refer to it as a depreciation schedule. See partial ledger on the
right.

Page 6 of 24
4. Subsidiary Accounts Receivable Ledger
 The subsidiary accounts receivable ledger details information regarding charges and
payments on customer accounts for each customer.
 The total of the subsidiary accounts receivable ledger supports the accounts receivable
control account in the general ledger.

5. Subsidiary Inventory Ledger


 The subsidiary inventory ledger details information on the inventory held by the
company. The detailed balance of each type of inventory items, or SKU (stockkeeping
unit), support the inventory control account in the general ledger.

B. Financial Accounting-Related Data


1. Corporate Securities and Exchange Commission (SEC) Filings
2. Conference Call Transcripts – Website Links
3. XRBL (eXtensible Business Reporting Language)
4. Press Releases

1. Corporate Securities and Exchange Commission (SEC) Filings


 Publicly traded companies make required filings to the SEC each year:
 Form 10-K – Annual filing with the following required sections:
 Business – an overview of the company’s main operations.
 Risk Factors – any significant risks the company faces or could likely face in the future.
 Selected Financial Data – financial highlights over the last five years.
 Management’s Discussion and Analysis of Financial Condition and Results of
Operations – unlike most of the 10-K filings, most of this is unstructured data requiring
additional work to extract and analyze.
 Financial Statements and Supplemental Data – includes audited financial statements,
notes to the financial statements, and auditor report.
 Form 10-Q – Quarterly submission like the 10-K but more abbreviated in its disclosures.
 Form 8-K – used to notify investors of important events such as changes in senior officers,
substantial asset acquisitions or sales, change in auditors, and restatement of financial
statements, etc.
 SEC submissions are stored at EDGAR (https://www.sec.gov/edgar.shtml), the Electronric Data
Gathering, Analysis, and Retrieval system.

2. Conference Call Transcripts – Website Links


 An earnings call is a conference call between senior company management, analysts, investors,
and the media to discuss earnings and other financial results that immediately precede the call.
 Analysts and investors have an opportunity to ask management questions during the call, in
order to better understand past reports and predict future performance.
 Transcripts are available for download and analysis.
 Considered to be a form of unstructured data.
 Websites such as Seeking Alpha (https://seekingalpha.com/) are dedicated to curating content
of interest to investors.

Page 7 of 24
3. XRBL (eXtensible Business Reporting Language)
 XBRL is the computer-based standard used to define and exchange financial information
between disclosing companies and various financial statement users.
 The SEC requires each publicly traded company to submit its financial statements using XBRL.
 An example of the use of XBRL requesting data for IBM’s Total Assets (XBRL tag: Assets) and
Liabilities (XBRL tag: Liabilities) from 2014 to 2017 is shown below:

4. Press Releases
 Companies often issue press releases to make an official statement to the media. Over time,
press releases can be aggregated into a fairly-comprehensive view of:
 What is happening at a company.
 The tone of management toward business opportunities, and
 Future profitability.
 Represents unstructured data available for analysis.

C. Managerial Accounting Data


 Managerial accounting data is used for decision making by managers and other personnel within
an organization. It has fewer rules than following GAAP standards, but does have the following
two essential criteria:
 Create, maintain, and store information that is useful to internal stakeholders in making
decisions.
 The cost of capturing, measuring, and storing the data must be less than the value of the
data.
1. Budget Data
2. Standard Cost Data
3. Point-of-Sale Transaction Data
4. Data on Potential Cost Drivers for Allocating Overhead
5. Supply Chain Data
6. Customer Relationship Management Data
7. Human Resource Data

1. Budget Data
 Budgets generally start with a prediction of the level of sales. The company then predicts the
level of expenses and capital expenditures that will be needed to support those sales.
 Comparing the actual to the budgeted amounts helps the company learn what occurred that
was anticipated as well as unanticipated.

2. Standard Cost Data


 A system of standard costing allows an accountant to mathematically determine where
deviations from the budget have occurred. Further analysis can then determine reasons for
deviations.

Page 8 of 24
 Note that the overhead volume variance calculated here assumes that hours are used to
allocate the fixed manufacturing overhead. Alternatively, the overhead volume variance can be
calculated as follows: Flexible budget level of overhead for the actual level of production –
Overhead applied to production using standard overhead rate.

3. Point-of-Sale Transaction Data


 Point-of-sale (POS) transaction data creates a stream of data from every transaction that occurs
regarding sales and inventory.
 Example: Walmart has a system called Retail Link, which is an Internet-based tool allowing its
suppliers to access sales and inventory data by: item, store, and day. The data is used to help
buyers predict buying patterns, trends, and inventory management.

4. Data on Potential Cost Drivers for Allocating Overhead


 Estimated Manufacturing Overhead = MOH Rate
Estimated Level of Allocation Base
MOH Rate x Actual Usage of Allocation Base = MOH Applied to Product
 The above shows how a single manufacturing overhead rate is calculated and used to apply the
overhead to product. The problem with this approach is that there are often many different
causes of the overhead and using a single allocation base can sometimes inaccurately allocate
costs to products. Activity-based costing attempts to overcome this by having multiple
overhead allocations with a variety of allocation bases that better reflect what causes the
overhead to be incurred. In essence, it is as simple as using multiple overhead rates, each with
an allocation base that more accurately reflects what causes the overhead in its numerator to
be incurred.
 Past data regarding overhead can be used in regression analysis to determine whether various
allocation bases are good predictors of the level of manufacturing overhead for various cost
pools.
 y = a + bx + e
Where:
y = total overhead costs
a = fixed overhead costs
b = variable overhead rate
e= error

5. Supply Chain Data


 The supply chain represents the process of getting products from raw materials to production to
distribution to the ultimate delivery of the final product to the customer.
 Information on active vendors such as their contact information, orders made to date, payment
locations, payment amounts, etc. can be collected by the supply chain system.
 One form of internal control over expenditures is the maintenance of an approved vendor list
that purchasing agents are restricted to when making purchases. The vendor information from
the supply chain can be used in helping to select and monitoring vendors.

6. Customer Relationship Management Data


 A customer relationship management (CRM) system is an information system for overseeing all
interactions with current and potential customers with the goal of improving relationships.
Examples of CRM data include the following:

Page 9 of 24
 Customer contact history.
 Customer credit score.
 Customer credit limit
 Customer payment history
 Having such data on each customer could help in predicting the allowance for doubtful
accounts.

7. Human Resource Data


 A human resource management system tracks information regarding employees (pay, hire date,
benefits, applicants, retirees, etc.)
 It could be possible that an analysis of salary and wage payments could uncover a fraud scheme
where a paycheck is being generated for someone who no longer works for the company. The
check might then be cash by a dishonest supervisor.

D. Tax Data
 Accountants are deeply interested in the impact of transactions and events on the amount of
tax that is due and payable.
 The tax information comes from transactional data gathered in the financial reporting system.
 Two examples of tax data needed to be stored:
 Depreciation used for tax and financial reporting purposes. Sometimes the depreciation
used for tax is different from that used for financial reporting.
 Certain information is used to claim a research and development tax credit, including
linking an employee’s time directly to a research activity or the use of specific
equipment.

E. Nonaccounting Data
 Some non-accounting data helps accountants in better understanding accounting data.
1. Economic Data
2. Current and Historical Stock Prices
3. Social Media
4. Analyst Research Reports and Earnings Forecasts

1. Economic Data
 Accountants sometimes use macroeconomic data, such as:
 Gross domestic product (GDP) – as a measure of economy-wide performance.
 Unemployment numbers – as a measure of labor availability.
 Consumer price index (CPI) – as a measure of inflation
 Housing market starts and price levels – generally regarded as a key measure of
economic status.
 Macroeconomic data is generally useful for diagnostic and predictive analytics.

2. Current and Historical Stock Prices


 Current and historical daily stock price data is readily available on websites like Yahoo! Finance
(https://finance.yahoo.com/)
 By clicking on “Historical Data” and inputting the appropriate time period, it is easy to access
daily stock prices for any publicly traded firm.

Page 10 of 24
3. Social Media
 Social media and the Internet have several ways potential investors to communicate with each
other.
 Blog sites and chat boards (like on Yahoo! Finance) or on Twitter or StockTwits. StockTwits
organizes its discussion using a cashtag which is $ plus the ticker symbol. So for Netflix, the
cashtag would be $NFLX. Any discussion that includes that cashtag would be summarized on
the NFLX StockTwits page.
 Sometimes data analysts employ computer programs (sometimes called machine learning) to
assess the sentiment on chat sites to see how it is related to stock price. The machine learning
techniques may count how many positive words are said by those on StockTwits as compared to
the number of negative words to assess the overall sentiment reflected on the post.
 Product reviews can give insight as to whether a product is pleasing to customers and help
predict demand.

4. Analyst Research Reports and Earnings Forecasts


 Financial analysts often prepare research reports talking about company prospects. Such
reports synthesize financial statements, conference calls, and discussions with company
managers and company competitors.
 The exhibit on the right provides an example of Value Line’s Financial Analysts Report for
Johnson and Johnson.

IV. The PIVOTTABLE


 PivotTables are a tool that allows reorganization and summarization of certain data using
crosstabulations without changing the underlying spreadsheet (or data).

Chapter 3
Accounting Data: Data Types and How They Are Used

Presentation Outline
I. Structured Data Types
II. Categorizing Data Based on Tools
III. Analyzing Categorical and Numerical Values in a PIVOTTABLE
IV. Data Dictionaries and Data Catalogs
 With a focus on the analysis of structured data, this chapter will delve into ways to extract,
transform, and load data to avoid information overload.

I. Structured Data Types


 The two broadest types of structured data are:
A. Categorical Data
B. Numerical Data
C. Structured Summary

A. Categorical Data
 Categorical data tend to “categorize” items represented by words—such as classifying a groups
of people by gender (i.e., male or female), or labeling transaction types (i.e., FIFO, LIFO, average
cost, etc.). There are two subsets within the categorical data type:
1. Nominal Data

Page 11 of 24
2. Ordinal Data

1. Nominal Data
 Nominal data is categorical data that cannot be ranked, such as gender and type of transaction.
The primary methods to summarize categorical data that is nominal is:
 Counting and grouping
 Proportion

 Is the above dataset all categorical?


- There is a mix of categorical and numerical data in the above dataset. The first
three columns are all categorical, while the Amount column is numerical.

 Let’s begin by analyzing Transaction Type, which is a categorical, nominal data type. By
highlighting the transactions associated with returns, we can count how many transactions are
Returns, 9, which means that 11 transactions are Sales.
 This analysis of categorical data can go further by calculating the proportion.
 Proportion is calculated by taking the number of observations in a category and dividing that
number by the grand total of the number of observations in the sample.
 Return transactions are 9 of the 20 in the sample, meaning that 45 percent of the transactions
are Returns. We can infer that 55 percent of the remaining transactions were Sales.

2. Ordinal Data
 Categorical, ordinal data has the same characteristics as nominal data, but it goes a step further
—there is a natural “order” to ordinal data that allows you to rank and sort it. This means that
there are three primary methods to summarize categorical, ordinal data:
 Counting and grouping
 Proportion
 Ranking
 Examples of ordinal data are letter grades (i.e., A, B, C, D, and F) and Olympic medals (i.e., gold,
silver, and bronze).
 In considering the Date variable, we can count the number of transaction associated with each
date, but there is also a natural ordering of the dates through time.

Page 12 of 24
 The above summary table shows the number of transactions on each date, as well as the
proportions of each date in the sample.
 Ranking refers to a position on a scale. While ordinal data results could be ranked by the count
or in some cases alphabetically, a ranking of the records on the basis of the natural order of date
is more informative for this dataset.

B. Numerical Data
 Numerical data, as the name implies, are meaningful numbers, such as transaction amount,
net income, age, or the score on an exam. There are four primary methods to summarize
numerical data:
 Counting and grouping
 Proportion
 Summing
 Averaging
 The two types of numerical data are:
1. Interval Data
2. Ratio Data

1. Interval Data
 Interval data is so named because there is an equal “interval” or distance between each
observation. However, interval data does not have a meaningful point of zero.
 A good example of interval data is temperature. Even though the difference
between 1 degree on a Fahrenheit scale is the same from 30 to 31, as it it from 77 to
78, which makes the data type numerical, when the temperature is 0, that does not
mean “the absence of temperature”; it is simply 1 degree below 1, and 1 degree
above -1.
 Another example of interval data is SAT scores. A student cannot even earn a 1 on
the SAT—the range of possible scores are from 400 to 1,600 for the total score.
 Interval data is uncommon in the type of data that accountant’s work with.

2. Ratio Data
 Ratio data is defined as numerical data with an equal and definitive interval between each
data point and absolute “zero” in the point of origin. It is also important to note that
negative values also take on meaning with ratio data. For example, the sum of net sales
considers the negatives for sales discounts and sales returns and allowances.
 Data measuring money—such as transaction amounts, expenses, revenues, assets,
salary, taxes, etc.—are all examples of ratio data.
 The majority of data relating to accounting and other business decisions are ratio
data.
 Depending on the way the data is set up in the system or database, sometimes
transactions are always listed at the transaction the transaction amounts absolute
value (i.e., the positive value that corresponds with the number). For example,
revenues and expenses are expressed as positive numbers on an income statement,
but expenses are subtracted from revenues to calculate net income.

Page 13 of 24
C. Structured Summary

II. Categorizing Data Based on Tools


 Depending on the tool you are using, there are a variety of ways to address the different
data types:
1. Database Data
2. Tableau Data
3. Geographic Data

A. Database Data
 Databases and some tools define data using the following:
 String, text, short text, or alphanumeric – a string of characters is a collection of
one or more characters that are stored as categorical data. The characters can
be letters, numbers, or a combination, but even if they are stored as numbers,
the numbers are not interpreted as meaningful values that can be used in
calculations. Short text would be a brief name. Long text would be a paragraph.
 Date – the date data type represents a string of characters that are formatted in
a traditional date format, such as mm/dd/yyyy or mm/dd/yy.
 Number – the number data type is reserved for numeric data, typically ratio
data. Any characters that are stored as a number can be used in calculations.
 Y/N flag – used to indicate yes or no.

B. Tableau Data
 Tableau and some other tools makes use of the following data types:
1. Dimension – any attribute that is considered to be categorical.
2. Measure – any attribute that is considered to be numerical.

C. Geographic Data
 Geographic data is any data that can be linked to a map, such as an attribute for state, city, or
country.
 Both Excel and Tableau interact with geographic data.

Page 14 of 24
 Sales by country illustration on left was completed in Tableau.

III. Analyzing Categorical and Numerical Values in a PIVOTTABLE


 Pivottables are one of the best methods for summarizing raw data in Excel.
 Recognizing categorical data helps to identify different ways to subtotal, or slice, the
numerical data.
 Recognizing numerical data will help identify the different measures that can be
calculated.
 The methods of data analysis that can be performed using variable in a dataset is dependent on
the type of data you are working with and whether it is numerical or categorical.
A. Insert a Pivottable
B. Create Pivottable Values
C. Create Pivottable Rows

A. Insert a Pivottable
To create a pivotable from the transaction table example, download and open the file “Exhibit 3-
1 – TransactionsTable.xlsx.”

B. Create Pivottable Values


 From the PivotTable Fields list, drag both the Transaction_Type and Amount to the Values
section of the field list.
 Excel will automatically interpret the data type of each variable and will provide a COUNT of the
Transaction_Type and a SUM of Amount.

C. Create Pivottable Rows


 Drag Transaction_Type into the Rows section.
 Excel provides rows and grand totals for each transaction type.

IV. Data Dictionary and Data Catalogs


 Data dictionaries are useful because some variables can be ambiguously named and interpreted
in a variety of ways if a specific definition for the variable is not provided.
 The data dictionary contains a separate record for each field (or variable) in tables of data.
- Although the terms data dictionary and data catalog are sometimes used
interchangeably, data catalogs are more robust and technical than data
dictionaries. In addition to defining variables, data catalogs will often include
database schemas, such as ER diagrams (covered in chapter 4).

Chapter 4
Master the Data: Preparing Data for Analysis

I. Differences Between Database, Excel, and Tableau


 A database is a structured data set that can be accessed by many potential authorized users via
a computer system or network.

Page 15 of 24
 Although there are a variety of types of databases, the relational model is the most popular and
common.
 It is possible to create and store data in Excel, but it is far preferable to store data in a database
and simply connect it to Excel when you wish to perform data analysis.
 Tableau will almost always default to showing data in a visual format instead of a numerical
format.
 Tableau’s biggest advantage over Excel is data visualization.
 Unlike Excel, it is not possible to create raw data in Tableau. Tableau must create an
connection to an existing data source.

II. Relational Databases


A. Relational Database Defined
B. Primary and Foreign Keys

A. Relational Database Defined


 Instead of storing all the data required for analysis in one massive table, relational databases
break the data into separate tables, each containing a unique list of the items stored. It is
organized into Tables which represent data organized into sets of:
 Fields (or variables): Columns that contain descriptive characteristics about the
observations in the table.
 Records: Rows with a set of observations that make up a record such as customer
information.

B. Primary Keys and Foreign Keys


 Primary Key: Unique identifier in each table, Transaction_ID in Transaction Table; CustomerID in
Customer Table
 Foreign Key: Exist to create relationships or links between two tables
 To identify which customer was involved in each transaction, there must be a common
field between the two tables that relates the Transaction Table to the Customers Table.
 In the Transaction Table, CustomerID is not uniquely identifying anything, but is creating
a look-up relationship so that individual transactions can be related to certain
customers.
 Unlike primary keys, foreign keys are not required in every table. However, when a
foreign keys exists, it must contain matching data in the related table.

III. Defining Relational Database Attributes and Relationships


A. The Data Dictionary
B. Relational Database Diagrams

A. Relational Database Data Dictionary


- Because it supports a multi-table database, a relational database data dictionary has several additional
fields compared to the data dictionary for a regular dataset.

Page 16 of 24
B. Entity Relationship Diagram
 In Microsoft Access, the relationship would appear as shown below:

IV. Advantages of Using Relational Database for Data Storage


 While it is possible to store data in Excel, its main usage should be for data analysis rather than
storage.
A. Data Integrity
B. Internal Control Benefits Data Storage in a Relational Database

A. Data Integrity
 Data integrity essentially means truth in data. Accounting information must be both relevant
and be a faithful representation of what occurs. Accounting information that exhibits a faithful
representation has the following three characteristics:
 Free from error (contains no mistakes or inaccuracies)
 Complete (includes all monetary transactions)
 Neutrality (information is not biased)
 Integrity of data can be damaged if different versions of the data are stored on the users’
desktops or laptops rather analyzing data through a live connection to the database.
 When that happens, the data being used for analysis and decisions becomes out-of-date as soon
as the database accumulates new data; thus, “multiple versions of the truth” end up being
stored on computers across the company.

B. Internal Control Benefits Data Storage in a Relational Database


 Preventive internal controls are easier to enforce:
 Controls are implemented in the database to ensure the suppliers receiving checks are
verified in the company’s system.
- Because a foreign key cannot contain data that does not first exist in the related
table’s primary key, one could not write a check to a supplier that is not already
included in the Supplier’s table
 Security around data entry and table access can aid in creating and enforcing data entry internal
controls:
 The Database Administration team can set up table-level security to indicate which
employees have read/edit/write permission.
 Ensures that only employees with appropriate security clearance can create new data
entries, such as new suppliers in the Suppliers table
 Reduced redundancy cuts down on errors:
 The nature of the relational database table structure confirms that there is a unique listing
of each observation stored in only one place.
 This maintains one version of the truth across all reporting mechanisms.

Page 17 of 24
 Version control reduces the possibility of having more than one version of the data:
 Data integrity is maintained when data is stored in one centralized database that business
users can connect directly to using Excel, Tableau, or other tools, rather tan having multiple
desktop databases or Excel files where data is stored.

V. Accessing a Database Using Extract, Transform, and Load (ECL)


 The following three steps are normally done in order and are pervasive in Data Analytics:
 Extracting Data – means accessing or connecting to data.
 Transform – is a step needed when data is not formatted in the manner it needs to be for
analysis. Sometimes this requires cleaning the data – for example, removing rows or
columns that have blank or erroneous values. May also involve sorting or filtering the data
so it is easier to analyze.
 Loading is ensuring that after the data has been transformed, it is loaded into the
appropriate tool.
A. Extract: Connecting to Data in Excel
B. Extract: Connecting to Data in Tableau
C. Extract and Transform: Connecting to a Subset of Data from a Database Using SQL

A. Extract: Connecting to Data in Excel


 In Excel, the easiest way to extract data is by using a query. Excel makes this easy to do through
a point-and-click file path, so you don’t necessarily have to write code or know a programming
language to connect to a database.
 Once a connection is made to the database, save your Excel workbook so you can perform
analysis on the data; in other words, you are not at risk of overriding or changing any existing
data
 If you wish to see a current view of the data from the database in your Excel file, click the
Refresh All button. This refresh will automatically pull in any new or changed data from the
database connection into your existing Excel workbook.
 Once the initial connection to external data is made, you can manage the settings to ensure that
the data in your workbook automatically refreshes. Click on the bottom half of the Refresh All
button.
 Within Connection Properties, select from a variety of options, including setting an interval for
how frequently Excel should refresh the data connection or for Excel to refresh the data
connection each time the workbook is opened.

B. Extract: Connecting to Data in Tableau


 Tableau’s biggest advantage over Excel is data visualization.
 Unlike Excel, it is not possible to create raw data in Tableau; instead a connection to an existing
data source (Excel spreadsheet/workbook, a database etc.) must be created.
 Although the labs and activities in this text always select Microsoft Excel, not that there are a
variety of different data sources, including servers (or databases) that business users can
connect to in the professional world.
 Once connected to the data, Tableau will present a screen called Data Source, where you can
transform the data if necessary, before beginning to work with the data, as shown in Exhibit 4.9.

Page 18 of 24
 Data imported as a categorical value will have an Abc icon.
 Data imported as a numerical value will have a number sign (#) icon.
 It is always a good idea to double check the format in which the data imported before beginning
to work with the data and to change it if there was a mistake.
 For example, if you wanted to work with the order number as a numerical value, click the Abc
icon and change the data type, as shown in Exhibit 4.10.

C. Extract and Transform: Connecting to a Subset of Data from a Database Using SQL
 If the data in a database is too large to load directly into Excel, Structured Query Language (SQL)
can be used to select only a subset of the data.

Chapter 5
Perform the Analysis: Types of Data Analytics

Four Types of Questions


 Descriptive Analytics What happened? What is happening? (Chapter 6)
- How long have existing accounts receivable been outstanding?
 Diagnostic Analytics Why did it happen? What are the reasons for past results? Can we explain
why it happened? (Chapter 7)
- Why did selling and administrative expenses increase as compared to the
industry
 Predictive Analytics Will it happen in the future?
- What is the probability something will happen? Is it forecastable? (Chapter 8)
 Prescriptive Analytics What should we do, based on what we expect will happen?
- How do we optimize our performance based on potential constraints? (Chapter
9)

Presentation Outline
I. Descriptive Analytics
II. Diagnostic Analytics
III. Predictive Analytics
IV. Prescriptive Analytics
V. Review of Basic Statistics
VI. The Excel Data Analysis Toolpak

I. Descriptive Analytics
1. Counts: Show how frequently an attempt occurs
2. Totals, sums, averages, subtotals: Summarize measures of performance.
3. Minimums, maximums, medians, standard deviations: Summarize measures
showing extreme values to help explain what happened
4. Graphs (bar charts), histograms.
5. Percentage change from one period to the next using vertical analytics, horizontal
analytics, or common-size financial statements.

Page 19 of 24
6. Ratio analytics like return on assets, return on sales (profit margin), asset turnover
ratios, debt-to-equity ratios: Calculate important financial ratios for comparison.

II. Diagnostic Analytics


A. Identify Anomalies/Outliers
B. Finding Previously Unknown Linkages, Patterns, or Relationships Between and Among
Variables
1. Perform Drill-Down Analytics
2. Determine Relations/Patterns/Linkages Between Variables
3.
A. Identifying Anomalies/Outliers
 Often a first step in diagnostic analytics is to look for and identify unusual, unexpected results or
transactions.
 Sequence checks and sequence analytics
 Why are some check numbers missing documentation? Does it signify errors or fraud or can
they be explained?
 Duplicate Transactions
 Why are there duplicates of some transactions in the financial reporting records? Are they fraud
or just errors?
 Benford’s Law- used to identify fraud or irregular transactions.
 Why do some refunds from Verizon offered by customer service representatives depart from
the distribution expected by Benford’s Law? Are they associated with fraud?
 Variance analytics- (typically performed in management accounting), used to identify
differences from expectations.
 Why is the labor rate and labor use variance for direct labor at the manufacturing plant
unfavorable? Based on his analysis of these thousands of data sets,Dr. Benford was able to show
that the probability of the first digit being one was always about 30%, while the probability of
the first digit being nine was less than 5%.

B1. Perform Drill-Dow Analytics


 Look for patterns in the underlying data set by summarizing data at different levels and
uncovering additional details to understand why something happened.
 Uncover the details by summarizing the data at different levels.
- Example of Type of Question: Do some accounts need to be written off due to
being uncollectable?
- Type of Test: Prepare an aged analysis of accounts receivable by customer.
 Use of cross tabulations to view transactions from different perspectives. (method of
quantitatively analyzing relationship among two or more variables)
- Example of Type of Question: Why were some transactions approved and
recorded on the weekend?
- Type of Test: Cross-tabulate using a pivotable who approved transactions and
the day of the week the transactions were approved.

B2. Determine Relations/Patterns/Linkages between Variables

Page 20 of 24
 Used to find the extent to which there are patterns in the data between and among variables.

III. Predictive Analysis

IV. Prescriptive Analytics

Page 21 of 24
V. Review of Basic Statistics
A. Population Versus Sample
B. Parameters Versus Statistics
C. Probability Distributions
D. Normal Distribution
E. Hypothesis Testing
F. Alpha, p-values, and Confidence Intervals
G. Review of Hypothesis Testing
H. Regression
I. Sample t-Test of a Difference of Means of Two Groups

A. Population vs. Sample


 Population - a group of phenomenon having something in common
 Sample - a subset of members of a population selected to represent that population

B. Parameters vs. Statistics


 Parameter – a characteristic of a population [(μ) population mean]
 Statistic – a characteristic of a sample [x-bar, or x sample mean]

 The most common measures of spread or variability are the standard deviation and the
variance, where each observation in the sample is x, and the total number of observations is n.
The standard deviation of a sample is s, and the sample variance is s2, are computed as follows:

 The greater the sample standard deviation or variance, the greater the variability.

C. Probability Distributions
 Normal Distribution – a bell-shaped curve symmetric about its mean, with the data points
closer to the mean more frequent than those data points further from the mean. It is
arguably the most important probability distribution because it fits so many naturally
occurring phenomenon.
 Uniform Distribution – every outcome equally likely.

Page 22 of 24
 Exhibit 5.13A shows an example of the uniform distribution where each of the ten
possibilities are equally likely.
 Poisson Distribution – low mean and being highly skewed to the right; mean number of
events per interval of space or time. In business, it might be helpful in predicting customer
sales on a particular day of the year, or the number of diners in a restaurant on a particular
day. Exhibit 5.13B shows an example.

D. Normal Distributions
 Data within +/- one standard deviation includes 68% of the data points.
 Data within +/- two standard deviations includes 95% of the data points.
 Data within +/- three standard deviations includes 99.7% of the data points.
 A z-score is computed to tell us how many standard deviations ( ), a data point, xj, is from
the population mean µ, using the formula z = (xi - µ)/ . A z-score of 1 suggests that the
observation is one standard deviation above its mean. A z-score of -2 suggests that the
observation is two standard deviations below its mean.

E. Hypothesis Testing
 Null Hypothesis: assumes the hypothesized relationship does not exist, that there is no
significant difference between two samples or populations
- H0: We expect that there is no difference in sales returns between holiday and
non-holiday season
 Alternative Hypothesis: a hypothesis used in hypothesis testing that is opposite of the null
hypothesis, or a potential result that the analyst may expect
- HA: We expect that there greater sales returns during the holiday season as
compared to the non-holiday season.
F. Alpha, p-values and Confidence Intervals
 There are two types of results from a statistical test of hypotheses that may occur or may be
interpreted in different ways: the p-value and/or confidence intervals.
 The p-value is compared to a threshold value, called the significance level (or alpha). A common
value used for alpha is 5% or 0.05 (as is 1% or 0.01).
- If p-value > alpha: Fail to reject the null hypothesis (i.e. not significant result).
- If p-value <= alpha: Reject the null hypothesis (i.e. significant result).
 For example, if alpha (α) is 5%, then the confidence level is 95%.
 Therefore, statements such as the following can also be made:
- With a p-value of 0.09, the test found that Saturday and Sunday sales are not
different than Sunday sales, failing to reject the null hypothesis at a 95 percent
confidence level.
 This statistical result should then be reported to management, reporting the results of the
statistical test.

G. Review of Hypothesis Testing


 Note that failing to reject the null hypothesis (Ho) does not imply that Ho has been accepted. It
only indicates that it has not been proven wrong.

Page 23 of 24
H. Regression
 We can think about this like an algebraic equation where y is the dependent variable and x is the
independent variables, where y = f(x).
 Let’s imagine we are considering the relationship between SAT scores and the college
completion rate for first-time, full-time students at four-year institutions.
 In this example y (college completion rate) = f (factors potentially predicting college completion
rate), including the independent variable SAT score (SAT_AVG).
 The R Square represents the percent of variation in the dependent variable that can be
explained by changes in the independent variable.

I. Sample t-Test of a Difference of Means of Two Groups


 Let’s suppose that a company is trying to understand if its rate of sales returns is higher around
the end-of year holidays than at other times (nonholidays) during the year.
 The mean holiday sales returns over 1,167 days is 13.0% of sales and the mean
nonholiday sales returns over 5,839 days is 11.9%.
 The question is whether these two means are statistically different from one another.
The t-statistic of 7.86 and the p-value for the one-tail test is 3.59E-15 (well below .01
percent), suggesting that the two-sample means are significantly different from each
other.
 A one-tailed t-test is used if we hypothesize that holiday returns are significantly greater
(or significantly smaller) than nonholiday returns. A two-tailed t-test is used if we don’t
hypothesize holiday or nonholiday returns are greater or smaller than the other, only
that we expect the two sample means to be different from each other.

VI. The Excel Data Analysis Toolpak


A. Performing Data Analytics
B. Loading the Toolpak

Page 24 of 24

You might also like