Chapter 1-5 Notes Data Analytics
Chapter 1-5 Notes Data Analytics
Chapter 1-5 Notes Data Analytics
Page 1 of 24
1. Ask the Question
One should ask carefully constructed questions that can potentially be solved using
data and Data Analytics.
Narrowing the scope of the question makes it easier to find the necessary data,
perform the analytics, and potentially be able to address the question.
Four common types of questions are:
1. What happened? What is happening? (descriptive analytics)
2. Why did it happen? What are the root causes? (diagnostic
analytics)
3. Will it happen in the future? Is it forecastable? (predictive analytics)
4. What should we do, based on what we expect will happen?
(prescriptive analytics)
Page 2 of 24
What-if/goal seek analysis might be used to perform an analysis of how changing
costs and other factors affect the break-even point for a new product.
Chapter 2
Master the Data: An Introduction to Accounting Data
Presentation Outline
I. Accounting Data Analytics and Data Sources
II. The Meaning of Big Data
III. Sources of Data for Accounting Analysis
IV. The PIVOTTABLE
Page 3 of 24
A. Data Analytics and Accounting
B. Master the Data
Page 4 of 24
One of the Biggest Challenges:
-Finding which of all the mounds of data that is relevant to the decision maker.
B. Variety
Variety is the form of the data.
Structured data is highly organized and fits neatly into a table or in a database. The
best accounting example of structured data is a financial statement in a tabular
format.
Unstructured data is data without internal organization. Examples include blogs,
social media, and pictures posed in Instagram.
Semi-structured data does not have labeled columns, but its data may come with
tags or markers explaining what the data represents. An accounting example is
XBRL data which puts tags on financial statement data so that computers can easily
read and evaluate financial statement data.
C. Velocity
Velocity is the speed that the data is being generated.
Stock prices might be generated and analyzed on a second-by-second basis.
A company’s financial statements might be generated and analyzed on a monthly or
quarterly basis
D. Veracity
Veracity is the quality of the data. It is defined in terms of whether the data is truthful,
accurate (and clean), and worthy of trust. Some suggest that veracity is the
cornerstone of Big Data and Data Analytics.
Fact vs. estimate – in accounting, some data is generally considered factual (i.e.
cash balance in a bank account), and other data is considered estimated (i.e.,
balance in Allowance for Doubtful Accounts or amount of Goodwill).
Accurate – Some data contains errors (i.e., incorrect check posting) or is missing
data (i.e., accountant forgets to include the date of a transaction). Still another
possibility is fraud concealed through manipulation of records (i.e., lapping of
accounts receivable).
Page 5 of 24
2. Subsidiary Fixed Asset Ledger
3. Subsidiary Accounts Receivable Ledger
4. Subsidiary Inventory Ledger
Fixed assets include property, plant (i.e., factories, office buildings, stores, etc.),
equipment (vehicles, forklifts, computers, tools, etc.), and furniture.
This ledger also keeps details regarding the purchase date, depreciation
method, and accumulated depreciation for each fixed asset.
The detailed balances in each category of fixed asset account in the fixed asset
subsidiary ledger supports the control account balance in the general ledger.
Many simply refer to it as a depreciation schedule. See partial ledger on the
right.
Page 6 of 24
4. Subsidiary Accounts Receivable Ledger
The subsidiary accounts receivable ledger details information regarding charges and
payments on customer accounts for each customer.
The total of the subsidiary accounts receivable ledger supports the accounts receivable
control account in the general ledger.
Page 7 of 24
3. XRBL (eXtensible Business Reporting Language)
XBRL is the computer-based standard used to define and exchange financial information
between disclosing companies and various financial statement users.
The SEC requires each publicly traded company to submit its financial statements using XBRL.
An example of the use of XBRL requesting data for IBM’s Total Assets (XBRL tag: Assets) and
Liabilities (XBRL tag: Liabilities) from 2014 to 2017 is shown below:
4. Press Releases
Companies often issue press releases to make an official statement to the media. Over time,
press releases can be aggregated into a fairly-comprehensive view of:
What is happening at a company.
The tone of management toward business opportunities, and
Future profitability.
Represents unstructured data available for analysis.
1. Budget Data
Budgets generally start with a prediction of the level of sales. The company then predicts the
level of expenses and capital expenditures that will be needed to support those sales.
Comparing the actual to the budgeted amounts helps the company learn what occurred that
was anticipated as well as unanticipated.
Page 8 of 24
Note that the overhead volume variance calculated here assumes that hours are used to
allocate the fixed manufacturing overhead. Alternatively, the overhead volume variance can be
calculated as follows: Flexible budget level of overhead for the actual level of production –
Overhead applied to production using standard overhead rate.
Page 9 of 24
Customer contact history.
Customer credit score.
Customer credit limit
Customer payment history
Having such data on each customer could help in predicting the allowance for doubtful
accounts.
D. Tax Data
Accountants are deeply interested in the impact of transactions and events on the amount of
tax that is due and payable.
The tax information comes from transactional data gathered in the financial reporting system.
Two examples of tax data needed to be stored:
Depreciation used for tax and financial reporting purposes. Sometimes the depreciation
used for tax is different from that used for financial reporting.
Certain information is used to claim a research and development tax credit, including
linking an employee’s time directly to a research activity or the use of specific
equipment.
E. Nonaccounting Data
Some non-accounting data helps accountants in better understanding accounting data.
1. Economic Data
2. Current and Historical Stock Prices
3. Social Media
4. Analyst Research Reports and Earnings Forecasts
1. Economic Data
Accountants sometimes use macroeconomic data, such as:
Gross domestic product (GDP) – as a measure of economy-wide performance.
Unemployment numbers – as a measure of labor availability.
Consumer price index (CPI) – as a measure of inflation
Housing market starts and price levels – generally regarded as a key measure of
economic status.
Macroeconomic data is generally useful for diagnostic and predictive analytics.
Page 10 of 24
3. Social Media
Social media and the Internet have several ways potential investors to communicate with each
other.
Blog sites and chat boards (like on Yahoo! Finance) or on Twitter or StockTwits. StockTwits
organizes its discussion using a cashtag which is $ plus the ticker symbol. So for Netflix, the
cashtag would be $NFLX. Any discussion that includes that cashtag would be summarized on
the NFLX StockTwits page.
Sometimes data analysts employ computer programs (sometimes called machine learning) to
assess the sentiment on chat sites to see how it is related to stock price. The machine learning
techniques may count how many positive words are said by those on StockTwits as compared to
the number of negative words to assess the overall sentiment reflected on the post.
Product reviews can give insight as to whether a product is pleasing to customers and help
predict demand.
Chapter 3
Accounting Data: Data Types and How They Are Used
Presentation Outline
I. Structured Data Types
II. Categorizing Data Based on Tools
III. Analyzing Categorical and Numerical Values in a PIVOTTABLE
IV. Data Dictionaries and Data Catalogs
With a focus on the analysis of structured data, this chapter will delve into ways to extract,
transform, and load data to avoid information overload.
A. Categorical Data
Categorical data tend to “categorize” items represented by words—such as classifying a groups
of people by gender (i.e., male or female), or labeling transaction types (i.e., FIFO, LIFO, average
cost, etc.). There are two subsets within the categorical data type:
1. Nominal Data
Page 11 of 24
2. Ordinal Data
1. Nominal Data
Nominal data is categorical data that cannot be ranked, such as gender and type of transaction.
The primary methods to summarize categorical data that is nominal is:
Counting and grouping
Proportion
Let’s begin by analyzing Transaction Type, which is a categorical, nominal data type. By
highlighting the transactions associated with returns, we can count how many transactions are
Returns, 9, which means that 11 transactions are Sales.
This analysis of categorical data can go further by calculating the proportion.
Proportion is calculated by taking the number of observations in a category and dividing that
number by the grand total of the number of observations in the sample.
Return transactions are 9 of the 20 in the sample, meaning that 45 percent of the transactions
are Returns. We can infer that 55 percent of the remaining transactions were Sales.
2. Ordinal Data
Categorical, ordinal data has the same characteristics as nominal data, but it goes a step further
—there is a natural “order” to ordinal data that allows you to rank and sort it. This means that
there are three primary methods to summarize categorical, ordinal data:
Counting and grouping
Proportion
Ranking
Examples of ordinal data are letter grades (i.e., A, B, C, D, and F) and Olympic medals (i.e., gold,
silver, and bronze).
In considering the Date variable, we can count the number of transaction associated with each
date, but there is also a natural ordering of the dates through time.
Page 12 of 24
The above summary table shows the number of transactions on each date, as well as the
proportions of each date in the sample.
Ranking refers to a position on a scale. While ordinal data results could be ranked by the count
or in some cases alphabetically, a ranking of the records on the basis of the natural order of date
is more informative for this dataset.
B. Numerical Data
Numerical data, as the name implies, are meaningful numbers, such as transaction amount,
net income, age, or the score on an exam. There are four primary methods to summarize
numerical data:
Counting and grouping
Proportion
Summing
Averaging
The two types of numerical data are:
1. Interval Data
2. Ratio Data
1. Interval Data
Interval data is so named because there is an equal “interval” or distance between each
observation. However, interval data does not have a meaningful point of zero.
A good example of interval data is temperature. Even though the difference
between 1 degree on a Fahrenheit scale is the same from 30 to 31, as it it from 77 to
78, which makes the data type numerical, when the temperature is 0, that does not
mean “the absence of temperature”; it is simply 1 degree below 1, and 1 degree
above -1.
Another example of interval data is SAT scores. A student cannot even earn a 1 on
the SAT—the range of possible scores are from 400 to 1,600 for the total score.
Interval data is uncommon in the type of data that accountant’s work with.
2. Ratio Data
Ratio data is defined as numerical data with an equal and definitive interval between each
data point and absolute “zero” in the point of origin. It is also important to note that
negative values also take on meaning with ratio data. For example, the sum of net sales
considers the negatives for sales discounts and sales returns and allowances.
Data measuring money—such as transaction amounts, expenses, revenues, assets,
salary, taxes, etc.—are all examples of ratio data.
The majority of data relating to accounting and other business decisions are ratio
data.
Depending on the way the data is set up in the system or database, sometimes
transactions are always listed at the transaction the transaction amounts absolute
value (i.e., the positive value that corresponds with the number). For example,
revenues and expenses are expressed as positive numbers on an income statement,
but expenses are subtracted from revenues to calculate net income.
Page 13 of 24
C. Structured Summary
A. Database Data
Databases and some tools define data using the following:
String, text, short text, or alphanumeric – a string of characters is a collection of
one or more characters that are stored as categorical data. The characters can
be letters, numbers, or a combination, but even if they are stored as numbers,
the numbers are not interpreted as meaningful values that can be used in
calculations. Short text would be a brief name. Long text would be a paragraph.
Date – the date data type represents a string of characters that are formatted in
a traditional date format, such as mm/dd/yyyy or mm/dd/yy.
Number – the number data type is reserved for numeric data, typically ratio
data. Any characters that are stored as a number can be used in calculations.
Y/N flag – used to indicate yes or no.
B. Tableau Data
Tableau and some other tools makes use of the following data types:
1. Dimension – any attribute that is considered to be categorical.
2. Measure – any attribute that is considered to be numerical.
C. Geographic Data
Geographic data is any data that can be linked to a map, such as an attribute for state, city, or
country.
Both Excel and Tableau interact with geographic data.
Page 14 of 24
Sales by country illustration on left was completed in Tableau.
A. Insert a Pivottable
To create a pivotable from the transaction table example, download and open the file “Exhibit 3-
1 – TransactionsTable.xlsx.”
Chapter 4
Master the Data: Preparing Data for Analysis
Page 15 of 24
Although there are a variety of types of databases, the relational model is the most popular and
common.
It is possible to create and store data in Excel, but it is far preferable to store data in a database
and simply connect it to Excel when you wish to perform data analysis.
Tableau will almost always default to showing data in a visual format instead of a numerical
format.
Tableau’s biggest advantage over Excel is data visualization.
Unlike Excel, it is not possible to create raw data in Tableau. Tableau must create an
connection to an existing data source.
Page 16 of 24
B. Entity Relationship Diagram
In Microsoft Access, the relationship would appear as shown below:
A. Data Integrity
Data integrity essentially means truth in data. Accounting information must be both relevant
and be a faithful representation of what occurs. Accounting information that exhibits a faithful
representation has the following three characteristics:
Free from error (contains no mistakes or inaccuracies)
Complete (includes all monetary transactions)
Neutrality (information is not biased)
Integrity of data can be damaged if different versions of the data are stored on the users’
desktops or laptops rather analyzing data through a live connection to the database.
When that happens, the data being used for analysis and decisions becomes out-of-date as soon
as the database accumulates new data; thus, “multiple versions of the truth” end up being
stored on computers across the company.
Page 17 of 24
Version control reduces the possibility of having more than one version of the data:
Data integrity is maintained when data is stored in one centralized database that business
users can connect directly to using Excel, Tableau, or other tools, rather tan having multiple
desktop databases or Excel files where data is stored.
Page 18 of 24
Data imported as a categorical value will have an Abc icon.
Data imported as a numerical value will have a number sign (#) icon.
It is always a good idea to double check the format in which the data imported before beginning
to work with the data and to change it if there was a mistake.
For example, if you wanted to work with the order number as a numerical value, click the Abc
icon and change the data type, as shown in Exhibit 4.10.
C. Extract and Transform: Connecting to a Subset of Data from a Database Using SQL
If the data in a database is too large to load directly into Excel, Structured Query Language (SQL)
can be used to select only a subset of the data.
Chapter 5
Perform the Analysis: Types of Data Analytics
Presentation Outline
I. Descriptive Analytics
II. Diagnostic Analytics
III. Predictive Analytics
IV. Prescriptive Analytics
V. Review of Basic Statistics
VI. The Excel Data Analysis Toolpak
I. Descriptive Analytics
1. Counts: Show how frequently an attempt occurs
2. Totals, sums, averages, subtotals: Summarize measures of performance.
3. Minimums, maximums, medians, standard deviations: Summarize measures
showing extreme values to help explain what happened
4. Graphs (bar charts), histograms.
5. Percentage change from one period to the next using vertical analytics, horizontal
analytics, or common-size financial statements.
Page 19 of 24
6. Ratio analytics like return on assets, return on sales (profit margin), asset turnover
ratios, debt-to-equity ratios: Calculate important financial ratios for comparison.
Page 20 of 24
Used to find the extent to which there are patterns in the data between and among variables.
Page 21 of 24
V. Review of Basic Statistics
A. Population Versus Sample
B. Parameters Versus Statistics
C. Probability Distributions
D. Normal Distribution
E. Hypothesis Testing
F. Alpha, p-values, and Confidence Intervals
G. Review of Hypothesis Testing
H. Regression
I. Sample t-Test of a Difference of Means of Two Groups
The most common measures of spread or variability are the standard deviation and the
variance, where each observation in the sample is x, and the total number of observations is n.
The standard deviation of a sample is s, and the sample variance is s2, are computed as follows:
The greater the sample standard deviation or variance, the greater the variability.
C. Probability Distributions
Normal Distribution – a bell-shaped curve symmetric about its mean, with the data points
closer to the mean more frequent than those data points further from the mean. It is
arguably the most important probability distribution because it fits so many naturally
occurring phenomenon.
Uniform Distribution – every outcome equally likely.
Page 22 of 24
Exhibit 5.13A shows an example of the uniform distribution where each of the ten
possibilities are equally likely.
Poisson Distribution – low mean and being highly skewed to the right; mean number of
events per interval of space or time. In business, it might be helpful in predicting customer
sales on a particular day of the year, or the number of diners in a restaurant on a particular
day. Exhibit 5.13B shows an example.
D. Normal Distributions
Data within +/- one standard deviation includes 68% of the data points.
Data within +/- two standard deviations includes 95% of the data points.
Data within +/- three standard deviations includes 99.7% of the data points.
A z-score is computed to tell us how many standard deviations ( ), a data point, xj, is from
the population mean µ, using the formula z = (xi - µ)/ . A z-score of 1 suggests that the
observation is one standard deviation above its mean. A z-score of -2 suggests that the
observation is two standard deviations below its mean.
E. Hypothesis Testing
Null Hypothesis: assumes the hypothesized relationship does not exist, that there is no
significant difference between two samples or populations
- H0: We expect that there is no difference in sales returns between holiday and
non-holiday season
Alternative Hypothesis: a hypothesis used in hypothesis testing that is opposite of the null
hypothesis, or a potential result that the analyst may expect
- HA: We expect that there greater sales returns during the holiday season as
compared to the non-holiday season.
F. Alpha, p-values and Confidence Intervals
There are two types of results from a statistical test of hypotheses that may occur or may be
interpreted in different ways: the p-value and/or confidence intervals.
The p-value is compared to a threshold value, called the significance level (or alpha). A common
value used for alpha is 5% or 0.05 (as is 1% or 0.01).
- If p-value > alpha: Fail to reject the null hypothesis (i.e. not significant result).
- If p-value <= alpha: Reject the null hypothesis (i.e. significant result).
For example, if alpha (α) is 5%, then the confidence level is 95%.
Therefore, statements such as the following can also be made:
- With a p-value of 0.09, the test found that Saturday and Sunday sales are not
different than Sunday sales, failing to reject the null hypothesis at a 95 percent
confidence level.
This statistical result should then be reported to management, reporting the results of the
statistical test.
Page 23 of 24
H. Regression
We can think about this like an algebraic equation where y is the dependent variable and x is the
independent variables, where y = f(x).
Let’s imagine we are considering the relationship between SAT scores and the college
completion rate for first-time, full-time students at four-year institutions.
In this example y (college completion rate) = f (factors potentially predicting college completion
rate), including the independent variable SAT score (SAT_AVG).
The R Square represents the percent of variation in the dependent variable that can be
explained by changes in the independent variable.
Page 24 of 24