Using R For Financial Data
Using R For Financial Data
for
Introduction to Statistics with R
&
Assessment
1. Students have (need to decide the time) hours to work on an online quiz
(available on Learn), which is structured like a homework assignment.
2. The typical quiz is a combination of multiple choice and numeric ques-
tions.
(a) The focus of the multiple choice question will be on such topics
as:
i. Statistical concepts and techniques.
ii. A&F applications of statistical concepts and techniques.
iii. Practical sources of A&F data and properties/structure of
these data sets.
iv. Understand and transform A&F data.
v. Interpret statistical results leveraging A&F knowledge.
vi. Commonly used R commands.
(b) To answer the numeric questions students will have to use R to
perform statistical analysis on a given A&F data set. The focus
of numeric question will be on such topics as:
i. Load, understand, transform A&F data.
ii. Perform statistical analysis.
iii. Interpret statistical results.
3. The weekly quizzes are open books/open notes.
4. Students can take each quiz twice and the highest score counts.
5. Quizzes are available from Friday afternoon till Monday morning. Stu-
dents can take the quizzes at any time during this window.
6. The first quiz will be on the second week of classes. There is no quiz
for the first week.
7. Students can drop one (lowest) quiz score. There are no make-ups for
any reason.
1 Introduction to R 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 R Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 R packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 R Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.2 Defining Variables . . . . . . . . . . . . . . . . . . . . 7
1.5.3 How to Clean the Content of Console . . . . . . . . . . 9
1.5.4 How to Clean the Content of Environment . . . . . . . 9
1.6 Working with R files . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Using Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.1 Generating basic descriptive statistics . . . . . . . . . . 11
3 Summary Statistics 39
3.1 Management Accounting Data . . . . . . . . . . . . . . . . . . 39
3.1.1 Load and Review Sales by Store Data . . . . . . . . . . 39
3.1.2 Summary Statistics . . . . . . . . . . . . . . . . . . . . 40
3.1.3 Detecting Outliers . . . . . . . . . . . . . . . . . . . . 41
3.1.4 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.5 Subset of Outliers . . . . . . . . . . . . . . . . . . . . . 43
3.1.6 Practice Problems: Sales by Store . . . . . . . . . . . . 44
3.1.7 Frequency Tables for Categorical Data . . . . . . . . . 44
3.1.8 Practice Problems . . . . . . . . . . . . . . . . . . . . . 45
3.2 Stock Market Data . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.1 Stock Returns . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.2 Summary Statistics: Stock Returns . . . . . . . . . . . 49
3.2.3 Detecting Outliers: Stock Returns . . . . . . . . . . . . 50
3.2.4 Identifying Negative Outliers in Stock Returns . . . . . 51
3.2.5 Categorical Variable for Stock Returns . . . . . . . . . 52
3.2.6 Frequency Tables for Categorical Stock Returns . . . . 53
3.2.7 Practice Problems: Do Stock Returns Have Memory? . 54
3.3 Financial Accounting Data . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Summary Statistics & Outliers: Profits . . . . . . . . . 55
3.3.2 Summary Statistics & Outliers: Financial Ratios . . . . 57
3.3.3 Practice Problems: Financial Ratios . . . . . . . . . . 59
3.4 Solutions to Selected Practice Problems . . . . . . . . . . . . . 61
3.4.1 Sales by Store (p. 44) . . . . . . . . . . . . . . . . . . 61
3.4.2 Do Stock Returns Have Memory? (p. 54) . . . . . . . . 62
3.4.3 Financial Ratios (p. 59) . . . . . . . . . . . . . . . . . 63
4 Hypothesis Testing 65
4.1 Management Accounting Data . . . . . . . . . . . . . . . . . . 65
4.1.1 Load and Review Data . . . . . . . . . . . . . . . . . . 66
4.1.2 Null and Alternative Hypothesis . . . . . . . . . . . . . 67
4.1.3 Create and Review Random Samples . . . . . . . . . . 67
Random Sample: Liquor . . . . . . . . . . . . . . . . . 69
Random Sample: Wine . . . . . . . . . . . . . . . . . . 70
4.1.4 Practice Problems: Random Samples . . . . . . . . . . 71
4.1.5 Hypothesis Testing (two sided t-test) . . . . . . . . . . 71
I Appendix 97
A R Script 99
A.1 R Script: Introduction to R . . . . . . . . . . . . . . . . . . . 99
A.2 R Script: Load and Review Data . . . . . . . . . . . . . . . . 99
A.2.1 Section 2.1: Management Accounting Data . . . . . . . 99
A.2.2 Section 2.2: Stock Market Data . . . . . . . . . . . . . 100
A.2.3 Section 2.3: Financial Accounting Data . . . . . . . . . 101
A.3 R Script: Summary Statistics . . . . . . . . . . . . . . . . . . 102
A.3.1 Section 3.1: Management Accounting Data . . . . . . . 102
A.3.2 Section 3.2: Stock Market Data . . . . . . . . . . . . . 102
A.3.3 Section 3.3: Financial Accounting Data . . . . . . . . . 104
A.4 R Script: Hypothesis Testing . . . . . . . . . . . . . . . . . . . 106
A.4.1 Section 4.1: Management Accounting Data . . . . . . . 106
A.4.2 Section 4.2: Stock Market Data . . . . . . . . . . . . . 107
A.4.3 Section 4.3: Financial Accounting Data . . . . . . . . . 108
B Compustat 111
B.1 Compustat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
B.1.1 Compustat: Single Firm . . . . . . . . . . . . . . . . . 112
B.1.2 Compustat: Entire Industry . . . . . . . . . . . . . . . 117
Introduction to R
Learning Objectives
By the end of this chapter students should have achieved the following learn-
ing objectives (know how to do the following):
1. Learn how to install R and RStudio.
2. Become familiar with the RStudio interface (environment) and the role
of packages in R.
3. Perform some basic tasks in R. For example, use R as a calculator,
define (create) variables, work with R files, and generate descriptive
statistics.
1.1 Introduction
R is a very powerful and versatile software package that is used in several
required and elective courses in the School of Accounting and Finance (SAF).
In very general terms, R can be used to extract, transform, and analyze struc-
tured (e.g., financial statements based data or sales data) and unstructured
data (e.g., Tweets, emails, blogs).
While the list of R applications is very extensive, here are some exam-
ples of what you will learn as an SAF student: run SQL queries in order
to extract and transform data; organize data (e.g., pivot tables); run sim-
ple statistical analysis, such as generating descriptive statistics, identifying
outliers, perform hypothesis testing; run regression analysis and classifica-
tion analysis (logistic regression); perform sentiment analytics, and leverage
neural network applications to make predictions.
Chances are that you may have heard from your fellow students that the
learning curve of R is relatively steep, and you may have heard that some
of the things that you can do with R you can also do with Excel. You may
wonder, why bother with R. There are several reasons on why it is worth it
your time to make this investment.
• First, the spectrum of applications that you can work on is very large.
It ranges from basic statical analysis to very advanced data mining
techniques and text mining for sentiment analysis.
• Second, once you have learned R, it would be relatively easier to work
with any other package (e.g., SAS, STATA, SPSS, MATLAB, Octave).
• Third, R and Python are the most powerful and most used tools for
data analytics.
• Fourth, R is an open source software, which means that it is free.
Figure 1.1,1 shows the trade-off between the difficulty and complexity of
Excel and R.
1
The figure was prepared by Gordon Shotwell and it is available from the following
URL: http://blog.yhat.com/posts/R-for-excel-users.html.
As you can see the difficulty (learning curve) of Excel is very low. The
red Excel line grows very slowly along the horizontal axis. The complexity
of Excel (i.e., what you can do with Excel) becomes almost vertical, which
means there is a limit. Alternatively, you can think of this as saying that it
becomes much more complex to do advanced analysis with Excel.
Looking on the R line, we can see that it grows up very quickly, which
means that the learning curve is steeper in the beginning. This makes sense
because there is a lot of new terminology to cover. However, once you learn
to work with R, it is much easier to go from basic statistical analysis to
advanced data mining techniques.
1.2 R Installation
R is pre-installed in most computers. Check the applications in your Mac
or All Programs in your Windows machine. If it is not pre-installed, you
can download and install from the appropriate URL (shown below). For the
basic installation your can simply follow the directions on your screen and
accept the default settings.
1.3 RStudio
While we can run our analysis from R, the interface of RStudio is more
user friendly. RStudio is an integrated development environment (IDE) that
will let us see all components of our R project in an integrated interface.
Download RStudio from http://www.rstudio.com/. Please, remember that
first we install R, and then RStudio.
The interface of RStudio (shown in Figure 1.2) is divided into four panels
(areas):2
1. Source, this is where we write script that we want to save. This is
helpful when we work on projects that we may want to revisit at a later
2
If the interface looks different than the one shown in Figure 1.2, we can change by
selecting preferences in Mac or Tools - Global Options in Windows and then select Pane
Layout. If we want our interface to match Figure 1.2, we can use the drop down menu in
each pane and select source for the upper left pane, console for the upper right, etc.
point. We can create a new R script file by selecting File > New File > R
Script. The new file will show as untitled1 in the source pane.
2. The Console plays a dual role. This is where we execute script inter-
actively, and we see the output/results of our script. Executing interactively
means that when we type an instruction in the console, and we hit return,
the line is immediately executed and the output will be shown in the console.
If we execute a line or an entire script in the source area, the output will
shown in the console.
3. The Files, Plots, Packages, Help, Viewer area serves multiple needs.
For example, if our script contains instructions for the creation of a graph,
the output will show up in the Plots. If we need to install specific packages
that would allow us to execute some sophisticated functions, we do this from
the Packages area. We can view the files in our working directory or access
the built in Help functionality.
4. The Environment area shows the data set(s) currently open and the
variables in each one of these data sets. The History keeps track of every
single line of script that we have entered through the console.
1.4 R packages
R has a collection of ready to use programs, called packages. For example,
we can make some very powerful graphs with the package ggplot2 , we can
generate detailed descriptive statistics with the package psych, run SQL
queries from within R using sqldf , and perform sentiment analysis based on
Twitter data using twitteR and stringr.
There are a couple of ways that we can install these packages. First,
we can use the console to type the command install.packages(). For
example, the following line would install the psych package:
i n s t a l l . p a c k a g e s ( ‘ psych ’ )
Alternatively, we can click Packages > Install and select the package from
the new window/drop down menu (see Figure 1.3). In the new window make
sure that the Install Dependencies is selected.
Once the packages have been installed, we can load them (i.e., include
them in our script), using the library function. For example, the following
line would load the psych package.
l i b r a r y ( psych )
1.5 R Basics
1.5.1 Operations
In its simplest form, we can use R as a calculator to perform basic operations.
For example, in the console area we can write (See also Figure 1.4):
> 50+20
It will return
[1] 70
> 80-30
It will return
[1] 50
> 20*3
It will return
[1] 60
> 54/9
It will return
[1] 6
> 2^3
[1] 8
or
> 2**3
[1] 8
> (2^3)+(80-50)
[1] 38
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> x<- 5
> y<- 9
> x*y
[1] 45
> x-y
[1] -4
> x+y
[1] 14
[1] 5 9
[1] 7 5 9 7 5 7 9
• CTRL+L.
Please, notice that when hit Enter, it does not show any results. It simply
moves the cursor to the next line. If we want to execute a line at a time,
we can do this by clicking on the Run icon or the shortcut Ctrl+Enter. (See
Figure 1.6)
As shown in Figure 1.7, if we want to run all lines in our R script, we
can do this clicking on the Source icon and select Source with Echo from
the drop down menu. Once, we are done we can select save or save as and
select to save the file in our working directory on our computer.
When we create R files, it is a good idea to add comments that either
explain what we are trying to do or simply provide a reminder related to
the function that we are using. In R, we can add a comment by using the
# sign. Everything that follows the # is just a comment and does not affect
the functionality of our R script. Please keep in mind that we can add a
comment either at the beginning of a new line or after a command.
> x=seq(2,20,2)
> x
[1] 2 4 6 8 10 12 14 16 18 20
> length(x)
[1] 10
> sum(x)
[1] 110
> min(x)
[1] 2
> max(x)
[1] 20
> mean(x)
[1] 11
Learning Objectives
By the end of this chapter students should have achieved the following learn-
ing objectives (know how to do the following):
13
tailed data related to its daily operations. There are millions of lines of
sales, purchases, inventory, invoices, expenses, payroll, and store data in its
database. In this section, we will load and review data related to the size of
the Bibitor stores from the file named tStores.csv.
The .csv files - unlike for example Excel files, which are currently limited
to about a million lines - do not have an upper limit. This difference matters
since some of the Bibitor’s files have more than a million observations (e.g.
the sales file has over 12 million lines).
> library(data.table)
> tStores <- fread("tStores.csv")
2
The complete R Script used for the creation of this section is available in Appendix
A.2.1 (p. 99).
> names(tStores)
The data set has four variables: Store is unique identifier for each store/ob-
servation), City, Location, and size of each store in square feet (SqFt).
Data Structure: We can get detailed information about the entire data
set (e.g., number of observations, variables, and format for each variable)
with the function str(...). The function takes one argument, the name of
the data set:
> str(tStores)
The data set has 79 observations (one row in the table for each store) and 4
variables (columns in the table). The variables Store and SqFt are integers
(int), while the variables City and Location are formatted as text (chr).
Knowing the format of variables matters when doing statistical analysis. For
example, we can calculate the average of numeric data, but we will try to
create a frequency table (or pie chart) when working with categorical data.
> head(tStores)
Because in this example, we have specified that the output is to be the first 3
observations for all the variables, the result is the same as that above (head(
tStores , 3)).
Subseting with data[rows, columns] The main message from these ex-
amples is that when we work with data sets we use the following format:
data[rows, columns]. Using this format we can generate the following
combinations:
1. All rows and all columns by including just a comma inside the bracket:
data[,]
Combine head() with subseting Recall that the argument within the
function head is the name of the data set. This means that we can use any
subset that we have created as the argument within the function head. The
following example, shows how we do this to achieve the same output as the
one shown in Example 3 above. Notice, that the argument before the comma
is the subset.
> head(tStores[,2:4],3)
> head(tStores[order(tStores$SqFt),2:4],3)
Average We can use the function mean() to calculate the average store
size as follows:
> mean(tStores$SqFt)
[1] 7893.671
The average Bibitor store is 7893.671 square feet in size.
If this is the first time using the quantmod package, follow directions
in Chapter 1 (p. 5) on how to install packages in R. Clear the RStu-
dio environment and create a new R file. We load the package, using
the library(quantmod) and use the command Sys.setenv(TZ = "UTC")
to avoid annoying warnings/error messages related to timezones.
> library(quantmod)
> Sys.setenv(TZ = "UTC")
1. The ticker symbol of the stock, index, or ETF in quotation marks. The
ticker of Amazon is "AMZN".4
2. The source engine that supplies the financial data. In this example, we
use yahoo Finance (src="yahoo").
3. The frequency for extracting data. This can be daily, weekly, monthly,
quarterly, or annually. For weekly data (periodicity="weekly").
4. The beginning (from=) and the end (to=) of the time period. Notice
that we specify the date using the format "yyyy-mm-dd" and tell R to
read this format as a date (as.Date).
[1] "AMZN"
When ran the function creates a new data set called AMZN. We are not going
to use the function str(...) to review it because the type of data sets
generated by quantmod are beyond the scope of these notes. Instead of this,
we use the functions names as well as head and tail.
> names(AMZN)
The data set has six variables (columns). More specifically, for each observa-
tion/week in the data set, we have the following variables: The opening price
of AMZN (AMZN.Open), highest price (AMZN.High), lowest price (AMZN.Low),
4
To extract both data sets at the same time, we specify the getSymbols(c("AMZN",
"^GSPC"), src="yahoo", ...).
> nrow(AMZN)
[1] 574
Using the functions head and tail respectively, we can verify that the start-
ing week corresponds to January 1st, 2006 and the ending week corresponds
to December 25th, 2016.
> head(AMZN,3)
> tail(AMZN,3)
> plot(AMZN[,6])
The graph (Figure 2.1) shows that after the introduction of cloud computing,
the price of Amazon went from around 100 to around 800. This means a
return of around 700%.
[1] "GSPC"
> names(GSPC)
> nrow(GSPC)
[1] 574
head(GSPC,1)
> tail(GSPC,1)
As we can see, the SP500 data set (GSPC) has the same number of variables
(6), the same number of observations (574), as well as the same starting week
(2006-01-01) and ending week (2016-12-25).
AMZN.Adjusted GSPC.Adjusted
2006-01-01 47.87 1285.45
2006-01-08 44.40 1287.61
2006-01-15 43.92 1261.49
AMZN.Adjusted GSPC.Adjusted
2016-12-11 757.77 2258.07
2016-12-18 760.59 2263.79
2016-12-25 749.87 2238.83
> nrow(dt1)
[1] 574
> names(dt1)
> par(mfcol=c(1,2))
Once the format has been changed, we develop the two graphs.
> plot(dt1$AMZN)
> plot(dt1$SP500)
• When comparing graphs which are presented side-by-side pay very close
attention to the scales used.
We can return the paragraph format to its original (one row and one column)
as follows:
> par(mfcol=c(1,1))
Using the new data set, we can print the two graphs as follows:
> par(mfcol=c(1,2))
> plot(dt2$AMZN); plot(dt2$SP500)
> par(mfcol=c(1,1))
The new graph (Figure 2.3) shows that Amazon was able to recover and
achieve a price higher than the pre-crisis level much faster than the SP500.
More specifically; by 2010, Amazon had reached a price level that was almost
twice the pre-crisis level. By 2010, SP500 had not yet reached its pre-crisis
level.
SALEit − COGSit
GMit = (2.1)
SALEit
Operating margin (OM ) is defined as operating income before deprecia-
tion (Compustat code=OIBDP) to Sales (SALE).
OIBDPit
OMit = (2.2)
SALEit
Profit margin (P Mit ) is defined as income before extraordinary items
(Compustat code=IB) to Sales (SALE).
7
You may want to review your notes from the introduction to financial accounting
(AFM101).
8
In the following formulas, the subscript i refers to the specific firm and the subscript
t refers to the time period.
IBit
P Mit = (2.3)
SALEit
Return on assets (ROA) measures a firm’s ability to generate profits per
dollar of assets.9 Since, we have three versions of profits, we will create three
versions of ROA. The first, version of ROA is based on gross profits, i.e., the
same numerator as the gross margin (GM ):
SALEit − COGSit
ROA (with GM)it = (2.4)
AT it
The second specification of ROA is based on operating income before
depreciation (OIBDP). This means that we use the same numerator as the
operating margin (OM ):
OIBDPit
ROA (with OM)it = (2.5)
AT it
The third version of ROA is based on income before extraordinary items
(IB), i.e., the same numerator as the profit margin (P M ):
IBit
ROA (with PM)it = (2.6)
AT it
One of the major benefits of using financial ratios to measure profitabil-
ity, rather than actual profits, is the removal of the size factor. Since all
values are expressed as a percentage of the firm’s sales or assets it makes it
feasible to compare companies of different size. Please, keep in mind that
the profitability ratios will not make sense unless the denominator is greater
than zero.
environment, and create a new R script file. Using the package data.table
and the function fread we import the data (industryAAPL_2010_2016.csv)
in R and review the names of the variables in the data set using the function
names.10
> library(data.table)
> dt <- fread("industryAAPL_2010_2016.csv")
> names(dt)
> str(dt)
Reviewing the structure of the data set, we see that it has 441 observations
and there is a combination of numeric (num or int) and text variables (chr).
Subset - Apple Data Using the function data[,] we can create and view
a subset that has just the financial data for Apple. In section 2.1.3 (p. 16)
we have seen that we can create a subset by specifying the row numbers. An
alternative approach would be to specify that we want to see rows that meet
a certain condition (i.e., have a ticker symbol which is AAPL). We can do
this by setting the term before the comma as follows: dt$tic=="AAPL".
With the following command, we can view the Apple data for variables
3 through 10 (fyear, tic, conm, at, cogs, ib, oibdp, and sale).
> dt[dt$tic=="AAPL",c(3:10)]
We can see by running the function nrow, that the new data set has 396
observations.
> nrow(dt)
[1] 396
Using the formulas 2.1-2.3, we can specify the creation of the three prof-
itability ratios as follows:
We can see that the new variables (gm, om, and pm) have been added in the
data set.
> names(dt)
om pm
1: 0.2961594 0.2148409
2: 0.3289823 0.2394664
3: 0.3734378 0.2666509
4: 0.3262302 0.2167047
5: 0.3306929 0.2161438
6: 0.3496994 0.2284577
7: 0.3220776 0.2124078
Using a similar approach we can create the ROA ratios and observe them by
creating a subset.12
ROA_om ROA_pm
1: 0.2569331 0.1863852
2: 0.3060213 0.2227531
3: 0.3319588 0.2370331
4: 0.2693527 0.1789227
5: 0.2607370 0.1704200
6: 0.2813629 0.1838136
7: 0.2153529 0.1420236
> min(dt[dt$tic=="AAPL",]$gm)
[1] 0.4080644
> min(dt$gm)
[1] -3.583661
At this point, we may want to find out which company has such a low per-
formance. We can do this by creating a subset that specifies that we want
to see the observation that is associated with the minimum of gross margin.
> dt[dt$gm==min(dt$gm),]
> max(dt[dt$tic=="AAPL",]$gm)
[1] 0.4591906
> max(dt$gm)
[1] 0.9974311
• Q: Can you determine which firm achieved such a high gross margin?
The resulting graph is shown in Figure 2.5. From this, we can see that
Apple’s sales more than doubled from 2010 to 2016.
Summary Statistics
Learning Objectives
By the end of this chapter students should have achieved the following learn-
ing objectives (know how to do the following):
1. Generate summary descriptive statistics
2. Identify outliers using the interquartile range approach
3. Create a boxplot
4. Creare a subset that indicates which observations are outliers
5. Create a frequency table for categorical variables
39
The data set has 79 observations (one row in the table for each store) and
5 variables (columns in the table). The variable Store provides the unique
identification of each store; SqFt provides the size of each store in square feet;
unitSold captures the units sold (number of bottles); averagePrice captures
the average price per unit (bottle) sold; and revenue captures the revenues
(sales) generated by each store.
> summary(dt1$SqFt)
From this we can see that the size of the smallest store is 1100 square feet,
while the largest is 33000. There are 25% of stores which have a size below
4000 square feet. In other words, the 1st Quartile (Q1) is 4000. The median
is 6400. Therefore, there are 50% of stores which are smaller than 6400 square
feet and 50% which are larger than 6400. The average store is 7894 square
feet. There 75% of stores which are below the 3rd quartile (Q3=10200).
We can generate the first quartile, third quartile, or any other percentile
we want using the function quantile. The function takes two arguments:
the variable and the percentage of observations below. For example, we can
generate the 1st quartile of store size by specifying the variable (SqFt) and
25% of observations below, as follows:
25%
4000
If we change the percentage to .75 it will return the 3rd quartile (Q3).
75%
10200
We can take the difference between the two of them to calculate the in-
terquartile range.
75%
6200
Alternatively, we can generate the interquartile range (IQR) using the func-
tion IQR as follows:
> IQR(dt1$SqFt)
[1] 6200
25%
-5300
75%
19500
The size of the smallest store (1100) is not below the lower whisker (-5300),
therefore there are no outliers on the left size of the distribution. However,
the largest store (33000) is above the upper whisker (19500), hence at least
this store is an outlier.
3.1.4 Boxplot
A boxplot is a way of graphically showing data in their quartiles, including the
"whiskers" as noted above (i.e., indicators of the variability of data beyond
the upper quartile and below the first quartile). In R, we can create the
boxplot for the size of the stores (shown in Figure 3.1) using the function
boxplot and specifying the variable that we would like to graph as follows:
> boxplot(dt1$SqFt)
The boxplot shows us that there are 2 outliers on the upper end. This
means that there are two values above the upper whisker (19500). In the
previous section we have seen that the minimum value for the store size is
1100 and the lower whisker is −5300. Note that given the fact that there are
no stores with negative size or stores with size less than 1100, the boxplot
shows the minimum value instead of showing the lower whisker.
We use the new variable to create and view the subset that returns only the
outliers (tStores[tStores$sqftOutlier == 1 , ]) below.
Therefore, store 49 and 66 are the two very large stores (outliers) shown in
Figure 3.1.
Alternatively, we can simply use the condition from the function ifelse
to generate the subset.
> dt1[dt1$SqFt<lWhisker4sQFt|dt1$SqFt>uWhisker4sQFt,]
The first method (creating a new dataset) is useful if there are a lot of
outliers/exceptions and we need to perform further statistical analysis to
understand patterns or common themes across the entire data set of outliers.
For example, we can use the new data set to analyze (generate summary
statistics) for the group of outliers. The second is more useful when dealing
with just a handful of observations, and a simple visual review would be
enough to see what is going on.
The data set has 10473 observations (one line/observation for each unique
product that Bibitor sells) and 6 variables. The variable Brand is a unique
identifier for each product. Description is the name of the product. Classifi-
cation captures the two categories of products sold (1=liquor, 2=wine). The
remaining variables are units sold (unitSold), average price (averagePrice),
and sales (revenue) for each product.
We can generate the product count (frequency of products) based on their
classification as liquor or wine, using the function table() and specifying the
variable we want to analyze (Classification).
> table(dt2$Classification)
1 2
3182 7291
The results show that there are 3182 liquors and 7291 wines. If we include
the function table within another function prop.table the output will be
expressed as percentage of total observations.
> prop.table(table(dt2$Classification))
1 2
0.3038289 0.6961711
1. Create a data set that has only liquors and name it dt_liquor.3
(a) Are there outliers in units sold (unitSold)?
(b) How many?
(c) If there are more than ten, show the top ten.
(d) Are there outliers in the average price (averagePrice)?
(e) How many?
(f) If there are more than ten, show the top ten.
(g) Are there outliers in sales (revenue)?
(h) How many?
(i) If there are more than ten, show the top ten.
2. Create a data set that has only wines and name it dt_wine. Perform
the same analysis as above for outliers in units sold, average price, and
revenue.
3. Summarize the main points of your analysis and their implications for
the management of Bibitor
> options(scipen=999)
> library(quantmod)
> Sys.setenv(TZ = "UTC")
> getSymbols(c("AMZN", "^GSPC"), src="yahoo",
periodicity="weekly",
from=as.Date("2006-01-01"), to=as.Date("2016-12-31"))
> names(AMZN)
> names(GSPC)
AMZN.Adjusted GSPC.Adjusted
2006-01-01 47.87 1285.45
2006-01-08 44.40 1287.61
2006-01-15 43.92 1261.49
AMZN.Adjusted GSPC.Adjusted
2016-12-11 757.77 2258.07
2016-12-18 760.59 2263.79
2016-12-25 749.87 2238.83
If you have any questions related to the above commands and/or output,
please review section 2.2.
> head(dt1)
AMZN SP500
2006-01-01 47.87 1285.45
2006-01-08 44.40 1287.61
2006-01-15 43.92 1261.49
2006-01-22 45.22 1283.72
2006-01-29 38.33 1264.03
2006-02-05 38.52 1266.99
We can use these values to calculate the rate of return using the following
formula:
P ricet − P ricet−1
returnt = (3.1)
P ricet−1
For example, if the current week is 2006-01-08 ; the current week’s P ricet =
44.40, last week’s P ricet−1 = 47.87, and the returnt = 44.40−47.87
47.87
= −.072.
This means, the Amazon stock had a negative return of 7.2% or the Amazon
price dropped by 7.2%.
In order to calculate market returns, we need to have the current price
(P ricet ) as well as the previous (last week’s) price (P ricet−1 ). In R, we
can create a new variable that shows the previous price, using the function
lag(). The function takes two arguments: the variable and the number of
lags. Typically, we don’t specify the number of periods/weeks we want to go
back (number of lags). R assumes the default value which is one period/week.
Based on the above output, we can see that during the first week of 2006
(2006-01-01 ), the current price (P ricet ) of Amazon was 47.87 (AMZN=47.87)
and the previous price (P ricet−1 ) was NA (i.e., not available). During the
second week of 2006 (2006-01-08 ), the current AMZN price was 44.40, and the
previous week’s price (lagAMZN) was 47.87.
Using formula 3.1, we calculate the rate of return for Amazon (rtrnAMZN)
and SP500 (rtrnSP500) as follows:
rtrnSP500
2006-01-01 NA
2006-01-08 0.001680372
2006-01-15 -0.020285642
2006-01-22 0.017622003
2006-01-29 -0.015338191
2006-02-05 0.002341686
Notice, that as a result of using the function lag our data set has missing
values in the first observation (week). This happens because our data set
does not contain the last week of December 2005. If currently, we are in the
first week of January 2006, and we need to generate the one period/week lag,
we need the data from last week of December. Since this piece of information
is not available R returns a missing value (NA). We can eliminate the missing
values from our data set (dt1) using the function na.omit as follows:
> dt1 <- na.omit(dt1)
> head(dt1,2)
rtrnSP500
2006-01-08 0.001680372
2006-01-15 -0.020285642
Practice Problems Review the above results and answer the following
questions:
25%
-0.1101005
75%
0.1191161
Based on the above, we see that weekly Amazon stock returns which are
below -11% or above 11.9% are outliers.
Using the same approach, we find that SP500 returns which are below
-4.6% or above 5% are outliers.
25%
-0.04602874
4
Hint: The saying is supported if the stock/index that has the widest range generates
on average higher returns.
75%
0.0502261
> dt1[dt1$rtrnAMZN==min(dt1$rtrnAMZN),]
> dt1[dt1$rtrnSP500==min(dt1$rtrnSP500),]
From the above results, we can see that within the period 2006-2016, SP500
had its worst performance on 2008-10-05 and Amazon had its worst perfor-
mance on 2006-07-23.
5
Hint: compare the min and max returns to the lower and upper whisker respectively.
Practice Question Search the web and try to find information about these
two periods in order to answer the following questions.
1. What happened around the week of 2008-10-05? Does it make sense
that both SP500 and Amazon had such a steep drop on the same week?
Explain why.
2. Try to find information related to Amazon around the period of 2006-
07-23. Does it make sense that Amazon experienced such a steep drop
but not SP500? Explain why.6
answer is no, we have two more conditions to explore. Was the return zero or
negative. We test them introducing the second ifelse() statement, which
tests whether the return was zero. If the answer is yes, we assign the value of
AMZN_unchanged. If the answer is no, and given the fact that we already
know that returns are not positive, we assign the value of AMZN_down.
We repeat the same approach for the SP500 index.
We observe, the set of the top observation using the function head() below.
> head(dt2)
> table(dt2$directionSP500)
SP500_down SP500_up
252 321
> table(dt2$directionAMZN)
AMZN_down AMZN_up
270 303
Practice Problem Interpret and contrast the results from these two ta-
bles.
> nrow(dt)
[1] 441
[1] 396
If you have any questions related to the above commands and/or output
please review sections 2.3.2 - 2.3.3.
For the rest of the analysis we will focus on a single year (2010).
> summary(dt_2010[,8:9])
ib oibdp
Min. : -874.000 Min. : -56.546
1st Qu.: -4.081 1st Qu.: 0.222
Median : 1.008 Median : 4.555
Mean : 262.069 Mean : 490.497
3rd Qu.: 14.468 3rd Qu.: 34.996
Max. :14013.000 Max. :19317.000
As expected, the above results show that there is a wide spectrum of values
for both profitability measures. Clearly, the max ib and oibdp, which belong
to Apple (see p. 31), are outliers.
Mental Math We can use mental math to validate that the max ib is an
outlier as follows: First, round Q1 and Q3 very generously. So Q1 is around
-5 and Q3 is around 15. This means that IQR is around 20 (15 − (−5) = 20)
and half of it is around 10. Therefore 1.5 ∗ IQR is around 30. Based on
these values, the upper whisker is 15 + 30 = 45, which is well below the max
(14013). Thus we can conclude that the max value is an outlier in this data
set.
losses_IB profits_IB
0.4366197 0.5633803
Profit margin We can calculate the three different version of profit margin
(2.1-2.3) for all firms in the industry as follows:
> dt_2010$gm <- (dt_2010$sale-dt_2010$cogs)/dt_2010$sale
> dt_2010$om <- dt_2010$oibdp/dt_2010$sale
> dt_2010$pm <- dt_2010$ib/dt_2010$sale
> names(dt_2010)
[1] "gvkey" "datadate" "fyear" "tic"
[5] "conm" "at" "cogs" "ib"
[9] "oibdp" "sale" "loc" "naics"
[13] "sic" "status_OIBDP" "status_IB" "gm"
[17] "om" "pm"
Summary statistics Since the three version of profit margin are variables
16-18, we can generate summary statistics as follows:
> summary(dt_2010[,16:18])
gm om pm
Min. :-0.01508 Min. :-1.341778 Min. :-2.17918
1st Qu.: 0.27485 1st Qu.: 0.007278 1st Qu.:-0.09114
Median : 0.40806 Median : 0.075490 Median : 0.01317
Mean : 0.40877 Mean : 0.015356 Mean :-0.10142
3rd Qu.: 0.50625 3rd Qu.: 0.134599 3rd Qu.: 0.05769
Max. : 0.78430 Max. : 0.506473 Max. : 0.80502
Using mental math, we can quickly come to the conclusion that the third
version of profit margin (pm) is the one that seems to have the most extreme
outliers. We are interested in these outliers because these are the firms that
do extremely well or extremely poorly.
25%
-0.3143745
75%
0.2809268
> table(dt_2010$relativePosition_pm)
> prop.table(table(dt_2010$relativePosition_pm))
Therefore, there are 2 firms (around 3%) that perform well above the rest
of the industry. There are 33 firms (around 47%) that perform above the
industry median, but below the top performers. The remaining 36 firms
(around 50%) perform below the industry median.
Practice Problem Can you find the two top performing firms?
9. Using the dt2 data, calculate the upper whisker for ROA_om. Interpret
your finding.
10. Create a new categorical variable (relativePosition_ROA_om) that
takes the value topOutlier_ROA_om if the firm’s ROA_om is above
the upper whisker; aboveMedian_ROA_om if the firm’s ROA_om is
above the median, and belowMedian_ROA_om if the firm is below the
median.
11. How many firms are in the top group (topOutlier_ROA_om)? If there
are less then ten list them.
25%
-221843.2
75%
884590.8
> boxplot(dt1$unitSold)
> dt1$units_Outlier <-
ifelse(dt1$unitSold<lWhisker4UnitSold|dt1$unitSold>upperWhisker,1,0)
> nrow(dt1[dt1$units_Outlier==1,])
[1] 8
> dt1[dt1$unitSold<lWhisker4UnitSold|dt1$unitSold>upperWhisker,]
units_Outlier
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
7: 1
8: 1
[1] 43
losses_OIBDP profits_OIBDP
0.2325581 0.7674419
25%
-0.1693562
[1] 40
> summary(dt2$ROA_om)
> names(dt2)
75%
0.281306
> table(dt2$relativePosition_ROA_om)
> prop.table(table(dt2$relativePosition_ROA_om))
> dt2[dt2$relativePosition_ROA_om=="topOutlier_ROA_om",]
Hypothesis Testing
Learning Objectives
By the end of this chapter students should have achieved the following learn-
ing objectives (know how to do the following):
1. Generate random samples using sample().
2. Run t-tests using t.test() and interpret output.
3. Select appropriate options/arguments when using t.test().
4. Append data sets using rbind and perform t-test on grouped data.
5. Generate detailed descriptive statistics using describe().
Important Note The management of Bibitor has access to the entire pop-
ulation of products sold. Therefore, we can answer these questions by simply
comparing the results of summary statistics. This means that there is no
need to take random samples and do hypothesis testing. However, learning
how to perform hypothesis testing is a critical stepping stone for understand-
ing how to evaluate more advanced statistical techniques, such as regression
65
> library(data.table)
> dt1 <- fread("salesByProduct.csv")
Please recall from the discussion in section 3.1.7, that the file contains
sales data, as well as classification, for each one of the 10473 products sold
in Bibitor stores in fiscal year 2016.
1. Create the table that shows frequency distribution of liquor and wine
brands.
2. Create the table that shows the relative frequency distribution (per-
centage) of liquor and wine brands.
Using the notation [,], we create a subset (dt1_L) for liquor products by
imposing the constraint that classification equals to 1, and a subset (dt1_L)
for wine products with the constraint classification equals to 2.
Summary statistics for units sold, average price, and revenue for each
subset are shown below. As we can see the average units sold is much higher
(6065.9) for liquor than for wine (1783). Similarly, the average price and
average revenue for liquor (36.47 and 88364 respectively) are much higher
the corresponding values for wine (30.97 and 21930.6 respectively).
from the liquor data and another 25 from wine data. A manual way for
creating a random sample for liquor would be to write in small pieces of
paper the row numbers for each row (observation) in the liquor data set.
The data set has 3182 rows. Put all these 3182 pieces of paper in a box,
shake it well, and draw/remove 25 pieces. The numbers in these 25 pieces,
are the row numbers of the 25 observations that would make our random
sample.
The function sample() in R achieves the same effect. It generates a
random sample from a data set. The function takes two arguments: 1) the
name of the data set from which to take the random sample, and 2) the
number of observations to draw from this data set. For example if we want
to take a random sample of 25 row numbers from the data set of liquor
products (dt1_L) we can write this as follows:
> sample(1:nrow(dt1_L),25)
[1] 988 246 175 831 1991 214 1157 2394 752 546 619 2753
[13] 2999 695 2680 2819 326 1145 2459 1549 3113 18 1580 1027
[25] 1069
This means that the random sample is made from observations (row numbers)
988, 246, 175, etc.
If we were to re-run the same command, it would produce a different
random sample.
> sample(1:nrow(dt1_L),25)
[1] 1826 2498 595 1429 2980 2800 2399 468 603 1575 1289 2094
[13] 1992 2573 1124 1172 1913 2321 2148 2932 3020 1864 2791 160
[25] 2199
Our second random sample, is made of row numbers 1826, 2498, 595, etc. If
you repeat exactly the same code in your computer you will get a different
set of numbers. Similar to drawing from the box that contains the pieces of
paper; every time we draw, we get different results.
> tail(dt1_L_rs, 3)
Brand Description Classification unitSold
1: 4054 Patron XO Cafe Liqueur 1 3020
2: 8985 J Roget Spumante 1 8939
3: 4167 Shellback Silver Rum 1 23
averagePrice revenue
1: 13.24544 39981.80
2: 5.99000 53544.61
3: 14.99000 344.77
Practice Problem: Verify that the row number 916 corresponds to brand
number 2199 shown as the first observation in the random sample.
This means that within the function t-test(), we have specified the
following arguments:
high as $11.04. Since this range contains the value of zero, we can again
conclude that there is not enough evidence to reject the null hypotheses
(i.e., the true difference in means is equal to zero).
6. The average price of liquor (variable x) in the sample was $19.84 and
the sample average price of wine (y variable) was $20.08.
(a) Use the function rbind to combine the random sample from liquor
product (dt1_L_rs) and the random sample from wine products
(dt1_W_rs) and save it as as new data set (dt1_rs).
(b) How many variables are in the new data set (dt1_rs)?
(c) How many observations are in the new data set (dt1_rs)? Ob-
serve, that the new data set has one variable that captures the
average price and another variable that captures the product clas-
sification.
(d) When our data set has observations that can be divided into
groups (i.e., liquor and wine), we can specify that we can run
a t-test that can compare the means of the two groups as follows:
t.test(dt1_rs$averagePrice ∼ dt1_rs$classification).
(e) Use the above formula to run the t-test and compare the new
results with the results in Figure 4.1. If you followed the directions
above, the test results should be identical.6
6. The function describe() from the package psych provides more de-
tailed summary statistics. The output among others, includes the stan-
dard error.7 The function in its simplest form takes just one argument,
the variable or variables for which we would like to see descriptive
statistics. For example, the following statement will generate descrip-
tive statistics for the 4th variable (units sold) from the random sample
of wine products: describe( dt1_W_rs[,4]).
(a) Use the function describe() to generate descriptive statistics for
units sold, price, and revenue from the random sample of liquor
products.
(b) What is the average value, standard deviation, number of obser-
vations, and standard error (se) for units sold.
(c) A quick and dirty way to create an approximately 95% confidence
interval is to multiply the standard error (se) times two, and then
add and subtract this from the mean. Use this approach to create
confidence intervals for units sold, price, and revenue.
(d) Use the function describe() to generate descriptive statistics for
units sold, price, and revenue from the random sample of wine
products.
(e) What is the average value, standard deviation, number of obser-
vations, and standard error (se) for units sold.
(f) Use the quick and dirty way to generate confidence intervals for
units sold, price, and revenue of wine.
6
The solution to this practice problem is on p. 91.
7
The solution to this practice problem is on p. 91.
(g) Use your confidence intervals to compare average units sold, price,
and revenue between liquor and wine.
> names(AMZN)
> names(GSPC)
As we have seen in section 2.2.3 (p. 24), we can use the function cbind
to combine the two data sets, as follows:
8
Random samples are being used to demonstrate hypothesis testing.
> dt1[1:3,5:7]
> set.seed(999)
> startDate <- sample(1:nrow(dtBear)-15,1)
> startDate
[1] 26
Using the [,] notation we can see that the trading day that corresponds to
the 26th observation is the first week of July 2007.
> dtBear[startDate,]
> nrow(dtBear_15w)
[1] 15
> mean(dtBear_15w$delta)
[1] 0.01911319
> sd(dtBear_15w$delta)
[1] 0.06805197
While the average difference is greater than zero, we don’t know if the dif-
ference is statistically significant. To test this we use the function t.test().
In the script below; we have specified that our target variable is delta, the
level of significance is 10%, the value of the population delta is zero (µ = 0),
the alternative is one sided (µ > 0), and we want to see the 90% confidence
interval.
Based on the results below; the p-value (0.1475) is greater than the chosen
level of significance (α = 10%). Therefore, there is not enough statistical
evidence to reject the null hypothesis.
data: dtBear_15w$delta
t = 1.0878, df = 14, p-value = 0.1475
alternative hypothesis: true mean is greater than 0
90 percent confidence interval:
-0.004520262 Inf
sample estimates:
mean of x
0.01911319
This means, that based on our sample, there is not enough statistical evidence
to conclude that on average Amazon returns are higher than S&P500 returns
during a bear market.
> set.seed(999)
> startDate <- sample(1:nrow(dtBull)-15,1)
> startDate
[1] 26
> dtBull[startDate,]
[1] 15
> mean(dtBull_15w$delta)
[1] 0.008670262
> sd(dtBull_15w$delta)
[1] 0.01647467
The fact that the mean delta is positive means that average returns of Ama-
zon were higher than the returns of SP500. However, we don’t know if this
difference is statistically significant. To test this, we use the t.test() and
the same specifications as in the bear market. The script and results are
shown below.
> t.test(dtBull_15w$delta, level = 0.90, mu=0,
alternative= "greater", conf.level = 0.90)
One Sample t-test
data: dtBull_15w$delta
t = 2.0383, df = 14, p-value = 0.03044
alternative hypothesis: true mean is greater than 0
90 percent confidence interval:
0.002948849 Inf
sample estimates:
mean of x
0.008670262
Based on the above results; the p-value (0.03044) is less than the chosen level
of significance (α = 10%). Therefore, there is enough statistical evidence to
reject the null hypothesis. It seems that based on evidence from the period
2016-17 the average return on Amazon was higher than S&P500 during a
bull market.
(a) Select a random start date that would allow you to take a sample
of 25 observations. What is the startDate observation?
(b) What is the first trading date in the sample?
(c) What is the average delta in the sample?
(d) What is the standard deviation?
(e) Run a one sided t-test with the alternative greater than zero and
a 10% level of significance.
(f) What is the p-value?
(g) State the conclusion based on these results.
2. Set the seed to 123 and work with the bull market subset (dtBull) to
compare returns between Amazon and S&P500, using a sample of 25
observations.11
(a) Select a random start date that would allow you to take a sample
of 25 observations. What is the startDate observation?
(b) What is the first trading date in the sample?
(c) What is the avearage delta in the sample?
(d) What is the standard deviation?
(e) Run a one sided t-test with the alternative greater than zero and
a 10% level of significance.
(f) What is the p-value?
(g) State the conclusion based on these results.
> library(data.table)
> dt <- fread("industryAAPL_2010_2016.csv")
> names(dt)
[1] "gvkey" "datadate" "fyear" "tic" "conm"
[6] "at" "cogs" "ib" "oibdp" "sale"
[11] "loc" "naics" "sic"
> nrow(dt)
[1] 441
> dt <- dt[dt$sale>1,]
> nrow(dt)
[1] 396
> dt$gm <- (dt$sale-dt$cogs)/dt$sale
> dt$om <- dt$oibdp/dt$sale
> dt$pm <- dt$ib/dt$sale
The rest of this section, is organized as follows:
1. Identify the subset of companies that had an operating margin (om)
above the 3rd quartile (top quartile) in 2010. We will save these com-
panies in a data set named dt_2010_Q4.
2. Based on the 2012 data, create a sub-set that contains only firms that
were in the 4th Quartile in 2010. Name this new set dt_2012_oldQ4.
3. Remove the firms that were in the 4th Quartile in 2010 from the 2012
data set, and name the new sub-set dt_2012_minusOldQ4.
4. Generate a random sample based on the data set dt_2012_minusOldQ4.
Name the random sample dt_2012_rs. The random sample should
have the same number of observations as the firms in dt_2012_oldQ4.
> summary(dt_2010$om)
As we can see firms that had an operating margin (om) above 13.46% are
in the top quartile (above the 3rd quartile). Remember that as we have
in Chapter 3 (p. 40), we can calculate the 3rd quartile using quintile(x,
.75).
With the following script, we generate the sub-set dt_2010_Q4 and print
the ticker symbol of all firms that were in the 4th quartile in 2010.
As we can see there were 18 firms that were in the 4th quartile in terms of
their operating margin in 2010.
Stage 2 (dt_2012_oldQ4 )
Our objective in this section is to use 2012 data, in order to create a sub-set
that contains only firms that were in the 4th Quartile in 2010. We are going
to name this new set dt_2012_oldQ4.
We start by focusing of the financial data from fiscal year 2012.
[1] 13
> print(dt_2012_oldQ4$tic)
Stage 3 (dt_2012_minusOldQ4 )
In order to see how the firms which were at the top quartile in 2010 compare
to their competitors in 2012, we need to remove them from the list of 2012
firms. This means that we need to remove firms with ticker symbol "ANEN",
"AAPL", "CMTL" etc., from the list of 2012 firms.
In the previous step, we used the following statement dt_2012$tic %in%
c(dt_2010_Q4$tic) to extract the firms that belonged in the list of top
performing 2010 firms. In order to create the dt_2012_minusOldQ4 we
need the opposite of this statement. We need firms that do not belong to
the top performing 2010 firms. To do this we enclose the statement in a
parenthesis and add an exclamation point (!) before it. In other words, the
statement becomes as follows: !(dt_2012$tic %in% c(dt_2010_Q4$tic)).
Using the [,] notation and the above statement, we create the sub-set
dt_2012_minusOldQ4 :
Practice Problem:
1. What is the number of rows in dt_2012_minusOldQ4 ?
2. Verify that the number of rows in the data set dt_2012_minusOldQ4
plus the number of rows in dt_2012_oldQ4 is equal to the number of
rows in dt_2012.
Stage 4 (dt_2012_rs)
Since the data set dt_2012_minusOldQ4 has more observations than the
data set dt_2012_oldQ4, we will need to take a random sample of 13 obser-
vations from the dt_2012_minusOldQ4.
> set.seed(123)
> dt_2012_rs <-
dt_2012_minusOldQ4[sample(1:nrow(dt_2012_minusOldQ4),13),]
> mean(dt_2012_oldQ4$om)
[1] 0.1223479
> sd(dt_2012_oldQ4$om)
[1] 0.2022083
> mean(dt_2012_rs$om)
[1] -0.1332882
> sd(dt_2012_rs$om)
[1] 0.4909646
Based on these values it seems that the 2010 top performers were able to
sustain their superiority two years later. However, we don’t know if this
difference is statistical significance.
Results based on a one sided t-test are shown below.
The p-value (0.05093) is less than the chosen level of significance (10%).
Therefore, we can conclude that based on evidence from our sample, the
2012 average om of top performing firms in 2010 seem to be higher than the
2012 average om of their competitors. Therefore; two years later, these firms
seem to be able to maintain their advantage.
12
The R script for this problem is on p. 94
13
The R script for this problem is on p. 95
X Y X Z
1 10 1 100
2 20 2 200
3 30 3 300
In this case, we can place them side-by-side and have them share the com-
mon variable X using the function cbind. The function takes two arguments:
we need to specify the left table (XY) and the right table (XZ).
X Y Z
1 10 100
2 20 200
3 30 300
When to use rbind Suppose that we have two tables. The first one (4.4)
has two variables Year and Y and three observations for years 2014 through
2016. The second table (4.5) has the same two variables Year and Y and
observations for years 2017 and 2018. As we can see both tables have the
same variables Year and Y.
Year Y
Year Y
2014 40
2017 90
2015 50
2018 65
2016 70
Table 4.5: dt2
Table 4.4: dt1
In this case, it makes sense to append the second table (dt2 ) at the
bottom of the first table (dt1 ) using the function rbind. The function takes
two arguments: we need to specify the top table (dt1) and the bottom table
(dt2).
Year Y
2014 40
2015 50
2016 70
2017 90
2018 65
Table 4.6: dt
Use rbind to combine random samples and compare prices (p. 73)
> dt1_rs <- rbind(dt1_L_rs, dt1_W_rs)
> t.test(dt1_rs$averagePrice~dt1_rs$Classification,
level = 0.95, var.equal = FALSE,
alternative= "two.sided", conf.level = 0.95)
> describe(dt1_W_rs[,4:6])
[1] 5
> dtBear[startDate,]
[1] 25
> mean(dtBear_25w$delta)
[1] 0.03577445
> sd(dtBear_25w$delta)
[1] 0.09114545
data: dtBear_25w$delta
t = 1.9625, df = 24, p-value = 0.0307
alternative hypothesis: true mean is greater than 0
90 percent confidence interval:
0.0117515 Inf
sample estimates:
mean of x
0.03577445
[1] 5
> dtBull[startDate,]
[1] 25
> mean(dtBull_25w$delta)
[1] 0.005769844
> sd(dtBull_25w$delta)
[1] 0.03722621
data: dtBull_25w$delta
t = 0.77497, df = 24, p-value = 0.223
alternative hypothesis: true mean is greater than 0
90 percent confidence interval:
-0.004041763 Inf
sample estimates:
mean of x
0.005769844
[1] 11
> print(dt_2014_oldQ4$tic)
[1] 0.1390558
> sd(dt_2014_oldQ4$om)
[1] 0.1754235
> mean(dt_2014_rs$om)
[1] -0.2775218
> sd(dt_2014_rs$om)
[1] 0.7319373
[1] 11
> print(dt_2016_oldQ4$tic)
[1] 0.07157916
> sd(dt_2016_oldQ4$om)
[1] 0.4081531
> mean(dt_2016_rs$om)
[1] 0.04039443
> sd(dt_2016_rs$om)
[1] 0.2157376
Appendix
97
tStores[1:3,]
tStores[1:3, 2:4]
tStores[, 2:4]
head(tStores[,2:4],3)
tail(tStores[,2:4],3)
head(tStores[order(tStores$SqFt),2:4],3)
head(tStores[order(-tStores$SqFt),2:4],3)
min(tStores$SqFt)
max(tStores$SqFt)
99
mean(tStores$SqFt)
library(quantmod)
Sys.setenv(TZ = "UTC")
getSymbols(c("AMZN"), src="yahoo", periodicity="weekly",
from=as.Date("2006-01-01"), to=as.Date("2014-12-31"))
names(AMZN)
head(AMZN,3)
tail(AMZN,3)
nrow(AMZN)
plot(AMZN[,6])
getSymbols(c("^GSPC"),src="yahoo",periodicity="weekly",
from=as.Date("2006-01-01"),to=as.Date("2014-12-31"))
names(GSPC)
nrow(GSPC)
head(AMZN,1);head(GSPC,1)
tail(AMZN,1);tail(GSPC,1)
dt1<-cbind(AMZN[,6],GSPC[,6])
head(dt1,3);tail(dt1,3)
nrow(dt1)
names(dt1)
names(dt1)[1:2]<-c("AMZN","SP500")
names(dt1)
par(mfcol=c(1,2))
plot(dt1$AMZN);plot(dt1$SP500)
par(mfcol=c(1,1))
dt2<-dt1['2007-01-01::2010-12-31']
par(mfcol=c(1,2))
plot(dt2$AMZN);plot(dt2$SP500)
par(mfcol=c(1,1))
The following R script was used to prepare the section on how to load and
review Financial Accounting data.
quantile(dt1$SqFt,.25)
quantile(dt1$SqFt,.75)
quantile(dt1$SqFt,.75)-quantile(dt1$SqFt,.25)
IQR(dt1$SqFt)
lWhisker4sQFt<-quantile(dt1$SqFt,.25)-1.5*IQR(dt1$SqFt)
lWhisker4sQFt
uWhisker4sQFt<-quantile(dt1$SqFt,.75)+1.5*IQR(dt1$SqFt)
uWhisker4sQFt
boxplot(dt1$SqFt)
dt1$sqftOutlier<-
ifelse(dt1$SqFt<lWhisker4sQFt|dt1$SqFt>uWhisker4sQFt,1,0)
dt4outlier <- dt1[dt1$sqftOutlier==1,]
dt4outlier
dt1[dt1$SqFt<lWhisker4sQFt|dt1$SqFt>uWhisker4sQFt,]
dt2<-fread("salesByProduct.csv")
str(dt2)
table(dt2$Classification)
prop.table(table(dt2$Classification))
options(scipen=999)
library(quantmod)
Sys.setenv(TZ="UTC")
getSymbols(c("AMZN","^GSPC"),src="yahoo",
periodicity="weekly",
from=as.Date("2006-01-01"),to=as.Date("2016-12-31"))
names(AMZN)
names(GSPC)
dt1<-cbind(AMZN[,6],GSPC[,6])
head(dt1,3);tail(dt1,3)
names(dt1)[1:2]<-c("AMZN","SP500")
names(dt1)
head(dt1)
dt1$lagAMZN<-lag(dt1$AMZN)
dt1$lagSP500<-lag(dt1$SP500)
head(dt1)
dt1$rtrnAMZN<-(dt1$AMZN-dt1$lagAMZN)/dt1$lagAMZN
dt1$rtrnSP500<-(dt1$SP500-dt1$lagSP500)/dt1$lagSP500
head(dt1)
dt1<-na.omit(dt1)
head(dt1)
summary(dt1[,5:6])
lwrWhiskerAMZN<-
quantile(dt1$rtrnAMZN,.25)-1.5*IQR(dt1$rtrnAMZN)
lwrWhiskerAMZN
uprWhiskerAMZN<-
quantile(dt1$rtrnAMZN,.75)+1.5*IQR(dt1$rtrnAMZN)
uprWhiskerAMZN
lwrWhiskerSP500<-
quantile(dt1$rtrnSP500,.25)-1.5*IQR(dt1$rtrnSP500)
lwrWhiskerSP500
uprWhiskerSP500<-
quantile(dt1$rtrnSP500,.75)+1.5*IQR(dt1$rtrnSP500)
uprWhiskerSP500
dt1[dt1$rtrnAMZN==min(dt1$rtrnAMZN),]
dt1[dt1$rtrnSP500==min(dt1$rtrnSP500),]
dt2<-as.data.frame(dt1[,5:6])
dt2$directionAMZN<-ifelse(dt2$rtrnAMZN>0,"AMZN_up",
ifelse(dt2$rtrnAMZN==0,"AMZN_unchanged","AMZN_down"))
dt2$directionSP500<-ifelse(dt2$rtrnSP500>0,"SP500_up",
ifelse(dt2$rtrnSP500==0,"SP500_unchanged","SP500_down"))
head(dt2)
table(dt2$directionSP500)
table(dt2$directionAMZN)
table(dt2$directionSP500,dt2$directionAMZN)
dt<-dt[dt$sale>1,]
nrow(dt)
dt_2010<-dt[dt$fyear==2010,]
summary(dt_2010[,8:9])
dt_2010$status_OIBDP<-ifelse(dt_2010$oibdp>=0,
"profits_OIBDP","losses_OIBDP")
prop.table(table(dt_2010$status_OIBDP))
dt_2010$status_IB<-ifelse(dt_2010$ib>=0,"profits_IB",
"losses_IB")
prop.table(table(dt_2010$status_IB))
dt_2010$gm<-(dt_2010$sale-dt_2010$cogs)/dt_2010$sale
dt_2010$om<-dt_2010$oibdp/dt_2010$sale
dt_2010$pm<-dt_2010$ib/dt_2010$sale
names(dt_2010)
summary(dt_2010[,16:18])
lwrWhisker_pm<-
quantile(dt_2010$pm,.25)-1.5*IQR(dt_2010$pm)
lwrWhisker_pm
uprWhisker_pm<-
quantile(dt_2010$pm,.75)+1.5*IQR(dt_2010$pm)
uprWhisker_pm
dt_2010$relativePosition_pm<-
ifelse(dt_2010$pm>uprWhisker_pm,"topOutlier_pm",
ifelse(dt_2010$pm>median(dt_2010$pm),
"aboveMedian_pm","belowMedian_pm"))
table(dt_2010$relativePosition_pm)
prop.table(table(dt_2010$relativePosition_pm))
The following R script was used to prepare the section on hypothesis testing
for management accounting data.
library(data.table)
dt1 <- fread("salesByProduct.csv")
str(dt1)
summary(dt1_L[, 4:6])
summary(dt1_W[, 4:6])
sample(1:nrow(dt1_L),25)
sample(1:nrow(dt1_L),25)
set.seed(123)
sample(1:nrow(dt1_L),25)
set.seed(123)
sample(1:nrow(dt1_L),25)
set.seed(123)
dt1_L_rs<-dt1_L[sample(1:nrow(dt1_L),25),]
head(dt1_L_rs,3)
tail(dt1_L_rs,3)
set.seed(123)
dt1_W_rs<-dt1_W[sample(1:nrow(dt1_W),25),]
head(dt1_W_rs,3)
tail(dt1_W_rs,3)
t.test(dt1_L_rs$averagePrice, dt1_W_rs$averagePrice,
level = 0.95, var.equal = FALSE,
alternative= "two.sided", conf.level = 0.95)
mean(dt_2012_rs$om)
sd(dt_2012_rs$om)
t.test(dt_2012_oldQ4$om, dt_2012_rs$om,
level=.90, var.equal = FALSE,
alternative= "greater", conf.level = 0.90)
B.1 Compustat
Financial statement analysis requires a point of reference. This point can be
the company itself or other firms/competitors. When the point of reference
is the company itself, we compare the firm’s performance in current period
to its performance in prior periods (this means we use time-series data). In
section (B.1.1), we will learn how to extract data from Compustat for a single
company.
111
As you can see in Figure B.4, the output shows the date range that we
have selected (Jan 2010 to Dec 2016), the input code (AAPL), the fact that
there are no constraints, and the seven variables that we have chosen: com-
pany name=CONM, ticker symbol=TIC, total assets (AT), cost of good sold
(COGS), income before extraordinary items (IB), operating income before
depreciation (OIBDP), and sales (SALE). Notice that industry classifications
(NAICS and SIC) and headquarter location (LOC) do not show as variables
selected.
Once the query has been generated, we click on the link and we can
see the results on the browser. The output (a partial screen shot shown in
Figure B.5) shows the financial data for Apple, starting with fiscal year 2000,
as well as the industry classification (NAICS = 3342203 and SIC = 3663).
3
From the web site of Statistics Canada (http://www23.statcan.gc.ca/imdb/p3VD.
pl?Function=getVD&TVD=307532) we can find that NAICS=334220 is for firms in Radio
and television broadcasting and wireless communications equipment manufacturing.
1. Date range: select Fiscal Year and confine the date range to January
1, 2010 to December 31, 2016.
2. Apply your company codes Instead of searching by a single firm,
select ‘Search the entire database’
3. Screening Variables - de-select all output options. Given the nature
of our analysis, we would not be using this output.
4. We leave the same variables selected as we had them in the Apple query.
Our list of variables includes the following:
(a) company name=CONM,
(b) ticker symbol=TIC,
(c) headquarter location (LOC),
(d) industry classifications (NAICS and SIC)
(e) total assets (AT),
(f) cost of good sold (COGS),
(g) income before extraordinary items (IB),
(h) operating income before depreciation (OIBDP), and
(i) sales (SALE)
5. Under Conditional Statements we select NAICS from the first drop
down menu and set it equal to 334220 (See Figure B.6).
4
Please visit the Wikipedia to learn about NAICS and SIC, as well as other com-
monly used industry classification systems https://en.wikipedia.org/wiki/Industry_
classification.
121
Help, 4 Source, 3
Packages, 4
Plots, 4 SIC, 112, 113