0% found this document useful (0 votes)

44 views24 pages

Intro To R Introspection

The document discusses basic concepts of R including its data types and functions. It describes manipulating data frames in R, including subsetting, attaching, and deriving new variables. Methods for summary statistics, linear regression, and defining custom functions are also covered.

Uploaded by

Kip E

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views24 pages

Intro To R Introspection

Uploaded by

Kip E

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

INTRODUCTION TO R

In this chapter, we discuss some of the basic concepts of R. R is an open source programming language
supported by the R Foundation for Statistical Computing. R is growing in popularity in the data science
community over the years. Many R users use the software through the program R Studio.

Contrary to SAS where columns are defined as variables and rows as observations, R embraces the
matrix concept with rows and columns and leaves the interpretation up to the user. To avoid doubt,
the terms “columns”/”variables”, “rows”/”observations” and “matrix”/”data” are used
interchangeably throughout this book to better align this book with the main textbook “Credit Risk
Analytics, Measurement Techniques, Applications and Examples in SAS”. Furthermore, we use the
terms “fitting” (predominantly used in R) and “estimation” (predominantly used in SAS)
interchangeably.

Software such as SAS easily provides the user with additional output by specifying the relevant options
in a procedure, for example model diagnostic plots of a linear regression model. In contrast, analysis
in R frequently builds upon a series of steps in order to receive similar outputs. Therefore, it is
sometimes necessary to use the results produced in one step for another step and so on, until one
obtains the desired information.

R can be used to conduct the same arithmetic operations as a pocket calculator. In order to use some
of the results for further calculations, they can be stored in an object. This is done by using the
assignment symbol “<-” (or, alternatively, the operator “=”). There are five data types that are most
frequently used: vectors, matrices, arrays, dataframes and lists. While the first three data types must be
comprised of the same type, e.g. all entrances must be numeric, this is not required for dataframes and
lists. Some functions (such as the lm command) only work if the model variables are stored in a
dataframe.

In a first step, let’s do some housekeeping and clear the workspace.

remove(list = ls())

Then import the external csv files for our mortgage data set into R.
mortgage <- read.csv("mortgage.csv")

1
Data Manipulation

One convenient way to proceed is to convert an object to a data frame. Using data frames provides for
an easier way to access and manipulate columns that generally represent variables. This step is not
necessary at first sight as the function read.csv() will generate a data frame object by default. However,
this function may be useful if the data set was originally in a different format, such as a matrix.
mortgage <- as.data.frame(mortgage)
The function attach() will add a data set to the R search path, creating a shortcut that allows to directly
use the variables in a data frame. For example, without attach(mortgage), mortgage$LTV_time should
be typed to use column LTV_time. Using attach(mortgage), LTV_time can be used directly without
any additional reference. The function detach() will remove a data set from the R search path.
attach(mortgage)

The advantage of attach is that this function can reduce the typing workload when manipulating
variables, especially when these are repetitively used. However, be careful when working with multiple
data sets as R is confused when the same name exists in multiple data sets as duplicated names exist in
its search path.

Basic R
This section shows some useful functions to manipulate data in R. First, subsamples may be generated
from data sets by using function subset(). For example, a subsample of data set for FICO scores greater
than 500.
mortgage.temp <- subset(mortgage, FICO_orig_time >= 500)
There is an alternative method to retrieve the subset of a data set according to certain conditions other
than using function subset(). data[x,y] refers to the element of a matrix or data frame in the row x and
column y. "data[condition,]" returns a subset with rows that satisfy the condition and all the columns
as the y argument is left blank.
mortgage.temp2 <- mortgage[default_time==1, ]
New variables can be derived from existing ones following three steps. First, create a new empty vector
with name FICO_cat using the function vector(). The length of this vector is set to be the same of the
column FICO_orig_time in mortgage data set.
FICO_cat <- vector(length = length(FICO_orig_time))
Second, assign values to FICO_cat using a for loop that runs over all elements. In each step, the
value of FICO-cat is determined by the value of FICO_orig_time.
for (i in 1:length(FICO_orig_time)) {
FICO_cat[i] <- if ((FICO_orig_time[i] > 500) & (FICO_orig_time[i] <=
700)) 1 else if (FICO_orig_time[i] > 700) 2 else 0
}
Third, add the new vector FICO_cat as a new column to the mortgage data set.
mortgage[, "FICO_cat"] = FICO_cat
Delete variables from the data set. For example, we can delete the column status_time.
mortgage.temp$status_time <- NULL
Combine variables of interest by columns into a new matrix select.cols.
select.cols <- cbind(default_time, FICO_orig_time, LTV_orig_time,
gdp_time)
Calculate summary statistics of these selected columns by using function apply(). apply(x,2,statistic)
will apply the chosen function to the matrix x by columns (i.e., the second dimension).
n.mortgage <- apply(select.cols, 2, length)
mean.mortgage <- apply(select.cols, 2, mean)
st.dev.mortgage <- apply(select.cols, 2, sd)
min.mortgage <- apply(select.cols, 2, min)
max.mortgage <- apply(select.cols, 2, max)
Then, combine the calculated statistics in a matrix and rename its columns and print the resulting
table.
select.summary <- cbind(n.mortgage,mean.mortgage, st.dev.mortgage,
min.mortgage, max.mortgage)
colnames(select.summary) <- c("N", "Mean", "Std Dev", "Min", "Max")
print(select.summary)
## N Mean Std Dev Min Max
## default_time 622489 0.024 0.154 0.000 1.000
## FICO_orig_time 622489 673.617 71.725 400.000 840.000
## LTV_orig_time 622489 78.975 10.127 50.100 218.500
## gdp_time 622489 1.381 1.965 -4.147 5.132
Exhibit 2.2
The function lm() fits a linear regression model. The function summary is used to retrieve the key
statistics of the fitted linear regression.
mortgage.lm <- lm(default_time ~ FICO_orig_time + LTV_orig_time +
gdp_time)
summary(mortgage.lm)

3
##
## Call:
## lm(formula = default_time ~ FICO_orig_time + LTV_orig_time +
## gdp_time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10031 -0.03174 -0.02205 -0.01354 1.00863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.957e-02 2.591e-03 30.71 <2e-16 ***
## FICO_orig_time -1.154e-04 2.747e-06 -42.02 <2e-16 ***
## LTV_orig_time 3.808e-04 1.944e-05 19.59 <2e-16 ***
## gdp_time -5.454e-03 9.914e-05 -55.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1535 on 622485 degrees of freedom
## Multiple R-squared: 0.008343, Adjusted R-squared: 0.008338
## F-statistic: 1746 on 3 and 622485 DF, p-value: < 2.2e-16
Exhibit 2.3

Self-defined functions in R

The following is an example of how to create your own (self-defined) functions to fit a linear regression
model using the inputted data. lhs and rhs are input arguments. d is the combination of them in the
format of a data frame. example.lm is the fitted linear regression, where lhs is the dependent variable
and rhs is the independent variable. The output shows the summary statistics of the fitted linear model.
example <- function(lhs, rhs){
d = as.data.frame(cbind(lhs,rhs))
example.lm = lm(lhs ~ ., data = d)
return(summary(example.lm))
}
The merit of writing code in functions is that the functions can be used multiple times requiring only
a call of the function and arguments often saving a substantial amount of double coding.
The following example output is from the self-defined function using the control variable
FICO_orig_time.
example(lhs = default_time, rhs = FICO_orig_time)
##
## Call:
## lm(formula = lhs ~ ., data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.05625 -0.02967 -0.02349 -0.01731 0.99248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.029e-01 1.842e-03 55.85 <2e-16 ***
## rhs -1.166e-04 2.720e-06 -42.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1539 on 622487 degrees of freedom
## Multiple R-squared: 0.002944, Adjusted R-squared: 0.002942
## F-statistic: 1838 on 1 and 622487 DF, p-value: < 2.2e-16
Exhibit 2.4

The following example output is from the self-defined function using the control variable
FICO_orig_time and LTV_orig_time.
example(lhs = default_time, rhs = cbind(FICO_orig_time, LTV_orig_time))
##
## Call:
## lm(formula = lhs ~ ., data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.07745 -0.03045 -0.02360 -0.01688 1.00027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.835e-02 2.590e-03 26.39 <2e-16 ***
## FICO_orig_time -1.087e-04 2.751e-06 -39.50 <2e-16 ***
## LTV_orig_time 3.698e-04 1.948e-05 18.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1539 on 622486 degrees of freedom

5
## Multiple R-squared: 0.003521, Adjusted R-squared: 0.003518
## F-statistic: 1100 on 2 and 622486 DF, p-value: < 2.2e-16
Exhibit 2.5

The following example output is from the self-defined function using the control variable
FICO_orig_time, LTV_orig_time and gdp_time.
example(lhs = default_time, rhs = cbind(FICO_orig_time, LTV_orig_time,
gdp_time))
##
## Call:
## lm(formula = lhs ~ ., data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10031 -0.03174 -0.02205 -0.01354 1.00863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.957e-02 2.591e-03 30.71 <2e-16 ***
## FICO_orig_time -1.154e-04 2.747e-06 -42.02 <2e-16 ***
## LTV_orig_time 3.808e-04 1.944e-05 19.59 <2e-16 ***
## gdp_time -5.454e-03 9.914e-05 -55.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1535 on 622485 degrees of freedom
## Multiple R-squared: 0.008343, Adjusted R-squared: 0.008338
## F-statistic: 1746 on 3 and 622485 DF, p-value: < 2.2e-16
Exhibit 2.6
3 EXPLORATORY DATA ANALYSIS

This chapter provides the R codes for exploratory data analysis. The following codes clear the
workspace, install the required packages and read the data.
Clear workspace.
remove(list=ls())
Install required packages and input data. This step can be skipped if the following packages have been
already installed:
#install.packages("moments")
#install.packages("gmodels")
#install.packages("vcd")
#install.packages("mixtools")
library(moments)
library(gmodels)
library(vcd)
library(mixtools)
mortgage <- read.csv("mortgage.csv")

One-Dimensional Analysis

Observed Frequencies and Empirical Distributions

We first compute observed frequencies for the defaults and empirical distributions for the FICO score
and LTV.
Initialize empty vectors which will be used to calculate statistics of default_time.
Frequency <- numeric()
Percent <- numeric()
Cumulative.Frequency <- numeric()
Cumulative.Percent <- numeric()
Extract unique values of default_time.
default.indicator <- unique(mortgage$default_time)
For both scenarios where default_time is 0 and 1, calculate different statistics. Firstly, extract a subset
of mortgage based on the value of default_time. Then calculate the frequency and the percentage.
Secondly, calculate the cumulative frequency and percentage using different method based on the value
of default_time.

7
for (i in 1:2){
temp <- subset(mortgage, default_time == default.indicator[i])
Frequency[i] <- length(temp$default_time)
Percent[i] <- round(Frequency[i]/nrow(mortgage), 4)*100
if (i == 1){
Cumulative.Frequency[i] <- Frequency[i]
Cumulative.Percent[i] <- Percent[i]
}
else{
Cumulative.Frequency[i] <- Frequency[i-1] + Frequency[i]
Cumulative.Percent[i] <- Percent[i-1] + Percent[i]
}
}
Store the results in a data frame and show it.
results <- cbind.data.frame(default.indicator, Frequency, Percent,
Cumulative.Frequency, Cumulative.Percent)
colnames(results) <- c("default_time", "Frequency", "Percent",
"Cumulative Frequency", "Cumulative Percent")
print(results)
## default_time Frequency Percent Cumulative Frequency Cumulative
Percent
## 1 0 607331 97.56 607331
97.56
## 2 1 15158 2.44 622489
100.00
Exhibit 3.1
Plot the histogram for FICO_orig_time and by using function hist(). Then add a frame to the histogram
by using function box().
hist(mortgage$FICO_orig_time, freq = FALSE, breaks = 100, main =
"Distribution of FICO_orig_time", xlab = "FICO_orig_time")
box()
Exhibit 3.2a

Plot the cumulative distribution function for FICO_orig_time by using function plot.ecdf().
plot.ecdf(mortgage$FICO_orig_time, ylab = "Cumulative Percent", xlab
="FICO_orig_time", main = "Cumulative Distribution Function for
FICO_orig_time", pch = ".")

9
Exhibit 3.2b

Similar to above, plot the histogram and cumulative distribution function for LTV_orig_time.
hist(mortgage$LTV_orig_time, freq = FALSE, breaks = 100, main =
"Distribution of LTV_orig_time", xlab = "LTV_orig_time")
box()
Exhibit 3.2c
plot.ecdf(mortgage$LTV_orig_time, ylab = "Cumulative Percent", xlab =
"LTV_orig_time", main = "Cumulative Distribution Function for
LTV_orig_time", verticals = TRUE, pch = ".")

11
Exhibit 3.2d

Location Measures
Next, we compute location measures (mean, median and mode) for the three variables default, FICO
and LTV as well as some percentiles (quantiles). Moreover, Q-Q plots are created. R does not have a
default function to find the mode. The following self-defined function is commonly used for finding
the mode.
get.mode <- function(x){
unique.x <- unique(x)
unique.x[which.max(tabulate(match(x, unique.x)))]
}
Create a function to calculate the required descriptive statistics, including count, mean, median, mode,
1% quantile and 99% quantile. Then, these statistics are combined into one vector.
proc.means <- function(x){
N <- length(x)
Mean <- mean(x, na.rm = TRUE)
Median <- median(x, na.rm = TRUE)
Mode <- get.mode(x)
Pctl_1st <- quantile(x, 0.01)
Pctl_99th <- quantile(x, 0.99)
proc.means.results <- as.vector(round(cbind(N, Mean, Median, Mode,
Pctl_1st, Pctl_99th), 4))
}
Generate a vector var.names to store the names of requested variables.
var.names <- c("default_time", "FICO_orig_time", "LTV_orig_time")
Generate an empty data frame. Then, use var.names, which represents the names of the interested
variables, to extract each of them from the mortgage data set. Afterwards, calculate the required
statistics for each of these variables by using the self-defined function proc.means().
loc.measures <- as.data.frame(matrix(NA, nrow = 6, ncol = 3))
for (i in 1:3){
loc.measures[, i] <- lapply(mortgage[var.names[i]], proc.means)
}
Transpose the dataframe and label the rows. Then label the columns. Finally present the results.
loc.measures <- as.data.frame(t(loc.measures), row.names = var.names)
colnames(loc.measures) <- c("N", "Mean", "Median", "Mode", "1st Pctl",
"99th Pctl")
print(loc.measures)
## N Mean Median Mode 1st Pctl 99th Pctl
## default_time 622489 0.0244 0 0 0.0 1
## FICO_orig_time 622489 673.6169 678 660 506.0 801
## LTV_orig_time 622489 78.9755 80 80 52.2 100
Exhibit 3.3
Generate a Q-Q plot by using function qqnorm() for FICO_orig_time and add a theoretical line by
using function qqline().
qqnorm(mortgage$FICO_orig_time, xlim = c(-6, 6), ylim = c(200,
1200),ylab = "FICO_orig_time", xlab = "Normal Quantiles", main = "Q-Q
Plot for FICO_orig_time")
qqline(mortgage$FICO_orig_time)

13
Exhibit 3.4a

Generate a Q-Q plot by using function qqnorm() for LTV_orig_time and add a theoretical line by
using function qqline().
qqnorm(mortgage$LTV_orig_time, xlim = c(-6, 6), ylim = c(0, 250), ylab
= "LTV_orig_time", xlab = "Normal Quantiles", main = "Q-Q Plot for
LTV_orig_time")
qqline(mortgage$LTV_orig_time)
Exhibit 3.4b

Next, dispersion measures, skewness and kurtosis are computed.

Similar to Exhibit 3.3, firstly define a function to calculate the required descriptive statistics.
proc.means.ext <- function(x){
N <- length(x)
Minimum <- min(x, na.rm = TRUE)
Maximum <- max(x, na.rm = TRUE)
Range <- range(x)[2] - range(x)[1]
Quartile.Range <- quantile(x, 0.75) - quantile(x, 0.25)
Variance <- var(x, na.rm = TRUE)
Std.Dev <- sqrt(Variance)
Coeff.Variation <- (Std.Dev/mean(x, na.rm = TRUE))*100
proc.means.ext.results <- as.vector(round(cbind(N, Minimum, Maximum,
Range, Quartile.Range, Variance, Std.Dev, Coeff.Variation), 4))
}

Then, use a vector of variable names to extract these variables from the mortgage data set and calculate

15
the statistics by using the self-defined function.
var.names <- c("default_time", "FICO_orig_time", "LTV_orig_time")
disp.measures <- as.data.frame(matrix(NA, nrow = 8, ncol = 3))
for (i in 1:3){
disp.measures[, i] <- lapply(mortgage[var.names[i]], proc.means.ext)
}

Finally, combine these calculated statistics into a data frame, adjust its layout and present it.
disp.measures <- as.data.frame(t(disp.measures), row.names = var.names)
colnames(disp.measures) <- c("N", "Minimum", "Maximum", "Range",
"Quartile Range", "Variance", "Std. Dev.", "Coeff of Variation")
print(disp.measures)
## N Minimum Maximum Range Quartile Range Variance
## default_time 622489 0.0 1.0 1.0 0 0.0238
## FICO_orig_time 622489 400.0 840.0 440.0 103 5144.4122
## LTV_orig_time 622489 50.1 218.5 168.4 5 102.5572
## Std. Dev. Coeff of Variation
## default_time 0.1541 632.9831
## FICO_orig_time 71.7246 10.6477
## LTV_orig_time 10.1271 12.8230
Exhibit 3.5

Similar to Exhibit 3.3, firstly define a function to calculate the required descriptive statistics. Function
skewness() and kurtosis() from library moments are used to calculate skewness and Pearson's measure
of kurtosis. Pearson's measure of kurtosis - 3 will produce the excess kurtosis.
proc.means.skewKurt <- function(x){
N <- length(x)
Skewness <- skewness(x, na.rm = TRUE)
Kurtosis <- kurtosis(x, na.rm = TRUE) - 3 # Excess kurtosis
proc.means.skewKurt.results <- as.vector(round(cbind(N, Skewness,
Kurtosis), 4))
}
Then, use a vector of variable names to extract these variables from the mortgage data set and calculate
the statistics by using the self-defined function.
var.names <- c("default_time", "FICO_orig_time", "LTV_orig_time")
skewKurt.measures <- as.data.frame(matrix(NA, nrow = 3, ncol = 3))
for (i in 1:3){
skewKurt.measures[, i] <- lapply(mortgage[var.names[i]],
proc.means.skewKurt)
}
Finally, combine these calculated statistics into a data frame, adjust its layout and present it.
skewKurt.measures <- as.data.frame(t(skewKurt.measures), row.names =
var.names)
colnames(skewKurt.measures) <- c("N", "Skewness", "Kurtosis")
print(skewKurt.measures)
## N Skewness Kurtosis
## default_time 622489 6.1718 36.0917
## FICO_orig_time 622489 -0.3213 -0.4684
## LTV_orig_time 622489 -0.1964 1.4364
Exhibit 3.6

Two-Dimensional Analysis

Joint Empirical Distributions

Having explored the empirical data on a one-dimensional basis for each variable, we may also interested
in interrelations between variables. We therefore firstly create two-dimensional (or two-way) frequency
tables, e.g., for default and FICO classes.
Create a new vector FICO_orig_time_factor. Based on the value of FICO_orig_time, assign values 0
to 4 to FICO_orig_time_factor.
FICO_orig_time_factor <- mortgage$FICO_orig_time
FICO_orig_time_factor[FICO_orig_time_factor <
quantile(mortgage$FICO_orig_time,0.2)] <- 0
FICO_orig_time_factor[(FICO_orig_time_factor >=
quantile(mortgage$FICO_orig_time,0.2)) &
(FICO_orig_time_factor <
quantile(mortgage$FICO_orig_time,0.4))] <- 1
FICO_orig_time_factor[(FICO_orig_time_factor >=
quantile(mortgage$FICO_orig_time,0.4)) &
(FICO_orig_time_factor <
quantile(mortgage$FICO_orig_time,0.6))] <- 2
FICO_orig_time_factor[(FICO_orig_time_factor >=
quantile(mortgage$FICO_orig_time,0.6)) &
(FICO_orig_time_factor <
quantile(mortgage$FICO_orig_time,0.8))] <- 3
FICO_orig_time_factor[(FICO_orig_time_factor >=
quantile(mortgage$FICO_orig_time,0.8))] <- 4

17
Generate the two-dimensional contingency table by using function CrossTable() from library gmodels.
CrossTable(mortgage$default_time, FICO_orig_time_factor, prop.t=TRUE,
prop.r=TRUE, prop.c=TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 622489
##
##
## | FICO_orig_time_factor
## mortgage$default_time | 0 | 1 | 2 | 3 | 4 | Row Total |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 119046 | 118890 | 124047 | 121876 | 123472 | 607331 |
## | 14.558 | 5.383 | 0.229 | 3.387 | 22.468 | |
## | 0.196 | 0.196 | 0.204 | 0.201 | 0.203 | 0.976 |
## | 0.965 | 0.969 | 0.974 | 0.981 | 0.989 | |
## | 0.191 | 0.191 | 0.199 | 0.196 | 0.198 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 4328 | 3790 | 3269 | 2385 | 1386 | 15158 |
## | 583.295 | 215.667 | 9.188 | 135.721 | 900.201 | |
## | 0.286 | 0.250 | 0.216 | 0.157 | 0.091 | 0.024 |
## | 0.035 | 0.031 | 0.026 | 0.019 | 0.011 | |
## | 0.007 | 0.006 | 0.005 | 0.004 | 0.002 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 123374 | 122680 | 127316 | 124261 | 124858 | 622489 |
## | 0.198 | 0.197 | 0.205 | 0.200 | 0.201 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##

Exhibit 3.7
Another way of inferring the relation between both variables (without grouping FICO first) is to look
at a box plot.
Generate the boxplot of FICO_orig_time by different values of default_time by using function
boxplot(). Then compute the means of FICO_orig_time by default_time and add them to the boxplot
as points by using function points().
boxplot(FICO_orig_time ~ default_time, data = mortgage, range = 0, xlab
= "default_time", ylab = "FICO_orig_time", main = "Distribution of
FICO_orig_time by default_time")
means <- tapply(mortgage$FICO_orig_time, mortgage$default_time, mean)
points(means, pch = 18)

Exhibit 3.8

Similar to Exhibit 3.8, generate a boxplot for LTV_orig_time.

boxplot(LTV_orig_time ~ default_time, data = mortgage, range = 0, xlab
= "default_time", ylab = "LTV_orig_time", main = "Distribution of
LTV_orig_time by default_time")
means <- tapply(mortgage$LTV_orig_time, mortgage$default_time, mean)
points(means, col = "blue", pch = 18)

19
Exhibit 3.9

Correlation Measures

We now compute measures for association and correlation, namely 𝜒 2 , 𝜙, the contingency coefficient
and Cramer’s V.

Create a simple cross table of default_time and FICO_orig_time_factor by using function xtabs().
tab <- xtabs( ~ FICO_orig_time_factor + mortgage$default_time)

Use function accocstats() from library vcd to compute chi-square based measures.
assocstats(tab)
## X^2 df P(> X^2)
## Likelihood Ratio 2043.9 4 0
## Pearson 1890.1 4 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.055
## Cramer's V : 0.055
Exhibit 3.10

Create a stratified sample by setting a seed value (for identifying the random draw and being able to
repeat the same experiment) with the sample size of 1% without replacements from FICO_orig_time
and LTV_orig_time.
set.seed(12345)
smpl.FICO <- sample(mortgage$FICO_orig_time, size =
0.01*nrow(mortgage), replace = FALSE)
smpl.LTV <- sample(mortgage$LTV_orig_time, size = 0.01*nrow(mortgage),
replace = FALSE)

We can also compute the Pearson and Spearman correlation coefficients as well as Kendall’s Tau b
and produce a scatter plot..

Compute Pearson's correlation and perform a test by using function cor.test().

cor.test(smpl.FICO, smpl.LTV, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: smpl.FICO and smpl.LTV
## t = -1.1771, df = 6222, p-value = 0.2392
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03975041 0.00992735
## sample estimates:
## cor
## -0.01492074
Exhibit 3.11a

Compute Spearman's correlation and perform a test.

cor.test(smpl.FICO, smpl.LTV, method = "spearman", exact = FALSE)
##
## Spearman's rank correlation rho
##
## data: smpl.FICO and smpl.LTV
## S = 4.0792e+10, p-value = 0.233

21
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.01511976
Exhibit 3.11b

Compute Kendall’s Tau b correlation and perform a test.

cor.test(smpl.FICO, smpl.LTV, method = "kendall")
##
## Kendall's rank correlation tau
##
## data: smpl.FICO and smpl.LTV
## z = -1.1883, p-value = 0.2347
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.01067879
Exhibit 3.11c

Combine the two samples into one data set and generate a scatter plot from it. Then add ellipses by
using function ellipse() from library mixtools. Finally add the legend.
smpl.data <- cbind(smpl.FICO, smpl.LTV)
plot(smpl.data, xlab = "FICO_orig_time", ylab ="LTV_orig_time", main =
"Scatter Plot")
ellipse(mu = colMeans(smpl.data), sigma = cov(smpl.data), alpha = 0.20,
npoints = 250, lwd = 2)
ellipse(mu = colMeans(smpl.data), sigma = cov(smpl.data), alpha = 0.30,
npoints = 250, lty = 2, lwd = 2)
legend(x = "topleft", y.intersp = 0.80, cex = 0.80, title = "Prediction
Ellipses", legend = c("80%", "70%"), bty = "n", lty = c(1,2), lwd = 1)
Exhibit 3.12

Highlights of Inductive Statistics

Confidence Intervals
Once parameters are estimated, the estimates will not match the true (i.e., unknown data generating)
parameters in the population exactly but instead have a random deviation from these parameters.
Hence, we may want to compute confidence intervals and conduct hypothesis tests.
Generate a self-defined function for computing confidence intervals assuming normal distribution.
proc.univariate <- function(x){
n <- length(x)
Mean <- mean(x, na.rm = TRUE)
Std.Deviation <- sqrt(var(x, na.rm = TRUE))
Variance <- var(x, na.rm = TRUE)
Lower.Confidence.Limit <- c(Mean - qnorm(0.99)*Std.Deviation/sqrt(n))
Upper.Confidence.Limit <- c(Mean + qnorm(0.99)*Std.Deviation/sqrt(n))
results <- matrix(c(round(c(Mean, Std.Deviation, Variance,

23
Lower.Confidence.Limit,
Upper.Confidence.Limit), 5)),
nrow = 1, ncol = 5)
colnames(results) <- c("Mean", "Std. Deviation", "Variance", "Lower
Confidence Limit",
"Upper Confidence Limit")
print(results)
}
Apply the above self-defined function to LTV_orig_time.
proc.univariate(mortgage$LTV_orig_time)
## Mean Std. Deviation Variance Lower Confidence Limit
## [1,] 78.97546 10.12705 102.5572 78.9456
## Upper Confidence Limit
## [1,] 79.00532
Exhibit 3.13

Hypothesis Testing
Perform T-distribution test by using function t.test().
t.test(mortgage$LTV_orig_time, mu = 60, alternative = "two.sided")
##
## One Sample t-test
##
## data: mortgage$LTV_orig_time
## t = 1478.3, df = 622490, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 60
## 95 percent confidence interval:
## 78.95030 79.00062
## sample estimates:
## mean of x
## 78.97546
Exhibit 3.14

A Brief Overview of The Classical Linear Regression Model (CLRM)
No ratings yet
A Brief Overview of The Classical Linear Regression Model (CLRM)
85 pages
Lec 05 2 - Time Series Regression Model
No ratings yet
Lec 05 2 - Time Series Regression Model
75 pages
R Stastics PDF
No ratings yet
R Stastics PDF
30 pages
Lec 05 - Time Series Regression Model
No ratings yet
Lec 05 - Time Series Regression Model
32 pages
HW1 Solution
No ratings yet
HW1 Solution
23 pages
Econometrics I - R Summary (Maite Cabeza-Gutes)
No ratings yet
Econometrics I - R Summary (Maite Cabeza-Gutes)
77 pages
Unit 4 - R Programming
No ratings yet
Unit 4 - R Programming
26 pages
Unit 5-1
No ratings yet
Unit 5-1
17 pages
Data Scinece Practical File
No ratings yet
Data Scinece Practical File
23 pages
Lec 35
No ratings yet
Lec 35
11 pages
Course Notes18
No ratings yet
Course Notes18
113 pages
CM
No ratings yet
CM
8 pages
Profiling & Simulation Aim:: Pre-Lab Discussion Theory
No ratings yet
Profiling & Simulation Aim:: Pre-Lab Discussion Theory
8 pages
Econ6067 R (I) 2022
No ratings yet
Econ6067 R (I) 2022
22 pages
Data Analysis in R
No ratings yet
Data Analysis in R
10 pages
Uni T - 2 - R Programming
No ratings yet
Uni T - 2 - R Programming
10 pages
Matlab
100% (1)
Matlab
83 pages
MIT 302 - Statistical Computing II - Tutorial 03
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
16 pages
SSRN Id3466882
No ratings yet
SSRN Id3466882
114 pages
Predict and Co
No ratings yet
Predict and Co
6 pages
21BCS5999 - Ankit Kumar (Assignment 2)
No ratings yet
21BCS5999 - Ankit Kumar (Assignment 2)
16 pages
Chapter - 03 - Review of Basic Data
No ratings yet
Chapter - 03 - Review of Basic Data
92 pages
Unit III
No ratings yet
Unit III
13 pages
R Tutorial
No ratings yet
R Tutorial
15 pages
Sample Exam For ML YSZ Sample For Machine Lerning - CMNKNVMNCS."NMD, MN, MVN, MDNV, MNDV MC, MDN, MDCNVM, NDV, M Ccwdmnbnbew, Mwbe
No ratings yet
Sample Exam For ML YSZ Sample For Machine Lerning - CMNKNVMNCS."NMD, MN, MVN, MDNV, MNDV MC, MDN, MDCNVM, NDV, M Ccwdmnbnbew, Mwbe
4 pages
R Unit 4th and 5th
No ratings yet
R Unit 4th and 5th
17 pages
Sample Exam For ML YSZ: Question 1 (Linear Regression)
No ratings yet
Sample Exam For ML YSZ: Question 1 (Linear Regression)
4 pages
R For Introductory Econometrics-1
No ratings yet
R For Introductory Econometrics-1
4 pages
Useful R Functions-1
No ratings yet
Useful R Functions-1
4 pages
Notes 23 Regression R
No ratings yet
Notes 23 Regression R
5 pages
Bididi Industries
No ratings yet
Bididi Industries
12 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
7 pages
Sans 10227
100% (4)
Sans 10227
15 pages
Outline Field Development & Project Management (5th Apr 22) Rev.2
No ratings yet
Outline Field Development & Project Management (5th Apr 22) Rev.2
67 pages
Linear Regression
No ratings yet
Linear Regression
17 pages
Linear Regression in R
No ratings yet
Linear Regression in R
19 pages
Problem Set 1 Solution Numerical Methods
No ratings yet
Problem Set 1 Solution Numerical Methods
32 pages
MultivariateRGGobi PDF
No ratings yet
MultivariateRGGobi PDF
60 pages
R Tips
No ratings yet
R Tips
8 pages
Basic Regression Analysis 2
No ratings yet
Basic Regression Analysis 2
6 pages
Lec 4
No ratings yet
Lec 4
18 pages
Chapter 8 B - Trendlines and Regression Analysis
No ratings yet
Chapter 8 B - Trendlines and Regression Analysis
73 pages
SC&RP - Unit 5
No ratings yet
SC&RP - Unit 5
36 pages
Time Series Practice
No ratings yet
Time Series Practice
4 pages
Unit5 R
No ratings yet
Unit5 R
5 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
Session Set Working Directory Choose Directlry
No ratings yet
Session Set Working Directory Choose Directlry
17 pages
INSIGNIA Book Sample
No ratings yet
INSIGNIA Book Sample
38 pages
RStudio Cheat Sheet 2022
No ratings yet
RStudio Cheat Sheet 2022
1 page
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Experiment No.8 - Fit Simple Linear Regression Models Using Built-In Functions.
No ratings yet
Experiment No.8 - Fit Simple Linear Regression Models Using Built-In Functions.
8 pages
R Examples
No ratings yet
R Examples
56 pages
DM Assignment - Thena Bank
No ratings yet
DM Assignment - Thena Bank
39 pages
An Introduction To Matlab For Econometrics
No ratings yet
An Introduction To Matlab For Econometrics
106 pages
Multiple Regression
No ratings yet
Multiple Regression
7 pages
Cima f7 dvanced-Financial-Reporting PDF
100% (1)
Cima f7 dvanced-Financial-Reporting PDF
590 pages
AP Minimum Wages Details - 1st Oct 2024 - 31st March 2025
No ratings yet
AP Minimum Wages Details - 1st Oct 2024 - 31st March 2025
29 pages
Homework 2
100% (1)
Homework 2
14 pages
ACFAKFAFM
No ratings yet
ACFAKFAFM
27 pages
Sunlight Dishwashing Liquid Msds
No ratings yet
Sunlight Dishwashing Liquid Msds
12 pages
Invitation Letter For Visa Spouse
No ratings yet
Invitation Letter For Visa Spouse
2 pages
List of Books and Notebooks - 2025-26 Class 6-12
No ratings yet
List of Books and Notebooks - 2025-26 Class 6-12
7 pages
Advanced Photonics Research - 2021 - Wu - High Resolution 960 540 and 1920 1080 UV Micro Light Emitting Diode Displays
No ratings yet
Advanced Photonics Research - 2021 - Wu - High Resolution 960 540 and 1920 1080 UV Micro Light Emitting Diode Displays
8 pages
Reducing Patient Falls Through Purposeful Hourly Rounding
No ratings yet
Reducing Patient Falls Through Purposeful Hourly Rounding
78 pages
Econometrics With R
No ratings yet
Econometrics With R
56 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
Industrial Disputes Act
No ratings yet
Industrial Disputes Act
2 pages
Semi Automated Wireless Beach Cleaning Robot
No ratings yet
Semi Automated Wireless Beach Cleaning Robot
3 pages
Alternating Current: Avg. & Rms Values
No ratings yet
Alternating Current: Avg. & Rms Values
41 pages
About MCB: Vision Statement
No ratings yet
About MCB: Vision Statement
7 pages
LM3909 Integrado Flash
No ratings yet
LM3909 Integrado Flash
9 pages
2016 CCNY Great Grads
No ratings yet
2016 CCNY Great Grads
16 pages
The Japanese Led Light Industry
No ratings yet
The Japanese Led Light Industry
10 pages
Stax-21 Quick Reference Guides - Digital - PAX A920
No ratings yet
Stax-21 Quick Reference Guides - Digital - PAX A920
2 pages
ZEOFREE® 600 - Evonik
No ratings yet
ZEOFREE® 600 - Evonik
2 pages
Chavez vs. CA
No ratings yet
Chavez vs. CA
1 page
Reyes VS NLRC
No ratings yet
Reyes VS NLRC
2 pages
Defining A Function: Docstring
No ratings yet
Defining A Function: Docstring
8 pages
Design of Single Precision Floating Point Arithmetic Logic Unit
No ratings yet
Design of Single Precision Floating Point Arithmetic Logic Unit
5 pages
Sainik School Amaravathinagar Class Xii - Summer Vacation Home Work Annexure A
No ratings yet
Sainik School Amaravathinagar Class Xii - Summer Vacation Home Work Annexure A
5 pages
Rules of NPKL
No ratings yet
Rules of NPKL
4 pages
Study Plan
No ratings yet
Study Plan
1 page
Subject: Insufficient Fuel Tank Wall Thickness/Fuel Leak: 1200 New Jersey Avenue SE Washington, DC 20590
No ratings yet
Subject: Insufficient Fuel Tank Wall Thickness/Fuel Leak: 1200 New Jersey Avenue SE Washington, DC 20590
2 pages
Email Exchange
No ratings yet
Email Exchange
2 pages