0% found this document useful (0 votes)
26 views

Intro To R Introspection

The document discusses basic concepts of R including its data types and functions. It describes manipulating data frames in R, including subsetting, attaching, and deriving new variables. Methods for summary statistics, linear regression, and defining custom functions are also covered.

Uploaded by

Kip E
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Intro To R Introspection

The document discusses basic concepts of R including its data types and functions. It describes manipulating data frames in R, including subsetting, attaching, and deriving new variables. Methods for summary statistics, linear regression, and defining custom functions are also covered.

Uploaded by

Kip E
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

INTRODUCTION TO R

In this chapter, we discuss some of the basic concepts of R. R is an open source programming language
supported by the R Foundation for Statistical Computing. R is growing in popularity in the data science
community over the years. Many R users use the software through the program R Studio.

Contrary to SAS where columns are defined as variables and rows as observations, R embraces the
matrix concept with rows and columns and leaves the interpretation up to the user. To avoid doubt,
the terms “columns”/”variables”, “rows”/”observations” and “matrix”/”data” are used
interchangeably throughout this book to better align this book with the main textbook “Credit Risk
Analytics, Measurement Techniques, Applications and Examples in SAS”. Furthermore, we use the
terms “fitting” (predominantly used in R) and “estimation” (predominantly used in SAS)
interchangeably.

Software such as SAS easily provides the user with additional output by specifying the relevant options
in a procedure, for example model diagnostic plots of a linear regression model. In contrast, analysis
in R frequently builds upon a series of steps in order to receive similar outputs. Therefore, it is
sometimes necessary to use the results produced in one step for another step and so on, until one
obtains the desired information.

R can be used to conduct the same arithmetic operations as a pocket calculator. In order to use some
of the results for further calculations, they can be stored in an object. This is done by using the
assignment symbol “<-” (or, alternatively, the operator “=”). There are five data types that are most
frequently used: vectors, matrices, arrays, dataframes and lists. While the first three data types must be
comprised of the same type, e.g. all entrances must be numeric, this is not required for dataframes and
lists. Some functions (such as the lm command) only work if the model variables are stored in a
dataframe.

In a first step, let’s do some housekeeping and clear the workspace.


remove(list = ls())

Then import the external csv files for our mortgage data set into R.
mortgage <- read.csv("mortgage.csv")

1
Data Manipulation

One convenient way to proceed is to convert an object to a data frame. Using data frames provides for
an easier way to access and manipulate columns that generally represent variables. This step is not
necessary at first sight as the function read.csv() will generate a data frame object by default. However,
this function may be useful if the data set was originally in a different format, such as a matrix.
mortgage <- as.data.frame(mortgage)
The function attach() will add a data set to the R search path, creating a shortcut that allows to directly
use the variables in a data frame. For example, without attach(mortgage), mortgage$LTV_time should
be typed to use column LTV_time. Using attach(mortgage), LTV_time can be used directly without
any additional reference. The function detach() will remove a data set from the R search path.
attach(mortgage)

The advantage of attach is that this function can reduce the typing workload when manipulating
variables, especially when these are repetitively used. However, be careful when working with multiple
data sets as R is confused when the same name exists in multiple data sets as duplicated names exist in
its search path.

Basic R
This section shows some useful functions to manipulate data in R. First, subsamples may be generated
from data sets by using function subset(). For example, a subsample of data set for FICO scores greater
than 500.
mortgage.temp <- subset(mortgage, FICO_orig_time >= 500)
There is an alternative method to retrieve the subset of a data set according to certain conditions other
than using function subset(). data[x,y] refers to the element of a matrix or data frame in the row x and
column y. "data[condition,]" returns a subset with rows that satisfy the condition and all the columns
as the y argument is left blank.
mortgage.temp2 <- mortgage[default_time==1, ]
New variables can be derived from existing ones following three steps. First, create a new empty vector
with name FICO_cat using the function vector(). The length of this vector is set to be the same of the
column FICO_orig_time in mortgage data set.
FICO_cat <- vector(length = length(FICO_orig_time))
Second, assign values to FICO_cat using a for loop that runs over all elements. In each step, the
value of FICO-cat is determined by the value of FICO_orig_time.
for (i in 1:length(FICO_orig_time)) {
FICO_cat[i] <- if ((FICO_orig_time[i] > 500) & (FICO_orig_time[i] <=
700)) 1 else if (FICO_orig_time[i] > 700) 2 else 0
}
Third, add the new vector FICO_cat as a new column to the mortgage data set.
mortgage[, "FICO_cat"] = FICO_cat
Delete variables from the data set. For example, we can delete the column status_time.
mortgage.temp$status_time <- NULL
Combine variables of interest by columns into a new matrix select.cols.
select.cols <- cbind(default_time, FICO_orig_time, LTV_orig_time,
gdp_time)
Calculate summary statistics of these selected columns by using function apply(). apply(x,2,statistic)
will apply the chosen function to the matrix x by columns (i.e., the second dimension).
n.mortgage <- apply(select.cols, 2, length)
mean.mortgage <- apply(select.cols, 2, mean)
st.dev.mortgage <- apply(select.cols, 2, sd)
min.mortgage <- apply(select.cols, 2, min)
max.mortgage <- apply(select.cols, 2, max)
Then, combine the calculated statistics in a matrix and rename its columns and print the resulting
table.
select.summary <- cbind(n.mortgage,mean.mortgage, st.dev.mortgage,
min.mortgage, max.mortgage)
colnames(select.summary) <- c("N", "Mean", "Std Dev", "Min", "Max")
print(select.summary)
## N Mean Std Dev Min Max
## default_time 622489 0.024 0.154 0.000 1.000
## FICO_orig_time 622489 673.617 71.725 400.000 840.000
## LTV_orig_time 622489 78.975 10.127 50.100 218.500
## gdp_time 622489 1.381 1.965 -4.147 5.132
Exhibit 2.2
The function lm() fits a linear regression model. The function summary is used to retrieve the key
statistics of the fitted linear regression.
mortgage.lm <- lm(default_time ~ FICO_orig_time + LTV_orig_time +
gdp_time)
summary(mortgage.lm)

3
##
## Call:
## lm(formula = default_time ~ FICO_orig_time + LTV_orig_time +
## gdp_time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10031 -0.03174 -0.02205 -0.01354 1.00863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.957e-02 2.591e-03 30.71 <2e-16 ***
## FICO_orig_time -1.154e-04 2.747e-06 -42.02 <2e-16 ***
## LTV_orig_time 3.808e-04 1.944e-05 19.59 <2e-16 ***
## gdp_time -5.454e-03 9.914e-05 -55.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1535 on 622485 degrees of freedom
## Multiple R-squared: 0.008343, Adjusted R-squared: 0.008338
## F-statistic: 1746 on 3 and 622485 DF, p-value: < 2.2e-16
Exhibit 2.3

Self-defined functions in R

The following is an example of how to create your own (self-defined) functions to fit a linear regression
model using the inputted data. lhs and rhs are input arguments. d is the combination of them in the
format of a data frame. example.lm is the fitted linear regression, where lhs is the dependent variable
and rhs is the independent variable. The output shows the summary statistics of the fitted linear model.
example <- function(lhs, rhs){
d = as.data.frame(cbind(lhs,rhs))
example.lm = lm(lhs ~ ., data = d)
return(summary(example.lm))
}
The merit of writing code in functions is that the functions can be used multiple times requiring only
a call of the function and arguments often saving a substantial amount of double coding.
The following example output is from the self-defined function using the control variable
FICO_orig_time.
example(lhs = default_time, rhs = FICO_orig_time)
##
## Call:
## lm(formula = lhs ~ ., data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.05625 -0.02967 -0.02349 -0.01731 0.99248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.029e-01 1.842e-03 55.85 <2e-16 ***
## rhs -1.166e-04 2.720e-06 -42.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1539 on 622487 degrees of freedom
## Multiple R-squared: 0.002944, Adjusted R-squared: 0.002942
## F-statistic: 1838 on 1 and 622487 DF, p-value: < 2.2e-16
Exhibit 2.4

The following example output is from the self-defined function using the control variable
FICO_orig_time and LTV_orig_time.
example(lhs = default_time, rhs = cbind(FICO_orig_time, LTV_orig_time))
##
## Call:
## lm(formula = lhs ~ ., data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.07745 -0.03045 -0.02360 -0.01688 1.00027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.835e-02 2.590e-03 26.39 <2e-16 ***
## FICO_orig_time -1.087e-04 2.751e-06 -39.50 <2e-16 ***
## LTV_orig_time 3.698e-04 1.948e-05 18.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1539 on 622486 degrees of freedom

5
## Multiple R-squared: 0.003521, Adjusted R-squared: 0.003518
## F-statistic: 1100 on 2 and 622486 DF, p-value: < 2.2e-16
Exhibit 2.5

The following example output is from the self-defined function using the control variable
FICO_orig_time, LTV_orig_time and gdp_time.
example(lhs = default_time, rhs = cbind(FICO_orig_time, LTV_orig_time,
gdp_time))
##
## Call:
## lm(formula = lhs ~ ., data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10031 -0.03174 -0.02205 -0.01354 1.00863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.957e-02 2.591e-03 30.71 <2e-16 ***
## FICO_orig_time -1.154e-04 2.747e-06 -42.02 <2e-16 ***
## LTV_orig_time 3.808e-04 1.944e-05 19.59 <2e-16 ***
## gdp_time -5.454e-03 9.914e-05 -55.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1535 on 622485 degrees of freedom
## Multiple R-squared: 0.008343, Adjusted R-squared: 0.008338
## F-statistic: 1746 on 3 and 622485 DF, p-value: < 2.2e-16
Exhibit 2.6
3 EXPLORATORY DATA ANALYSIS

This chapter provides the R codes for exploratory data analysis. The following codes clear the
workspace, install the required packages and read the data.
Clear workspace.
remove(list=ls())
Install required packages and input data. This step can be skipped if the following packages have been
already installed:
#install.packages("moments")
#install.packages("gmodels")
#install.packages("vcd")
#install.packages("mixtools")
library(moments)
library(gmodels)
library(vcd)
library(mixtools)
mortgage <- read.csv("mortgage.csv")

One-Dimensional Analysis

Observed Frequencies and Empirical Distributions


We first compute observed frequencies for the defaults and empirical distributions for the FICO score
and LTV.
Initialize empty vectors which will be used to calculate statistics of default_time.
Frequency <- numeric()
Percent <- numeric()
Cumulative.Frequency <- numeric()
Cumulative.Percent <- numeric()
Extract unique values of default_time.
default.indicator <- unique(mortgage$default_time)
For both scenarios where default_time is 0 and 1, calculate different statistics. Firstly, extract a subset
of mortgage based on the value of default_time. Then calculate the frequency and the percentage.
Secondly, calculate the cumulative frequency and percentage using different method based on the value
of default_time.

7
for (i in 1:2){
temp <- subset(mortgage, default_time == default.indicator[i])
Frequency[i] <- length(temp$default_time)
Percent[i] <- round(Frequency[i]/nrow(mortgage), 4)*100
if (i == 1){
Cumulative.Frequency[i] <- Frequency[i]
Cumulative.Percent[i] <- Percent[i]
}
else{
Cumulative.Frequency[i] <- Frequency[i-1] + Frequency[i]
Cumulative.Percent[i] <- Percent[i-1] + Percent[i]
}
}
Store the results in a data frame and show it.
results <- cbind.data.frame(default.indicator, Frequency, Percent,
Cumulative.Frequency, Cumulative.Percent)
colnames(results) <- c("default_time", "Frequency", "Percent",
"Cumulative Frequency", "Cumulative Percent")
print(results)
## default_time Frequency Percent Cumulative Frequency Cumulative
Percent
## 1 0 607331 97.56 607331
97.56
## 2 1 15158 2.44 622489
100.00
Exhibit 3.1
Plot the histogram for FICO_orig_time and by using function hist(). Then add a frame to the histogram
by using function box().
hist(mortgage$FICO_orig_time, freq = FALSE, breaks = 100, main =
"Distribution of FICO_orig_time", xlab = "FICO_orig_time")
box()
Exhibit 3.2a

Plot the cumulative distribution function for FICO_orig_time by using function plot.ecdf().
plot.ecdf(mortgage$FICO_orig_time, ylab = "Cumulative Percent", xlab
="FICO_orig_time", main = "Cumulative Distribution Function for
FICO_orig_time", pch = ".")

9
Exhibit 3.2b

Similar to above, plot the histogram and cumulative distribution function for LTV_orig_time.
hist(mortgage$LTV_orig_time, freq = FALSE, breaks = 100, main =
"Distribution of LTV_orig_time", xlab = "LTV_orig_time")
box()
Exhibit 3.2c
plot.ecdf(mortgage$LTV_orig_time, ylab = "Cumulative Percent", xlab =
"LTV_orig_time", main = "Cumulative Distribution Function for
LTV_orig_time", verticals = TRUE, pch = ".")

11
Exhibit 3.2d

Location Measures
Next, we compute location measures (mean, median and mode) for the three variables default, FICO
and LTV as well as some percentiles (quantiles). Moreover, Q-Q plots are created. R does not have a
default function to find the mode. The following self-defined function is commonly used for finding
the mode.
get.mode <- function(x){
unique.x <- unique(x)
unique.x[which.max(tabulate(match(x, unique.x)))]
}
Create a function to calculate the required descriptive statistics, including count, mean, median, mode,
1% quantile and 99% quantile. Then, these statistics are combined into one vector.
proc.means <- function(x){
N <- length(x)
Mean <- mean(x, na.rm = TRUE)
Median <- median(x, na.rm = TRUE)
Mode <- get.mode(x)
Pctl_1st <- quantile(x, 0.01)
Pctl_99th <- quantile(x, 0.99)
proc.means.results <- as.vector(round(cbind(N, Mean, Median, Mode,
Pctl_1st, Pctl_99th), 4))
}
Generate a vector var.names to store the names of requested variables.
var.names <- c("default_time", "FICO_orig_time", "LTV_orig_time")
Generate an empty data frame. Then, use var.names, which represents the names of the interested
variables, to extract each of them from the mortgage data set. Afterwards, calculate the required
statistics for each of these variables by using the self-defined function proc.means().
loc.measures <- as.data.frame(matrix(NA, nrow = 6, ncol = 3))
for (i in 1:3){
loc.measures[, i] <- lapply(mortgage[var.names[i]], proc.means)
}
Transpose the dataframe and label the rows. Then label the columns. Finally present the results.
loc.measures <- as.data.frame(t(loc.measures), row.names = var.names)
colnames(loc.measures) <- c("N", "Mean", "Median", "Mode", "1st Pctl",
"99th Pctl")
print(loc.measures)
## N Mean Median Mode 1st Pctl 99th Pctl
## default_time 622489 0.0244 0 0 0.0 1
## FICO_orig_time 622489 673.6169 678 660 506.0 801
## LTV_orig_time 622489 78.9755 80 80 52.2 100
Exhibit 3.3
Generate a Q-Q plot by using function qqnorm() for FICO_orig_time and add a theoretical line by
using function qqline().
qqnorm(mortgage$FICO_orig_time, xlim = c(-6, 6), ylim = c(200,
1200),ylab = "FICO_orig_time", xlab = "Normal Quantiles", main = "Q-Q
Plot for FICO_orig_time")
qqline(mortgage$FICO_orig_time)

13
Exhibit 3.4a

Generate a Q-Q plot by using function qqnorm() for LTV_orig_time and add a theoretical line by
using function qqline().
qqnorm(mortgage$LTV_orig_time, xlim = c(-6, 6), ylim = c(0, 250), ylab
= "LTV_orig_time", xlab = "Normal Quantiles", main = "Q-Q Plot for
LTV_orig_time")
qqline(mortgage$LTV_orig_time)
Exhibit 3.4b

Next, dispersion measures, skewness and kurtosis are computed.

Similar to Exhibit 3.3, firstly define a function to calculate the required descriptive statistics.
proc.means.ext <- function(x){
N <- length(x)
Minimum <- min(x, na.rm = TRUE)
Maximum <- max(x, na.rm = TRUE)
Range <- range(x)[2] - range(x)[1]
Quartile.Range <- quantile(x, 0.75) - quantile(x, 0.25)
Variance <- var(x, na.rm = TRUE)
Std.Dev <- sqrt(Variance)
Coeff.Variation <- (Std.Dev/mean(x, na.rm = TRUE))*100
proc.means.ext.results <- as.vector(round(cbind(N, Minimum, Maximum,
Range, Quartile.Range, Variance, Std.Dev, Coeff.Variation), 4))
}

Then, use a vector of variable names to extract these variables from the mortgage data set and calculate

15
the statistics by using the self-defined function.
var.names <- c("default_time", "FICO_orig_time", "LTV_orig_time")
disp.measures <- as.data.frame(matrix(NA, nrow = 8, ncol = 3))
for (i in 1:3){
disp.measures[, i] <- lapply(mortgage[var.names[i]], proc.means.ext)
}

Finally, combine these calculated statistics into a data frame, adjust its layout and present it.
disp.measures <- as.data.frame(t(disp.measures), row.names = var.names)
colnames(disp.measures) <- c("N", "Minimum", "Maximum", "Range",
"Quartile Range", "Variance", "Std. Dev.", "Coeff of Variation")
print(disp.measures)
## N Minimum Maximum Range Quartile Range Variance
## default_time 622489 0.0 1.0 1.0 0 0.0238
## FICO_orig_time 622489 400.0 840.0 440.0 103 5144.4122
## LTV_orig_time 622489 50.1 218.5 168.4 5 102.5572
## Std. Dev. Coeff of Variation
## default_time 0.1541 632.9831
## FICO_orig_time 71.7246 10.6477
## LTV_orig_time 10.1271 12.8230
Exhibit 3.5

Similar to Exhibit 3.3, firstly define a function to calculate the required descriptive statistics. Function
skewness() and kurtosis() from library moments are used to calculate skewness and Pearson's measure
of kurtosis. Pearson's measure of kurtosis - 3 will produce the excess kurtosis.
proc.means.skewKurt <- function(x){
N <- length(x)
Skewness <- skewness(x, na.rm = TRUE)
Kurtosis <- kurtosis(x, na.rm = TRUE) - 3 # Excess kurtosis
proc.means.skewKurt.results <- as.vector(round(cbind(N, Skewness,
Kurtosis), 4))
}
Then, use a vector of variable names to extract these variables from the mortgage data set and calculate
the statistics by using the self-defined function.
var.names <- c("default_time", "FICO_orig_time", "LTV_orig_time")
skewKurt.measures <- as.data.frame(matrix(NA, nrow = 3, ncol = 3))
for (i in 1:3){
skewKurt.measures[, i] <- lapply(mortgage[var.names[i]],
proc.means.skewKurt)
}
Finally, combine these calculated statistics into a data frame, adjust its layout and present it.
skewKurt.measures <- as.data.frame(t(skewKurt.measures), row.names =
var.names)
colnames(skewKurt.measures) <- c("N", "Skewness", "Kurtosis")
print(skewKurt.measures)
## N Skewness Kurtosis
## default_time 622489 6.1718 36.0917
## FICO_orig_time 622489 -0.3213 -0.4684
## LTV_orig_time 622489 -0.1964 1.4364
Exhibit 3.6

Two-Dimensional Analysis

Joint Empirical Distributions


Having explored the empirical data on a one-dimensional basis for each variable, we may also interested
in interrelations between variables. We therefore firstly create two-dimensional (or two-way) frequency
tables, e.g., for default and FICO classes.
Create a new vector FICO_orig_time_factor. Based on the value of FICO_orig_time, assign values 0
to 4 to FICO_orig_time_factor.
FICO_orig_time_factor <- mortgage$FICO_orig_time
FICO_orig_time_factor[FICO_orig_time_factor <
quantile(mortgage$FICO_orig_time,0.2)] <- 0
FICO_orig_time_factor[(FICO_orig_time_factor >=
quantile(mortgage$FICO_orig_time,0.2)) &
(FICO_orig_time_factor <
quantile(mortgage$FICO_orig_time,0.4))] <- 1
FICO_orig_time_factor[(FICO_orig_time_factor >=
quantile(mortgage$FICO_orig_time,0.4)) &
(FICO_orig_time_factor <
quantile(mortgage$FICO_orig_time,0.6))] <- 2
FICO_orig_time_factor[(FICO_orig_time_factor >=
quantile(mortgage$FICO_orig_time,0.6)) &
(FICO_orig_time_factor <
quantile(mortgage$FICO_orig_time,0.8))] <- 3
FICO_orig_time_factor[(FICO_orig_time_factor >=
quantile(mortgage$FICO_orig_time,0.8))] <- 4

17
Generate the two-dimensional contingency table by using function CrossTable() from library gmodels.
CrossTable(mortgage$default_time, FICO_orig_time_factor, prop.t=TRUE,
prop.r=TRUE, prop.c=TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 622489
##
##
## | FICO_orig_time_factor
## mortgage$default_time | 0 | 1 | 2 | 3 | 4 | Row Total |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 119046 | 118890 | 124047 | 121876 | 123472 | 607331 |
## | 14.558 | 5.383 | 0.229 | 3.387 | 22.468 | |
## | 0.196 | 0.196 | 0.204 | 0.201 | 0.203 | 0.976 |
## | 0.965 | 0.969 | 0.974 | 0.981 | 0.989 | |
## | 0.191 | 0.191 | 0.199 | 0.196 | 0.198 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 4328 | 3790 | 3269 | 2385 | 1386 | 15158 |
## | 583.295 | 215.667 | 9.188 | 135.721 | 900.201 | |
## | 0.286 | 0.250 | 0.216 | 0.157 | 0.091 | 0.024 |
## | 0.035 | 0.031 | 0.026 | 0.019 | 0.011 | |
## | 0.007 | 0.006 | 0.005 | 0.004 | 0.002 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 123374 | 122680 | 127316 | 124261 | 124858 | 622489 |
## | 0.198 | 0.197 | 0.205 | 0.200 | 0.201 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##

Exhibit 3.7
Another way of inferring the relation between both variables (without grouping FICO first) is to look
at a box plot.
Generate the boxplot of FICO_orig_time by different values of default_time by using function
boxplot(). Then compute the means of FICO_orig_time by default_time and add them to the boxplot
as points by using function points().
boxplot(FICO_orig_time ~ default_time, data = mortgage, range = 0, xlab
= "default_time", ylab = "FICO_orig_time", main = "Distribution of
FICO_orig_time by default_time")
means <- tapply(mortgage$FICO_orig_time, mortgage$default_time, mean)
points(means, pch = 18)

Exhibit 3.8

Similar to Exhibit 3.8, generate a boxplot for LTV_orig_time.


boxplot(LTV_orig_time ~ default_time, data = mortgage, range = 0, xlab
= "default_time", ylab = "LTV_orig_time", main = "Distribution of
LTV_orig_time by default_time")
means <- tapply(mortgage$LTV_orig_time, mortgage$default_time, mean)
points(means, col = "blue", pch = 18)

19
Exhibit 3.9

Correlation Measures

We now compute measures for association and correlation, namely 𝜒 2 , 𝜙, the contingency coefficient
and Cramer’s V.

Create a simple cross table of default_time and FICO_orig_time_factor by using function xtabs().
tab <- xtabs( ~ FICO_orig_time_factor + mortgage$default_time)

Use function accocstats() from library vcd to compute chi-square based measures.
assocstats(tab)
## X^2 df P(> X^2)
## Likelihood Ratio 2043.9 4 0
## Pearson 1890.1 4 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.055
## Cramer's V : 0.055
Exhibit 3.10

Create a stratified sample by setting a seed value (for identifying the random draw and being able to
repeat the same experiment) with the sample size of 1% without replacements from FICO_orig_time
and LTV_orig_time.
set.seed(12345)
smpl.FICO <- sample(mortgage$FICO_orig_time, size =
0.01*nrow(mortgage), replace = FALSE)
smpl.LTV <- sample(mortgage$LTV_orig_time, size = 0.01*nrow(mortgage),
replace = FALSE)

We can also compute the Pearson and Spearman correlation coefficients as well as Kendall’s Tau b
and produce a scatter plot..

Compute Pearson's correlation and perform a test by using function cor.test().


cor.test(smpl.FICO, smpl.LTV, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: smpl.FICO and smpl.LTV
## t = -1.1771, df = 6222, p-value = 0.2392
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03975041 0.00992735
## sample estimates:
## cor
## -0.01492074
Exhibit 3.11a

Compute Spearman's correlation and perform a test.


cor.test(smpl.FICO, smpl.LTV, method = "spearman", exact = FALSE)
##
## Spearman's rank correlation rho
##
## data: smpl.FICO and smpl.LTV
## S = 4.0792e+10, p-value = 0.233

21
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.01511976
Exhibit 3.11b

Compute Kendall’s Tau b correlation and perform a test.


cor.test(smpl.FICO, smpl.LTV, method = "kendall")
##
## Kendall's rank correlation tau
##
## data: smpl.FICO and smpl.LTV
## z = -1.1883, p-value = 0.2347
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.01067879
Exhibit 3.11c

Combine the two samples into one data set and generate a scatter plot from it. Then add ellipses by
using function ellipse() from library mixtools. Finally add the legend.
smpl.data <- cbind(smpl.FICO, smpl.LTV)
plot(smpl.data, xlab = "FICO_orig_time", ylab ="LTV_orig_time", main =
"Scatter Plot")
ellipse(mu = colMeans(smpl.data), sigma = cov(smpl.data), alpha = 0.20,
npoints = 250, lwd = 2)
ellipse(mu = colMeans(smpl.data), sigma = cov(smpl.data), alpha = 0.30,
npoints = 250, lty = 2, lwd = 2)
legend(x = "topleft", y.intersp = 0.80, cex = 0.80, title = "Prediction
Ellipses", legend = c("80%", "70%"), bty = "n", lty = c(1,2), lwd = 1)
Exhibit 3.12

Highlights of Inductive Statistics

Confidence Intervals
Once parameters are estimated, the estimates will not match the true (i.e., unknown data generating)
parameters in the population exactly but instead have a random deviation from these parameters.
Hence, we may want to compute confidence intervals and conduct hypothesis tests.
Generate a self-defined function for computing confidence intervals assuming normal distribution.
proc.univariate <- function(x){
n <- length(x)
Mean <- mean(x, na.rm = TRUE)
Std.Deviation <- sqrt(var(x, na.rm = TRUE))
Variance <- var(x, na.rm = TRUE)
Lower.Confidence.Limit <- c(Mean - qnorm(0.99)*Std.Deviation/sqrt(n))
Upper.Confidence.Limit <- c(Mean + qnorm(0.99)*Std.Deviation/sqrt(n))
results <- matrix(c(round(c(Mean, Std.Deviation, Variance,

23
Lower.Confidence.Limit,
Upper.Confidence.Limit), 5)),
nrow = 1, ncol = 5)
colnames(results) <- c("Mean", "Std. Deviation", "Variance", "Lower
Confidence Limit",
"Upper Confidence Limit")
print(results)
}
Apply the above self-defined function to LTV_orig_time.
proc.univariate(mortgage$LTV_orig_time)
## Mean Std. Deviation Variance Lower Confidence Limit
## [1,] 78.97546 10.12705 102.5572 78.9456
## Upper Confidence Limit
## [1,] 79.00532
Exhibit 3.13

Hypothesis Testing
Perform T-distribution test by using function t.test().
t.test(mortgage$LTV_orig_time, mu = 60, alternative = "two.sided")
##
## One Sample t-test
##
## data: mortgage$LTV_orig_time
## t = 1478.3, df = 622490, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 60
## 95 percent confidence interval:
## 78.95030 79.00062
## sample estimates:
## mean of x
## 78.97546
Exhibit 3.14

You might also like