0% found this document useful (0 votes)
12 views40 pages

Statistical Computing With R: Masters in Data Sciences 503 (S29) Third Batch, SMS, TU, 2024

The document discusses Monte Carlo simulations and their applications in statistical computing with R, focusing on randomness, random deviates, and resampling techniques. It highlights the importance of these methods in machine learning, particularly for addressing class imbalance and estimating model accuracy. Additionally, it covers the replication of simulations and summarization of results to derive insights from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views40 pages

Statistical Computing With R: Masters in Data Sciences 503 (S29) Third Batch, SMS, TU, 2024

The document discusses Monte Carlo simulations and their applications in statistical computing with R, focusing on randomness, random deviates, and resampling techniques. It highlights the importance of these methods in machine learning, particularly for addressing class imbalance and estimating model accuracy. Additionally, it covers the replication of simulations and summarization of results to derive insights from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Statistical Computing with R:

Masters in Data Sciences 503 (S29)


Third Batch, SMS, TU, 2024
Shital Bhandary
Associate Professor
Statistics/Bio-statistics, Demography and Public Health Informatics
Patan Academy of Health Sciences, Lalitpur, Nepal
Faculty, Data Analysis and Decision Modeling, MBA, Pokhara University, Nepal
Faculty, FAIMER Fellowship in Health Professions Education, India/USA.
Review Preview:
• Monte Carlo simulations • Class imbalance problem
• Randomness • Statistical approach
• Random deviates • Data science approach
• Resampling
• Use of Monte Carlo • Missing data
methods in Machine • Supervised learning
Learning • Unsupervised learning
Monte Carlo Simulations:
https://bstaton1.github.io/au-r-workshop/ch4.html
• Simulation modeling is one of • We can then summarize and plot
the primary reasons to move the results of these replicated
away from spreadsheet-type calculations all within the same
programs (like Microsoft Excel) program.
and into a program like R. • Analyses of this type are
called Monte Carlo methods:
• R allows us to replicate the same they randomly sample from a set
(possibly complex and detailed) of quantities for the purpose of
calculations over and over with generating and summarizing a
different random values. distribution of some statistic
related to the sampled
quantities.
Randomness:
• A critical part of simulation • They are tightly linked to the
modeling is the use of random concept of uncertainty: you are
processes. unsure about the outcome the
next time the process is executed.
• A random process is one that
generates a different outcome • There are two basic ways to
according to some rules each introduce randomness in R:
time it is executed. • Random deviates
• Resampling
Random deviates:
• At the end of each year, each • We can execute a binomial random
individual alive at the start can process with p=0.8 and n=100 like
either live or die. this in R:
• There are two outcomes here, and
suppose each individual has an • rbinom(n = 1, size = 100, prob =
80% chance of surviving. 0.8)

• The number of individuals that • I got:


survive is the result of a binomial
random process in which there • [1] 83
were n individuals alive at the start
of this year and p is the probability
that any one individual survives to • But you almost certainly get
the next year. different number than this one!
We can also plot it with a bit of tweaking:
# Histogram

survivors = rbinom(1000, 100, 0.8)


hist(survivors, col = "skyblue")
We could also use other processes like log
normal:
• Another random process is
the lognormal process.
• It generates random numbers
such that the log of the values
are normally-distributed with
mean equal to logmean and
standard deviation equal
to logsd
• hist(rlnorm(1000, 0, 0.1), col =
"skyblue")
Need for sampling:
https://machinelearningmastery.com/monte-carlo-sampling-for-probability/

• There are many problems in • The desired calculation is typically


probability, and more broadly in a sum of a discrete distribution or
machine learning, where we integral of a continuous
cannot calculate an analytical distribution and is intractable to
solution directly. calculate.

• Class imbalance problem is such • The calculation may be intractable


situation in Machine Learning! for many reasons, such as the large
number of random variables, the
stochastic nature of the domain,
• In fact, there may be an argument noise in the observations, the lack
that exact inference may be of observations, and more.
intractable for most practical
probabilistic models.
Resampling:
• Using random deviates works great #Resampling of 1 to 10:
for creating new random numbers, • sample(x = 1:10, size = 5)
but what if we already have a set of
numbers that we wish to introduce
randomness to? #Sample with replacement
• sample(x = c("a", "b", "c"), size =
• For this, we can use resampling 10, replace = T)
techniques.
#Sample with set probabilities
• In R, the sample() function is used • sample(x = c("live", "die"), size =
to sample size elements from the 10, replace = T, prob = c(0.8, 0.2))
vector x.
We have used it:
• roll() function defining roll of a fair die twice
• Training and Testing sets definition, cross-validation
Reproducing randomness:
• For reproducibility purposes, we #Example:
may wish to get the same exact • set.seed(1234)
random numbers each time we
run our script. • rnorm(1)
• [1] -1.207066
• To do this, we need to set
the random seed, which is the #Try without random seed
starting point of the random • rnorm(1)
number generator our computer
uses. • [1] 0.2774292
Replication:
• To use Monte Carlo methods, we • The replicate() function executes
need to be able to replicate some expression many times
some random process many and returns the output from
times. each execution.

• There are two main ways this is • Say we have a vector x, which
commonly done: either with represents 30 observations of an
replicate() or with for() loops. animal length (mm):

• x = rnorm(30, 500, 30)


Replication in R:
• We wish to build the sampling #Code after x is defined:
distribution of the mean length
“by hand”.
means = replicate(n = 1000, expr = {
x_i = sample(x, length(x),
• We can sample randomly from replace = T)
it, calculate the mean, then mean(x_i)
repeat this process many times.
})

• This can be done in R with:


Mean and SE same in x and 1000 replicated
means of x? Unbiased estimate of x!
• If we take mean(means) and sd #Check means first
(means), that should be very • mean(means); mean(x)
similar to mean(x) and se(x).
• -[1] 492.5897
• [1] 492.6636
• Create the se() function and
prove this using R! #Standard error of mean
sd(means); se(x)
• se = function(x) [1] 5.130683
sd(x)/sqrt(length(x)) [1] 5.023584 More on Law of Large Numbers here:
https://machinelearningmastery.com
Monte Carlo Simulations is based on Law of Large Numbers. It can also be /a-gentle-introduction-to-the-law-of-
used to prove Regression to Mean and Central Limit Theorem. large-numbers-in-machine-learning/
Replication with “for” loop:
• In programming, a loop is a • A for() loop repeats some action
command that does something for however many times you tell
over and over until it reaches some it for each value in some vector.
point that you specify.
#For loop syntax:
•R has a few types of
loops: repeat(), while(), and for(),
to name a few. for (var in seq) {
expression(var)
• for() loops are among the most }
common in simulation modeling.
Examples:
#1 #Output 1
for (i in 1:5) { • [1] 1
print(i^2) • [1] 4
} • [1] 9
#2 • [1] 16
results=numeric(5) • [1] 25
for (i in 1:5) {
results[i] = i^2 } #Output 2
results [1] 1 4 9 16 25
More:
• nt = 100 # number of years #Loop for replication

• N = NULL # container for (fish) for (t in 2:nt) {


abundance
#N this year is N last year * growth *
• N[1] = 1000 # first end-of-year # randomness * fraction that survive
abundance harvest

N[t] = (N[t-1] * 1.1 * rlnorm(1, 0,


0.1)) * (1 - 0.08)
}
Let’s plot it:
• plot(N, type = "l", pch = 15, xlab
= "Year", ylab = "Abundance")
Function writing for Monte Carlo simulation:
#In Monte Carlo analyses, it is often useful pop_sim = function(nt, grow, sd_grow, U, plot = F) {
to wrap code into functions. N = NULL
• This allows for easy replication and setting
adjustment (e.g., if you wanted to N[1] = 1000
compare the growth trajectories of two for (t in 2:nt) {
populations with differing growth rates).
N[t] = (N[t-1] * grow * rlnorm(1, 0, sd_grow)) * (1 - U)
#Let’s use five parameters to do so now:
}
• nt: the number of years,
if (plot) { plot(N, type = "l", pch = 15, xlab = "Year",
• grow: the population growth rate, ylab = "Abundance")
• sd_grow: the amount of annual variability }
in the growth rate
N
• U: the annual exploitation rate
}
• plot: whether you wish to have a plot
created.
Run: pop_sim(100, 1.1, 0.1, 0.08, T) to get

• [1] 1000.0000 982.3888 802.9221 930.8944 942.8799 1147.2425 1343.0696


• [8] 1547.2829 1679.2181 1514.6867 1513.1179 1560.9256 1736.7056 2135.8081
• [15] 2106.6725 1775.4615 1665.7489 1623.7020 1589.0171 1889.1755 2029.1288
• [22] 2170.6199 2058.1873 2038.3532 2347.5983 2290.1806 2671.5877 2598.8134
• [29] 2738.5065 2669.0003 2617.4264 2859.6799 2764.8132 2694.8130 2388.6001
• [36] 2057.0187 2041.2244 2351.3923 2395.7745 2151.1563 2509.3455 2943.5983
• [43] 2599.5925 2706.5242 2710.5283 2587.0943 2696.7068 2573.2741 2267.4747
• [50] 2676.7501 2638.4771 2306.5914 2464.6563 2126.1586 2090.3945 2131.9059
• [57] 2676.4949 2435.6190 2128.2608 2225.5276 2179.7877 2706.6805 2989.4001
• [64] 3277.0129 3609.7139 3843.7520 4117.0917 4546.7481 4706.4806 5077.8774
• [71] 6248.7845 5797.8300 5824.2902 5400.6019 4948.3756 4747.5507 5046.0663
• [78] 5894.9432 6207.3198 6030.4074 6706.5260 6884.1739 6946.1890 7204.8305
• [85] 7895.0993 7563.0521 8655.1318 8783.0285 7210.3333 8300.1920 10254.6761
• [92] 9983.9319 10467.9362 9487.7283 9186.1128 10096.7386 8892.1724 11403.9986
• [99] 11699.4072 12772.6927
Replicating the simulation:
#Replicate the simulation for • out = large matrix (10000
1000 times elements, 800.2 kb)
• out = replicate(n = 1000, expr =
pop_sim(100, 1.1, 0.1, 0.08, F)) #View this matrix in R Studio:
• View(out)
Summarization of simulation:
• After replicating a calculation #Central Tendency: mean
many times, we will need to • N_mean = apply(out, 1, mean)
summarize the results.
• N_mean[1:10]

• We must show the central


tendency and variability #Variability:
N_sd = apply(out, 1, sd)
• We can also show Frequencies N_sd[1:10]
and cross-tabulations
Summarization of simulation:
#Frequencies 1 #Cross-tabulations
out10 = ifelse(out[10,] < 1000, • table(out10, out20)
"less10", "greater10")
table(out10) #Cross-tabulations with
probabilities
#Frequencies 2 • round(table(out10,
out20 = ifelse(out[20,] < 1100, out20)/1000, 2)
"less20", "greater20")
table(out20)
Simulation Based Learning: Example 1
• mu = 500; sig = 30

• random = rnorm(100, mu, sig)

• p = seq(0.01, 0.99, 0.01)


random_q = quantile(random, p)
normal_q = qnorm(p, mu, sig)
• plot(normal_q ~ random_q);
abline(c(0,1))
Simulation Based Learning: Example 2
• q = seq(400, 600, 10)
• random_cdf = ecdf(random)
• random_p = random_cdf(q)
• normal_p = pnorm(q, mu, sig)

• plot(normal_p ~ q, type = "l", col


= "blue") points(random_p ~ q,
col = "red")
Use in Machine learning:
https://machinelearningmastery.com/monte-carlo-sampling-for-probability/

• In machine learning, Monte • Random sampling of model


Carlo methods provide the basis hyperparameters when tuning a
for resampling techniques like model is a Monte Carlo method
the bootstrap method for
estimating a quantity, such as
the accuracy of a model on a • Ensemble models used to
limited dataset. overcome challenges such as the
limited size and noise in a small
• We have seen its use in: data sample and the stochastic
• Resampling algorithms variance in a learning algorithm
• Random hyperparameter tuning are all examples of Monte Carlo
(caret package) methods.
• Ensemble learning algorithms
Question/queries so far?
Class imbalance problem: Binary dep. var. (y)
• It happens in the classification • Statistical approach
problems • Instead of binary logistic
• When we have a categorical regression
• Use exact logistic regression
binary dependent variable then • Use Poisson regression
distribution of 1 and 0 may not • Use zero-inflated Poisson regression
be equal (or very skewed) • Use negative binomial regression
• When it is very skewed then it is • Data science approach
known as “class imbalance” • Generate new data using
• We can deal with it using simulations, make the balanced
class and get accuracy measures
statistics or data science
Class imbalance problem: Categorical “y”
• It happens in the classification • Statistical approach
problems • Instead of multinominal or ordinal
• When we have a categorical logistic regression
• Use exact multinomial/ordinal
dependent variable then logistic regression
distribution of 0, 1 or 2 may not • Use Poisson regression
be equal (or very skewed) • Use zero-inflated Poisson regression
• Use negative binomial regression
• When it is very skewed then it is
known as “class imbalance” • Data science approach
• We can deal with it using • Generate new data using
simulations, make the balanced
statistics or data science class and get accuracy measures
In statistics, we are more concerned with “Simpson’s Paradox” than the Class Imbalance problems!
UCLA Admission “paradox”: Overall few females were admitted but more females were admitted when the
same data was analyzed by departments! Same can happen with all the supervised models!
Example: binary.csv data
#Admission to UCLA #Class imbalance problem data
• Four variables in the data data <- read.csv("binary.csv",
• admit = Admitted or not header = T)
• gre = GRE score str(data)
• gpa = GPA score summary(data)
• rank = Rank of the institute #Change the admit as factor
where they got their GPA variable
data$admit <-
as.factor(data$admit)
summary(data)
Outputs:
admit gre gpa rank admit gre gpa rank
• Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000 • 0:273 Min. :220.0 Min. :2.260 Min. :1.000
• 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000 • 1:127 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
• Median :0.0000 Median :580.0 Median :3.395 Median :2.000
Median :580.0 Median :3.395 Median :2.000
• Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
Mean :587.7 Mean :3.390 Mean :2.485
• 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
• Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
Max. :800.0 Max. :4.000 Max. :4.000

• prop. table(table(data$admit))
Class imbalance as dependent variable “admit” has 273
(68.25%) cases in 0 (not admitted category) and 127
• 0 1 (31.75%) in 1 (admitted) category.
• 0.6825 0.3175

In statistics, we deal it using different methods but in data


science we deal it with making these classes “balanced”
Let’s predict without correcting imbalance:
#Data partition • #Check the imbalance in the
#set.seed(1234) train data
• Ind <- sample(2, nrow(data), • table(train$admit)
replace=T, prob=c(0.7,0.3)) •0 1
• 196 83
• train <- data[ind==1,] • prop.table(table(train$admit))
• 0 1
• test <- data[ind==2,] • 0.702509 0.297491 (Is this
really imbalance!)
Let’s predict without correcting imbalance:
#Prediction model • #Outputs
#Random forest model Reference
library(randomForest) Prediction 0 1
rfm.train <- randomForest(admit~., 0 73 32
data=train) 1 4 12
#Model evaluation with test data Accuracy : 0.7025 (misleading!)
using caret package
95% CI : (0.6126, 0.7821)
library(caret)
Sensitivity : 0.27273 (not good for 1)
confusionMatrix(predict(rfm.train,
test), test$admit, positive = '1') Specificity : 0.94805 (good for 0)
This is due to “class imbalance” problem!
Let’s predict with correction: Oversampling
#Correcting class imbalance by • Here 196 is used as there was an
oversampling: Using Randomly imbalance in the train data
Oversampling Examples (ROSE) • table(train$admit)
package
•0 1
• library(ROSE)
• 196 83
#We will get equal values now:
• over.samp <- ovun.sample(admit~.,
data = train, method = "over", N = • table(over.samp$admit)
196*2)$data •0 1
• 196 196
• table(over.samp$admit) #Resampling of observed values of
category=1 is used to get more 1s!
Let’s predict with correction: Oversampling
#Check summary for changes in the Reference
other variables too! Prediction 0 1
• summary(over.samp) 0 59 22
1 18 22
#Random Forrest model with over Accuracy : 0.6694
sampled data
• rfm.os <- randomForest(admit~., 95% CI : (0.5781, 0.7522)
data=over.samp) Sensitivity : 0.5000
• confusionMatrix(predict(rfm.os, Specificity : 0.7662
test), test$admit, positive = '1')
Sensitivity improved (good if we
wanted to improve prediction for 1)
but overall accuracy decreased!
What else can be done with ROSE package?
• We can do the undersampling • We can create a synthetic data,
and check the model accuracy, fit the model, predict it to check
sensitivity and specificity again the model accuracy, sensitivity,
specificity etc.
• We can do both i.e.
oversampling and • While creating synthetic data,
undersamplign and check the we must use random seed too in
model accuracy, sensitivity and the function to get replicable
specificity again results!

More on Synthetic Minority Oversampling (SMOTE) here:


https://www.youtube.com/watch?v=dkXB8HH_4-k
Missing values:
https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e

• The real-world data often has a lot • The 7 ways to handle missing
of missing values. The cause of values in the dataset
missing values can be data • Deleting Rows with missing values
corruption or failure to record • Impute missing values for continuous
data. variable (mean, median etc.)
• The handling of missing data is • Impute missing values for categorical
variable (predict the categories)
very important during the • Other Imputation Methods
preprocessing of the dataset as
• Using Algorithms that support
many machine learning algorithms missing values
do not support missing values. • Prediction of missing values
• Visit the link to learn more about • Imputation using Deep Learning
handling missing values to learn: Library — Datawig
Missing values checking and handling in R:
#Check missing values in R #List of R Packages
• colsum(is.na(data frame)) • MICE
• sum(is.na(data frame$column name) • Amelia
#Strategies • missForest
• List-wise deletion • Hmisc
• Pair-wise deletion • mi
• Mean/ Mode/ Median Imputation • etc.
• Generalized Imputation
• Similar case Imputation
• Prediction Model
• KNN Imputation
https://medium.com/coinmonks/dealing-with-missing- https://www.analyticsvidhya.com/blog/2016/03/tutori
data-using-r-3ae428da2d17 al-powerful-packages-imputing-missing-values/
MICE package:
• MICE (Multivariate Imputation via • It imputes data on a variable by
Chained Equations) is one of the variable basis by specifying an
commonly used package by R imputation model per variable.
users. Creating multiple • The methods used by this package
imputations as compared to a are:
single imputation (such as mean) • PMM (Predictive Mean Matching) —
takes care of uncertainty in missing For numeric variables
values. • logreg(Logistic Regression) — For
• MICE assumes that the missing Binary Variables( with 2 levels)
data are Missing at Random • polyreg(Bayesian polytomous
(MAR), which means that the regression) — For Factor Variables (>=
probability that a value is missing 2 levels)
depends only on observed value • Proportional odds model (ordered
and can be predicted using them. and censored variables, >= 2 levels)

More here: https://medium.com/coinmonks/dealing- Use of MICE with an example data is here:


with-missing-data-using-r-3ae428da2d17 https://www.youtube.com/watch?v=An7nPLJ0fsg
Question/queries?
• Next classes • Communicating the results of
data science projects

• Defining projects in R studio


• Local file/folder
• GitHub repository

• R notebook
Thank you!
@shitalbhandary

You might also like