0% found this document useful (0 votes)
60 views17 pages

Kickstarter Success Prediction

The document discusses predicting the success of Kickstarter projects using logistic regression. It cleans and splits the data into training and test sets. A logistic regression model is fitted on the training set with predictors like backers, project length, category and funding goal. The model is used to predict outcomes in the test set, achieving a misclassification rate of 9.4%. Leave-one-out cross validation is also applied, reporting a misclassification rate of 7.6% for the training set.

Uploaded by

Tianhui Xu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views17 pages

Kickstarter Success Prediction

The document discusses predicting the success of Kickstarter projects using logistic regression. It cleans and splits the data into training and test sets. A logistic regression model is fitted on the training set with predictors like backers, project length, category and funding goal. The model is used to predict outcomes in the test set, achieving a misclassification rate of 9.4%. Leave-one-out cross validation is also applied, reporting a misclassification rate of 7.6% for the training set.

Uploaded by

Tianhui Xu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

2021/1/9 Kickstarter Success Prediction

Kickstarter Success Prediction


Tianhui Xu
09/27/2020
Can we predict what projects end up being successful on Kickstarter?

We have data from the Kickstarter (https://www.kickstarter.com/) company.

From Wikipedia: Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that
maintains a global crowdfunding platform focused on creativity and merchandising. The company’s stated
mission is to “help bring creative projects to life”. As of May 2019, Kickstarter has received more than $4 billion in
pledges from 16.3 million backers to fund 445,000 projects, such as films, music, stage shows, comics,
journalism, video games, technology, publishing, and food-related projects.

The data is collected by Mickaël Mouillé (https://www.kaggle.com/kemical) and is last uodated in 2018. Columns
are self explanatory. Note that usd_pledged is the column pledged in US dollars (conversion done by
kickstarter) and usd_pledge_real is the pledged column in real US dollars of the pledged column. Finally,
usd_goal_real is the column goal in real US dollars. You should use the real columns.

So what makes a project successful? Undoubtedly, there are many factors, but perhaps we could set up a
prediction problem here, similar to the one from the bonus part of the last assignment where we used GDP to
predict personnel contributions.

We have columns representing the number of backers, project length, the main category, and the real project goal
in USD for each project.

Let’s explore the relationship between those predictors and the dependent variable of interest — the success of a
project.

Instead of running a simple linear regression and calling it a day, let’s use cross-validation to make our prediction
a little more sophisticated.

Our general plan is the following:

1. Build the model on a training data set


2. Apply the model on a new test data set to make predictions based on the inferred model parameters.
3. Compute and track the prediction errors to check performance using the mean squared difference between
the observed and the predicted outcome values in the test set.

STEP 1: Import & Clean the Data

# import the dataset


ks_df <- read.csv("ks-projects-201801.csv")

ks_df[ks_df==""] <- NA # set all the blanks to NAs


ks_df <- na.omit(ks_df) # omit the NA values

# see the head


head(ks_df)

STEP 2: Codify outcome variable

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 1/17
2021/1/9 Kickstarter Success Prediction

# A project is successful if the real pledged value is larger than or equal to the goal
value
# create a column of dummy variables to indicate the status
ks_df$success <- ifelse(ks_df$usd_pledged_real - ks_df$usd_goal_real >= 0, 1, 0)
head(ks_df)

STEP 3: Getting the project length variable

# extract the date using lubricate package


ks_df$deadline = date(ks_df$deadline)
ks_df$launched = date(ks_df$launched)
# create a new column to store the length of the project
ks_df$length <- ks_df$deadline - ks_df$launched
# remove any project length that is higher than 60
ks_df <- ks_df[!(ks_df$length > 60),]

STEP 4: Splitting the data into a training and a testing set

# randomly select 80% of the data


tr_idx <- sample(nrow(ks_df),as.integer(nrow(ks_df) * 0.80))
ks_tr <- ks_df[tr_idx,] # training set
ks_te <- ks_df[-tr_idx,] # rest to testing set

STEP 5: Fitting a model

ks_df$main_category <- as.factor(ks_df$main_category)

# fit a logistic regression model to the dataset


glm_ks <- glm(success ~ backers + length + main_category + usd_goal_real, data = ks_tr,
family=binomial)
summary(glm_ks)

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 2/17
2021/1/9 Kickstarter Success Prediction

##
## Call:
## glm(formula = success ~ backers + length + main_category + usd_goal_real,
## family = binomial, data = ks_tr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.5478 -0.0617 0.2219 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.326e-01 2.333e-02 -9.967 <2e-16 ***
## backers 4.997e-02 2.340e-04 213.506 <2e-16 ***
## length -1.534e-02 5.106e-04 -30.052 <2e-16 ***
## main_categoryComics -4.692e-01 3.789e-02 -12.383 <2e-16 ***
## main_categoryCrafts -6.811e-01 3.882e-02 -17.543 <2e-16 ***
## main_categoryDance 7.520e-01 5.043e-02 14.911 <2e-16 ***
## main_categoryDesign -6.777e-01 3.022e-02 -22.421 <2e-16 ***
## main_categoryFashion -5.825e-01 2.984e-02 -19.519 <2e-16 ***
## main_categoryFilm & Video 2.517e-01 2.188e-02 11.501 <2e-16 ***
## main_categoryFood -3.790e-01 3.055e-02 -12.404 <2e-16 ***
## main_categoryGames -1.710e+00 3.241e-02 -52.772 <2e-16 ***
## main_categoryJournalism -6.697e-01 5.653e-02 -11.847 <2e-16 ***
## main_categoryMusic 1.899e-01 2.224e-02 8.536 <2e-16 ***
## main_categoryPhotography -3.911e-01 3.571e-02 -10.952 <2e-16 ***
## main_categoryPublishing -4.764e-01 2.459e-02 -19.372 <2e-16 ***
## main_categoryTechnology -5.930e-01 3.328e-02 -17.820 <2e-16 ***
## main_categoryTheater 7.762e-01 3.382e-02 22.950 <2e-16 ***
## usd_goal_real -2.257e-04 1.340e-06 -168.432 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 386741 on 295507 degrees of freedom
## Residual deviance: 182413 on 295490 degrees of freedom
## AIC: 182449
##
## Number of Fisher Scoring iterations: 13

STEP 6: Predictions
Use the model you’ve inferred from the previous step to predict the success outcomes in the test set.

# predict the response of probability of the test set using the fitted model
ks_p <- predict(glm_ks, ks_te, type = "response")
# convert into class labels
ks.pred=rep("No",nrow(ks_te))
ks.pred[ks_p > .5] = "Yes"
ks_te$ks.pred = ks.pred
# produce a confusion matrix
acc_table <- table(ks.pred,ks_te$success)
acc_table

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 3/17
2021/1/9 Kickstarter Success Prediction

##
## ks.pred 0 1
## No 45585 5169
## Yes 1758 21366

STEP 7: How well did it do?


Report the misclassification rate of the predictions for the training and the test sets.

# use the same process to predict for the training set


ks_p_tr <- predict(glm_ks, ks_tr, type = "response")
ks.pred_tr <- rep("No",nrow(ks_tr))
ks.pred_tr[ks_p_tr>.5] = "Yes"
ks_tr$ks.pred_tr = ks.pred_tr
acc_table_tr <- table(ks.pred_tr,ks_tr$success)
acc_table_tr

##
## ks.pred_tr 0 1
## No 181631 20866
## Yes 7004 86007

# report misclassification rates


misclass_tr <- (acc_table_tr[1,2]+acc_table_tr[2,1])/sum(acc_table_tr)
sprintf("The misclassification rate of the predictions for the training sets is: %f",mis
class_tr)

## [1] "The misclassification rate of the predictions for the training sets is: 0.09431
2"

misclass_te <- (acc_table[1,2]+acc_table[2,1])/sum(acc_table)


sprintf("The misclassification rate of the predictions for the test sets is: %f",misclas
s_te)

## [1] "The misclassification rate of the predictions for the test sets is: 0.093763"

Step 8: LOOCV method


Apply the leave-one-out cross validation (LOOCV) method to the training set.

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 4/17
2021/1/9 Kickstarter Success Prediction

# sample 5% of the data set to apply LOOCV in order to reduce running time
ks_df_sam <- ks_df[sample(nrow(ks_df),as.integer(nrow(ks_df)*0.05)),]

# randomly select 80% of the data


sam_tr_idx <- sample(nrow(ks_df_sam),as.integer(nrow(ks_df_sam) * 0.80))
ks_sam_tr <- ks_df_sam[sam_tr_idx,] # training set from 5% sample
ks_sam_te <- ks_df_sam[-sam_tr_idx,] # rest to testing set from 5% sample

# fit a logistic model on the training data set


glm_ks_sam <- glm(success ~ backers + length + main_category + usd_goal_real, data = ks_
sam_tr, family=binomial)

# apply LOOCV method to the training set


cv.err=cv.glm(ks_sam_tr,glm_ks_sam)

# apply LOOCV method to the test set


cv.err.test =cv.glm(ks_sam_te,glm_ks_sam)

sprintf("The misclassification rate of the training set is: raw: %f, adjusted: %f.", cv.
err$delta[1], cv.err$delta[2])

## [1] "The misclassification rate of the training set is: raw: 0.076014, adjusted: 0.07
6014."

sprintf("The misclassification rate of the test set is: raw: %f, adjusted: %f.", cv.err.
test$delta[1], cv.err.test$delta[2])

## [1] "The misclassification rate of the test set is: raw: 0.362208, adjusted: 0.08124
6."

The adjusted error rate of the training and test set resulting from the LOOCV is quite similar. But the raw cross-
validation estimate of prediction error is much higher because there’s not enough data in the test set so that the
model trained every time is likely to be not fitted. And since every time it’s trained on approximately identical data
sets, the models are correlated with each other, adding on the effect of high variance.

Step 9: Explanations
Compare the misclassification rates from the simple method to the LOOCV method?

In this project, we applied cross-validation to the training set and test set respectively to compare the
misclassification rate calculated. However, in real-world cases, cross-validation is used when there are not
enough data points but we want to estimate the out-of-sample fit to select the best model so that we don’t need
to divide the training set and the test set.

Linear Regression
STEP 1
Using the laLonde data set, run a linear regression that models re78 as a function of age , education ,
re74 , re75 , hisp , and black .
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 5/17
2021/1/9 Kickstarter Success Prediction

data(lalonde)
# fit a linear regression model
lm1 <- lm(re78~age+educ+re74+re75+hisp+black, data=lalonde)
summary(lm1)

##
## Call:
## lm(formula = re78 ~ age + educ + re74 + re75 + hisp + black,
## data = lalonde)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9065 -4577 -1775 3186 55037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.126e+03 2.434e+03 0.463 0.6437
## age 5.928e+01 4.409e+01 1.345 0.1794
## educ 4.293e+02 1.758e+02 2.442 0.0150 *
## re74 7.462e-02 7.699e-02 0.969 0.3330
## re75 6.676e-02 1.314e-01 0.508 0.6116
## hisp -2.125e+02 1.546e+03 -0.137 0.8907
## black -2.323e+03 1.162e+03 -1.999 0.0462 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6544 on 438 degrees of freedom
## Multiple R-squared: 0.03942, Adjusted R-squared: 0.02627
## F-statistic: 2.996 on 6 and 438 DF, p-value: 0.007035

STEP 2
Report coefficients and R-squared.

# report coefficients and R-sqaured


print("The coefficients estimated from the model are:")

## [1] "The coefficients estimated from the model are:"

print(coef(lm1))

## (Intercept) age educ re74 re75


## 1.126492e+03 5.928155e+01 4.293074e+02 7.461872e-02 6.676314e-02
## hisp black
## -2.125053e+02 -2.323282e+03

print("The reported R-squared is: 0.039")

## [1] "The reported R-squared is: 0.039"

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 6/17
2021/1/9 Kickstarter Success Prediction

# predict using the fitted model


predicted_re78 <- predict.lm(lm1, lalonde)

# hand-calculation of R-squared
mean <- mean(lalonde$re78)
# SSR (regression sum of squares): how far the estimated sloped regression line is from
the horizontal 'no relationship line', the sample mean
SSR <- sum((predicted_re78 - mean) ** 2)
# SSTO (total sum of squres): how much the data points, vary around their mean
SSTO <- sum((lalonde$re78 - mean) ** 2)
SSE <- sum((lalonde$re78 - predicted_re78) ** 2)
R_squared <- SSR/SSTO
sprintf("The calculated R-squared by hand is: %s, the answer is the same with the R-squa
red from the summary statistics above.",round(R_squared,3))

## [1] "The calculated R-squared by hand is: 0.039, the answer is the same with the R-sq
uared from the summary statistics above."

STEP 3
Then, setting all the predictors at their means EXCEPT education , create a data visualization that shows the
95% confidence interval of the expected values of re78 as education varies from 3 to 16. Be sure to include
axes labels and figure titles.

set.seed(254)
# setting all predictors at their mean
age = mean(lalonde$age)
re74 = mean(lalonde$re74)
re75 = mean(lalonde$re75)
hisp = mean(lalonde$hisp)
black = mean(lalonde$black)

# create a function to calculate the dependent variable given coefficients and independe
nt variables
calc_turnout <- function(coef, person) {
turnout <- coef[1] +
coef[2] * person[1] +
coef[3] * person[2] +
coef[4] * person[3] +
coef[5] * person[4] +
coef[6] * person[5] +
coef[7] * person[6]
return(turnout)
}

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 7/17
2021/1/9 Kickstarter Success Prediction

# use three nested loops to get the expected values of 're78'


storage.matrix_exp <- matrix(NA, nrow = 1000, ncol = 14)

for (j in c(1:1000)) {
sim_exp <- sim(lm1, 1000)
sim_coef_exp <- sim_exp@coef
for (educ in c(3:16)) {
person <- c(age, educ, re74, re75, hisp, black)
store <- rep(0,1000)
for (i in c(1:1000)) {
store[i] <- calc_turnout(sim_coef_exp[i,],person)
}
storage.matrix_exp[j,educ-2] <- mean(store)
}
}

# 95% confidence interval of the expected values


conf.intervals_exp <- apply(storage.matrix_exp, 2, quantile, probs = c(0.025, 0.975))

plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(1,10000),


main = "Expected Earnings in 1978 by Years of Schooling (95% CI)", xlab = "Years of
Schooling",
ylab = "Real Earnings in 1978")

for (educ in 3:16) {


segments(
x0 = educ,
y0 = storage.matrix_exp[1, educ - 2],
x1 = educ,
y1 = storage.matrix_exp[2, educ - 2],
lwd = 2)
}

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 8/17
2021/1/9 Kickstarter Success Prediction

STEP 4
Then, do the same thing, but this time for the predicted values of re78 . Be sure to include axes labels and figure
titles.

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 9/17
2021/1/9 Kickstarter Success Prediction

set.seed(25)
# simulate 1000 set of coefficients
sim_pred <- sim(lm1, 1000)
sim_coef_pred <- sim_pred@coef

# store 1000 predicted values for each education value


storage.matrix_pred <- matrix(NA, nrow = 1000, ncol = 14)
for (educ in c(3:16)) {
for (i in c(1:1000)) {
person <- c(age, educ, re74, re75, hisp, black)
storage.matrix_pred[i,educ-2] <- calc_turnout(sim_coef_pred[i,],person)
}
}
conf.intervals_pred <- apply(storage.matrix_pred, 2, quantile, probs = c(0.025, 0.975))

plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(1,10000),


main = "Predicted Earnings in 1978 by Years of Schooling (95% CI)", xlab = "Years o
f Schooling",
ylab = "Real Earnings in 1978")

for (educ in 3:16) {


segments(
x0 = educ,
y0 = storage.matrix_pred[1, educ - 2],
x1 = educ,
y1 = storage.matrix_pred[2, educ - 2],
lwd = 2)
}

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 10/17
2021/1/9 Kickstarter Success Prediction

Findings
The confidence intervals of the expected values are smaller than the predicted values. For the expected values,
the spread of the values is narrower because taking the mean of the 1000 predictions reduces the variance so
that the values are centered towards the mean.

The results is that years of schooling (3-16) has a positive influence when predicting real earnings in 1978 given
that the other predictors are at their mean, and the relationship is linear. The confidence intervals of the expected
predicted values decreases when years of schooling approaches 9 and increases after that, meaning that it’s
most certain when years of schooling is around 9 and uncertain around the two extremes. The expected earning
of 3 years of schooling is around $2000, and increase linearly towards around $8000 for 16 years of schooling.

Logistic Regression
STEP 1
Using the lalonde data set, run a logistic regression, modeling treatment status as a function of age ,
education , hisp , re74 and re75 . Report and interpret the regression coefficient and 95% confidence
intervals for age and education .

glm.treat <- glm(treat~age+educ+hisp+re74+re75,data=lalonde,family=binomial)


print("The regression coefficients are:")

## [1] "The regression coefficients are:"

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 11/17
2021/1/9 Kickstarter Success Prediction

print(coef(glm.treat))

## (Intercept) age educ hisp re74


## -1.320183e+00 1.164838e-02 6.922206e-02 -6.024002e-01 -2.277944e-05
## re75
## 5.221126e-05

conf_int <- confint(glm.treat)


sprintf("The 95%% confidence interval for 'age' is [%f,%f].", conf_int[2,1], conf_int[2,
2])

## [1] "The 95% confidence interval for 'age' is [-0.015382,0.038522]."

sprintf("The 95%% confidence interval for 'education' is [%f,%f].", conf_int[3,1], conf_


int[3,2])

## [1] "The 95% confidence interval for 'education' is [-0.038539,0.179589]."

The regression coefficient shows that a unit increase in age results in an increase in the log odds of treatment by
2.816*(10^-3), a unit increase in years of schooling results in an increase in the log odds of treatment by 1.629*
(10^-2). However, a unit increase in hispanic (0-1) results in a decrease in the log odds of treatment by 1.336 *
10^(-1), and a unit increase in earnings in 1974 results in a decrease in the log odds of treatments by 5.235 *
10^(-6). A unit increase in earnings in 1975 results in a increase in the log odds of treatment by 1.226 * 10^(-5).

The confidence intervals for ‘age’ and ‘education’ means that 95% of the estimated coefficients of the two
variables fall in the two intervals.

STEP 2
Use a simple bootstrap to estimate (and report) bootstrapped confidence intervals for age and education
given the logistic regression above. Code the bootstrap algorithm yourself.

set.seed(42)
age.fn <- function(data,index) {
d <- data[index,]
return(coef(glm(treat~age+educ+hisp+re74+re75,data=d,family=binomial))[2])
}

boot.age <- boot(lalonde, age.fn, 1000)


age_conf <- quantile(boot.age$t, probs = c(0.025, 0.975))

educ.fn <- function(data,index) {


d <- data[index,]
return(coef(glm(treat~age+educ+hisp+re74+re75,data=d,family=binomial))[3])
}

boot.educ <- boot(lalonde, educ.fn, 1000)


educ_conf <- quantile(boot.educ$t, probs = c(0.025, 0.975))

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 12/17
2021/1/9 Kickstarter Success Prediction

Report bootstrapped confidence intervals for age and education here.

sprintf("The bootstrapped confidence interval for 'age' is [%f, %f].", age_conf[1], age_
conf[2])

## [1] "The bootstrapped confidence interval for 'age' is [-0.015177, 0.040006]."

sprintf("The bootstrapped confidence interval for 'education' is [%f, %f].", educ_conf[1


], educ_conf[2])

## [1] "The bootstrapped confidence interval for 'education' is [-0.037055, 0.199727]."

STEP 3
Then, using the simulation-based approach and the arm library, set all the predictors at their means EXCEPT
education , create a data visualization that shows the 95% confidence interval of the expected values of the
probability of receiving treatment as education varies from 3 to 16. Be sure to include axes labels and figure titles.

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 13/17
2021/1/9 Kickstarter Success Prediction

set.seed(43)
glm.treat <- glm(treat~age+educ+hisp+re74+re75,data=lalonde,family=binomial)

# create a function to calculate the dependent variable given coefficients and independe
nt variables
calc_turnout_2 <- function(coef, person) {
turnout <- coef[1] +
coef[2] * person[1] +
coef[3] * person[2] +
coef[4] * person[3] +
coef[5] * person[4] +
coef[6] * person[5]
return(exp(turnout) / (1 + exp(turnout)))
}

# store 1000 predicted values for each education value


storage.matrix_exp_2 <- matrix(NA, nrow = 1000, ncol = 14)
for (j in c(1:1000)) {
sim_exp <- sim(glm.treat, 1000)
sim_coef_exp <- sim_exp@coef
for (educ in c(3:16)) {
person <- c(age, educ, hisp, re74, re75)
store <- rep(0,1000)
for (i in c(1:1000)) {
store[i] <- calc_turnout_2(sim_coef_exp[i,],person)
}
storage.matrix_exp_2[j,educ-2] <- mean(store)
}
}

conf.intervals_exp_2 <- apply(storage.matrix_exp_2, 2, quantile, probs = c(0.025, 0.975


))

plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(0,1),


main = "Expected probability of receiving treatment by years of schooling(95%CI)",
xlab = "Years of Schooling",
ylab = "Probability of Receiving Treatment")

for (educ in 3:16) {


segments(
x0 = educ,
y0 = storage.matrix_exp_2[1, educ - 2],
x1 = educ,
y1 = storage.matrix_exp_2[2, educ - 2],
lwd = 2)
}

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 14/17
2021/1/9 Kickstarter Success Prediction

STEP 4
Then, do the same thing, but this time for the predicted values of the probability of receiving treatment as
education varies from 3 to 16. Be sure to include axes labels and figure titles.

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 15/17
2021/1/9 Kickstarter Success Prediction

set.seed(44)
# simulate 1000 set of coefficients
sim_pred_2 <- sim(glm.treat, 1000)
sim_coef_pred_2 <- sim_pred_2@coef

# create a function to calculate the dependent variable given coefficients and independe
nt variables
calc_turnout_2 <- function(coef, person) {
turnout <- coef[1] +
coef[2] * person[1] +
coef[3] * person[2] +
coef[4] * person[3] +
coef[5] * person[4] +
coef[6] * person[5]
return(exp(turnout) / (1 + exp(turnout)))
}

# store 1000 predicted values for each education value


storage.matrix_pred_2 <- matrix(NA, nrow = 1000, ncol = 14)
for (educ in c(3:16)) {
for (i in c(1:1000)) {
person <- c(age, educ, hisp, re74, re75)
storage.matrix_pred_2[i,educ-2] <- calc_turnout_2(sim_coef_pred_2[i,],person)
}
}

conf.intervals_pred_2 <- apply(storage.matrix_pred_2, 2, quantile, probs = c(0.025, 0.97


5))

plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(0,1),


main = "Predicted probability of receiving treatment by years of schooling(95%CI)",
xlab = "Years of Schooling",
ylab = "Probability of Receiving Treatment")

for (educ in 3:16) {


segments(
x0 = educ,
y0 = storage.matrix_pred_2[1, educ - 2],
x1 = educ,
y1 = storage.matrix_pred_2[2, educ - 2],
lwd = 2)
}

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 16/17
2021/1/9 Kickstarter Success Prediction

The graph is similar with the previous exercise, with confidence intervals of the expected value smaller than the
predicted values. The expected probability of receiving treatment for people with 3 years of schooling is around
0.3, and increase linearly as the years of schooling increases, towards around 0.5 for 16 years of schooling. The
interval for all possible independent values is similar meaning that the uncertainty is similar.

file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 17/17

You might also like