0% found this document useful (0 votes)

60 views17 pages

Kickstarter Success Prediction

The document discusses predicting the success of Kickstarter projects using logistic regression. It cleans and splits the data into training and test sets. A logistic regression model is fitted on the training set with predictors like backers, project length, category and funding goal. The model is used to predict outcomes in the test set, achieving a misclassification rate of 9.4%. Leave-one-out cross validation is also applied, reporting a misclassification rate of 7.6% for the training set.

Uploaded by

Tianhui Xu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views17 pages

Kickstarter Success Prediction

Uploaded by

Tianhui Xu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

2021/1/9 Kickstarter Success Prediction

Kickstarter Success Prediction

Tianhui Xu
09/27/2020
Can we predict what projects end up being successful on Kickstarter?

We have data from the Kickstarter (https://www.kickstarter.com/) company.

From Wikipedia: Kickstarter is an American public-beneﬁt corporation based in Brooklyn, New York, that
maintains a global crowdfunding platform focused on creativity and merchandising. The company’s stated
mission is to “help bring creative projects to life”. As of May 2019, Kickstarter has received more than $4 billion in
pledges from 16.3 million backers to fund 445,000 projects, such as ﬁlms, music, stage shows, comics,
journalism, video games, technology, publishing, and food-related projects.

The data is collected by Mickaël Mouillé (https://www.kaggle.com/kemical) and is last uodated in 2018. Columns
are self explanatory. Note that usd_pledged is the column pledged in US dollars (conversion done by
kickstarter) and usd_pledge_real is the pledged column in real US dollars of the pledged column. Finally,
usd_goal_real is the column goal in real US dollars. You should use the real columns.

So what makes a project successful? Undoubtedly, there are many factors, but perhaps we could set up a
prediction problem here, similar to the one from the bonus part of the last assignment where we used GDP to
predict personnel contributions.

We have columns representing the number of backers, project length, the main category, and the real project goal
in USD for each project.

Let’s explore the relationship between those predictors and the dependent variable of interest — the success of a
project.

Instead of running a simple linear regression and calling it a day, let’s use cross-validation to make our prediction
a little more sophisticated.

Our general plan is the following:

1. Build the model on a training data set

2. Apply the model on a new test data set to make predictions based on the inferred model parameters.
3. Compute and track the prediction errors to check performance using the mean squared diﬀerence between
the observed and the predicted outcome values in the test set.

STEP 1: Import & Clean the Data

# import the dataset

ks_df <- read.csv("ks-projects-201801.csv")

ks_df[ks_df==""] <- NA # set all the blanks to NAs

ks_df <- na.omit(ks_df) # omit the NA values

# see the head

head(ks_df)

STEP 2: Codify outcome variable

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 1/17
2021/1/9 Kickstarter Success Prediction

# A project is successful if the real pledged value is larger than or equal to the goal
value
# create a column of dummy variables to indicate the status
ks_df$success <- ifelse(ks_df$usd_pledged_real - ks_df$usd_goal_real >= 0, 1, 0)
head(ks_df)

STEP 3: Getting the project length variable

# extract the date using lubricate package

ks_df$deadline = date(ks_df$deadline)
ks_df$launched = date(ks_df$launched)
# create a new column to store the length of the project
ks_df$length <- ks_df$deadline - ks_df$launched
# remove any project length that is higher than 60
ks_df <- ks_df[!(ks_df$length > 60),]

STEP 4: Splitting the data into a training and a testing set

# randomly select 80% of the data

tr_idx <- sample(nrow(ks_df),as.integer(nrow(ks_df) * 0.80))
ks_tr <- ks_df[tr_idx,] # training set
ks_te <- ks_df[-tr_idx,] # rest to testing set

STEP 5: Fitting a model

ks_df$main_category <- as.factor(ks_df$main_category)

# fit a logistic regression model to the dataset

glm_ks <- glm(success ~ backers + length + main_category + usd_goal_real, data = ks_tr,
family=binomial)
summary(glm_ks)

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 2/17
2021/1/9 Kickstarter Success Prediction

##
## Call:
## glm(formula = success ~ backers + length + main_category + usd_goal_real,
## family = binomial, data = ks_tr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.5478 -0.0617 0.2219 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.326e-01 2.333e-02 -9.967 <2e-16 ***
## backers 4.997e-02 2.340e-04 213.506 <2e-16 ***
## length -1.534e-02 5.106e-04 -30.052 <2e-16 ***
## main_categoryComics -4.692e-01 3.789e-02 -12.383 <2e-16 ***
## main_categoryCrafts -6.811e-01 3.882e-02 -17.543 <2e-16 ***
## main_categoryDance 7.520e-01 5.043e-02 14.911 <2e-16 ***
## main_categoryDesign -6.777e-01 3.022e-02 -22.421 <2e-16 ***
## main_categoryFashion -5.825e-01 2.984e-02 -19.519 <2e-16 ***
## main_categoryFilm & Video 2.517e-01 2.188e-02 11.501 <2e-16 ***
## main_categoryFood -3.790e-01 3.055e-02 -12.404 <2e-16 ***
## main_categoryGames -1.710e+00 3.241e-02 -52.772 <2e-16 ***
## main_categoryJournalism -6.697e-01 5.653e-02 -11.847 <2e-16 ***
## main_categoryMusic 1.899e-01 2.224e-02 8.536 <2e-16 ***
## main_categoryPhotography -3.911e-01 3.571e-02 -10.952 <2e-16 ***
## main_categoryPublishing -4.764e-01 2.459e-02 -19.372 <2e-16 ***
## main_categoryTechnology -5.930e-01 3.328e-02 -17.820 <2e-16 ***
## main_categoryTheater 7.762e-01 3.382e-02 22.950 <2e-16 ***
## usd_goal_real -2.257e-04 1.340e-06 -168.432 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 386741 on 295507 degrees of freedom
## Residual deviance: 182413 on 295490 degrees of freedom
## AIC: 182449
##
## Number of Fisher Scoring iterations: 13

STEP 6: Predictions
Use the model you’ve inferred from the previous step to predict the success outcomes in the test set.

# predict the response of probability of the test set using the fitted model
ks_p <- predict(glm_ks, ks_te, type = "response")
# convert into class labels
ks.pred=rep("No",nrow(ks_te))
ks.pred[ks_p > .5] = "Yes"
ks_te$ks.pred = ks.pred
# produce a confusion matrix
acc_table <- table(ks.pred,ks_te$success)
acc_table

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 3/17
2021/1/9 Kickstarter Success Prediction

##
## ks.pred 0 1
## No 45585 5169
## Yes 1758 21366

STEP 7: How well did it do?

Report the misclassiﬁcation rate of the predictions for the training and the test sets.

# use the same process to predict for the training set

ks_p_tr <- predict(glm_ks, ks_tr, type = "response")
ks.pred_tr <- rep("No",nrow(ks_tr))
ks.pred_tr[ks_p_tr>.5] = "Yes"
ks_tr$ks.pred_tr = ks.pred_tr
acc_table_tr <- table(ks.pred_tr,ks_tr$success)
acc_table_tr

##
## ks.pred_tr 0 1
## No 181631 20866
## Yes 7004 86007

# report misclassification rates

misclass_tr <- (acc_table_tr[1,2]+acc_table_tr[2,1])/sum(acc_table_tr)
sprintf("The misclassification rate of the predictions for the training sets is: %f",mis
class_tr)

## [1] "The misclassification rate of the predictions for the training sets is: 0.09431
2"

misclass_te <- (acc_table[1,2]+acc_table[2,1])/sum(acc_table)

sprintf("The misclassification rate of the predictions for the test sets is: %f",misclas
s_te)

## [1] "The misclassification rate of the predictions for the test sets is: 0.093763"

Step 8: LOOCV method

Apply the leave-one-out cross validation (LOOCV) method to the training set.

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 4/17
2021/1/9 Kickstarter Success Prediction

# sample 5% of the data set to apply LOOCV in order to reduce running time
ks_df_sam <- ks_df[sample(nrow(ks_df),as.integer(nrow(ks_df)*0.05)),]

# randomly select 80% of the data

sam_tr_idx <- sample(nrow(ks_df_sam),as.integer(nrow(ks_df_sam) * 0.80))
ks_sam_tr <- ks_df_sam[sam_tr_idx,] # training set from 5% sample
ks_sam_te <- ks_df_sam[-sam_tr_idx,] # rest to testing set from 5% sample

# fit a logistic model on the training data set

glm_ks_sam <- glm(success ~ backers + length + main_category + usd_goal_real, data = ks_
sam_tr, family=binomial)

# apply LOOCV method to the training set

cv.err=cv.glm(ks_sam_tr,glm_ks_sam)

# apply LOOCV method to the test set

cv.err.test =cv.glm(ks_sam_te,glm_ks_sam)

sprintf("The misclassification rate of the training set is: raw: %f, adjusted: %f.", cv.
err$delta[1], cv.err$delta[2])

## [1] "The misclassification rate of the training set is: raw: 0.076014, adjusted: 0.07
6014."

sprintf("The misclassification rate of the test set is: raw: %f, adjusted: %f.", cv.err.
test$delta[1], cv.err.test$delta[2])

## [1] "The misclassification rate of the test set is: raw: 0.362208, adjusted: 0.08124
6."

The adjusted error rate of the training and test set resulting from the LOOCV is quite similar. But the raw cross-
validation estimate of prediction error is much higher because there’s not enough data in the test set so that the
model trained every time is likely to be not ﬁtted. And since every time it’s trained on approximately identical data
sets, the models are correlated with each other, adding on the eﬀect of high variance.

Step 9: Explanations
Compare the misclassiﬁcation rates from the simple method to the LOOCV method?

In this project, we applied cross-validation to the training set and test set respectively to compare the
misclassiﬁcation rate calculated. However, in real-world cases, cross-validation is used when there are not
enough data points but we want to estimate the out-of-sample ﬁt to select the best model so that we don’t need
to divide the training set and the test set.

Linear Regression
STEP 1
Using the laLonde data set, run a linear regression that models re78 as a function of age , education ,
re74 , re75 , hisp , and black .
ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 5/17
2021/1/9 Kickstarter Success Prediction

data(lalonde)
# fit a linear regression model
lm1 <- lm(re78~age+educ+re74+re75+hisp+black, data=lalonde)
summary(lm1)

##
## Call:
## lm(formula = re78 ~ age + educ + re74 + re75 + hisp + black,
## data = lalonde)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9065 -4577 -1775 3186 55037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.126e+03 2.434e+03 0.463 0.6437
## age 5.928e+01 4.409e+01 1.345 0.1794
## educ 4.293e+02 1.758e+02 2.442 0.0150 *
## re74 7.462e-02 7.699e-02 0.969 0.3330
## re75 6.676e-02 1.314e-01 0.508 0.6116
## hisp -2.125e+02 1.546e+03 -0.137 0.8907
## black -2.323e+03 1.162e+03 -1.999 0.0462 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6544 on 438 degrees of freedom
## Multiple R-squared: 0.03942, Adjusted R-squared: 0.02627
## F-statistic: 2.996 on 6 and 438 DF, p-value: 0.007035

STEP 2
Report coeﬃcients and R-squared.

# report coefficients and R-sqaured

print("The coefficients estimated from the model are:")

## [1] "The coefficients estimated from the model are:"

print(coef(lm1))

## (Intercept) age educ re74 re75

## 1.126492e+03 5.928155e+01 4.293074e+02 7.461872e-02 6.676314e-02
## hisp black
## -2.125053e+02 -2.323282e+03

print("The reported R-squared is: 0.039")

## [1] "The reported R-squared is: 0.039"

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 6/17
2021/1/9 Kickstarter Success Prediction

# predict using the fitted model

predicted_re78 <- predict.lm(lm1, lalonde)

# hand-calculation of R-squared
mean <- mean(lalonde$re78)
# SSR (regression sum of squares): how far the estimated sloped regression line is from
the horizontal 'no relationship line', the sample mean
SSR <- sum((predicted_re78 - mean) ** 2)
# SSTO (total sum of squres): how much the data points, vary around their mean
SSTO <- sum((lalonde$re78 - mean) ** 2)
SSE <- sum((lalonde$re78 - predicted_re78) ** 2)
R_squared <- SSR/SSTO
sprintf("The calculated R-squared by hand is: %s, the answer is the same with the R-squa
red from the summary statistics above.",round(R_squared,3))

## [1] "The calculated R-squared by hand is: 0.039, the answer is the same with the R-sq
uared from the summary statistics above."

STEP 3
Then, setting all the predictors at their means EXCEPT education , create a data visualization that shows the
95% conﬁdence interval of the expected values of re78 as education varies from 3 to 16. Be sure to include
axes labels and ﬁgure titles.

set.seed(254)
# setting all predictors at their mean
age = mean(lalonde$age)
re74 = mean(lalonde$re74)
re75 = mean(lalonde$re75)
hisp = mean(lalonde$hisp)
black = mean(lalonde$black)

# create a function to calculate the dependent variable given coefficients and independe
nt variables
calc_turnout <- function(coef, person) {
turnout <- coef[1] +
coef[2] * person[1] +
coef[3] * person[2] +
coef[4] * person[3] +
coef[5] * person[4] +
coef[6] * person[5] +
coef[7] * person[6]
return(turnout)
}

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 7/17
2021/1/9 Kickstarter Success Prediction

# use three nested loops to get the expected values of 're78'

storage.matrix_exp <- matrix(NA, nrow = 1000, ncol = 14)

for (j in c(1:1000)) {
sim_exp <- sim(lm1, 1000)
sim_coef_exp <- sim_exp@coef
for (educ in c(3:16)) {
person <- c(age, educ, re74, re75, hisp, black)
store <- rep(0,1000)
for (i in c(1:1000)) {
store[i] <- calc_turnout(sim_coef_exp[i,],person)
}
storage.matrix_exp[j,educ-2] <- mean(store)
}
}

# 95% confidence interval of the expected values

conf.intervals_exp <- apply(storage.matrix_exp, 2, quantile, probs = c(0.025, 0.975))

plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(1,10000),

main = "Expected Earnings in 1978 by Years of Schooling (95% CI)", xlab = "Years of
Schooling",
ylab = "Real Earnings in 1978")

for (educ in 3:16) {

segments(
x0 = educ,
y0 = storage.matrix_exp[1, educ - 2],
x1 = educ,
y1 = storage.matrix_exp[2, educ - 2],
lwd = 2)
}

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 8/17
2021/1/9 Kickstarter Success Prediction

STEP 4
Then, do the same thing, but this time for the predicted values of re78 . Be sure to include axes labels and ﬁgure
titles.

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 9/17
2021/1/9 Kickstarter Success Prediction

set.seed(25)
# simulate 1000 set of coefficients
sim_pred <- sim(lm1, 1000)
sim_coef_pred <- sim_pred@coef

# store 1000 predicted values for each education value

storage.matrix_pred <- matrix(NA, nrow = 1000, ncol = 14)
for (educ in c(3:16)) {
for (i in c(1:1000)) {
person <- c(age, educ, re74, re75, hisp, black)
storage.matrix_pred[i,educ-2] <- calc_turnout(sim_coef_pred[i,],person)
}
}
conf.intervals_pred <- apply(storage.matrix_pred, 2, quantile, probs = c(0.025, 0.975))

plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(1,10000),

main = "Predicted Earnings in 1978 by Years of Schooling (95% CI)", xlab = "Years o
f Schooling",
ylab = "Real Earnings in 1978")

for (educ in 3:16) {

segments(
x0 = educ,
y0 = storage.matrix_pred[1, educ - 2],
x1 = educ,
y1 = storage.matrix_pred[2, educ - 2],
lwd = 2)
}

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 10/17
2021/1/9 Kickstarter Success Prediction

Findings
The conﬁdence intervals of the expected values are smaller than the predicted values. For the expected values,
the spread of the values is narrower because taking the mean of the 1000 predictions reduces the variance so
that the values are centered towards the mean.

The results is that years of schooling (3-16) has a positive inﬂuence when predicting real earnings in 1978 given
that the other predictors are at their mean, and the relationship is linear. The conﬁdence intervals of the expected
predicted values decreases when years of schooling approaches 9 and increases after that, meaning that it’s
most certain when years of schooling is around 9 and uncertain around the two extremes. The expected earning
of 3 years of schooling is around $2000, and increase linearly towards around $8000 for 16 years of schooling.

Logistic Regression
STEP 1
Using the lalonde data set, run a logistic regression, modeling treatment status as a function of age ,
education , hisp , re74 and re75 . Report and interpret the regression coeﬃcient and 95% conﬁdence
intervals for age and education .

glm.treat <- glm(treat~age+educ+hisp+re74+re75,data=lalonde,family=binomial)

print("The regression coefficients are:")

## [1] "The regression coefficients are:"

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 11/17
2021/1/9 Kickstarter Success Prediction

print(coef(glm.treat))

## (Intercept) age educ hisp re74

## -1.320183e+00 1.164838e-02 6.922206e-02 -6.024002e-01 -2.277944e-05
## re75
## 5.221126e-05

conf_int <- confint(glm.treat)

sprintf("The 95%% confidence interval for 'age' is [%f,%f].", conf_int[2,1], conf_int[2,
2])

## [1] "The 95% confidence interval for 'age' is [-0.015382,0.038522]."

sprintf("The 95%% confidence interval for 'education' is [%f,%f].", conf_int[3,1], conf_

int[3,2])

## [1] "The 95% confidence interval for 'education' is [-0.038539,0.179589]."

The regression coeﬃcient shows that a unit increase in age results in an increase in the log odds of treatment by
2.816*(10^-3), a unit increase in years of schooling results in an increase in the log odds of treatment by 1.629*
(10^-2). However, a unit increase in hispanic (0-1) results in a decrease in the log odds of treatment by 1.336 *
10^(-1), and a unit increase in earnings in 1974 results in a decrease in the log odds of treatments by 5.235 *
10^(-6). A unit increase in earnings in 1975 results in a increase in the log odds of treatment by 1.226 * 10^(-5).

The conﬁdence intervals for ‘age’ and ‘education’ means that 95% of the estimated coeﬃcients of the two
variables fall in the two intervals.

STEP 2
Use a simple bootstrap to estimate (and report) bootstrapped conﬁdence intervals for age and education
given the logistic regression above. Code the bootstrap algorithm yourself.

set.seed(42)
age.fn <- function(data,index) {
d <- data[index,]
return(coef(glm(treat~age+educ+hisp+re74+re75,data=d,family=binomial))[2])
}

boot.age <- boot(lalonde, age.fn, 1000)

age_conf <- quantile(boot.age$t, probs = c(0.025, 0.975))

educ.fn <- function(data,index) {

d <- data[index,]
return(coef(glm(treat~age+educ+hisp+re74+re75,data=d,family=binomial))[3])
}

boot.educ <- boot(lalonde, educ.fn, 1000)

educ_conf <- quantile(boot.educ$t, probs = c(0.025, 0.975))

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 12/17
2021/1/9 Kickstarter Success Prediction

Report bootstrapped conﬁdence intervals for age and education here.

sprintf("The bootstrapped confidence interval for 'age' is [%f, %f].", age_conf[1], age_
conf[2])

## [1] "The bootstrapped confidence interval for 'age' is [-0.015177, 0.040006]."

sprintf("The bootstrapped confidence interval for 'education' is [%f, %f].", educ_conf[1

], educ_conf[2])

## [1] "The bootstrapped confidence interval for 'education' is [-0.037055, 0.199727]."

STEP 3
Then, using the simulation-based approach and the arm library, set all the predictors at their means EXCEPT
education , create a data visualization that shows the 95% conﬁdence interval of the expected values of the
probability of receiving treatment as education varies from 3 to 16. Be sure to include axes labels and ﬁgure titles.

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 13/17
2021/1/9 Kickstarter Success Prediction

set.seed(43)
glm.treat <- glm(treat~age+educ+hisp+re74+re75,data=lalonde,family=binomial)

# create a function to calculate the dependent variable given coefficients and independe
nt variables
calc_turnout_2 <- function(coef, person) {
turnout <- coef[1] +
coef[2] * person[1] +
coef[3] * person[2] +
coef[4] * person[3] +
coef[5] * person[4] +
coef[6] * person[5]
return(exp(turnout) / (1 + exp(turnout)))
}

# store 1000 predicted values for each education value

storage.matrix_exp_2 <- matrix(NA, nrow = 1000, ncol = 14)
for (j in c(1:1000)) {
sim_exp <- sim(glm.treat, 1000)
sim_coef_exp <- sim_exp@coef
for (educ in c(3:16)) {
person <- c(age, educ, hisp, re74, re75)
store <- rep(0,1000)
for (i in c(1:1000)) {
store[i] <- calc_turnout_2(sim_coef_exp[i,],person)
}
storage.matrix_exp_2[j,educ-2] <- mean(store)
}
}

conf.intervals_exp_2 <- apply(storage.matrix_exp_2, 2, quantile, probs = c(0.025, 0.975

))

plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(0,1),

main = "Expected probability of receiving treatment by years of schooling(95%CI)",
xlab = "Years of Schooling",
ylab = "Probability of Receiving Treatment")

for (educ in 3:16) {

segments(
x0 = educ,
y0 = storage.matrix_exp_2[1, educ - 2],
x1 = educ,
y1 = storage.matrix_exp_2[2, educ - 2],
lwd = 2)
}

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 14/17
2021/1/9 Kickstarter Success Prediction

STEP 4
Then, do the same thing, but this time for the predicted values of the probability of receiving treatment as
education varies from 3 to 16. Be sure to include axes labels and ﬁgure titles.

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 15/17
2021/1/9 Kickstarter Success Prediction

set.seed(44)
# simulate 1000 set of coefficients
sim_pred_2 <- sim(glm.treat, 1000)
sim_coef_pred_2 <- sim_pred_2@coef

# store 1000 predicted values for each education value

storage.matrix_pred_2 <- matrix(NA, nrow = 1000, ncol = 14)
for (educ in c(3:16)) {
for (i in c(1:1000)) {
person <- c(age, educ, hisp, re74, re75)
storage.matrix_pred_2[i,educ-2] <- calc_turnout_2(sim_coef_pred_2[i,],person)
}
}

conf.intervals_pred_2 <- apply(storage.matrix_pred_2, 2, quantile, probs = c(0.025, 0.97

5))

plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(0,1),

main = "Predicted probability of receiving treatment by years of schooling(95%CI)",
xlab = "Years of Schooling",
ylab = "Probability of Receiving Treatment")

for (educ in 3:16) {

segments(
x0 = educ,
y0 = storage.matrix_pred_2[1, educ - 2],
x1 = educ,
y1 = storage.matrix_pred_2[2, educ - 2],
lwd = 2)
}

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 16/17
2021/1/9 Kickstarter Success Prediction

The graph is similar with the previous exercise, with conﬁdence intervals of the expected value smaller than the
predicted values. The expected probability of receiving treatment for people with 3 years of schooling is around
0.3, and increase linearly as the years of schooling increases, towards around 0.5 for 16 years of schooling. The
interval for all possible independent values is similar meaning that the uncertainty is similar.

ﬁle:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 17/17

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Private Health Institutions Law
100% (1)
Private Health Institutions Law
22 pages
Hanover Report 1978
100% (1)
Hanover Report 1978
10 pages
Case Bennie and The Jets (CHAPTER 3) : Muadz Kamaruddin 191264
No ratings yet
Case Bennie and The Jets (CHAPTER 3) : Muadz Kamaruddin 191264
2 pages
BDA Lab Manual (12 Weeks)
No ratings yet
BDA Lab Manual (12 Weeks)
22 pages
ISYE6501 Homework 2
No ratings yet
ISYE6501 Homework 2
11 pages
Regression PDF
No ratings yet
Regression PDF
10 pages
ML Lab
No ratings yet
ML Lab
23 pages
Predictive Modeling (MP) Project Report
100% (1)
Predictive Modeling (MP) Project Report
73 pages
Analysis Course HW2
No ratings yet
Analysis Course HW2
13 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Shubham Pract 6 - Merged
No ratings yet
Shubham Pract 6 - Merged
12 pages
cp4252 Machine Learning Lab Manual
No ratings yet
cp4252 Machine Learning Lab Manual
21 pages
Data Science Notes
No ratings yet
Data Science Notes
36 pages
Optimizing Flight Booking Decisions Through Machine Learning Price Predictions
No ratings yet
Optimizing Flight Booking Decisions Through Machine Learning Price Predictions
50 pages
PPPL Final Practical Questions
No ratings yet
PPPL Final Practical Questions
5 pages
Phase 3.PDF Ramana
No ratings yet
Phase 3.PDF Ramana
17 pages
Statistics Consulting Cheat Sheet: Kris Sankaran October 1, 2017
100% (1)
Statistics Consulting Cheat Sheet: Kris Sankaran October 1, 2017
44 pages
Practical Machine Learning Course Notes
No ratings yet
Practical Machine Learning Course Notes
76 pages
Real World Project For R Programing
No ratings yet
Real World Project For R Programing
3 pages
Vighnesh - S Log 13
No ratings yet
Vighnesh - S Log 13
4 pages
Machine Learning Final Manual
No ratings yet
Machine Learning Final Manual
45 pages
ML Lecture For School Students
No ratings yet
ML Lecture For School Students
8 pages
Project 2
No ratings yet
Project 2
2 pages
Model Evaluation
No ratings yet
Model Evaluation
80 pages
Saurabh
No ratings yet
Saurabh
22 pages
DS File Et C1 23
No ratings yet
DS File Et C1 23
15 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
Data Science Notes
No ratings yet
Data Science Notes
5 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
PreProcessing With R
No ratings yet
PreProcessing With R
6 pages
ML Lab Manual (1-10) FINAL
No ratings yet
ML Lab Manual (1-10) FINAL
34 pages
Project2 - 158755. 4.21
No ratings yet
Project2 - 158755. 4.21
3 pages
Solution 1
No ratings yet
Solution 1
6 pages
Unit 5
No ratings yet
Unit 5
18 pages
Statistical Regression
No ratings yet
Statistical Regression
32 pages
Machine Learning Based Student AcademicPerformance Prediction
No ratings yet
Machine Learning Based Student AcademicPerformance Prediction
6 pages
Exp 6
No ratings yet
Exp 6
3 pages
Team Alacrity - Amazon ML Challenge 2023 - Text File
No ratings yet
Team Alacrity - Amazon ML Challenge 2023 - Text File
8 pages
Assignment Projects-13th Nov
No ratings yet
Assignment Projects-13th Nov
8 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
ML Theory
No ratings yet
ML Theory
5 pages
Introduction To Python and Computer Programming 1704298503
No ratings yet
Introduction To Python and Computer Programming 1704298503
44 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
MBSD
No ratings yet
MBSD
5 pages
Predictive Modelling Sweta Kumari
No ratings yet
Predictive Modelling Sweta Kumari
35 pages
First Project
No ratings yet
First Project
34 pages
ML Lap
No ratings yet
ML Lap
23 pages
Machine Learning Glob (22241a1237)
No ratings yet
Machine Learning Glob (22241a1237)
16 pages
Question Bank1
No ratings yet
Question Bank1
9 pages
Sat - 7.Pdf - Predicting Student's Performance Based On Machine Learning
No ratings yet
Sat - 7.Pdf - Predicting Student's Performance Based On Machine Learning
11 pages
Advance Stats
No ratings yet
Advance Stats
233 pages
Sameena Parvin R Prog
No ratings yet
Sameena Parvin R Prog
43 pages
FrmCourseSyllabusIPDownload Aspx
No ratings yet
FrmCourseSyllabusIPDownload Aspx
2 pages
2022hw01sol Na Na
No ratings yet
2022hw01sol Na Na
11 pages
Econometrics in R: Grant V. Farnsworth October 26, 2008
No ratings yet
Econometrics in R: Grant V. Farnsworth October 26, 2008
50 pages
Python For Data Science IA 1 Programs
No ratings yet
Python For Data Science IA 1 Programs
14 pages
Course: Applied Statistics Projects: Bui Anh Tuan March 1, 2022
No ratings yet
Course: Applied Statistics Projects: Bui Anh Tuan March 1, 2022
9 pages
Group 6 CC07
No ratings yet
Group 6 CC07
36 pages
EE331 Introduction To Machine Learning Spring 2019 Project Proposal Predicting Alcohol Consumption Based On Student Information
No ratings yet
EE331 Introduction To Machine Learning Spring 2019 Project Proposal Predicting Alcohol Consumption Based On Student Information
5 pages
Practitioner's Guide To Data Science
No ratings yet
Practitioner's Guide To Data Science
403 pages
Lectures - Data Analytics - Lab 6 at Main KAIR ISZ - Lectures
No ratings yet
Lectures - Data Analytics - Lab 6 at Main KAIR ISZ - Lectures
5 pages
Building a GPA Calculator Web App with Vanilla HTML, CSS, and JavaScript.: A Practical Q&A Guide Using a GPA Calculator
From Everand
Building a GPA Calculator Web App with Vanilla HTML, CSS, and JavaScript.: A Practical Q&A Guide Using a GPA Calculator
Lumavalle Press
No ratings yet
Practical Set-1: The Result Is 600 The Result Is 70
No ratings yet
Practical Set-1: The Result Is 600 The Result Is 70
12 pages
6648 0400 5 PS Pi 0001 - F PDF
100% (1)
6648 0400 5 PS Pi 0001 - F PDF
97 pages
4-Quantity Calculations
No ratings yet
4-Quantity Calculations
18 pages
Plumbing Tools and Their Uses
No ratings yet
Plumbing Tools and Their Uses
6 pages
Amcas Coursework Video
100% (2)
Amcas Coursework Video
7 pages
Swimming Pool Structural Calcs
100% (1)
Swimming Pool Structural Calcs
7 pages
Read Across America Day - by Slidesgo
No ratings yet
Read Across America Day - by Slidesgo
56 pages
50 KLD STP Boq
No ratings yet
50 KLD STP Boq
104 pages
2022ce11566 Srijan Lab
No ratings yet
2022ce11566 Srijan Lab
9 pages
D Professional Development For Office Administration 2 1
No ratings yet
D Professional Development For Office Administration 2 1
55 pages
Guidanc CTspection
No ratings yet
Guidanc CTspection
17 pages
Bone Forming Tumors
No ratings yet
Bone Forming Tumors
81 pages
Trainz 2004 DRAFT Content Creation Procedures
100% (1)
Trainz 2004 DRAFT Content Creation Procedures
101 pages
Chemical Engineering in Practice Second Edition - Sampler
100% (1)
Chemical Engineering in Practice Second Edition - Sampler
99 pages
Worksheet 3 LS6 - MIANO, REYMARK
No ratings yet
Worksheet 3 LS6 - MIANO, REYMARK
1 page
A.Datum Case Study
No ratings yet
A.Datum Case Study
23 pages
1304593-TogetherwithElements Archetypes v1
No ratings yet
1304593-TogetherwithElements Archetypes v1
33 pages
WAH5 - Functional Language Worksheets
No ratings yet
WAH5 - Functional Language Worksheets
6 pages
My Classroom
No ratings yet
My Classroom
1 page
Notes Summer 2024 - Finance and Economics Summary
No ratings yet
Notes Summer 2024 - Finance and Economics Summary
3 pages
Learning Area Learners With Special Educational Needs (LSEN) Learning Delivery Modality Modular Distance Learning Modality
No ratings yet
Learning Area Learners With Special Educational Needs (LSEN) Learning Delivery Modality Modular Distance Learning Modality
5 pages
OOP Assignment 2
No ratings yet
OOP Assignment 2
2 pages
3.1 Tuple Relational Calculus
No ratings yet
3.1 Tuple Relational Calculus
11 pages
Intervention21120-5570393 152823
No ratings yet
Intervention21120-5570393 152823
10 pages
List of Banned Pesticides
No ratings yet
List of Banned Pesticides
3 pages
Think Pair Share Food Safety 2
No ratings yet
Think Pair Share Food Safety 2
4 pages
Fa22 Rba 003
No ratings yet
Fa22 Rba 003
7 pages

Kickstarter Success Prediction

Uploaded by

Kickstarter Success Prediction

Uploaded by

2021/1/9 Kickstarter Success Prediction

Kickstarter Success Prediction

We have data from the Kickstarter (https://www.kickstarter.com/) company.

Our general plan is the following:

1. Build the model on a training data set

STEP 1: Import & Clean the Data

# import the dataset

ks_df[ks_df==""] <- NA # set all the blanks to NAs

# see the head

STEP 2: Codify outcome variable

STEP 3: Getting the project length variable

# extract the date using lubricate package

STEP 4: Splitting the data into a training and a testing set

# randomly select 80% of the data

STEP 5: Fitting a model

ks_df$main_category <- as.factor(ks_df$main_category)

# fit a logistic regression model to the dataset

STEP 7: How well did it do?

# use the same process to predict for the training set

# report misclassification rates

misclass_te <- (acc_table[1,2]+acc_table[2,1])/sum(acc_table)

Step 8: LOOCV method

# randomly select 80% of the data

# fit a logistic model on the training data set

# apply LOOCV method to the training set

# apply LOOCV method to the test set

# report coefficients and R-sqaured

## [1] "The coefficients estimated from the model are:"

## (Intercept) age educ re74 re75

print("The reported R-squared is: 0.039")

## [1] "The reported R-squared is: 0.039"

# predict using the fitted model

# use three nested loops to get the expected values of 're78'

# 95% confidence interval of the expected values

plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(1,10000),

for (educ in 3:16) {

# store 1000 predicted values for each education value

plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(1,10000),

for (educ in 3:16) {

glm.treat <- glm(treat~age+educ+hisp+re74+re75,data=lalonde,family=binomial)

## [1] "The regression coefficients are:"

## (Intercept) age educ hisp re74

conf_int <- confint(glm.treat)

## [1] "The 95% confidence interval for 'age' is [-0.015382,0.038522]."

sprintf("The 95%% confidence interval for 'education' is [%f,%f].", conf_int[3,1], conf_

## [1] "The 95% confidence interval for 'education' is [-0.038539,0.179589]."

boot.age <- boot(lalonde, age.fn, 1000)

educ.fn <- function(data,index) {

boot.educ <- boot(lalonde, educ.fn, 1000)

Report bootstrapped conﬁdence intervals for age and education here.

## [1] "The bootstrapped confidence interval for 'age' is [-0.015177, 0.040006]."

sprintf("The bootstrapped confidence interval for 'education' is [%f, %f].", educ_conf[1

## [1] "The bootstrapped confidence interval for 'education' is [-0.037055, 0.199727]."

# store 1000 predicted values for each education value

conf.intervals_exp_2 <- apply(storage.matrix_exp_2, 2, quantile, probs = c(0.025, 0.975

plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(0,1),

for (educ in 3:16) {

# store 1000 predicted values for each education value

conf.intervals_pred_2 <- apply(storage.matrix_pred_2, 2, quantile, probs = c(0.025, 0.97

plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(0,1),

for (educ in 3:16) {

You might also like