Kickstarter Success Prediction
Kickstarter Success Prediction
From Wikipedia: Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that
maintains a global crowdfunding platform focused on creativity and merchandising. The company’s stated
mission is to “help bring creative projects to life”. As of May 2019, Kickstarter has received more than $4 billion in
pledges from 16.3 million backers to fund 445,000 projects, such as films, music, stage shows, comics,
journalism, video games, technology, publishing, and food-related projects.
The data is collected by Mickaël Mouillé (https://www.kaggle.com/kemical) and is last uodated in 2018. Columns
are self explanatory. Note that usd_pledged is the column pledged in US dollars (conversion done by
kickstarter) and usd_pledge_real is the pledged column in real US dollars of the pledged column. Finally,
usd_goal_real is the column goal in real US dollars. You should use the real columns.
So what makes a project successful? Undoubtedly, there are many factors, but perhaps we could set up a
prediction problem here, similar to the one from the bonus part of the last assignment where we used GDP to
predict personnel contributions.
We have columns representing the number of backers, project length, the main category, and the real project goal
in USD for each project.
Let’s explore the relationship between those predictors and the dependent variable of interest — the success of a
project.
Instead of running a simple linear regression and calling it a day, let’s use cross-validation to make our prediction
a little more sophisticated.
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 1/17
2021/1/9 Kickstarter Success Prediction
# A project is successful if the real pledged value is larger than or equal to the goal
value
# create a column of dummy variables to indicate the status
ks_df$success <- ifelse(ks_df$usd_pledged_real - ks_df$usd_goal_real >= 0, 1, 0)
head(ks_df)
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 2/17
2021/1/9 Kickstarter Success Prediction
##
## Call:
## glm(formula = success ~ backers + length + main_category + usd_goal_real,
## family = binomial, data = ks_tr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.5478 -0.0617 0.2219 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.326e-01 2.333e-02 -9.967 <2e-16 ***
## backers 4.997e-02 2.340e-04 213.506 <2e-16 ***
## length -1.534e-02 5.106e-04 -30.052 <2e-16 ***
## main_categoryComics -4.692e-01 3.789e-02 -12.383 <2e-16 ***
## main_categoryCrafts -6.811e-01 3.882e-02 -17.543 <2e-16 ***
## main_categoryDance 7.520e-01 5.043e-02 14.911 <2e-16 ***
## main_categoryDesign -6.777e-01 3.022e-02 -22.421 <2e-16 ***
## main_categoryFashion -5.825e-01 2.984e-02 -19.519 <2e-16 ***
## main_categoryFilm & Video 2.517e-01 2.188e-02 11.501 <2e-16 ***
## main_categoryFood -3.790e-01 3.055e-02 -12.404 <2e-16 ***
## main_categoryGames -1.710e+00 3.241e-02 -52.772 <2e-16 ***
## main_categoryJournalism -6.697e-01 5.653e-02 -11.847 <2e-16 ***
## main_categoryMusic 1.899e-01 2.224e-02 8.536 <2e-16 ***
## main_categoryPhotography -3.911e-01 3.571e-02 -10.952 <2e-16 ***
## main_categoryPublishing -4.764e-01 2.459e-02 -19.372 <2e-16 ***
## main_categoryTechnology -5.930e-01 3.328e-02 -17.820 <2e-16 ***
## main_categoryTheater 7.762e-01 3.382e-02 22.950 <2e-16 ***
## usd_goal_real -2.257e-04 1.340e-06 -168.432 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 386741 on 295507 degrees of freedom
## Residual deviance: 182413 on 295490 degrees of freedom
## AIC: 182449
##
## Number of Fisher Scoring iterations: 13
STEP 6: Predictions
Use the model you’ve inferred from the previous step to predict the success outcomes in the test set.
# predict the response of probability of the test set using the fitted model
ks_p <- predict(glm_ks, ks_te, type = "response")
# convert into class labels
ks.pred=rep("No",nrow(ks_te))
ks.pred[ks_p > .5] = "Yes"
ks_te$ks.pred = ks.pred
# produce a confusion matrix
acc_table <- table(ks.pred,ks_te$success)
acc_table
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 3/17
2021/1/9 Kickstarter Success Prediction
##
## ks.pred 0 1
## No 45585 5169
## Yes 1758 21366
##
## ks.pred_tr 0 1
## No 181631 20866
## Yes 7004 86007
## [1] "The misclassification rate of the predictions for the training sets is: 0.09431
2"
## [1] "The misclassification rate of the predictions for the test sets is: 0.093763"
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 4/17
2021/1/9 Kickstarter Success Prediction
# sample 5% of the data set to apply LOOCV in order to reduce running time
ks_df_sam <- ks_df[sample(nrow(ks_df),as.integer(nrow(ks_df)*0.05)),]
sprintf("The misclassification rate of the training set is: raw: %f, adjusted: %f.", cv.
err$delta[1], cv.err$delta[2])
## [1] "The misclassification rate of the training set is: raw: 0.076014, adjusted: 0.07
6014."
sprintf("The misclassification rate of the test set is: raw: %f, adjusted: %f.", cv.err.
test$delta[1], cv.err.test$delta[2])
## [1] "The misclassification rate of the test set is: raw: 0.362208, adjusted: 0.08124
6."
The adjusted error rate of the training and test set resulting from the LOOCV is quite similar. But the raw cross-
validation estimate of prediction error is much higher because there’s not enough data in the test set so that the
model trained every time is likely to be not fitted. And since every time it’s trained on approximately identical data
sets, the models are correlated with each other, adding on the effect of high variance.
Step 9: Explanations
Compare the misclassification rates from the simple method to the LOOCV method?
In this project, we applied cross-validation to the training set and test set respectively to compare the
misclassification rate calculated. However, in real-world cases, cross-validation is used when there are not
enough data points but we want to estimate the out-of-sample fit to select the best model so that we don’t need
to divide the training set and the test set.
Linear Regression
STEP 1
Using the laLonde data set, run a linear regression that models re78 as a function of age , education ,
re74 , re75 , hisp , and black .
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 5/17
2021/1/9 Kickstarter Success Prediction
data(lalonde)
# fit a linear regression model
lm1 <- lm(re78~age+educ+re74+re75+hisp+black, data=lalonde)
summary(lm1)
##
## Call:
## lm(formula = re78 ~ age + educ + re74 + re75 + hisp + black,
## data = lalonde)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9065 -4577 -1775 3186 55037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.126e+03 2.434e+03 0.463 0.6437
## age 5.928e+01 4.409e+01 1.345 0.1794
## educ 4.293e+02 1.758e+02 2.442 0.0150 *
## re74 7.462e-02 7.699e-02 0.969 0.3330
## re75 6.676e-02 1.314e-01 0.508 0.6116
## hisp -2.125e+02 1.546e+03 -0.137 0.8907
## black -2.323e+03 1.162e+03 -1.999 0.0462 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6544 on 438 degrees of freedom
## Multiple R-squared: 0.03942, Adjusted R-squared: 0.02627
## F-statistic: 2.996 on 6 and 438 DF, p-value: 0.007035
STEP 2
Report coefficients and R-squared.
print(coef(lm1))
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 6/17
2021/1/9 Kickstarter Success Prediction
# hand-calculation of R-squared
mean <- mean(lalonde$re78)
# SSR (regression sum of squares): how far the estimated sloped regression line is from
the horizontal 'no relationship line', the sample mean
SSR <- sum((predicted_re78 - mean) ** 2)
# SSTO (total sum of squres): how much the data points, vary around their mean
SSTO <- sum((lalonde$re78 - mean) ** 2)
SSE <- sum((lalonde$re78 - predicted_re78) ** 2)
R_squared <- SSR/SSTO
sprintf("The calculated R-squared by hand is: %s, the answer is the same with the R-squa
red from the summary statistics above.",round(R_squared,3))
## [1] "The calculated R-squared by hand is: 0.039, the answer is the same with the R-sq
uared from the summary statistics above."
STEP 3
Then, setting all the predictors at their means EXCEPT education , create a data visualization that shows the
95% confidence interval of the expected values of re78 as education varies from 3 to 16. Be sure to include
axes labels and figure titles.
set.seed(254)
# setting all predictors at their mean
age = mean(lalonde$age)
re74 = mean(lalonde$re74)
re75 = mean(lalonde$re75)
hisp = mean(lalonde$hisp)
black = mean(lalonde$black)
# create a function to calculate the dependent variable given coefficients and independe
nt variables
calc_turnout <- function(coef, person) {
turnout <- coef[1] +
coef[2] * person[1] +
coef[3] * person[2] +
coef[4] * person[3] +
coef[5] * person[4] +
coef[6] * person[5] +
coef[7] * person[6]
return(turnout)
}
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 7/17
2021/1/9 Kickstarter Success Prediction
for (j in c(1:1000)) {
sim_exp <- sim(lm1, 1000)
sim_coef_exp <- sim_exp@coef
for (educ in c(3:16)) {
person <- c(age, educ, re74, re75, hisp, black)
store <- rep(0,1000)
for (i in c(1:1000)) {
store[i] <- calc_turnout(sim_coef_exp[i,],person)
}
storage.matrix_exp[j,educ-2] <- mean(store)
}
}
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 8/17
2021/1/9 Kickstarter Success Prediction
STEP 4
Then, do the same thing, but this time for the predicted values of re78 . Be sure to include axes labels and figure
titles.
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 9/17
2021/1/9 Kickstarter Success Prediction
set.seed(25)
# simulate 1000 set of coefficients
sim_pred <- sim(lm1, 1000)
sim_coef_pred <- sim_pred@coef
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 10/17
2021/1/9 Kickstarter Success Prediction
Findings
The confidence intervals of the expected values are smaller than the predicted values. For the expected values,
the spread of the values is narrower because taking the mean of the 1000 predictions reduces the variance so
that the values are centered towards the mean.
The results is that years of schooling (3-16) has a positive influence when predicting real earnings in 1978 given
that the other predictors are at their mean, and the relationship is linear. The confidence intervals of the expected
predicted values decreases when years of schooling approaches 9 and increases after that, meaning that it’s
most certain when years of schooling is around 9 and uncertain around the two extremes. The expected earning
of 3 years of schooling is around $2000, and increase linearly towards around $8000 for 16 years of schooling.
Logistic Regression
STEP 1
Using the lalonde data set, run a logistic regression, modeling treatment status as a function of age ,
education , hisp , re74 and re75 . Report and interpret the regression coefficient and 95% confidence
intervals for age and education .
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 11/17
2021/1/9 Kickstarter Success Prediction
print(coef(glm.treat))
The regression coefficient shows that a unit increase in age results in an increase in the log odds of treatment by
2.816*(10^-3), a unit increase in years of schooling results in an increase in the log odds of treatment by 1.629*
(10^-2). However, a unit increase in hispanic (0-1) results in a decrease in the log odds of treatment by 1.336 *
10^(-1), and a unit increase in earnings in 1974 results in a decrease in the log odds of treatments by 5.235 *
10^(-6). A unit increase in earnings in 1975 results in a increase in the log odds of treatment by 1.226 * 10^(-5).
The confidence intervals for ‘age’ and ‘education’ means that 95% of the estimated coefficients of the two
variables fall in the two intervals.
STEP 2
Use a simple bootstrap to estimate (and report) bootstrapped confidence intervals for age and education
given the logistic regression above. Code the bootstrap algorithm yourself.
set.seed(42)
age.fn <- function(data,index) {
d <- data[index,]
return(coef(glm(treat~age+educ+hisp+re74+re75,data=d,family=binomial))[2])
}
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 12/17
2021/1/9 Kickstarter Success Prediction
sprintf("The bootstrapped confidence interval for 'age' is [%f, %f].", age_conf[1], age_
conf[2])
STEP 3
Then, using the simulation-based approach and the arm library, set all the predictors at their means EXCEPT
education , create a data visualization that shows the 95% confidence interval of the expected values of the
probability of receiving treatment as education varies from 3 to 16. Be sure to include axes labels and figure titles.
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 13/17
2021/1/9 Kickstarter Success Prediction
set.seed(43)
glm.treat <- glm(treat~age+educ+hisp+re74+re75,data=lalonde,family=binomial)
# create a function to calculate the dependent variable given coefficients and independe
nt variables
calc_turnout_2 <- function(coef, person) {
turnout <- coef[1] +
coef[2] * person[1] +
coef[3] * person[2] +
coef[4] * person[3] +
coef[5] * person[4] +
coef[6] * person[5]
return(exp(turnout) / (1 + exp(turnout)))
}
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 14/17
2021/1/9 Kickstarter Success Prediction
STEP 4
Then, do the same thing, but this time for the predicted values of the probability of receiving treatment as
education varies from 3 to 16. Be sure to include axes labels and figure titles.
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 15/17
2021/1/9 Kickstarter Success Prediction
set.seed(44)
# simulate 1000 set of coefficients
sim_pred_2 <- sim(glm.treat, 1000)
sim_coef_pred_2 <- sim_pred_2@coef
# create a function to calculate the dependent variable given coefficients and independe
nt variables
calc_turnout_2 <- function(coef, person) {
turnout <- coef[1] +
coef[2] * person[1] +
coef[3] * person[2] +
coef[4] * person[3] +
coef[5] * person[4] +
coef[6] * person[5]
return(exp(turnout) / (1 + exp(turnout)))
}
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 16/17
2021/1/9 Kickstarter Success Prediction
The graph is similar with the previous exercise, with confidence intervals of the expected value smaller than the
predicted values. The expected probability of receiving treatment for people with 3 years of schooling is around
0.3, and increase linearly as the years of schooling increases, towards around 0.5 for 16 years of schooling. The
interval for all possible independent values is similar meaning that the uncertainty is similar.
file:///Users/tianhui/CS112/Assignment2/cs112_assignment_02_fall_2020的副本.html 17/17