Bitwise cst383 Final Project
Bitwise cst383 Final Project
# https://stackoverflow.com/questions/27934885/how-to-hide-code-from-
cells-in-ipython-notebook-visualized-with-nbviewer
from IPython.display import HTML
HTML("""
<script>
code_show = false
function code_toggle() {
if (code_show) {
$("div.input").hide()
}
else {
$("div.input").show()
}
code_show = !code_show
}
$(document).ready(code_toggle)
</script>
<form action="javascript:code_toggle()"><input type="submit"
value="Click here to display/hide the code."></form>
""")
<IPython.core.display.HTML object>
Setup
from datetime import datetime
from matplotlib import rcParams
from scipy.stats import zscore
from sklearn.linear_model import LinearRegression
from sklearn.metrics import roc_curve, auc, roc_auc_score,
confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz
import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import time
sns.set()
rcParams['figure.figsize'] = 8,6
Global Functions
def print_conf_matrix(y_correct, y_predict, labels):
formatSize = 15 # Change this if your labels are long
matrix = confusion_matrix(y_correct, y_predict)
print('{:^{width}}'.format("Actual", width=str(formatSize*3)))
print('{:^{width}}'.format("Predict", width=str(formatSize)),
end='')
print('{:^{width}}'.format(labels[0], width=str(formatSize)),
end='')
print('{:^{width}}'.format(labels[1], width=str(formatSize)))
print('{:^{width}}'.format(labels[0], width=str(formatSize)),
end='')
print('{:^{width}}'.format(matrix[0,0], width=str(formatSize)),
end='')
print('{:^{width}}'.format(matrix[0,1], width=str(formatSize)))
print('{:^{width}}'.format(labels[1], width=str(formatSize)),
end='')
print('{:^{width}}'.format(matrix[1,0], width=str(formatSize)),
end='')
print('{:^{width}}'.format(matrix[1,1], width=str(formatSize)))
print_percent_correct(y_correct, y_predict)
Preprocessing
# Read the data
df = pd.read_csv("ks-projects-201801.csv")
# Removing goal and USD pledged columns, they have some bad data
df = df.drop(columns=["goal", "usd pledged"], axis=1);
# Removing all rows with a date less than 2000. The only ones that do
are from 1970 (Unix start date)
df = df.drop(df[df["launched"] < "2000-00-00"].index);
# Getting the time since unix values for launched and deadline
# Decision Trees need features to be an int
df["launchedUX"] = df["launched"].astype('int64')/ 10**9 / 60 / 60 /
24
df["deadlineUX"] = df["deadline"].astype('int64')/ 10**9 / 60 / 60 /
24
<class 'pandas.core.frame.DataFrame'>
Index: 370216 entries, 0 to 378660
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 category 370216 non-null object
1 main_category 370216 non-null object
2 deadline 370216 non-null datetime64[ns]
3 launched 370216 non-null datetime64[ns]
4 state 370216 non-null object
5 country 370216 non-null object
6 usd_goal_real 370216 non-null float64
7 duration 370216 non-null int64
8 launchedUX 370216 non-null float64
9 deadlineUX 370216 non-null float64
10 stateInt 370216 non-null int32
dtypes: datetime64[ns](2), float64(3), int32(1), int64(1), object(4)
memory usage: 32.5+ MB
clas_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 370216 entries, 0 to 378660
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 usd_goal_real 370216 non-null float64
1 duration 370216 non-null int64
2 launchedUX 370216 non-null float64
3 deadlineUX 370216 non-null float64
4 stateInt 370216 non-null int32
dtypes: float64(3), int32(1), int64(1)
memory usage: 15.5 MB
Data Munging
Dropped Data Fields:
1. ID - Dropped due to being an internal Kickstarter value, irrelevant to our project.
2. name - Dropped due to not being useful for our classification models.
3. currency - Dropped due to not being useful for our classification models.
4. goal - Dropped due to having large amounts of bad data.
5. pledged - Dropped due to not aligning with our goal of being able to predict a
campaign's success based off its initial characteristics.
6. backers - Dropped due to not aligning with our goal of being able to predict a
campaign's success based off its initial characteristics.
7. usd pledged - Dropped due to not aligning with our goal of being able to predict a
campaign's success based off its initial characteristics.
8. usd_pledged_real - Dropped due to not aligning with our goal of being able to
predict a campaign's success based off its initial characteristics.
Feature Engineering
Added Data Fields:
1. launchedUX - The time of the project's launch measured in days since Unix start -
(January 1, 1970 at 00:00:00).
2. deadlineUX - The time of the project's deadline measured in days since Unix start -
(January 1, 1970 at 00:00:00).
3. duration - The duration in days that the project was active.
4. stateInt - 0 for a failed or canceled state, 1 for a successful state.
1. category
2. main_category
3. country
Visualization
dfSuccess = df[df["state"] == "successful"]
dfFailed = df[df["state"] == "failed"]
We began our exploration by looking at the percentage of projects that were market successful,
grouped by Main Category.
1. Comics
2. Dance
3. Theater
Music is an honorable mention, coming in just shy with a 49.09% success rate.
By quantifying the number of launched and successful projects by Main Category, we can see
that, out of the four previously mentioned Main Categories, Music was the only one that had a
relatively large number of projects; the others were generally niche.
• Average project success rate lingered between 40% - 50% between 2009 - 2014.
• Average project success rate dropped to 25% - 30% halfway through 2014, until 2016.
• Average project success rate rose to 30% - 40% between 2016 - 2018.
# Percentage of Successful Project by Launch Month
successPercentage = (100 *
(dfSuccess['launched'].dt.month_name().value_counts() /
(df['launched'].dt.month_name().value_counts())))
plt.figure(figsize=(36, 12))
sns.barplot(x=successPercentage.index, y=successPercentage.values)
plt.title('Percentage of Successful Projects by Launch Month',
fontsize=30)
plt.xlabel('Month', fontsize=30)
plt.ylabel('Rate of Success', fontsize=30)
plt.xticks(rotation=45, ha='right', fontsize=20)
plt.yticks(fontsize=20)
plt.tight_layout()
plt.show()
All launch months have a similar percentage of successful projects, however the exceptionally
successful months were February, March, and April.
df_scale = df.copy()
df_scale.drop(['category', 'main_category', 'deadline', 'launched',
'country', 'launchedUX', 'deadlineUX', 'stateInt', 'launched_month'],
axis = 1)
df_scale = df_scale[df_scale["usd_goal_real"] <= 100000]
df_scale = df_scale.sample(frac=0.005)
plt.figure(figsize=(30, 18))
sns.scatterplot(data=df_scale, x="usd_goal_real", y="duration",
hue="state")
plt.xlabel("USD_goal_real", fontsize=30)
plt.ylabel("Duration in days", fontsize=30)
plt.title("Goal Vs Duration with State", fontsize=30)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(title="State", fontsize=30)
plt.show()
It seems as though the goal doesn't scale with a campaign's deadline, although the majority of
successful projects seem to have lower total goals.
def calc_success_rate_by_duration_days(x):
duration_days = x.index[0][0]
successful_count = x.loc[(duration_days, "successful")]
failed_count = x.loc[(duration_days, "failed")]
return successful_count / (successful_count + failed_count)
df_success_rate_by_duration_days =
df_duration_state_counts.groupby(level=0).apply(calc_success_rate_by_d
uration_days)
plt.figure(figsize=(36,12))
success_rate_by_duration_days = sns.barplot(
x=df_success_rate_by_duration_days.index,
y=df_success_rate_by_duration_days
)
It appears that campaign success rate gradually declined from about 60% to about 30% for
campaigns with durations between 0 to 60 days.
It then picked back up for a gradual decline from about 50% to about 30% for campaigns with a
duration between 60 to 90 days.
Outliers were campaign durations of 30 and 60 days. This can be explained by referring to the
previous visualization, wherein we discovered that almost all campaigns launched had a launch-
to-deadline time of either 30 or 60 days.
Machine Learning
Considering our goal is to classify Kickstarter projects as "success" or "failed" using our machine
learning analysis, we chose to utilize primarily Decision Trees while also testing with KNeighbor
Classfication strategies.
Model 1 - kNN
start = time.time()
It appears that even a simple classification algorithm such as kNN, given a large enough k value,
is able to sufficiently generalize from the training data to produce a model with an accuracy of
approximately 68%.
# Category
# Duration
# LaunchedUX
# Main_Category
# Country
# usd_goal_real
# Depth
# 16
# min_samples_split
# 2
# min_samples_leaf
# 83
# min_impurity_decrease
# 0.0
best_max_depth = 16
best_min_samples_split = 2
best_min_samples_leaf = 83
best_min_impurity_decrease = 0.0
best_permutation = []
score_board = []
index = 0
start = time.time()
#index += 1
#print("Index {}".format(index))
clf.fit(X_train, y_train)
y_predict_train = clf.predict(X_train)
y_predict_test = clf.predict(X_test)
# Confusion Matrix
print_conf_matrix(y_test, y_predict_test, ["Failed", "Success"])
# Used when trying to validate the best values for targeted arguments
#scores = cross_val_score(estimator=clf, X=x, y=y, cv = 5, n_jobs=-1)
#score_board.append((i, scores.mean()))