0% found this document useful (0 votes)
30 views

Machine Learning Code File

This document discusses preprocessing data from a Kickstarter projects dataset for a machine learning model to predict Kickstarter project success. It loads the Kickstarter dataset, displays some sample rows and column information. It then cleans the data by dropping the unneeded 'ID' and 'name' columns. Finally, it checks for null values in the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Machine Learning Code File

This document discusses preprocessing data from a Kickstarter projects dataset for a machine learning model to predict Kickstarter project success. It loads the Kickstarter dataset, displays some sample rows and column information. It then cleans the data by dropping the unneeded 'ID' and 'name' columns. Finally, it checks for null values in the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

10/15/23, 10:26 PM kickstarter-success-prediction - Jupyter Notebook

Getting Started
In [1]:  import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight

import tensorflow as tf

In [2]:  data = pd.read_csv('../input/kickstarter-projects/ks-projects-201801.csv')

localhost:8888/notebooks/kickstarter-success-prediction.ipynb 1/10
10/15/23, 10:26 PM kickstarter-success-prediction - Jupyter Notebook

In [3]:  data

Out[3]:
ID name category main_category currency deadline goa

The Songs of
2015-10-
0 1000002330 Adelaide & Poetry Publishing GBP 1000.0
09
Abullah

Greeting From
Earth: ZGAC Narrative 2017-11-
1 1000003930 Film & Video USD 30000.0
Arts Capsule Film 01
For ET

Where is Narrative 2013-02-


2 1000004038 Film & Video USD 45000.0
Hank? Film 26

ToshiCapital
Rekordz
2012-04-
3 1000007540 Needs Help to Music Music USD 5000.0
16
Complete
Album

Community
Film Project: 2015-08-
4 1000011046 Film & Video Film & Video USD 19500.0
The Art of 29
Neighborhoo...

... ... ... ... ... ... ... ..

ChknTruk
Nationwide
2014-10-
378656 999976400 Charity Drive Documentary Film & Video USD 50000.0
17
2014
(Canceled)

Narrative 2011-07-
378657 999977640 The Tribe Film & Video USD 1500.0
Film 19

Walls of
Remedy- New
Narrative 2010-08-
378658 999986353 lesbian Film & Video USD 15000.0
Film 16
Romantic
Comedy f...

BioDefense 2016-02-
378659 999987933 Technology Technology USD 15000.0
Education Kit 13

Nou Renmen
Performance 2011-08-
378660 999988282 Ayiti! We Love Art USD 2000.0
Art 16
Haiti!

378661 rows × 15 columns

localhost:8888/notebooks/kickstarter-success-prediction.ipynb 2/10
10/15/23, 10:26 PM kickstarter-success-prediction - Jupyter Notebook

In [4]:  data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378661 entries, 0 to 378660
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 378661 non-null int64
1 name 378657 non-null object
2 category 378661 non-null object
3 main_category 378661 non-null object
4 currency 378661 non-null object
5 deadline 378661 non-null object
6 goal 378661 non-null float64
7 launched 378661 non-null object
8 pledged 378661 non-null float64
9 state 378661 non-null object
10 backers 378661 non-null int64
11 country 378661 non-null object
12 usd pledged 374864 non-null float64
13 usd_pledged_real 378661 non-null float64
14 usd_goal_real 378661 non-null float64
dtypes: float64(5), int64(2), object(8)
memory usage: 43.3+ MB

Cleaning and Preprocessing


In [5]:  unneeded_columns = ['ID', 'name']

data = data.drop(unneeded_columns, axis=1)

In [6]:  data.isna().sum()

Out[6]: category 0
main_category 0
currency 0
deadline 0
goal 0
launched 0
pledged 0
state 0
backers 0
country 0
usd pledged 3797
usd_pledged_real 0
usd_goal_real 0
dtype: int64

In [7]:  data['usd pledged'] = data['usd pledged'].fillna(data['usd pledged'].mean(

In [8]:  data.isna().sum().sum()
Out[8]: 0

localhost:8888/notebooks/kickstarter-success-prediction.ipynb 3/10
10/15/23, 10:26 PM kickstarter-success-prediction - Jupyter Notebook

In [9]:  data['state'].unique()

Out[9]: array(['failed', 'canceled', 'successful', 'live', 'undefined',


'suspended'], dtype=object)

In [10]:  data = data.drop(data.query("state != 'failed' and state != 'successful'")

In [11]:  data['state'].unique()

Out[11]: array(['failed', 'successful'], dtype=object)

In [12]:  data

Out[12]:
category main_category currency deadline goal launched pledged s

2015-08-
2015-10-
0 Poetry Publishing GBP 1000.0 11 0.0 fa
09
12:12:28

2017-09-
Narrative 2017-11-
1 Film & Video USD 30000.0 02 2421.0 fa
Film 01
04:43:57

2013-01-
Narrative 2013-02-
2 Film & Video USD 45000.0 12 220.0 fa
Film 26
00:20:50

2012-03-
2012-04-
3 Music Music USD 5000.0 17 1.0 fa
16
03:24:11

2016-02-
2016-04-
4 Restaurants Food USD 50000.0 26 52375.0 succes
01
13:38:27

... ... ... ... ... ... ... ...

2017-03-
2017-04-
331670 Small Batch Food USD 6500.0 20 154.0 fa
19
22:08:22

2011-06-
Narrative 2011-07-
331671 Film & Video USD 1500.0 22 155.0 fa
Film 19
03:35:14

2010-07-
Narrative 2010-08-
331672 Film & Video USD 15000.0 01 20.0 fa
Film 16
19:40:30

2016-01-
2016-02-
331673 Technology Technology USD 15000.0 13 200.0 fa
13
18:13:53

2011-07-
Performance 2011-08-
331674 Art USD 2000.0 19 524.0 fa
Art 16
09:07:47

331675 rows × 13 columns

Feature Engineering and Encoding

localhost:8888/notebooks/kickstarter-success-prediction.ipynb 4/10
10/15/23, 10:26 PM kickstarter-success-prediction - Jupyter Notebook

In [13]:  data['deadline_year'] = data['deadline'].apply(lambda x: np.float(x[0:4]))


data['deadline_month'] = data['deadline'].apply(lambda x: np.float(x[5:7])

data['launched_year'] = data['launched'].apply(lambda x: np.float(x[0:4]))
data['launched_month'] = data['launched'].apply(lambda x: np.float(x[5:7])

data = data.drop(['deadline', 'launched'], axis=1)

In [14]:  data['state'] = data['state'].apply(lambda x: 1 if x == 'successful' else

In [15]:  {column: list(data[column].unique()) for column in data.columns if data.dt

Out[15]: {'category': ['Poetry',


'Narrative Film',
'Music',
'Restaurants',
'Food',
'Drinks',
'Nonfiction',
'Indie Rock',
'Crafts',
'Games',
'Tabletop Games',
'Design',
'Comic Books',
'Art Books',
'Fashion',
'Childrenswear',
'Theater',
'Comics',
'DIY',
'W b i '
In [16]:  def onehot_encode(df, columns, prefixes):
df = df.copy()
for column, prefix in zip(columns, prefixes):
dummies = pd.get_dummies(df[column], prefix=prefix)
df = pd.concat([df, dummies], axis=1)
df = df.drop(column, axis=1)
return df

In [17]:  data = onehot_encode(


data,
['category', 'main_category', 'currency', 'country'],
['cat', 'main_cat', 'curr', 'country']
)

localhost:8888/notebooks/kickstarter-success-prediction.ipynb 5/10
10/15/23, 10:26 PM kickstarter-success-prediction - Jupyter Notebook

In [18]:  data

Out[18]: usd
goal pledged state backers usd_pledged_real usd_goal_real deadlin
pledged

0 1000.0 0.0 0 0 0.0 0.0 1533.95

1 30000.0 2421.0 0 15 100.0 2421.0 30000.00

2 45000.0 220.0 0 3 220.0 220.0 45000.00

3 5000.0 1.0 0 1 1.0 1.0 5000.00

4 50000.0 52375.0 1 224 52375.0 52375.0 50000.00

... ... ... ... ... ... ... ...

331670 6500.0 154.0 0 4 0.0 154.0 6500.00

331671 1500.0 155.0 0 5 155.0 155.0 1500.00

331672 15000.0 20.0 0 1 20.0 20.0 15000.00

331673 15000.0 200.0 0 6 200.0 200.0 15000.00

331674 2000.0 524.0 0 17 524.0 524.0 2000.00

331675 rows × 222 columns

Splitting and Scaling


In [19]:  y = data.loc[:, 'state']
X = data.drop('state', axis=1)

In [20]:  scaler = StandardScaler()



X = scaler.fit_transform(X)

In [21]:  X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7,

Modeling and Training


In [22]:  X.shape
Out[22]: (331675, 221)

In [23]:  y.mean()

Out[23]: 0.4038772895153388

localhost:8888/notebooks/kickstarter-success-prediction.ipynb 6/10
10/15/23, 10:26 PM kickstarter-success-prediction - Jupyter Notebook

In [24]:  class_weights = class_weight.compute_class_weight(


'balanced',
y_train.unique(),
y_train
)

class_weights = dict(enumerate(class_weights))
class_weights
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:70: Fu
tureWarning: Pass classes=[0 1], y=270640 0
72886 0
43654 1
126262 0
250713 0
..
38435 1
199765 0
225014 0
153449 1
305642 0
Name: state, Length: 232172, dtype: int64 as keyword args. From version
0.25 passing these as positional arguments will result in an error
FutureWarning)

Out[24]: {0: 0.8394874242489985, 1: 1.236404302907658}

localhost:8888/notebooks/kickstarter-success-prediction.ipynb 7/10
10/15/23, 10:26 PM kickstarter-success-prediction - Jupyter Notebook

In [25]:  inputs = tf.keras.Input(shape=(221,))


x = tf.keras.layers.Dense(64, activation='relu')(inputs)
x = tf.keras.layers.Dense(64, activation='relu')(x)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.Model(inputs, outputs)


model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=[
'accuracy',
tf.keras.metrics.AUC(name='auc')
]
)


batch_size = 64
epochs = 100

history = model.fit(
X_train,
y_train,
validation_split=0.2,
class_weight=class_weights,
batch_size=batch_size,
epochs=epochs,
callbacks=[
tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=3,
restore_best_weights=True,
verbose=1
)
],
verbose=2
)

localhost:8888/notebooks/kickstarter-success-prediction.ipynb 8/10
10/15/23, 10:26 PM kickstarter-success-prediction - Jupyter Notebook

Epoch 1/100
2903/2903 - 7s - loss: 0.4499 - accuracy: 0.7913 - auc: 0.8722 - val_los
s: 0.3517 - val_accuracy: 0.8479 - val_auc: 0.9275
Epoch 2/100
2903/2903 - 7s - loss: 0.3189 - accuracy: 0.8622 - auc: 0.9374 - val_los
s: 0.2883 - val_accuracy: 0.8764 - val_auc: 0.9478
Epoch 3/100
2903/2903 - 7s - loss: 0.2783 - accuracy: 0.8818 - auc: 0.9525 - val_los
s: 0.2653 - val_accuracy: 0.8897 - val_auc: 0.9585
Epoch 4/100
2903/2903 - 6s - loss: 0.2481 - accuracy: 0.8947 - auc: 0.9624 - val_los
s: 0.2759 - val_accuracy: 0.8839 - val_auc: 0.9571
Epoch 5/100
2903/2903 - 7s - loss: 0.2263 - accuracy: 0.9045 - auc: 0.9687 - val_los
s: 0.2135 - val_accuracy: 0.9114 - val_auc: 0.9714
Epoch 6/100
2903/2903 - 7s - loss: 0.2097 - accuracy: 0.9116 - auc: 0.9731 - val_los
s: 0.2329 - val_accuracy: 0.9016 - val_auc: 0.9720
Epoch 7/100
2903/2903 - 7s - loss: 0.2010 - accuracy: 0.9160 - auc: 0.9752 - val_los
s: 0.2228 - val_accuracy: 0.9042 - val_auc: 0.9749
Epoch 8/100
2903/2903 - 7s - loss: 0.1937 - accuracy: 0.9196 - auc: 0.9770 - val_los
s: 0.1935 - val_accuracy: 0.9214 - val_auc: 0.9781
Epoch 9/100
2903/2903 - 7s - loss: 0.1883 - accuracy: 0.9214 - auc: 0.9783 - val_los
s: 0.1897 - val_accuracy: 0.9208 - val_auc: 0.9771
Epoch 10/100
2903/2903 - 10s - loss: 0.1826 - accuracy: 0.9244 - auc: 0.9796 - val_los
s: 0.1807 - val_accuracy: 0.9256 - val_auc: 0.9795
Epoch 11/100
2903/2903 - 10s - loss: 0.1782 - accuracy: 0.9264 - auc: 0.9805 - val_los
s: 0.1757 - val_accuracy: 0.9286 - val_auc: 0.9819
Epoch 12/100
2903/2903 - 8s - loss: 0.1749 - accuracy: 0.9278 - auc: 0.9813 - val_los
s: 0.1812 - val_accuracy: 0.9243 - val_auc: 0.9792
Epoch 13/100
2903/2903 - 8s - loss: 0.1704 - accuracy: 0.9296 - auc: 0.9821 - val_los
s: 0.1901 - val_accuracy: 0.9198 - val_auc: 0.9784
Epoch 14/100
2903/2903 - 7s - loss: 0.1669 - accuracy: 0.9309 - auc: 0.9829 - val_los
s: 0.1700 - val_accuracy: 0.9310 - val_auc: 0.9834
Epoch 15/100
2903/2903 - 7s - loss: 0.1647 - accuracy: 0.9318 - auc: 0.9833 - val_los
s: 0.1691 - val_accuracy: 0.9312 - val_auc: 0.9818
Epoch 16/100
2903/2903 - 7s - loss: 0.1609 - accuracy: 0.9336 - auc: 0.9841 - val_los
s: 0.1872 - val_accuracy: 0.9240 - val_auc: 0.9810
Epoch 17/100
2903/2903 - 7s - loss: 0.1597 - accuracy: 0.9345 - auc: 0.9843 - val_los
s: 0.1680 - val_accuracy: 0.9296 - val_auc: 0.9841
Epoch 18/100
2903/2903 - 7s - loss: 0.1567 - accuracy: 0.9356 - auc: 0.9849 - val_los
s: 0.1672 - val_accuracy: 0.9299 - val_auc: 0.9840
Epoch 19/100
2903/2903 - 7s - loss: 0.1547 - accuracy: 0.9365 - auc: 0.9852 - val_los
s: 0.1551 - val_accuracy: 0.9388 - val_auc: 0.9857
Epoch 20/100
2903/2903 - 9s - loss: 0.1534 - accuracy: 0.9369 - auc: 0.9855 - val_los
s: 0.1680 - val_accuracy: 0.9327 - val_auc: 0.9832
Epoch 21/100
localhost:8888/notebooks/kickstarter-success-prediction.ipynb 9/10
10/15/23, 10:26 PM kickstarter-success-prediction - Jupyter Notebook
2903/2903 - 7s - loss: 0.1518 - accuracy: 0.9375 - auc: 0.9858 - val_los
s: 0.1505 - val_accuracy: 0.9406 - val_auc: 0.9861
Epoch 22/100
2903/2903 - 7s - loss: 0.1494 - accuracy: 0.9386 - auc: 0.9862 - val_los
s: 0.1539 - val_accuracy: 0.9358 - val_auc: 0.9852
Epoch 23/100
2903/2903 - 7s - loss: 0.1478 - accuracy: 0.9393 - auc: 0.9865 - val_los
s: 0.1577 - val_accuracy: 0.9362 - val_auc: 0.9848
Epoch 24/100
Restoring model weights from the end of the best epoch.
2903/2903 - 7s - loss: 0.1457 - accuracy: 0.9396 - auc: 0.9870 - val_los
s: 0.1565 - val_accuracy: 0.9376 - val_auc: 0.9850
Epoch 00024: early stopping

Results
In [26]:  model.evaluate(X_test, y_test)

3110/3110 [==============================] - 4s 1ms/step - loss: 0.1503 -


accuracy: 0.9397 - auc: 0.9861

Out[26]: [0.15034817159175873, 0.9397003054618835, 0.9860697984695435]

localhost:8888/notebooks/kickstarter-success-prediction.ipynb 10/10
10/15/23, 10:28 PM Kickstarter Prediction - Jupyter Notebook

In [ ]:  # Step 1: Import necessary libraries


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Step 2: Load and preprocess the dataset
# Assuming you have a dataset named 'kickstarter_data.csv'
df = pd.read_csv('Kickstarter001.csv')


# Drop unnecessary columns
df = df[['usd_pledged', 'state', 'usd_type', 'pledged', 'goal', 'country',
#df = df[['usd_pledged', 'state', 'pledged', 'goal', 'staff_pick', 'country

# Handle categorical variables
le = LabelEncoder()
df['usd_type'] = le.fit_transform(df['usd_type'])
df['country'] = le.fit_transform(df['country'])
df['state'] = le.fit_transform(df['state'])
df['staff_pick'] = le.fit_transform(df['staff_pick'])

# Define features and target variable
X = df.drop('state', axis=1)
y = df['state']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, r

# Step 3: Train and evaluate the individual models
# AdaBoost Classifier
ada_model = AdaBoostClassifier(n_estimators=50, random_state=42)
ada_model.fit(X_train, y_train)
ada_pred = ada_model.predict(X_test)

# XGBoost Classifier
xgb_model = XGBClassifier(n_estimators=100, random_state=42)
xgb_model.fit(X_train, y_train)
xgb_pred = xgb_model.predict(X_test)

# Step 4: Create the ensemble model
# You can use a simple voting approach for ensembling
from sklearn.ensemble import VotingClassifier

ensemble_model = VotingClassifier(estimators=[
('ada', ada_model),
('xgb', xgb_model)],
voting='hard')

ensemble_model.fit(X_train, y_train)
ensemble_pred = ensemble_model.predict(X_test)

# Step 5: Evaluate the models
ada_accuracy = accuracy_score(y_test, ada_pred)
xgb_accuracy = accuracy_score(y_test, xgb_pred)
ensemble_accuracy = accuracy_score(y_test, ensemble_pred)

print(f'AdaBoost Accuracy: {ada_accuracy}')
print(f'XGBoost Accuracy: {xgb_accuracy}')
localhost:8888/notebooks/Kickstarter Prediction.ipynb 3/5
10/15/23, 10:28 PM Kickstarter Prediction - Jupyter Notebook
print(f'Ensemble Model Accuracy: {ensemble_accuracy}')

In [ ]:  from Py_FS.wrapper.nature_inspired import WOA


import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv('Kickstarter001.csv')


# Drop unnecessary columns
df = df[['usd_pledged', 'state', 'usd_type', 'pledged', 'goal', 'country',

# Handle categorical variables
le = LabelEncoder()
df['usd_type'] = le.fit_transform(df['usd_type'])
df['country'] = le.fit_transform(df['country'])
df['state'] = le.fit_transform(df['state'])
df['staff_pick'] = le.fit_transform(df['staff_pick'])
print(df)

# Define features and target variable
X = df.drop('state', axis=1)
y = df['state']
print(y.shape)
answer = WOA(num_agents=5, max_iter=20, train_data=X.to_numpy(), train_lab
print(answer.best_agent)

localhost:8888/notebooks/Kickstarter Prediction.ipynb 4/5


10/15/23, 10:28 PM Kickstarter Prediction - Jupyter Notebook

In [ ]:  import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Load the dataset
df = pd.read_csv('Kickstarter001.csv')


# Drop unnecessary columns
df = df[['usd_pledged', 'state', 'usd_type', 'pledged', 'goal', 'country',

# Handle categorical variables
le = LabelEncoder()
df['usd_type'] = le.fit_transform(df['usd_type'])
df['country'] = le.fit_transform(df['country'])
df['state'] = le.fit_transform(df['state'])
df['staff_pick'] = le.fit_transform(df['staff_pick'])
print(df)

# Define features and target variable
X = df.drop('state', axis=1)
y = df['state']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, r

# Build the fully connected model
model = Sequential([
Conv1D(filters=1024, kernel_size=5, input_dim=X_train.shape[1]),
Conv1D(filters=512, kernel_size=3),
Dense(1024, activation='relu', ),
Dense(512, activation='relu'),
Dense(128, activation='relu'),
Dense(64, activation='relu'),
Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accu

# Train the model
model.fit(X_train, y_train, epochs=100, batch_size=64, validation_data=(X_

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Accuracy: {accuracy*100:.2f}%')

localhost:8888/notebooks/Kickstarter Prediction.ipynb 5/5

You might also like