Final Project - Big Data
Final Project - Big Data
1
Table of Contents
Introduction ................................................................................................................................................. 3
Theoretical framework ............................................................................................................................... 3
Business and Data understanding ............................................................................................................. 4
Data preparation ......................................................................................................................................... 6
Modeling ...................................................................................................................................................... 7
ZeroR ....................................................................................................................................................... 7
J-48 ........................................................................................................................................................... 7
Logistic Regression ................................................................................................................................. 8
IBk ............................................................................................................................................................ 9
Naïve Bayes ............................................................................................................................................ 10
Clustering .............................................................................................................................................. 11
Evaluation .................................................................................................................................................. 12
Deploying ................................................................................................................................................... 14
Limitation and further study ............................................................................................................... 14
Privacy and Ethics ................................................................................................................................ 14
Conclusion ................................................................................................................................................. 15
2
Introduction
As known, big data analytics is a disruptive technology, and its use has increased rapidly in the recent days.
The days go by, the data are being collected more and more. Several organizations rely upon data mining
process and machine learning to serve their business tasks desiring more efficient and more profitable paths.
Therefore, given the KickStarter dataset, this report encompasses seven main sections: theoretical
framework, business and data understanding, data preparation, modeling, evaluation, deploying, and
conclusion. Started with describing the main theoretical framework, the report then explicitly shows the
application of data mining process, using both supervised and unsupervised machine learning to solve the
classification task or in other word, to discover predictive learning machines that are able to classify given
project into one of predefined group: successful and failed. The research question is addressed below:
Given the dataset by the instructor, this data analytics project attempt to determine the classes, successful
or failed, that each Kickstarter belong to. However, this project is not limited to supervised segmentation.
The unsupervised approach clustering also mentioned to perhaps gain better business understanding or
make an any important discovery. The dataset is accessible via Kaggle.com and with 192,548 instances, it
originally contains 20 attributes. In this paper, the algorithms including ZeroR, decision tree, logistic
regression, k-NN, Naïve Bayes, and clustering are conducted mainly in Weka. Additionally, Tableau is
used to visualize and explore the data.
Theoretical framework
Clearly, based on the research question, the problem can be classified as classification task. Classification
techniques predict a target variable that is the predefined classes by learning from the known dataset when
the set of features are given with specific learning algorithms.
3
Business and Data understanding
As most of data analytics project, it does not come with the well-defined question at the very beginning.
Instead, the individual, shareholders, or company often approach manager or data scientist with business
problem (Fawcett & Provost, 2013). In this KickStarter dataset, the Kicksaterter company may come up
with something like how to increase their profit. Therefore, the first duty is to convert the business problem
to the actual predictive analytics solution. The main source of profit in Kickstarter company is fee charged
to successfully funded projects, and as mentioned above, it might be straightforward that goal of company
is to maximize profit. Digging deeper to the company’s capability, suppose its current approach to identify
successful projects is very poor or random which is actually unacceptable. So, a significant amount of time
is needed to be spent on meeting or talking with shareholder or board of directors, for example, to gain
more insight and avoid miscommunication. Based on individual understanding, this paper develops a well-
specified question, addressed above “Can the successful project be predicted?” that could help Kickstarter
increase their profit by investing in more promising projects. Supervised models in the following part are
built to predict the status of project. This could be used to identify projects that look valuable to invest. By
allowing the right project, surely the company is able to increase its profit.
It is vital to understand the strengths and limitation of the data since often there are mismatch between data
and problem. The KickStarter dataset is collected as a part of machine learning and data science community,
Kaggle, so the datasets are not published for a specific problem. Instead, they are open for data scientists’
creativity. So, let’s dig beneath the surface to discover the data that is available. 192,548 projects are
presented as instances. There are two types of business condition: successful and failed, so in the dataset,
each project is label, successful or failed, indicating the condition that the project is in. Additionally, by
clicking edit tab in Weka, all data are showed in the table with the columns indicate attributes and the rows
indicate instances. Figure 2 illustrates how a project is presented in Weka. So, it quickly become visible
that there are different type of data including string, nominal, and numeric. Before going further, the target
variable must be defined. Consider the case of KickStarter dataset, the goal is to predict the status of project.
The bottom line shows up at the deadline of crowdfunding, so the condition of the project whether it is
successful is usually well-evaluated – if not initially by the project’s report, then later by the platform when
the goal is not achieved. Thus, it can be assumed that the status of project is reliably identified that may
serve as target for our classification task.
4
Of course, rarely one can magically
understand the data without some
visualization, so tableau give data scientists
tool to project the data into the abstract
images, and the following dashboard shows
some of visualizations. Vertical bar chart
shows the occurrence of project in each
country. Clearly, there are countries that their
number of instances is so small, revealing the
noise due to sampling; the US contains the
mass of examples. Let’s have a look at
relationship between projects and country of Figure 3 Dashboard by Tableau
those. Represented by maps, it is quite obvious that the US have a large number of successful projects due
to the irregular distribution of project on country. Thus, providing the percentage of successful project on
the right-hand side can be useful so that the study does not simply assume that projects launched in the US
are more likely to be successful than the others.
Backing to basic, consider the easily obtained visualization from Weka by only clicking on “Visualize All”
shown in Figure 4. One might notice that there are several attributes, “launched_at”, “deadline”, “city”,
“state” that Weka cannot properly visualize as it has too many values to display; however, the statement
does make sense since the range of possible value is high. “launched_at”, for example, contains not only
date but also the time each project launched, so as the dataset is huge, the number of value is naturally large.
“start_Q”, on the other hand, has four possible values which also make sense using common sense.
“country” and “currency” provide almost the exact same value and shape since in some sense, one is just
the different presentation of another. For example, project launched in the U.S. is of course supported by
USD. Turing attention to “goal_usd”, it may suffer from the very low value which seems unusual; however,
this issue can be described. For
example, some projects require a very
small amount of money to complete
while some projects need much more.
By examining more closely on
“goal_usd”, it might be said that most
of the projects launched in Kickstarter
platform are relatively a small scale.
As seen, the historical data from KickStarter dataset is complete, no missing value was found, so the target
value whether a given project will be successful can be provided. Also, it is already converted to ready-to-
use formats. The dataset splits between project which is success and project which is not success, not
equally, but the ratio between them is not too high, so this project ignore the problem of imbalance dataset.
Another concerning during this process is selecting informative attributes. Firstly, individual’s creativity,
common sense, and knowledge come into play; thus, the “name”, “id”, “city”, “state”, “start_month”,
“end_month”, “start_Q”, “end_Q”, “launch_at”, “deadline”, and “usd_pledged” are removed. The
international Kickstarter id is used by the KickStarter platform to verify the identity of project that give a
unique number to each individual, having not much of a story about successfulness. Too many categorical
as “city”, “state”, “sub_category” as well should be eliminated. start_month, end_month, start_Q, end_Q,
launch_at, and deadline, in the same sense, are summarized by the feature duration. “currency” overlaps
with country. Also, when predicting the status of the project, although the fundraising goal, namely
usd_goal is known as money needed to complete the project must be specified before launching, the amount
of pledged by crowd would not be known until after the crowdfunding period is over, the point that the
project status is shown. So, “usd_pledged” is removed too.
Besides the problem above, the way that data represented lead to another concern. There is the situation
that the data needed to be normalized or converted to make it more compatible with a specific machine
learning.
6
Modeling
ZeroR
J-48
Continuing on creating a supervised segmentation “decision tree”, the default setting in Weka results in
overfitted and too complex model. Therefore, the minNumObj is changed from 2 to 3000 meaning that each
leaf contains instances no less than 3000. This number is chosen in respond to a large dataset. Taking a
divide and conquer approach recusrively, Figure 9 shows a decisioin tree drawn upside down using J48
method which is much simpler than the default one but obviously, they lead to the same conclusion in term
of class outcoumes which in this case succesful or failed.
7
s(8425.0/3810.0) refer to 8425 projects in this node are classified correctly as successful while 3810 of
them are missclassified. Addtionally, accoriding to figure 8, the tree achieved 67.7068 % accuracy on its
decisions which is greater than baseline model.
Just to be practical. Discussing the use of decision tree to make prediction for a project, the process starts
by testing the value of feature from given project at the root node and the result lead to the path that the
project should descend to, and then repeats until it reachs the leaf node which predicts whether the project
will be succesful.
Logistic Regression
Another choice of model, logistic regression is applied to estimate class probability. By using the Weka, to
avoiding overfitting the model, firstly the dataset is split into two: training set and testing set, 70% and 30%
respectively. Unlike decision tree, the logistic regression model is used to identify the project as its
likelihood to be successful. Looking at the result from Weka (Figure10), so what the number tell us? To
illustrate, two variables main_category=journalism and main_category=dance are selected. For the former,
8
with the negative coefficient of -1.1128, it means a journalism project is less likely to be successful, and
for its odds ratios of 0.3286 means that the probability of journalism project being successful is lower than
the probability of journalism project being failed. Similarly, the positive coefficient of
main_category=dance and its odds ratio suggest that if the project fall into dance category, it is more likely
to be successful. Also, this logistic classification’s accuracy is 68.927%, slightly higher than that of decision
tree.
IBk
Using the entire dataset for prediction, k-NN is built based on the similarity, comparing the value and
calculating the distance between projects. All the attributes are normalized, and all nominal attributes are
converted to dummy variable, preventing the problem of
Value of k The accuracy level
dominant of gigantic scale and providing fair comparison. For
(%)
example, consider goal_usd and name_lenght, the slight
1 88.4174
variation of goal_usd will influence the distance calculation
2 85.6794
unreasonably. The Figure 11 shows the different value of k
3 81.1476
results in the different level of accuracy using Weka. For the
4 81.4
default setting (k=1), the distanceWeighting is simply set as no
5 78.3285
distance weighting since given an unknown status project, the
10 77.2031
model just find a nearest project and adopt its class, successful
50 74.382 or failed. In contrast, when k is greater than 1, weight by
100 73.548 1/distance or 1-distance are assigned; sharing more similar
500 71.8984 characteristic, one project should have more to say than the
1000 71.0509 others. In this paper, the former is used. However, after the
Figure 11 The accuracy level given output of the accuracy level given different value of k is
value of k presented in the table, the model for 1-NN perform best, so it
is chosen.
Also, the prediction of each project can be viewed. Figure 12 shows the result built by Weka using the
chosen model. The prediction margin suggests the range of confidence from -1 to 1. Let’s consider the first
highlighted example labeled 8, the model is very confidence, and its prediction is correct. The following
row, on the other hand, contain a negative prediction margin, so it predicts the project to be failed instead
of successful which is the precise status. Still, compared to all previous machine learning, the k-NN
algorithm become the best classifier with highest accuracy level of 88.4174%.
9
Figure 12 k-NN by Weka
Naïve Bayes
Predicting the status of project based on past evidence, figures 13 shows the result provided by Weka.
Consider following figure, for the attribute “main_category”, for example, the value in column 2 represents
a number of projects being successful given a specific category: in the KickStarter dataset, among 12,569
game projects, 8389 of them are successful. However, for the numeric attribute, duration, on the other
hand, the summary statistic is used. And its accuracy level is 63.2679%.
10
Clustering
Besides supervised machine learning, Weka also offers unsupervised learning machine “clustering”
technology that helps finding meaningful groups in data. The different between clustering and classification
is obvious like previously the models tried to categorize given projects as successful or failed while now it
intends to segregate projects into group based on similar attribute. Before doing so, attribute “country and
main category” is excluded while “usd_pledged” is added in the hope of better capturing the possible group.
Then, numeric variables are normalized, manually. Keep in mind that although the clustering works with
nominal and numeric attribute, it requires distance measuring which is more effective with numeric data
type, so this project decides to keep only duration, goal_usd, name length, and usd_pledged for clustering.
Sum of Initially, the different number of cluster “k” is picked to minimize the
Value of k squared errors sum of squared errors and in this KickStarter dataset, 4 is the desire
(SSE) number of clusters as the value of sum square error reduce substantially,
2 3609.9884 shown in Figure 14. Form the final cluster centroids and visualization
Figure 14 The SSE given full of ambitious, innovative, or creative, if project takes too long, it is
different value of k less attractive and riskier for fund providers than the others.
11
Anyway, basing upon algorithm alone would not allow us to achieve good clusters such as some algorithms
works poorer as the size of dataset increase. A large number of attributes too might affect the effectiveness
of outcomes, so only a few parameters are used here as less attributes are more preferable. Additionally,
the outlier in the KickStarter dataset leads to the difficulty for clustering algorithm.
Evaluation
Time to connect the results to the goal of predicting the status of crowdfunding project. Even it is difficult
to assess the performance of model perfectly, fortunately, with some measurement, each algorithm can be
evaluated systematically and reasonably. Perhaps, It is tempting to consider a simplest metric called
accuracy or error rate to measure model performance since it is very easy to obtain. However, that is not
enough for evaluation.
Up to this point, to evaluate a classifier, the confusion matrix is introduced. For KickStarter dataset, since
it is binary classification, 2x2 matrix is implemented for each classifier. Separating the prediction made by
the model, the example of confusion matrix in Weka is presented in Figure 16 with rows referred to actual
classes and column labeled with predicted classes. To clarify, project in test set of course has its actual class
from the historical data, whether it is successful and its predicted classes by the classifier. In the confusion
matrix below, it denotes predicted classes as a and b and no label is given to actual classes.
Going back to our dataset before moving straightforward to model evaluation, for the classification
problem, if one class is rare or class distribution is unbalanced, evaluation using accuracy fail, giving high
accuracy on baseline performance without telling anything. However, according to our dataset, it relatively
balance. Thus, perhaps it is acceptable to simply look at the accuracy level.
12
But the evaluation is not only the accuracy level, it is important to make use of other evaluation metrics
Recall and Precision which recall metric is simply the true positive rate (TP/(TP+FN)) while precision is
calculated by TP/(TP+FP).
For example, given successful and failed are positive and negative classes respectively, the following shows
the calculation of Recall and Precision using confusion metric drawn from logistic regression model.
Precision = 30557/(30557+13287)
Please note that actually some of the evaluation metrics are already computed by Weka; however, those
number presented above are calculated manually to clarify where the number come from and for
comprehension. Figure 18 shows the example of results by Weka.
Here, it is crucial to connect the results (figure 19) back to the evaluation process to make sure that the
models built to classify the status of the project in Kickstarter platform make any good. In the broadest
view, based on the results shown in the figure 19, it suggests that by randomly identifying the status of the
projects, it has an accuracy of 67.7068% and thus all newly developed algorithms perform better than that
so in some extent, they might be deemed worthwhile.
Firstly, consider the most popular metric that has been mentioned many times in this paper known as plain
accuracy. To measure the performance of classifiers, the accuracy is the proportion of correct decision, and
from the figure below, there seems to be wide range of accuracy value from 63.2679% to 88.4174%. The
Naïve Bayes results appear to be close to the base rate and the IBk or k-NN looks outstanding with its
88.4174% accuracy which is significantly better than others.
13
Moving to recall and precision, the greater number of recall imply a higher confident that all the successful
projects have been found by the model while the precision captures how often the projects that actually
successful are predicted to be successful when the model predict a successful status, so similar to recall, the
greater number is preferable. In this case, it is unarguable that k-NN model is superior.
Deploying
Take the case of KickStarter dataset, suppose that the Kickstarter company never make use of machine
learning in its decision making so they allow every project to use their platform freely, a deployment of
model is reasonable. By implementing the model, a predictive algorithm helps the company to identify the
potential of project before the approval.
However, the main challenged might return to the data preparation phase to make a more reliable
data or perhaps collect from other source than Kaggle.com. This may involve working with
Kickstarter company itself to develop a better data or keep collecting new instance every day. Some
attributes such as “backers” or “year_launced” could be added to achieve a better outcome. Also,
other information such as profit and cost might seem useful. For example, to compute expected
profit so that cost-benefit matrix could be built to maybe provide further analysis of the
consequence after the decision made.
In data mining, the ethical issues should not be overlooked. In this case, it might not be considered
as a big deal since the project does not goes on the detailed data. Also, the instance is not individual
but project.
14
Conclusion
Along this paper, a collection of data mining process has been introduced naturally through the use of
KickStarter dataset. Each important process is developed and made into the headline of sections. The
business problem of Kickstarter brings the data science into practice. By applying the supervised learning
machine to addressed question:
The project successfully specified a status of project with 88.4174% accuracy using k-NN algorithm.
However, before achieving the outcome, things needed to be emphasized. After converting the business
problem to predictive data analytic solution, by exploring and visualizing the given data, it is noticeable
that the successful project can be predicted. So, the paper then goes beyond data visualization and focus on
data preparation. Particular attributes are selected to serve the problem, a variety of models are applied, and
obviously followed by the evaluation. Moving a step forward, the paper also talk about the deployment
stage.
However, it is important to note that the result from this paper is not meant to be the only solution of the
problem as well as this specific research question is not the only question that may arise from the dataset
since data science in some sense depend on individual’s capability and creativity.
15
References
Fawcett, T., & Provost, F. (2013). Data Science for Business. Sebastopol: O’Reilly Media, Inc.
Kickstarter. (2021). Bring your creative project to life. From Kickstarter.
Kickstarter. (2021). Fees for Denmark. From Kickstarter: https://www.kickstarter.com/help/fees
Kickstarter. (2021, Nov https://help.kickstarter.com/hc/en-us/articles/115005047953-What-do-backers-
get-in-return-). What do backers get in return? From Kickstarter.
16