0% found this document useful (0 votes)

138 views16 pages

Final Project - Big Data

This document provides a summary of a student's data analytics project on predicting successful Kickstarter projects. The project follows the CRISP data mining process, including business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Several machine learning algorithms are applied to the Kickstarter dataset, including ZeroR, J-48, logistic regression, k-NN, Naive Bayes, and clustering. The goal is to develop a model that can predict whether a Kickstarter project will be successful or failed. Visualization tools like Tableau are used to explore relationships in the data.

Uploaded by

Eve Prothien

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

138 views16 pages

Final Project - Big Data

Uploaded by

Eve Prothien

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Student number 148120

Name: Preechayapond Prothien

Date: 16 December 2021

Big Data Analytics for Managers

Home Assignment

1
Table of Contents

Introduction ................................................................................................................................................. 3
Theoretical framework ............................................................................................................................... 3
Business and Data understanding ............................................................................................................. 4
Data preparation ......................................................................................................................................... 6
Modeling ...................................................................................................................................................... 7
ZeroR ....................................................................................................................................................... 7
J-48 ........................................................................................................................................................... 7
Logistic Regression ................................................................................................................................. 8
IBk ............................................................................................................................................................ 9
Naïve Bayes ............................................................................................................................................ 10
Clustering .............................................................................................................................................. 11
Evaluation .................................................................................................................................................. 12
Deploying ................................................................................................................................................... 14
Limitation and further study ............................................................................................................... 14
Privacy and Ethics ................................................................................................................................ 14
Conclusion ................................................................................................................................................. 15

2
Introduction

As known, big data analytics is a disruptive technology, and its use has increased rapidly in the recent days.
The days go by, the data are being collected more and more. Several organizations rely upon data mining
process and machine learning to serve their business tasks desiring more efficient and more profitable paths.
Therefore, given the KickStarter dataset, this report encompasses seven main sections: theoretical
framework, business and data understanding, data preparation, modeling, evaluation, deploying, and
conclusion. Started with describing the main theoretical framework, the report then explicitly shows the
application of data mining process, using both supervised and unsupervised machine learning to solve the
classification task or in other word, to discover predictive learning machines that are able to classify given
project into one of predefined group: successful and failed. The research question is addressed below:

“Can the successful project be predicted?”

Given the dataset by the instructor, this data analytics project attempt to determine the classes, successful
or failed, that each Kickstarter belong to. However, this project is not limited to supervised segmentation.
The unsupervised approach clustering also mentioned to perhaps gain better business understanding or
make an any important discovery. The dataset is accessible via Kaggle.com and with 192,548 instances, it
originally contains 20 attributes. In this paper, the algorithms including ZeroR, decision tree, logistic
regression, k-NN, Naïve Bayes, and clustering are conducted mainly in Weka. Additionally, Tableau is
used to visualize and explore the data.

Theoretical framework

While it is obvious that data mining is craft, there is a

codification, a set of tasks, to develop the data science
solution called the Cross Industry Standard Process for Data
Mining (Fawcett & Provost, 2013). Figure 1 visualize the
overview of this framework. So, what is the CRISP data
mining process? Generally speaking, it is the process of
extracting useful knowledge and information from data with
the rule of iteration, going back and forth between steps
(Fawcett & Provost, 2013). Figure 1 CRISP data mining process

Clearly, based on the research question, the problem can be classified as classification task. Classification
techniques predict a target variable that is the predefined classes by learning from the known dataset when
the set of features are given with specific learning algorithms.

3
Business and Data understanding

As most of data analytics project, it does not come with the well-defined question at the very beginning.
Instead, the individual, shareholders, or company often approach manager or data scientist with business
problem (Fawcett & Provost, 2013). In this KickStarter dataset, the Kicksaterter company may come up
with something like how to increase their profit. Therefore, the first duty is to convert the business problem
to the actual predictive analytics solution. The main source of profit in Kickstarter company is fee charged
to successfully funded projects, and as mentioned above, it might be straightforward that goal of company
is to maximize profit. Digging deeper to the company’s capability, suppose its current approach to identify
successful projects is very poor or random which is actually unacceptable. So, a significant amount of time
is needed to be spent on meeting or talking with shareholder or board of directors, for example, to gain
more insight and avoid miscommunication. Based on individual understanding, this paper develops a well-
specified question, addressed above “Can the successful project be predicted?” that could help Kickstarter
increase their profit by investing in more promising projects. Supervised models in the following part are
built to predict the status of project. This could be used to identify projects that look valuable to invest. By
allowing the right project, surely the company is able to increase its profit.

It is vital to understand the strengths and limitation of the data since often there are mismatch between data
and problem. The KickStarter dataset is collected as a part of machine learning and data science community,
Kaggle, so the datasets are not published for a specific problem. Instead, they are open for data scientists’
creativity. So, let’s dig beneath the surface to discover the data that is available. 192,548 projects are
presented as instances. There are two types of business condition: successful and failed, so in the dataset,
each project is label, successful or failed, indicating the condition that the project is in. Additionally, by
clicking edit tab in Weka, all data are showed in the table with the columns indicate attributes and the rows
indicate instances. Figure 2 illustrates how a project is presented in Weka. So, it quickly become visible
that there are different type of data including string, nominal, and numeric. Before going further, the target
variable must be defined. Consider the case of KickStarter dataset, the goal is to predict the status of project.
The bottom line shows up at the deadline of crowdfunding, so the condition of the project whether it is
successful is usually well-evaluated – if not initially by the project’s report, then later by the platform when
the goal is not achieved. Thus, it can be assumed that the status of project is reliably identified that may
serve as target for our classification task.

Figure 2 Table data by Weka

4
Of course, rarely one can magically
understand the data without some
visualization, so tableau give data scientists
tool to project the data into the abstract
images, and the following dashboard shows
some of visualizations. Vertical bar chart
shows the occurrence of project in each
country. Clearly, there are countries that their
number of instances is so small, revealing the
noise due to sampling; the US contains the
mass of examples. Let’s have a look at
relationship between projects and country of Figure 3 Dashboard by Tableau

those. Represented by maps, it is quite obvious that the US have a large number of successful projects due
to the irregular distribution of project on country. Thus, providing the percentage of successful project on
the right-hand side can be useful so that the study does not simply assume that projects launched in the US
are more likely to be successful than the others.

Backing to basic, consider the easily obtained visualization from Weka by only clicking on “Visualize All”
shown in Figure 4. One might notice that there are several attributes, “launched_at”, “deadline”, “city”,
“state” that Weka cannot properly visualize as it has too many values to display; however, the statement
does make sense since the range of possible value is high. “launched_at”, for example, contains not only
date but also the time each project launched, so as the dataset is huge, the number of value is naturally large.
“start_Q”, on the other hand, has four possible values which also make sense using common sense.
“country” and “currency” provide almost the exact same value and shape since in some sense, one is just
the different presentation of another. For example, project launched in the U.S. is of course supported by
USD. Turing attention to “goal_usd”, it may suffer from the very low value which seems unusual; however,
this issue can be described. For
example, some projects require a very
small amount of money to complete
while some projects need much more.
By examining more closely on
“goal_usd”, it might be said that most
of the projects launched in Kickstarter
platform are relatively a small scale.

Figure 4 Visualization by Weka 5

Data preparation

As seen, the historical data from KickStarter dataset is complete, no missing value was found, so the target
value whether a given project will be successful can be provided. Also, it is already converted to ready-to-
use formats. The dataset splits between project which is success and project which is not success, not
equally, but the ratio between them is not too high, so this project ignore the problem of imbalance dataset.

Another concerning during this process is selecting informative attributes. Firstly, individual’s creativity,
common sense, and knowledge come into play; thus, the “name”, “id”, “city”, “state”, “start_month”,
“end_month”, “start_Q”, “end_Q”, “launch_at”, “deadline”, and “usd_pledged” are removed. The
international Kickstarter id is used by the KickStarter platform to verify the identity of project that give a
unique number to each individual, having not much of a story about successfulness. Too many categorical
as “city”, “state”, “sub_category” as well should be eliminated. start_month, end_month, start_Q, end_Q,
launch_at, and deadline, in the same sense, are summarized by the feature duration. “currency” overlaps
with country. Also, when predicting the status of the project, although the fundraising goal, namely
usd_goal is known as money needed to complete the project must be specified before launching, the amount
of pledged by crowd would not be known until after the crowdfunding period is over, the point that the
project status is shown. So, “usd_pledged” is removed too.

Often it is favorable in applying formula for feature selection

when a set of attributes are given. Fortunately, there is a
common tool called information gain. Figure 5 and 6 shows the
calculation of information gain provided by Weka. However,
by simply looking at the result without any critical analysis,
those attributes that obtained higher information gain value are
assumed to contribute more information. For example, “name”
which its number (0.962683) is luring but it does not make
Figure 5 Attributes selection by Weka
sense, so attributes selection need individual’s ability too. As a
result, using the cutoff of 0.05, the five most information gain
attributes: “main_category”, “goal_usd”, “duration”,
Figure 6 Attributes selection after “name_length”, and “country” are selected as features used in
individual correction by Weka
this report, the “blurb_length” is dropped out.

Besides the problem above, the way that data represented lead to another concern. There is the situation
that the data needed to be normalized or converted to make it more compatible with a specific machine
learning.

6
Modeling

Supervised Learning Machinces

ZeroR

Moving to a baseline model, for the sake of

camparison, ZeroR classifier is built. Spliting
dataset into training set and houdout test set, a
10-fold cross validation experiment, predicting
all projects to be sucessful as it is majority class,
results in 60.9235% of accuracy. Therefore,
other mechine learning alogrithms should have
better perfomance than ZeroR classifier.
Figure 7 ZeroR by Weka
Otherwise, they do not make any sense.

J-48
Continuing on creating a supervised segmentation “decision tree”, the default setting in Weka results in
overfitted and too complex model. Therefore, the minNumObj is changed from 2 to 3000 meaning that each
leaf contains instances no less than 3000. This number is chosen in respond to a large dataset. Taking a
divide and conquer approach recusrively, Figure 9 shows a decisioin tree drawn upside down using J48
method which is much simpler than the default one but obviously, they lead to the same conclusion in term
of class outcoumes which in this case succesful or failed.

Get started with all representations in

the decision tree. Clearly, the feature
used in the root or starting node is the
most infromative, referring to
information gain measurement. Each
project corresponds to only on path in
the tree. s and f leaf nodes stand for
succesful and failed repectively which
are the class of target variable.
Consider this model, it has a set of 22
Figure 8 J-48 by Weka
rules. For example, the 21th leaf can be
interpreted as logistical statements as if the goal of the project is more than 22,150.5 usd but less than or
equal to 51,684.51, and its name is greater than 6 words then the project is classified as successful.

7
s(8425.0/3810.0) refer to 8425 projects in this node are classified correctly as successful while 3810 of
them are missclassified. Addtionally, accoriding to figure 8, the tree achieved 67.7068 % accuracy on its
decisions which is greater than baseline model.

Just to be practical. Discussing the use of decision tree to make prediction for a project, the process starts
by testing the value of feature from given project at the root node and the result lead to the path that the
project should descend to, and then repeats until it reachs the leaf node which predicts whether the project
will be succesful.

Figure 9 J-48 visualization by Weka

Logistic Regression

Figure 10 Logistic Regression by Weka

Another choice of model, logistic regression is applied to estimate class probability. By using the Weka, to
avoiding overfitting the model, firstly the dataset is split into two: training set and testing set, 70% and 30%
respectively. Unlike decision tree, the logistic regression model is used to identify the project as its
likelihood to be successful. Looking at the result from Weka (Figure10), so what the number tell us? To
illustrate, two variables main_category=journalism and main_category=dance are selected. For the former,

8
with the negative coefficient of -1.1128, it means a journalism project is less likely to be successful, and
for its odds ratios of 0.3286 means that the probability of journalism project being successful is lower than
the probability of journalism project being failed. Similarly, the positive coefficient of
main_category=dance and its odds ratio suggest that if the project fall into dance category, it is more likely
to be successful. Also, this logistic classification’s accuracy is 68.927%, slightly higher than that of decision
tree.

IBk

Using the entire dataset for prediction, k-NN is built based on the similarity, comparing the value and
calculating the distance between projects. All the attributes are normalized, and all nominal attributes are
converted to dummy variable, preventing the problem of
Value of k The accuracy level
dominant of gigantic scale and providing fair comparison. For
(%)
example, consider goal_usd and name_lenght, the slight
1 88.4174
variation of goal_usd will influence the distance calculation
2 85.6794
unreasonably. The Figure 11 shows the different value of k
3 81.1476
results in the different level of accuracy using Weka. For the
4 81.4
default setting (k=1), the distanceWeighting is simply set as no
5 78.3285
distance weighting since given an unknown status project, the
10 77.2031
model just find a nearest project and adopt its class, successful
50 74.382 or failed. In contrast, when k is greater than 1, weight by
100 73.548 1/distance or 1-distance are assigned; sharing more similar
500 71.8984 characteristic, one project should have more to say than the
1000 71.0509 others. In this paper, the former is used. However, after the

Figure 11 The accuracy level given output of the accuracy level given different value of k is
value of k presented in the table, the model for 1-NN perform best, so it
is chosen.

Also, the prediction of each project can be viewed. Figure 12 shows the result built by Weka using the
chosen model. The prediction margin suggests the range of confidence from -1 to 1. Let’s consider the first
highlighted example labeled 8, the model is very confidence, and its prediction is correct. The following
row, on the other hand, contain a negative prediction margin, so it predicts the project to be failed instead
of successful which is the precise status. Still, compared to all previous machine learning, the k-NN
algorithm become the best classifier with highest accuracy level of 88.4174%.

9
Figure 12 k-NN by Weka

Naïve Bayes

Figure 13 Naïve Bayes by Weka

Predicting the status of project based on past evidence, figures 13 shows the result provided by Weka.
Consider following figure, for the attribute “main_category”, for example, the value in column 2 represents
a number of projects being successful given a specific category: in the KickStarter dataset, among 12,569
game projects, 8389 of them are successful. However, for the numeric attribute, duration, on the other
hand, the summary statistic is used. And its accuracy level is 63.2679%.

10
Clustering

Besides supervised machine learning, Weka also offers unsupervised learning machine “clustering”
technology that helps finding meaningful groups in data. The different between clustering and classification
is obvious like previously the models tried to categorize given projects as successful or failed while now it
intends to segregate projects into group based on similar attribute. Before doing so, attribute “country and
main category” is excluded while “usd_pledged” is added in the hope of better capturing the possible group.
Then, numeric variables are normalized, manually. Keep in mind that although the clustering works with
nominal and numeric attribute, it requires distance measuring which is more effective with numeric data
type, so this project decides to keep only duration, goal_usd, name length, and usd_pledged for clustering.

Sum of Initially, the different number of cluster “k” is picked to minimize the
Value of k squared errors sum of squared errors and in this KickStarter dataset, 4 is the desire
(SSE) number of clusters as the value of sum square error reduce substantially,
2 3609.9884 shown in Figure 14. Form the final cluster centroids and visualization

3 1892.2797 below, by mapping duration and usd_pledged together, one observation

drawn from the clustering is that the longer the duration, the higher the
4 1453.8463
fund pledged by investors. However, surprisingly, cluster 2 has a
5 1207.3106
unique result, it shows that given a long period of crowdfunding, money
6 1026.9546
pledged is fairly low. Perhaps it means that even the project might be

Figure 14 The SSE given full of ambitious, innovative, or creative, if project takes too long, it is
different value of k less attractive and riskier for fund providers than the others.

Figure 15 Clustering by Weka

11
Anyway, basing upon algorithm alone would not allow us to achieve good clusters such as some algorithms
works poorer as the size of dataset increase. A large number of attributes too might affect the effectiveness
of outcomes, so only a few parameters are used here as less attributes are more preferable. Additionally,
the outlier in the KickStarter dataset leads to the difficulty for clustering algorithm.

Evaluation

Time to connect the results to the goal of predicting the status of crowdfunding project. Even it is difficult
to assess the performance of model perfectly, fortunately, with some measurement, each algorithm can be
evaluated systematically and reasonably. Perhaps, It is tempting to consider a simplest metric called
accuracy or error rate to measure model performance since it is very easy to obtain. However, that is not
enough for evaluation.

Up to this point, to evaluate a classifier, the confusion matrix is introduced. For KickStarter dataset, since
it is binary classification, 2x2 matrix is implemented for each classifier. Separating the prediction made by
the model, the example of confusion matrix in Weka is presented in Figure 16 with rows referred to actual
classes and column labeled with predicted classes. To clarify, project in test set of course has its actual class
from the historical data, whether it is successful and its predicted classes by the classifier. In the confusion
matrix below, it denotes predicted classes as a and b and no label is given to actual classes.

However, Weka elaborate clearly that a and b are successful

and failed, respectively, and intuitively, the first row and
second row of course say the actual classes are successful
and failed as outcomes. The top-left and bottom-right contain
the count of correct prediction meaning that if the project is
successful or failed, the classifier predict its status correctly.
Figure 16 J-48’s confusion metric
So, in this case of J48, for example, the model predicts
by Weka
104,422 projects are successful which are really successful
or in other words, true positives.

Going back to our dataset before moving straightforward to model evaluation, for the classification
problem, if one class is rare or class distribution is unbalanced, evaluation using accuracy fail, giving high
accuracy on baseline performance without telling anything. However, according to our dataset, it relatively
balance. Thus, perhaps it is acceptable to simply look at the accuracy level.

12
But the evaluation is not only the accuracy level, it is important to make use of other evaluation metrics
Recall and Precision which recall metric is simply the true positive rate (TP/(TP+FN)) while precision is
calculated by TP/(TP+FP).

For example, given successful and failed are positive and negative classes respectively, the following shows
the calculation of Recall and Precision using confusion metric drawn from logistic regression model.

TP = 30557, FP = 13287, FN = 4662, TN = 9258

Recall = 30557/(30557+4662) = 0.8676

Precision = 30557/(30557+13287)

Figure 17 Logistic regression’s confusion metric by Weka

Please note that actually some of the evaluation metrics are already computed by Weka; however, those
number presented above are calculated manually to clarify where the number come from and for
comprehension. Figure 18 shows the example of results by Weka.

Figure 18 Logistic regression’s detailed accuracy by class by Weka

Here, it is crucial to connect the results (figure 19) back to the evaluation process to make sure that the
models built to classify the status of the project in Kickstarter platform make any good. In the broadest
view, based on the results shown in the figure 19, it suggests that by randomly identifying the status of the
projects, it has an accuracy of 67.7068% and thus all newly developed algorithms perform better than that
so in some extent, they might be deemed worthwhile.

Firstly, consider the most popular metric that has been mentioned many times in this paper known as plain
accuracy. To measure the performance of classifiers, the accuracy is the proportion of correct decision, and
from the figure below, there seems to be wide range of accuracy value from 63.2679% to 88.4174%. The
Naïve Bayes results appear to be close to the base rate and the IBk or k-NN looks outstanding with its
88.4174% accuracy which is significantly better than others.

13
Moving to recall and precision, the greater number of recall imply a higher confident that all the successful
projects have been found by the model while the precision captures how often the projects that actually
successful are predicted to be successful when the model predict a successful status, so similar to recall, the
greater number is preferable. In this case, it is unarguable that k-NN model is superior.

ZeroR Accuracy Recall Precision

J48 67.7068% 0.8901 0.6793
Logistic 68.927% 0.8676 0.6969
Regression 60.9235%
IBk 88.4174% 0.9511 0.8707
Naïve Bayes 63.2679%. 0.9892 0.6256

Figure 19 Models’ performance

Deploying

Take the case of KickStarter dataset, suppose that the Kickstarter company never make use of machine
learning in its decision making so they allow every project to use their platform freely, a deployment of
model is reasonable. By implementing the model, a predictive algorithm helps the company to identify the
potential of project before the approval.

Limitation and further study

However, the main challenged might return to the data preparation phase to make a more reliable
data or perhaps collect from other source than Kaggle.com. This may involve working with
Kickstarter company itself to develop a better data or keep collecting new instance every day. Some
attributes such as “backers” or “year_launced” could be added to achieve a better outcome. Also,
other information such as profit and cost might seem useful. For example, to compute expected
profit so that cost-benefit matrix could be built to maybe provide further analysis of the
consequence after the decision made.

Privacy and Ethics

In data mining, the ethical issues should not be overlooked. In this case, it might not be considered
as a big deal since the project does not goes on the detailed data. Also, the instance is not individual
but project.

14
Conclusion

Along this paper, a collection of data mining process has been introduced naturally through the use of
KickStarter dataset. Each important process is developed and made into the headline of sections. The
business problem of Kickstarter brings the data science into practice. By applying the supervised learning
machine to addressed question:

“Can the successful project be predicted?”

The project successfully specified a status of project with 88.4174% accuracy using k-NN algorithm.
However, before achieving the outcome, things needed to be emphasized. After converting the business
problem to predictive data analytic solution, by exploring and visualizing the given data, it is noticeable
that the successful project can be predicted. So, the paper then goes beyond data visualization and focus on
data preparation. Particular attributes are selected to serve the problem, a variety of models are applied, and
obviously followed by the evaluation. Moving a step forward, the paper also talk about the deployment
stage.

However, it is important to note that the result from this paper is not meant to be the only solution of the
problem as well as this specific research question is not the only question that may arise from the dataset
since data science in some sense depend on individual’s capability and creativity.

15
References
Fawcett, T., & Provost, F. (2013). Data Science for Business. Sebastopol: O’Reilly Media, Inc.
Kickstarter. (2021). Bring your creative project to life. From Kickstarter.
Kickstarter. (2021). Fees for Denmark. From Kickstarter: https://www.kickstarter.com/help/fees
Kickstarter. (2021, Nov https://help.kickstarter.com/hc/en-us/articles/115005047953-What-do-backers-
get-in-return-). What do backers get in return? From Kickstarter.

Chapter 1 - Chapter 1 Introduction: Data-Analytic Thinking
No ratings yet
Chapter 1 - Chapter 1 Introduction: Data-Analytic Thinking
54 pages
Final Project - Big Data
No ratings yet
Final Project - Big Data
6 pages
5 - InnovatiCS - Data Types - Measure of Shape - Position - Dispersion
No ratings yet
5 - InnovatiCS - Data Types - Measure of Shape - Position - Dispersion
47 pages
Complete
No ratings yet
Complete
27 pages
2 - Business Problems and Data Science Solutions
No ratings yet
2 - Business Problems and Data Science Solutions
26 pages
Chapter 2. Business Problem and Data-Driven Decision
No ratings yet
Chapter 2. Business Problem and Data-Driven Decision
22 pages
G20 - Crowdfunding Predicting Kickstarter Project Success
No ratings yet
G20 - Crowdfunding Predicting Kickstarter Project Success
7 pages
Lecture02 Frameworks Platforms-Part1
No ratings yet
Lecture02 Frameworks Platforms-Part1
40 pages
2 - BBDS - Decisions Management & Problem Framing
No ratings yet
2 - BBDS - Decisions Management & Problem Framing
78 pages
PAM - Unit1 PDF
No ratings yet
PAM - Unit1 PDF
217 pages
Introduction: Data Analytic Thinking
No ratings yet
Introduction: Data Analytic Thinking
38 pages
Summary Business Analytics
No ratings yet
Summary Business Analytics
24 pages
PAM - Complete
No ratings yet
PAM - Complete
322 pages
DMlecture 1
No ratings yet
DMlecture 1
39 pages
Basic Concept of Classification (Data Mining)
No ratings yet
Basic Concept of Classification (Data Mining)
11 pages
02-Data Mining Functionalities-2
No ratings yet
02-Data Mining Functionalities-2
23 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
Module 5 - Data Science Methodology
No ratings yet
Module 5 - Data Science Methodology
17 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
IME 672-Chapter 1 PDF
No ratings yet
IME 672-Chapter 1 PDF
41 pages
Exam 2003 B
No ratings yet
Exam 2003 B
20 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Data Science Theory: Analysis and Analytics
No ratings yet
Data Science Theory: Analysis and Analytics
14 pages
Session Summary CRISP Data Mining: Business Understanding
No ratings yet
Session Summary CRISP Data Mining: Business Understanding
4 pages
Vendor Batch Check
No ratings yet
Vendor Batch Check
13 pages
Lecture2 DataMiningFunctionalities
No ratings yet
Lecture2 DataMiningFunctionalities
18 pages
Sentiment Analysis Using Deep Learning
No ratings yet
Sentiment Analysis Using Deep Learning
10 pages
Introduction To Data Structures
No ratings yet
Introduction To Data Structures
37 pages
Deep Learning-Assisted MRI Image Segmentation and Classification For Precise Bra
No ratings yet
Deep Learning-Assisted MRI Image Segmentation and Classification For Precise Bra
6 pages
Explore Machine Learning and Regression Learner Machine Learning
No ratings yet
Explore Machine Learning and Regression Learner Machine Learning
3 pages
BCI BASED HOME AUTOMATION SYSTEM Report PDF
No ratings yet
BCI BASED HOME AUTOMATION SYSTEM Report PDF
66 pages
MLAgentBench Evaluating Language Agents On Machine Learning Experimentation
No ratings yet
MLAgentBench Evaluating Language Agents On Machine Learning Experimentation
39 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Emotion Classification Using ML and DL
No ratings yet
Emotion Classification Using ML and DL
8 pages
Day 2. Lecture - Machinelearning
No ratings yet
Day 2. Lecture - Machinelearning
32 pages
The Data Science Process Course Slides Red
No ratings yet
The Data Science Process Course Slides Red
95 pages
Stress Detection With Machine Learning and Deep
No ratings yet
Stress Detection With Machine Learning and Deep
7 pages
ML.4-Classification Techniques (Week 5,6,7)
No ratings yet
ML.4-Classification Techniques (Week 5,6,7)
56 pages
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 3
No ratings yet
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 3
8 pages
CBS3006 - Machine-Learning - Eth - 1.0 - 66 - CBS3006 - 61 Acp
No ratings yet
CBS3006 - Machine-Learning - Eth - 1.0 - 66 - CBS3006 - 61 Acp
2 pages
Peeking Inside The Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation
No ratings yet
Peeking Inside The Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation
23 pages
NN Ch04
No ratings yet
NN Ch04
29 pages
Quantum Neural Networks Versus Conventional Feedforward Neural N
No ratings yet
Quantum Neural Networks Versus Conventional Feedforward Neural N
10 pages
Kuliah 1 Pendahuluan
No ratings yet
Kuliah 1 Pendahuluan
39 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
41 j48 Naive Bayes Weka
No ratings yet
41 j48 Naive Bayes Weka
5 pages
EOL Quiz Answers Module 3
No ratings yet
EOL Quiz Answers Module 3
17 pages
Comparative Analysis of Machine Learning Techniques For Indian Liver Disease Patients
No ratings yet
Comparative Analysis of Machine Learning Techniques For Indian Liver Disease Patients
5 pages
Sentiment Analysis On IMDB Movie Reviews Using Machine Learning and Deep Learning Algorithms
No ratings yet
Sentiment Analysis On IMDB Movie Reviews Using Machine Learning and Deep Learning Algorithms
6 pages
Bachelor of Science in Computer Science and Engineering
No ratings yet
Bachelor of Science in Computer Science and Engineering
24 pages
Unit III - Question Bank
No ratings yet
Unit III - Question Bank
1 page
DT CVD SPSS
No ratings yet
DT CVD SPSS
6 pages
Predicting Pavement Condition Index Using International Roughness Index in A Dense Urban Area
No ratings yet
Predicting Pavement Condition Index Using International Roughness Index in A Dense Urban Area
8 pages
Vasu Gupta, Sharan Srinivasan, Sneha Kudli, Prediction and Classification of Cardiac Arrhythmia
No ratings yet
Vasu Gupta, Sharan Srinivasan, Sneha Kudli, Prediction and Classification of Cardiac Arrhythmia
5 pages
Video Lectures: Week 1 Course Introduction
No ratings yet
Video Lectures: Week 1 Course Introduction
5 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Microprediction: Building an Open AI Network
From Everand
Microprediction: Building an Open AI Network
Peter Cotton
No ratings yet
Future-Proofing Your Business: How to Leverage Artificial Intelligence, Machine Learning, and Data Analytics to Stay Ahead in a Changing World
From Everand
Future-Proofing Your Business: How to Leverage Artificial Intelligence, Machine Learning, and Data Analytics to Stay Ahead in a Changing World
Morris
No ratings yet
Data Scientist as a Strategist: Aligning Data Insights with Business Goals
From Everand
Data Scientist as a Strategist: Aligning Data Insights with Business Goals
Eretoru Nimi Robert
No ratings yet
Enterprise Data Science: Smarter Decisions with Big Data
From Everand
Enterprise Data Science: Smarter Decisions with Big Data
Vidhur Gupta
No ratings yet
Leading in Analytics: The Seven Critical Tasks for Executives to Master in the Age of Big Data
From Everand
Leading in Analytics: The Seven Critical Tasks for Executives to Master in the Age of Big Data
Joseph A. Cazier
No ratings yet
Prompt Power - Book 5: Data-Driven Decision Making with Prompt Engineering: Prompt Power - Mastering ChatGPT for Exceptional Office Productivity, #5
From Everand
Prompt Power - Book 5: Data-Driven Decision Making with Prompt Engineering: Prompt Power - Mastering ChatGPT for Exceptional Office Productivity, #5
Rorschach Tse
No ratings yet
What Is Data Analytics? A Complete Guide For Beginners
From Everand
What Is Data Analytics? A Complete Guide For Beginners
Piyush Kumar Jain
No ratings yet
Getting Data Science Done: Managing Projects From Ideas to Products
From Everand
Getting Data Science Done: Managing Projects From Ideas to Products
John Hawkins
No ratings yet
Minding the Machines: Building and Leading Data Science and Analytics Teams
From Everand
Minding the Machines: Building and Leading Data Science and Analytics Teams
Jeremy Adamson
No ratings yet
Enterprise Architect’s Handbook: A Blueprint to Design and Outperform Enterprise-level IT Strategy (English Edition)
From Everand
Enterprise Architect’s Handbook: A Blueprint to Design and Outperform Enterprise-level IT Strategy (English Edition)
Dr. Vishwakarma J S
No ratings yet
Introduction to Data Analytics
From Everand
Introduction to Data Analytics
Dan Martin
No ratings yet
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
From Everand
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
Zemelak Goraga
No ratings yet
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
From Everand
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
Neal Fishman
No ratings yet
Capitalizing Data Science: A Guide to Unlocking the Power of Data for Your Business and Products (English Edition)
From Everand
Capitalizing Data Science: A Guide to Unlocking the Power of Data for Your Business and Products (English Edition)
Mathangi Sri Ramachandran
No ratings yet
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
From Everand
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
Zemelak Goraga
No ratings yet
Business Models in Emerging Technologies: Data Science, AI, and Blockchain
From Everand
Business Models in Emerging Technologies: Data Science, AI, and Blockchain
Stylianos Kampakis
No ratings yet
Big Data for Enterprise Architects
From Everand
Big Data for Enterprise Architects
Dr Mehmet Yildiz
4.5/5 (3)
Crack the Data Analyst Interview: Real-Time Questions & Expert Answers
From Everand
Crack the Data Analyst Interview: Real-Time Questions & Expert Answers
Yash d.
No ratings yet
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
R&D Productivity: How to Target It. How to Measure It. Why It Matters.
From Everand
R&D Productivity: How to Target It. How to Measure It. Why It Matters.
Gerald Dundon
No ratings yet
Low-Code/No-Code: Citizen Developers and the Surprising Future of Business Applications
From Everand
Low-Code/No-Code: Citizen Developers and the Surprising Future of Business Applications
Phil Simon
2.5/5 (2)
Comprehensive Guide to Implementing Data Science and Analytics: Tips, Recommendations, and Strategies for Success
From Everand
Comprehensive Guide to Implementing Data Science and Analytics: Tips, Recommendations, and Strategies for Success
Rick Spair
No ratings yet
Big Data: Understanding How Data Powers Big Business
From Everand
Big Data: Understanding How Data Powers Big Business
Bill Schmarzo
2/5 (1)
Digital Skills for Agile Business Analysis
From Everand
Digital Skills for Agile Business Analysis
Tj. Blake Williams
No ratings yet
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
Analytics and Big Data for Accountants
From Everand
Analytics and Big Data for Accountants
Jim Lindell
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Concept Based Practice Questions for Tableau Desktop Specialist Certification Latest Edition 2023
From Everand
Concept Based Practice Questions for Tableau Desktop Specialist Certification Latest Edition 2023
Exam OG
No ratings yet
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet
Architecting Big Data & Analytics Solutions - Integrated with IoT & Cloud
From Everand
Architecting Big Data & Analytics Solutions - Integrated with IoT & Cloud
Dr Mehmet Yildiz
4.5/5 (2)
The Predictive Project Manager
From Everand
The Predictive Project Manager
Puneet Mathur
No ratings yet
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
From Everand
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
Brian Knight
3/5 (1)
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
From Everand
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
alasdair gilchrist
No ratings yet

Final Project - Big Data

Uploaded by

Final Project - Big Data

Uploaded by

Student number 148120

Name: Preechayapond Prothien

Big Data Analytics for Managers

“Can the successful project be predicted?”

While it is obvious that data mining is craft, there is a

Figure 2 Table data by Weka

Figure 4 Visualization by Weka 5

Often it is favorable in applying formula for feature selection

Supervised Learning Machinces

Moving to a baseline model, for the sake of

Get started with all representations in

Figure 9 J-48 visualization by Weka

Figure 10 Logistic Regression by Weka

Figure 13 Naïve Bayes by Weka

3 1892.2797 below, by mapping duration and usd_pledged together, one observation

Figure 15 Clustering by Weka

However, Weka elaborate clearly that a and b are successful

TP = 30557, FP = 13287, FN = 4662, TN = 9258

Recall = 30557/(30557+4662) = 0.8676

Figure 17 Logistic regression’s confusion metric by Weka

Figure 18 Logistic regression’s detailed accuracy by class by Weka

ZeroR Accuracy Recall Precision

Figure 19 Models’ performance

Limitation and further study

Privacy and Ethics

“Can the successful project be predicted?”

You might also like