Main Steps For Doing Data Mining Project Using Weka: February 2016
Main Steps For Doing Data Mining Project Using Weka: February 2016
net/publication/293464737
CITATION READS
1 13,792
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Dalia Sami Jasim on 08 February 2016.
By
Abstract:
The fundament of data mining (DM) is to analyses data from various points of view. Classify the
data and summarize it, DM has begun to be widespread in every and each application. Although
we have huge magnitude of data, but we do not have helpful information in each field, there are a
lot of DM software and tools to aid us the advantageous information. In this work we give the
essentials of DM steps such as preprocessing data (remove outlier, replace missing values etc.),
attribute selection, aim to choose just relevant attribute and removing the irrelevant attribute and
redundant attribute, classification and assessment of varied classifier models using WEKA tool.
The WEKA software is helpful for a lot of application’s type, and it can be used in different
applications. This tool is consisting of a lot of algorithms for attribute selection, classification
regression and clustering.
Keywords
Data mining, Weka, preprocessing, classification.
1. Introduction :
Data mining (DM) or knowledge discovery is the procedure of using statistical techniques and
knowledge-based methods to analyze the data to mine patterns having meaning from vast data
sets and change these into helpful information. Through DM process, diverse techniques are
utilized to detection relationships, patterns or associations among the dataset features, that can be
transform into knowledge around past patterns and trends of future. In general whatever four
types of relationships are discussing: classes, clusters, sequential and association’s patterns.
Classification category deals with DM techniques that looking for cluster and class
relationships [1].
Classification point out the DM side that making an attempt to predict to which class every
observation of the dataset must be placed by constructing a model based on some predictor
attributes. Classification methods divided into: unsupervised or supervised. In supervised
classification, one attribute of the dataset include predetermined values that represent a collection
of the data. These collections are called classes. For unsupervised classification the objective is
to partition into groups or clusters the observations of the dataset based on some logical
relationship that exists among the values of the attributes but that must yet be discovered [2].
And to achieve classification we firstly need to preprocess the data that we already takes from
UCI web page and to do that we will use Weka to achieve all data mining process. Weka tool is
software for data mining existing below the General public license (GNU). The Weka system is
developed at Waikato University in New Zealand. Weka is available for free at
http://www.cs.waikato.ac.nz/ml/weka. The Developers used Java to write this system. Weka
provides implementation of data mining and machine learning algorithms. User can achieve
classification, association, clustering, filtering, regression, visualization etc. by using Weka tool.
The focus of this report is to use an existing dataset (Dresses_Attribute_Sales Data Set for year
2014) from UCI Machine Learning Repository to preprocessing data for data mining process.
This dataset contain Attributes of dresses and their recommendations according to their sales.
Sales are monitor on the basis of alternate days. There are many preprocessing data techniques
like (clean the data, reduce the size of data and transform data into appropriate type).then we can
use this data set for a Recommender System. The objective of a Recommender System is to build
up meaningful recommendations for users about items or products that might interest them.
Until now no one use dresses sales dataset from UCI, but there are several approaches have been
proposed in the field of recommendation system using data mining approach such as [3] that
proposed medical advices recommendation system based on a hybrid method using varied
classifications and unified Collaborative Filtering. Multiple classifications based on decision tree
algorithms are applied to build an accurate predictive model that predicts the disease risk
diagnosis for the monitored cases. [4] Designed a novel book recommendation system. Readers
will be redirected to the recommendation pages when they can’t find the required book through
the library bibliographic retrieval system. The recommendation pages contain all the essential
and expanding book information for readers to refer to. Readers can recommend a book on these
pages, and the recommendation data analyzed by the recommendation system to make scientific
purchasing decision, the author proposed two formulas to compute the book value and copy
number respectively based on the recommendation data. In same trend [5] Presented a
recommendation technique based on opinion mining to propose top ranked books on different
discipline of the computer science. Based on the need of the customers and the reviews collected
from them, they have categorized features for the books and analyze the features on the basis of
several characteristics that they had categorized and reviews of the users. Assigned Weights to
categorized features according to their importance and usage, and accordingly the ranks are
given. Finally, top ten ranked books are listed.
[6] Proposed a movie recommendation system that has the ability to recommend movies to a new
user as well as the others. It mines movie databases to collect all the important information, such
as, popularity and attractiveness, required for recommendation. It generates movie swarms not
only convenient for movie producer to plan a new movie but also useful for movie
recommendation. [7] Introduced a different approach to recommender system which learn rules
for user preferences using classification based on Decision Lists. they had followed two Decision
List based classification algorithms like Repeated Incremental Pruning to Produce Error
Reduction and Predictive Rule Mining, for learning rules for users past behaviors. The authors
also list out their proposed recommendation algorithm and discuss the advantages as well as
disadvantages of our approach to recommender system with the traditional approaches. They had
validated their recommender system with the movie lens data set that contains hundred thousand
movie ratings from different users, which is the bench mark dataset for recommender system
testing.
ID Attribute Type
1- Dress_ID numeric
2- Style categorical
3- Price categorical
4- Rating numeric
5- Size categorical
6- Season categorical
7- NeckLine categorical
8- SleeveLength categorical
9- waiseline categorical
10- Material categorical
11- FabricType categorical
12- Decoration categorical
Figure 1.Attribute visualization
13- PatternType categorical
14- Recommendat numeric
ion
Tabel1. List of attributes
3.2 Are there any experts that understand the data well and with whom you can talk?
No, I don’t know any experts that know the data and I can speak with him.
3.3 Are values missing for some attributes?
Yes, there are a missing values in attributes (price , season, neckline , waiseline , material
, FabricType ,decoration , PatternType) .
When we load the dataset in Weka, in the right side we can see information about the
selected attribute, like its value and how many times an instance in the dataset has a
particular value, mean and Standard Deviation matures for numeric values.And the class
attribute is numeric so to achieve classification requirements we must change the class
from numeric to nominal using the path:
Choosefiltersunsupervised attribute NumericToNominal
Also with this dataset we have ID attribute ,so we have to delete this attribute because it
is full discrete and not give any benefit to classification .
4. Material and Methods: Describe the preprocessing technique used to each attribute in the
dataset:
1.2
0.8
0.6
0.4
0.2
0
0 5 10 15 20 25
-0.2
So we can handle this problem using notepad++ by follow the path Search Replace
Now we still have same problem with Autumn and it is duplication value Automn, Spring and
it is duplication value spring, and Winter with its duplication value winter, so we can handle
this problem with same procedure as above, and the result by checking with Weka is :
Figure 7. Attribute Season before remove duplicate names Figure 8. Attribute Season after remove duplicate names
We have same problem with attributes, NeckLine(Sweetheart- sweetheart), Style(Sexy- sexy),
price(Low-low,High-high),size(small,s,S),sleevelength(sleeveless-sleevless-sleeevless
sleeveless , cap_sleeves - capsleeves , halesleeve - half ), Meterial (chiffon - chiffonfabric) ,
so we can solve this problem with same procedure.
noisy: containing errors or outliers : to check if we have any outliers in our dataset
(Dresses Sales)we use the filter InterquartileRange , this filter adds new attributes that
point out whether values of instances can be considered outliers or extreme values.
Choose filters unsupervised attribute InterquartileRange then click apply and
the result is:
As we show in figure above the filter add new attributes (outlier and extreme values) so referring
to this result we have 120 outlier in our dataset, and about extreme values the result is zero so we
don’t have any extreme values.Now to remove outlier values we can use the path :
4.2 Data reduction: get a new data representation that is considerably smaller in magnitude but
so far produces the same (or roughly same) analytical results. Data reduction strategies:
Some values take a large scale as example attribute material from 0-23. So to reduce this scale
we use the filter discretizes using:
Style 126 NeckLine 168 SleeveLength 115 Material 2410 FabricType 2210
Decoration 2412 PatternType 147
Choose Filters unsupervisedattributediscretize
Figure 11. Example of attributes before discretization Figure 12. Example of attributes after discretization
We apply decision tree algorithm with ten different models of dividing the dataset to training
and testing Ranging from (90:10) to (10:90) and the result as shown in table2.
Model Results for 5 attributes and the class Results for 12 attributes and the class
Num. Len. Mean Num. Len. Mean
Model Data
accuracy of of Square accuracy of of Square
num. allocation
rules rules Error rules rules Error
1- 90:10 66 % 37 19 0.4715 65.7895 % 63 32 0.5064
2- 80:20 67.6768 % 37 19 0.4903 58.6667 % 63 32 0.5103
3- 70:30 69.1275 % 37 19 0.4762 52.2124 % 63 32 0.5816
4- 60:40 59.0909 % 37 19 0.4968 54.6667 % 63 32 0.5679
5- 50:50 63.7097 % 37 19 0.489 61.1702 % 63 32 0.5181
6- 40:60 64.094 % 37 19 0.4795 55.3097 % 63 32 0.6245
7- 30:70 57.6369 % 37 19 0.502 55.5133 % 63 32 0.5606
8- 20:80 60.7053 % 37 19 0.5181 52.4917 % 63 32 0.563
9- 10:90 54.7085 % 37 19 0.5498 51.4793 % 63 32 0.6231
10- 66:34 65.6805 % 37 19 0.4725 53.125 % 63 32 0.5394
Table 2. DT accuracy Results for 10 models using 5 attribute and 12 attribute
Then apply Naïve Bayes algorithm with ten different attempts of dividing the dataset to training
and testing Ranging from (90:10) to (10:90) and the result as shown in table3.
80%
77%
74%
71%
68%
65% Decision Tree
62% Neural Network
56%
53%
50%
90 80 70 60 50 40 30 20 10 66
Model Model Model Model Model Model Model Model Model Model
1 2 3 4 5 6 7 8 9 10
7 Conclusions
We used a dataset (dresses recommendation) from UCI, firstly we preprocess the dataset using
Weka then we put the resulting preprocessed data in classification models for three algorithms
(DT, NB, ANN). Because the Weka choose just (5) attribute by applying AttributeSelecton, We
try to put the dataset with original number of attributes (12) in classifiers models to see the
difference between the result using AttributeSelection and without using it, and as we can see
from results the dataset give better results for all methods after attribute selection, that mean the
existence of irrelevant attribute decrease the accuracy. Then we examine the best performance
for the three techniques by comparative the accuracy and mean square error using the dataset
after applying AttributeSelection. And as we can see our learning procedure is performing better
with Naïve Bayes model in terms of higher accuracy rate and lower mean square error rate.
References:
1. N Padhy, Dr. P. Mishra, “The Survey of Data Mining Applications And Feature Scope” , (IJCSEIT) International
Journal of Computer Science, Engineering and Information Technology, Vol.2, No.3, June 2012 .
2. Guerra L, McGarry M, Robles V, Bielza C, Larrañaga P, Yuste R. ,”Comparison between supervised and
unsupervised classifications of neuronal cell types: A case study. Developmental neurobiology” , 71(1): 71-
82,(2011).
3. Asmaa S. Hussein, Wail M. Omar, Xue Li, ModafarAti,” Efficient Chronic Disease Diagnosis Prediction and
Recommendation System” , IEEE EMBS International Conference on Biomedical Engineering and Sciences I
Langkawi I 17th - 19th December 2012.
4. Binge Cui, Xin Chen,” An Online Book Recommendation System Based on Web Service”, Sixth International
Conference on Fuzzy Systems and Knowledge Discovery,IEEE,2009.
5. Shahab Saquib Sohail, Jamshed Siddiqui, Rashid Ali,” Book Recommendation System Using OpinionMining
Technique”,IEEE,2013.
6. Sajal Halder, A. M. Jehad Sarkar, Young-Koo Lee,” Movie Recommendation System Based on Movie Swarm”,
Second International Conference on Cloud and Green Computing,IEEE,2012.
7. Abinash, Vineet, An Approach to Content Based Recommender Systems using Decision List based Classification
with k-DNF Rule Set, International Conference on Information Technology , 2014.
8. Nadav Golbandi, Yehuda Koren, Ronny Lempel,” Adaptive Bootstrapping of Recommender Systems Using
Decision Trees”, ACM 978-1-4503-0493-1/11/02,2011.
9. Iván Cantador, Desmond Elliott, Joemon M. Jose,” A Case Study of Exploiting Decision Trees for an Industrial
Recommender System”, Lilybank Gardens, Glasgow, G12 8QQ, UK,2009.
10. Sofia Visa, Anca Ralescu, Mircea Ionescu,” Investigating Learning Methods for Binary Data”, IEEE,2007.
11. Anika Gupta, Dr. Deepak Garg,” Applying Data Mining Techniques in Job Recommender System for Considering
Candidate Job Preferences”,IEEE,2014.
12. Mustansar Ali Ghazanfar and Adam Pr¨ugel-Bennett,” An Improved Switching Hybrid Recommender System Using
Naive Bayes Classifier and Collaborative Filtering”, School of Electronics and Computer Science, University of
Southampton, Highfield Campus, SO17 1BJ, United Kingdom,2010.
13. Sutheera Puntheeranurak, Pongpan Pitakpaisarnsin,” Time-aware Recommender System Using Naïve Bayes
Classifier Weighting Technique”, 2nd International Symposium on Computer, Communication, Control and
Automation (3CA 2013).
14. Meghna Khatri,” A Survey of Naïve Bayesian Algorithms for Similarity in Recommendation Systems”,
International Journal of Advanced Research in Computer Science and Software Engineering, Volume 2, Issue 5,
May 2012.
15. Anand Shanker Tewari, Tasif Sultan Ansari, Asim Gopal Barman,” Opinion Based Book Recommendation Using
Naïve Bayes Classifier”,IEEE,2014.
16. Sutheera Puntheeranurak, Supitchaya Sanprasert,” Hybrid Naive Bayes Classifier Weighting and Singular Value
Decomposition Technique for Recommender System”,IEEE,2011.
17. Maria Nadia Postorino, Giuseppe M. L. Sarne ,” A Neural Network Hybrid Recommender System, Proceedings of
the 2011 conference on Neural Nets WIRN10: Proceedings of the 20th Italian Workshop on Neural Net,2012.
18. Anant Gupta, Dr. B. K. Tripathy,” A Generic Hybrid Recommender System based onNeural Networks”,IEEE.2014.
19. M.K.Kavitha Devi, R.Thirumalai Samy, S.Vinoth Kumar, Dr.P.Venkatesh,” Probabilistic Neural Network approach
to Alleviate sparsity and cold start problems in Collaborative Recommender Systems”,IEEE,2010.
20. Alberto Y. Hata, Danilo Habermann, Fernando S. Osorio1, and Denis F. Wolf,” Road Geometry Classification using
ANN”, 20141EEE Intelligent Vehicles Symposium (IV)June 8-11, 2014. Dearborn, Michigan, USA.
21. Arthur J. Lina, Chien-Lung Hsub, Eldon Y. Lic,,” Improving the effectiveness of experiential decisions by
recommendation systems”,ACM,2014.
22. Machine Learning Repository (UCI) , https://archive.ics.uci.edu/ml/index.html.
23. R. Kirkby,E. Frank,” WEKA Explorer User”, University of Waikato, November 9, 2004.
Appendix
Mean Accuracy
80.00%
75.00%
70.00%
65.00%
60.00%
55.00%
50.00%
45.00% Mean Accuracy
40.00%
35.00%
30.00%
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
Decision Tree Naïve Bayes Neural Network
Figure 18. Comparative of Mean for Accuracy and Mean Square Error for DT, NB and NN
Figure 19. Example of raw dataset
1.2
0.8
0.6
0.4
0.2
0
0 5 10 15 20 25
-0.2
1.2
0.8
0.6
0.4
0.2
0
0 5 10 15 20 25
-0.2
1.2
0.8
0.6
0.4
0.2
0
0 5 10 15 20 25 30 35 40
-0.2