Prediction Analysis Techniques of Data Mining: A Review

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

PREDICTION ANALYSIS TECHNIQUES OF DATA MINING: A

REVIEW
H.Venkata Subbaiah, BUCHI REDDY CHINTAKINDI, BIKSHPATHI DADI
4.Dharavath Bhadru, 5.SHARMA NEERAJ
Assistant Professor, Department of Computer Engineering,
Ellenki college of Engineering and Technonlogy, patelguda (vi), near BHEL ameenpur (m),
Sangareddy Dist. Telangana 502319
.
ABSTRACT: Data mining within the in geology. To discover new theories,
databases is called a technique from which information clustering can be used to classify
the extraction of necessary information can all documents available on Web. The
be done from the raw information. With the unsupervised data clustering classification
help of the prediction analysis technique method creates clusters and objects as these in
provided by the data mining the future different clusters are distinct and that are in
scenarios regarding to the current same cluster are very similar to each other. In
information can be predicted. The prediction data mining, cluster analysis is considered a
analysis is the combination of clustering and traditional topic which is applied for the
classification. In order to provide prediction knowledge discovery. The data objects are
analysis there are several techniques grouped as a set of disjoint classes which are
presented through many researchers. In this known as cluster [2]. Objects which are divided
review paper, various techniques proposed into separate classes are more different and
by various authors are analyzed to within a class objects have high resemblance to
understand latest trends in the prediction each other. In order to determine patterns and
analysis. predicting future outcomes and trends
KEYWORDS: Classification, Clustering, K- predictive analytics is the practice of extracting
means, SVM. from existing data sets. Future predictions are
1. INTRODUCTION not provided through prediction analysis. In the
Data mining is the patterns for analyzing future with an acceptable level of reliability
information and the process to extract the includes what-if scenarios and risk assessment
interesting knowledge. In data mining, various forecast is provided by the prediction analysis.
data mining tools available which are used to Future possibilities are completely predicted
analyze different types of data. For analyzing through the prediction analysis. In order to
the data information few applications which is better understand customers, products and
used by data mining are such as making partners and to identify potential risks and
decisions, analysis on market basket, production opportunities for company predictive models
control, and customer retention, scientific are used to analyze current data and historical
discovers and education systems [1]. Applied to facts applied to business. To make future
similar cluster and not same type of data is business forecasts including data mining,
referred to clustering in this approach.The statistical modeling and machine learning to
clusters are generated by analyzing similar help analysts it uses a number of techniques.
patterns of the input data. While categorizing From data Predictive analytics is an area of
genes with same functionality and in population statistics that deals with extracting information
gain insight into structures can be inherited in and used it for predicting trends and behavior
biology for deriving plant and animal patterns. Calculation of statistical probabilities
taxonomies. In city, similar houses and lands of future events online is the enhancements of
area can be identified by employing clustering predictive web analytics. Data modeling,

ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-3, ISSUE-10, 2016


DOI:10.21276/ijcesr.2016.3.10.1
1
INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)
machine learning, AI, deep learning algorithms obtain arbitrary shaped using the technique of
and data mining are included in the Predictive density based clustering.
analytics statistical techniques predictive
analytics can be applied to any type of unknown d. Grid Based Methods: - It is known as
whether it be in the past, present or future often the generation of grid structure by the
the unknown event of interest is in the future. quantizing the space of the object to the finite
To predict the likely behavior of individuals, number of cells. This method is independent as
machinery or other entities Predictive analytics it is not dependent on the availability of the
software applications use variables that can be number of data objects and also has a high
measured and analyzed. Such as age, gender, speed.
location, type of vehicle and driving record,
pricing and issuing auto insurance policies are 1.1. Classification in Data Mining
taken in an account potential driving safety Within the data mining the prediction of the
variables through the insurance company. With group membership for instance information
statistical methods and the ability to build can be done with the help of the classification
predictive data models Predictive analytics technique [5].
requires a high level of expertise. it's typically Prediction analysis is the process in which
the domain of data scientists, statisticians and outcome will be predicted on the basis of
other skilled data analysts which is the complete current data. For example, on the basis of
outcomes of prediction analysis. For helping to current weather information it will be analyzed
gather relevant data and prepare it for analysis that day can be either “sunny”, “rainy “or
is supported through data engineers. Therefore, “cloudy.
with data visualization, dashboards and reports Two steps are followed within this process.
are supported through software developers and They are:
business analysts. Clustering methods divided a. Model Construction: Model construction
into categories are as follows: explains the group of classes of predetermined.
a. Partitioning Methods: - the basic Wide numbers of tuples are utilized in the
functioning of this method is the collection of construction of the model known as training set.
the samples in a way to generate clusters of Classification of the rules, decision trees or
same objects that are of high similarities. Here, mathematical formulae/regression is shown in
the samples that are dissimilar are grouped this method.
under different clusters from similar ones. b. Model usage: The second way used in
These methods completely rely on the distance the classification is model usage. In order to
of the samples [3]. classify the test data, the training set is designed
of the unknown from the unknown data for the
b. Hierarchical Methods: - A given accuracy analysis [6]. The result of the
dataset of objects are decomposed classification of the model is used to compare in
hierarchically within this technique. There are sample test with a label that is known. Test set
two types in classification of this method is is not dependent on training set.
done with the involvement decomposition. It is 1.2 SVM classifier
divisive and agglomerative methods based upon In this study the author proposed SVM
[4]. Agglomerative technique is the bottom up classifier for regression, classification and also
technique at which the first step is the formation the general pattern recognition. Due to its high
of the separate group. Merging is done when the generalization performance without requiring
groups are near to each other. any prior knowledge to add in it, this classifier
is considered to be good in comparison to other
c. Density Based Methods: - In many classifiers. The performance is even better such
techniques the distance amongst the objects is as extremely high of the input space dimension.
taken for the separation of the objects into The SVM requires best classification function
clusters as a base into clusters. However, these identification for differentiating of training data
methods can only be helpful while identifying between the two classes The classification
the spherical shaped clusters. It is difficult to function metric may represent in a geometric
manner as well [7]. The hyper plane f(x) is

ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-3, ISSUE-10, 2016


DOI:10.21276/ijcesr.2016.3.10.1
2
INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)
separated through the linear classification result evaluation is described in which number
function for the linearly separable dataset. This of papers published in IEEE or Springer is
hyper plane passes through the middle of two studied.
classes which can be said to separating them. xn 2. Literature Review
is classified by testing the sign function of the Min Chen, et.al [9] presented on the basis of
new data instance function f(xn); xn which multimodal disease risk prediction (CNN-
refers to the positive class if f(xn)> 0. This is MDRP) algorithm called a novel convolution
done after the determination of a new function. neural network. The data was gathered from a
Determination of the best function by increasing hospital which included within it, both
the margin between the two classes is an structured as well as unstructured data. In order
important objective of SVM. There are many to make predictions related to the chronic
linear hyper planes because of this fact. Hyper disease that had been spread in several regions,
plane is amongst the two classes an amount of various machine learning algorithms were
space or distance present. Margin is closest streamlined here. 94.8% of prediction accuracy
between the closest data points to a point with a was achieved here along with the higher
shortest distance on the hyper plane. This can convergence speed in comparison to other
further help us in defining the way to extend the similar enhanced algorithms.
margin which can help in selecting only a few
hyper planes for the solution to SVM even Akhilesh Kumar Yadav, et.al presented an
when so many hyper planes are available [8]. analysis of different analytic tools that have
been used to extract information from large
For an identification of the target function the datasets such as in medical field where a huge
aim of the SVM is to produce linear function. amount of data is available [10]. The proposed
Performance of the regression analysis can help algorithm has been tested by performing
to extend the SVM. The error models are of different experiments on it that gives excellent
quiet help here for the SVRs. Within an epsilon result on real data sets. In comparison with
amount the error is defined zero of the existing simple k-means clustering algorithm
differences between real and predicted values.In using the algorithm results are achieved in real
the off chance, there is a linear growth in the world problem.
epsilon insensitive error. Through the reduction
of Lagrangian, the support vectors can be Sanjay Chakrabotry et.al, (2014) presented
studied. The insensitivity to the outliers can be clustering tool analysis for the forecasting
of beneficial for the support vector regression. analysis [11]. The weather forecasting has been
The demerit of SVM is that the computations performed using proposed incremental K-mean
are not efficient enough. There are many clustering generic methodology. The weather
solutions proposed for this. The breakage of one events
big problem into numerous numbers of smaller forecasting and prediction becomes easy using
problems is one way to solve this issue. There modeled computations. Towards the end
are only some selected variables for the section, the authors have performed different
efficient optimization for each problem. Until experiments to check the proposed approach’s
all the problems are solved eventually, this correctness.
process keeps working in iterative nature. The
problem of learning SVM is to be solved also Chew Li S. et.al, (2013) presented [12] that the
by recognizing the approximate minimum results of a particular university’s students have
enclosing a set of instances in the program. been recorded to keep a track using Student
This review paper is based on the prediction Performance Analysis System (SPAS). The
analysis which is generally done with the design and analysis has been performed to
classification techniques. predict student’s performance using proposed
This paper is organized such that in the section project on their results data. The data mining
1, the introduction of the prediction analysis is technique generated rules that are used by
given with various classification techniques. In proposed system provide enhanced results in
the section 2, the literature survey is written on predicting student’s performance. The student’s
the prediction analysis. In the section 3, the

ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-3, ISSUE-10, 2016


DOI:10.21276/ijcesr.2016.3.10.1
3
INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)
grades are used to classify existing students algorithm is also able to solve dead unit
using classification by data mining technique. problem.

Qasem A. et.al, (2013) suggested that the data Ming, J, et.al (2018) proposed multi-
analysis prediction [13] is considered as dimensionality and nonlinearity the
important subject for forecasting stock return. Characteristics of the technical and economic
The future data analysis can be predicted data of mining enterprises. Using technologies
through past investigation. The past historical of big data analysis and data mining the analysis
knowledge of experiments has been used by method of the technical and economic data is
stock market investors to predict better timing researched. Simplification of the fluctuation
to buy or sell stocks. There are different pattern and influencing factors of the mineral
available data mining techniques amongst products price are done. Using artificial neural
which, a decision tree classifier has been used network the prediction model of the mineral
by authors in this work. products price is established [17]. The
prediction model of the geological missing data
K.Rajalakshmi et.al, (2015) presented study is established on the basis of techniques of geo
related to [14] medical fast growing field statistics and artificial neural network.
authors. In this field every single day, a large Regularity of geological data of group
amount of data has been generated and to boreholes and of geological data of all
handle this much of large amount of data is not boreholes the regularity is discussed and
an easy task. By the medical line prediction analyzed by using the model. The practicability
based systems, optimum results are produced of the prediction model is strong, and the
using medical data mining. The K-means prediction accuracy is high as shown in the
algorithm has been used to analyze different outcomes of the proposed approach through the
existing diseases. The cost effectiveness and authors. Due to the limitation of technical
human effects have been reduced using conditions and equipment conditions during the
proposed prediction system based data mining. process of mineral development there is a loss
of a lot of geological data that decreases
BalaSundar V et.al, (2012) examined [15] real accuracy of the ore body shape and that of
and artificial datasets that have been used to reserves estimation in this study.
predict diagnosis of heart diseases with the help
of a K-mean clustering technique in order to Sakhare, A. V, et.al (2017) proposed in data
check its accuracy. The clusters are partitioned mining paper shows a survey of road accident
into k number of clusters by clustering which is analysis methods an important role played in
the part of cluster analysis and each cluster has transportation is the system road accident
its observations with nearest mean. The first analysis. Using the different methods of data
step is random initialization of whole data, and mining this paper Road Accident Data Analysis
then a cluster k is assigned to each cluster. The is described. The study of K-mean algorithm is
proposed scheme of integration of clustering given in this paper. Clusters are created and
has been tested and its results show that the analyze them with the help of SOM [18]. It is
highest robustness, and accuracy rate can be used as an unsupervised learning method based
achieved using it. on neural network known as self organizing
method. Analysis accuracy is improved through
Daljit Kaur et.al (2013) explained [16] that data this. Because no. of people death and injured for
that contains similar objects has been divided that improve the road transportation system is
using clustering. The data that contains similar needed in our daily life there are no. of accident
objects is clustered in same group and the increases and it is big problem to us. For finding
dissimilar objects are placed in different a no. of pattern to analysis the road accident
clusters. The proposed algorithm has been data which help to find prediction of accident
tested and results show that this algorithm is reasons and improve the accuracy of analysis
able to reduce efforts of numerical calculation compare to k-means clustering algorithm is
and complexity along with maintaining an known as the research self organization map
easiness of its implementation. The proposed (SOM).

ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-3, ISSUE-10, 2016


DOI:10.21276/ijcesr.2016.3.10.1
4
INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)
Chauhan, C, et.al (2017) proposed to analyze that handles the development of methods for
the victim system where the attack is occurring exploring information arising in scholastic
and also the forensic kit tool generates the file fields is known as the Educational Data Mining
and analyzes the data in this proposed approach. (EDM). For effective education planning as a
This approach can analyze previously unknown, result, it provides intrinsic knowledge of
useful information from an unstructured data teaching and learning process. Ameliorating
using the concept of data mining [19]. For the pedagogical process, presaging student
identification of criminal and it has been found performance, comparison of the precision of
to be pretty much effective in doing the same data mining algorithms, and demonstrate the
Predictive policing means, using analytical and maturity of open source implements are the
predictive techniques. Methodical approach for outcomes of these studies give insight into
identifying and analyzing patterns and trends in techniques in this proposed approach.
crime is the Crime analysis. Crime data analysts
can help the Law enforcement officers to speed proposed using commercial game log data
up the process of solving crimes with the competition framework for game data mining in
increasing origin of computerized systems. this paper. Promoting the research of game data
During analysis of experimental data it is mining by providing commercial game logs to
concluded that advanced ID3 algorithm is more the public is the purpose of the game data
reasonable and more effective classification mining competition. From other types of game
rules AI competitions that targeted strong or human-
like AI players and content generators the goal
Anoopkumar M, et.al (2016) proposed give a of the competition was very different [21]. With
comprehensive survey towards the research external researchers game companies avoid
papers which would have discussed different sharing their game data this approach enabled
Data Mining Methods especially the mostly researchers to develop and apply state-of-the-art
utilized and trendy algorithms applied to EDM data mining techniques to game log data. To
context. For computing educators and predict whether a player would churn and when
professional bodies this paper accumulates and the player would churn during two periods
relegates literature, identifies consequential between which the business model was changed
work and mediates. In this field to date this to a free-to-play model from a monthly
paper conducted a comprehensive study on the subscription was the main objective of this
recent and relevant studies kept through [20]. proposed approach. Highly ranked competitors
Developing models for improving academic used deep learning; tree boosting and linear
performances and improving institutional regression was the outcome of the competition
effectiveness is the main focus of this study on revealed in this proposed approach by the
methods of analyzing educational information. researchers and authors.
An interdisciplinary ingenuous research area
Authors Techniques / Datasets Attributes Tools Used Shortcoming Results
Algorithms
Min Chen, et.al Naïve Heart 79 MATLAB This classifier has high Decision tree performs
Bayesian, Diseases complexity. better in comparison to
KNN and other classifiers.
Decision tree
Akhilesh Kumar Foggy K- Lung cancer 9 WEKA` Complexity is high. Foggy k-mean performs
Yadav, et.al mean Data well as compared to K-
Algorithm means
Sanjay Chakrabotry Incremental Air 7 WEKA Accuracy is less The accuracy of proposed
et.al k-mean pollution method is achieved up to
clustering Data 83.3 percent.
Algorithm
Chew Li S. et.al BF Tree Student’s 9 WEKA Complexity is high BF Tree performs well as
classifier Performance which increases the compared to other tree
execution time. classifiers

ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-3, ISSUE-10, 2016


DOI:10.21276/ijcesr.2016.3.10.1
5
INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)
Conclusion and Evolutionary Computing, volume 12, issue
Future prediction is done from the current 5, pp- 103- 111
information by the prediction analysis which is [8] Himani Bhavsar, Mahesh H. Panchal,
the technique of data mining. The combining of (2012) “A Review on Support Vector Machine
clustering and classification is known as the for Data Classification”, 2012,
prediction analysis. Clustering algorithm groups International Journal of Advanced Research in
the data according to their similarity and Computer Engineering & Technology
classification algorithm assigns class to the (IJARCET) Volume 1, Issue 10
data. In terms of many parameters several [9] Min Chen, YixueHao, Kai Hwang,
prediction analysis algorithms are reviewed and Fellow, IEEE, Lu Wang, and Lin Wang (2017),
analyzed in this paper. The literature survey is “Disease Prediction by Machine Learning over
done on various techniques of prediction Big Data from Healthcare Communities”, 2017,
analysis from where problem is formulated. The IEEE, vol. 15, 2017, pp- 215-227
formulated problem can be solved in future to [10] Akhilesh Kumar Yadav, DivyaTomar
increase accuracy of prediction analysis. and SonaliAgarwal (2014), “Clustering of Lung
Cancer Data Using Foggy K- Means”,
References International Conference on Recent Trends in
[1] AbdelghaniBellaachia and ErhanGuven Information Technology (ICRTIT), vol. 21,
(2010), “Predicting Breast Cancer Survivability 2013, pp.121-126.
Using Data Mining Techniques”, Washington [11] Sanjay Chakrabotry, Prof. N.K Nigwani
DC 20052, vol. 6, 2010, pp. 234-239. and Lop Dey (2014), “Weather Forecasting
[2] Oyelade, O. J, Oladipupo, O. O and using Incremental K-means Clustering”, vol. 8,
Obagbuwa, I. C (2010), “Application of k- 2014, pp. 142-147.
Means Clustering algorithm for prediction of [12] Chew Li Sa., Bt Abang Ibrahim, D.H.,
Students’ Academic Performance”, Dahliana Hossain, E. and bin Hossin, M.
International Journal of Computer Science and (2014), "Student performance analysis system
Information Security, vol. 7, 2010, pp. 123-128. (SPAS)", in Information and Communication
[3] AzharRauf, Mahfooz, Shah Khusro and Technology for The Muslim World (ICT4M),
HumaJaved (2012), “Enhanced K-Mean 2014 The 5th International Conference on,
Clustering Algorithm to Reduce Number of vol.15, 2014, pp.1-6.
Iterations and Time Complexity”, Middle-East [13] Qasem A. Al-Radaideh, Adel Abu Assaf
Journal of Scientific Research, vol. 12, 2012, and EmanAlnagi “Predicting Stock Prices
pp. 959-963. Using Data Mining Techniques”, the
[4] Osamor VC, Adebiyi EF, Oyelade JO International Arab Conference on Information
and Doumbia S (2012), “Reducing the Time Technology (ACIT’2013), vol. 23, 2013, pp.
Requirement of K-Means Algorithm” PLoS 32-38, (2013),
ONE, vol. 7, 2012, pp-56-62. [14] K. Rajalakshmi, Dr. S. S. Dhenakaran
[5] AzharRauf, Sheeba, SaeedMahfooz, and N. Roobin (2015), “Comparative Analysis
Shah Khusro and HumaJaved (2012), of K-Means Algorithm in Disease Prediction”,
“Enhanced K-Mean Clustering Algorithm to International Journal of Science, Engineering
Reduce Number of Iterations and Time and Technology Research (IJSETR), Vol. 4,
Complexity,” Middle-East Journal of 2015, pp. 1023-1028.
ScientificResearch, vol. 5, 2012, pp. 959-963 [15] BalaSundar V, T Devi and N Saravan,
[6] Thair Nu Phyu, “Survey of (2012) “Development of a Data Clustering
Classification Techniques in Data Mining”, Algorithm for Predicting Heart”, International
2009, Proceedings of the International Journal of Computer Applications, vol. 48,
MultiConference of Engineers and Computer 2012, pp. 423-428.
Scientists, volume 3, issue 12, pp- 551-559, [16] DaljitKaur and KiranJyot (2013),
IMECS “Enhancement in the Performance of K-means
[7] Chuan-Yu Chang, Chuan-Wang Chang, Algorithm”, International Journal of Computer
Yu-Meng Lin, (2012) “Application of Support Science and Communication Engineering, vol.
Vector Machine for Emotion Classification”, 2 2013, pp. 724-729
2012 Sixth International Conference on Genetic

ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-3, ISSUE-10, 2016


DOI:10.21276/ijcesr.2016.3.10.1
6
INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR)
[17] Ming, J., Zhang, L., Sun, J.& Zhang, Y,
“Analysis models of technical and economic
data of mining enterprises based on big data
analysis”, International Conference on Cloud
Computing and Big Data Analysis (ICCCBDA),
2018, IEEE, 3rd

[18] Sakhare, A. V., & Kasbe, P. S “A


review on road accident data analysis using data
mining techniques”, International Conference
on Innovations in Information, Embedded and
Communication Systems (ICIIECS), 2017
[19] Chauhan, C., & Sehgal, S, “A review:
Crime analysis using data mining techniques
and algorithms”, International Conference on
Computing, Communication and Automation
(ICCCA), 2017
[20] Anoopkumar M, & Rahman, A. M. J.
M. Z, “A Review on Data Mining techniques
and factors used in Educational Data Mining to
predict student amelioration, International
Conference on Data Mining and Advanced
Computing (SAPIENCE), (2016)
[21] Lee, E., Jang, Y., Yoon, D.-M., Jeon, J.,
Yang, S., Lee, S, “Kim, K.-JGame Data Mining
Competition on Churn Prediction and Survival
Analysis” using Commercial Game Log Data
Transactions on Games, IEEE, 2018

ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-3, ISSUE-10, 2016


DOI:10.21276/ijcesr.2016.3.10.1
7

You might also like