A Review of Data Mining Literature
A Review of Data Mining Literature
Abstract - With progression in technology specifically in magnanimous to be examined manually and many a
last three decades or so, an enormous magnitude of times automatic data analysis supported by classic
information has been transitioned into a digital form, statistics and machine learning could face concerns
which resulted in formation of enormous data repositories. once the procedure is hefty and collected knowledge
With accrual of information in these repositories a
challenge persisted as how to extract meaningful
comprises of problematical entities. The bellicose,
knowledge from it. Data mining as a tool was used to massive volume of data collected from numerous
tackle the situation. Data mining considered as stepping sources and kept in vast and various repositories. The
stone to procedure of knowledge discovery in databases, data collection thus exceeds the human aptitude for
this is a procedure of extracting hidden information from analysis without a powerful analysis tool, as a
enormous sets of databases to excavate eloquent patterns consequence these repositories become ‘data vaults’,
and rules. Data mining has now become an indispensable that are not often visited. As decision makers lack tools
component in almost every field of human life. The to extract the treasurable knowledge mounted within
present article provides an analysis of the available enormous volume of data, hence vital decisions lack the
literature on data mining. The concept of data mining as
well as its various methodologies are summarized. Some
utilization of information rich data. Data mining tools
applications, tasks and issues related to it have also been perform analysis of data and determine the vital patterns
illustrated. that were earlier anonymous. As every arena of human
life has become data intensive which stemmed in
making data mining as an indispensable constituent.
Keywords- Data mining, Knowledge discovery in database,
Knowledge base.
Though Data mining and KDD have been used
conversely yet KDD can be seen as an inclusive
1. INTRODUCTION
procedure of extracting beneficial knowledge from data,
The readiness of ample magnitude of data in almost while as Data mining can be seen as core of KDD,
every field and the desire to excerpt beneficial which includes Algorithms that explore data, build
information and knowledge from it substantiated as models and discovery unknown patterns.
main motivation that pulled the eyes of researchers in
2. REVIEW OF LITERATURE
recent past towards data mining. The information and
knowledge extracted can be momentously useful for the Fayyad et.al (1996)[3] in their paper “From data mining
applications ranging from small business management to knowledge discovery in databases” described KDD
to complex engineering design to science exploration. as “a nontrivial process of recognizing valid, novel,
Data mining is the analysis and scrutiny of mammoth potentially useful and finally understandable patterns in
data sets, with an aim to uncover significant pattern and data”. Elaborating the definition data were any set of
rules that were previously unidentified. The core aim is valid facts that are accessible in an electronic form.
exploiting the data processing power of computer with Patterns are models expressed by some language as data
human’s capability to perceive patterns (Han and subset. The patterns must be valid so that are true and
Kamber 2001)[1]. The epoch of data mining can be modeled for any new data. Process includes
applications was conceived in the year 1980 multiple steps from data preparation to knowledge
predominantly by research driven tools engrossed on enhancement all recurrently used till looked-for results
solo chore (Piatetsky-Shapiro 2000([2]. In recent times are achieved. Nontrivial indicates that there ought to be
data mining is being dominant among Statisticians, MIS a sort of inference computation so as to distinguished it
communities’ data analysts. It was during the first from the traditional computation of values. Fayyad and
workshop on KDD in 1989 Piatetsky-Shapiro coined Stolorz (1997)[4] in their paper described KDD as
the phrase “knowledge discovery in database”. The “generalized procedure of uncovering treasured
recognition of data mining and KDD shouldn’t be knowledge from data with mining being one among
astonishing, considering the scale of data been collected other steps in that process that uses some algorithms for
from various obtainable sources, the collected data is knowledge extraction process”. Charles et.al (1998)[5]
1
437 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
proposed data mining as an effective tool for direct applying data mining approach on multiple table data
marketing so as to improve product marketing in this for abstracting the knowledge in it (Saso Dzeroski
technological age where traditional means of marketing 2010)[12]. Venkatadri.M et.al (2011)[13], discuss
such as mass marketing is showing downfall trend. appropriate Techniques and methodologies are needed
Using data mining we can determine buyers’ patterns, in future to cater the needs of data mining field as it is
in order to single out potential buyers from customers exploring more and more complex fields so that we can
list. It was demonstrated that data mining as direct explore the such complex situations where data is huge
marketing tool can bring more profit than the traditional but is full of hidden information. Tipawan
means of mass marketing as it targets only the potential Silwattananusarn et.al (2012)[14] in their review paper
buyers. Michael Goebel et.al (1999)[6] in their paper reconnoitered the applications of data mining
“A survey of data mining and knowledge discovery techniques that evolved overtime to support knowledge
tools”, provided an generalized view of common management process as it is being extensively used
knowledge discovery tasks and various methodologies across various fields and each field is being supported
to resolve these. A feature classification scheme was by discrete data mining techniques, it was shown that
proposed that was used to study knowledge and data data mining can be integrated into knowledge
mining software’s. They specified some of the management framework and improve that process with
important features that must be reckoned important for superior knowledge. Divya Tomar et.al (2013)[15]
the knowledge discovery software so that it will be used presented data mining as most vivacious and appealing
effectively and will address more issues that were not research area which is gaining attractiveness in medical
sufficiently studied. lots of the organizations in the domain. Data mining provides several benefits in
world today have very huge databases that don’t have healthcare domain. It enhances the medical services in
any limitation on growth. New data is added to these cost effective manner. Anand.v. saurkar et.al (2014)[16]
databases at rate of millions of records per day. These defined data mining as “interdisciplinary field which
types of databases provide a new challenge and unique consists of integrated databases, artificial intelligence,
opportunities to mine these data streams. Pedro machine learning, statistics etc.”. They defined data
Domingos et.al (2000)[7] described and evaluated mining as multi-step process which comprises
VDTF on these huge databases. They used Hoeffding preparation of data for mining, mining algorithms,
trees which allow learning in a meager constant time analysis of results and interpretation of results. The
per example and it has high asymptotic similarity to capability of data mining to dig deep into the data and
batch time. Differentiating data mining from extract hidden information and knowledge from it has
information Hand et.al (2001)[8] defined data mining as received tremendous attention form business
“analysis of huge datasets to discover the unsuspected professionals to generate the patterns related to
relationship and to review data in more logical way so customer’s behavior and predict future sales and trends,
that it serves the desired results”. According to and assist policy makers in decision making with the
Rygielski et.al (2002)[9], data mining technology has aim of increasing profits (Shraddha Soni 2015)[17].
added a new dimension to CRM. The data mining’s’
power to extract the predictive unknown information 3. ARCHITECTURE OF DATA MINING
from vast datasets have found its way into the CRM to
identify and evaluate valuable customers, predict the The architecture of data mining is shown in figure 1
customers shopping behavior which results in helping
the vendors taking proactive and knowledge based
decisions. Eamonn Keogh et.al (2004)[10], discussed
“parameter free data mining” as parameter laden
algorithms may over or under estimate certain
parameters which will yield in patterns that may not be
fully accurate. A parameter free mining can prevent us
from applying our own presumptions or prejudices.
They proposed a datamining paradigm centered on
compression. Streaming datamining considered as hard
chore in knowledge discovery in databases, the
traditional mining approaches are infeasible to deal with
it as data comes in multiple, unceasing and time
wavering data streams. Alhammady et.al (2007)[11],
introduced a unusual methodology for mining evolving
patterns in streaming data which has a better mining
complexity and classification accuracy as proved by
experimentation. As most of current methodologies of Figure 1. Architecture of Data Mining. (credits: Han and
data mining explore knowledge in single data table. But Kamber [1]).
recently most of these methodologies are protracted to
relational cases. Relational data mining involves
2
438 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
Knowledge Base: It serves as the initiation for the Database or data warehouse server: It contains
whole data mining process. It acts as guide for tangible data that is set to be fetched. Fetching of data
searching or assessing the interestingness of the on users’ request is its key responsibility.
resultant patterns. Such type of knowledge may include
concept hierarchies, that organize attributes or their Other process: before data is passed on into the data
values into distinct stages of abstraction. warehouse server, data needs to be cleaned and
integrated, as data is collected from distinct sources and
Data mining engine: It forms the staple component of is in different formats so it can’t be used directly for
mining system, it consists of all the necessary modules mining process. The data needs to be cleaned,
such as characterization, prediction, cluster analysis, integrated and only the reliable data needs to be
outlier analysis and evolution analysis for performing selected and passed on to the data warehouse server.
data mining tasks. The process may require number of techniques for
cleaning, integration and selection.
Pattern evaluation module: This module is usually
associated with interestingness measures. It persistently 4. DATA MINING PROCESS
interrelates with the data mining engine to remain
focused on search of interesting patterns. Many of a
times it uses thresholds to sieve out discovered pattern, Fayyad et.al (1996)[18] defined “data mining as one
or may use pattern evaluation module integrated with among several steps in the process of knowledge
mining module, depending on data mining technique discovery, it involves applying data analysis and
used.
discovering algorithms that yield a precise enumeration
User interface: The module acts as a connection of pattern over data under any acceptable computation
between users and data mining system. It facilitates efficiency”. This procedure is collaborative and
users’ interaction with the system in an easy and reiterative and encompasses of many steps with
efficient way without fretting the user about decisions made by user with attempts made at every
convolutions behind the process. step to complete a particular discovery task, each
Data sources (www, data warehouse, database, other accomplished by the application of discovery method.
repositories): These are the actual sources of data and Data mining synonymously used by some for phrase
enormous volume of historical data is required for knowledge discovery in databases (KDD) process,
successful data mining. Organizations typically store conversely many consider it as an essential step of
data in databases or data ware houses. Sometimes more KDD which results in beneficial patterns or models for
than one databases or text files or spreadsheets are
data. The various processes for data mining are shown
contained in data warehouse. www is an another huge
source of data. in figure 2.
Selection: Selecting pertinent data from distinct Preprocessing: As data is congregated form different
sources for mining process. sources, it contains inconsistencies, to remove those
various activities are carried out in this phase,
blemished data is corrected or removed, noise and
3
439 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
discrepancies are removed and data from distinct are some of the commonly used regression strategies.
sources is combined. More complex techniques such as Logistic regression,
Decision Trees or Neural Networks could also be
Transformation: Here data is transmuted into utilized for forecasting future values, these techniques
appropriate form for mining. Feature selection, could also be combined for attainment of better result.
sampling, aggregation may be used.
Clustering: it is a data mining technique which groups
Data Mining: it is a significant step here mining physical or abstract objects into classes of similar
algorithm is chosen which is apposite to pattern in data. objects [19]. clustering is a method of dividing set of
Extraction of data patterns is also carried out. data(records/tuples/objects/samples) into several
groups(clusters) based on foreordain similarities. The
Interpretation and Evaluation: To recognize and infer principal aim of clustering is finding groups(clusters) of
mining results or patterns into knowledge by objects based on affinity so that within individual
elimination of redundancies and irrelevant patterns. cluster there is great resemblance to each other while
Here an assortment of visualization and GUI stratagems clusters are diverse enough from one another. In
are used for transforming the advantageous patterns into machine learning terminology, clustering is a form of
the human understandable terms. unsupervised learning [19].
5. DATA MINING TASKS Dependency Modelling (Association Rule Mining):
it’s amongst the finest acknowledged data mining
techniques and is categorized under unsupervised data
Data mining tasks are grouped into two main categories mining technique, which aims at finding connections or
1: Predictive 2: Descriptive. These two are considered relations between items or records belonging to a large
primary objectives of data mining. Fayyad et.al 1996 dataset and labels significant dependencies among
define six main functions of data mining: 1.
variables. Association rule mining is implication of the
Classification 2. Regression 3. Clustering 4. form X → Y, where x and y are distinct items or item
Dependency modelling 5. Deviation detection. sets manufacturing if-then statements regarding
6.Summarization.
attribute values. In market basket analysis this rule has
been commonly used, it tries to analyze customers
Classification, regression and anomaly detection
categorized under predictive category while as purchasing certain items and provides insight into the
clustering, Dependency modelling categorized under combinations customer frequently purchases together.
descriptive category. Predictive model forecasts using
Anomaly detection: synonymous to is its name it deals
some variable in dataset so as to predict unknown
with the unearthing of most substantial changes or
values of other relevant variable while as descriptive aberrations from the standard behavior [19].
model classifies patterns or relationship and
encompasses human understandable pattern and trends Summarization: Though not amongst the techniques of
in data (Gorunescu Florin 2011)[19]. data mining, but is a resultant of these techniques and
deals with determining a compact depiction for a subset
Classification: classification is among the classical data of data synonymously referred to as generalization or
mining technique that is established on machine description.
learning. It finds mutual properties amongst a set of
objects in a database and categorizes them into diverse Sequential Patterns: Sequence discovery is a data
classes in accordance with the classification model. Its mining technique that is used to determine sequential
main objective is to scrutinize the training data and patterns or associations or regular events/trends
develop an accurate description or model for each class between variable data fields over a business period.
using feature available in data. This method uses
mathematical techniques like decision trees, Neural 6. ISSUES IN DATA MINING
networks and statistics (Ming-Syan et.al 1996)[20].
Regression: It is one among data mining techniques With data mining being well developed but it still faces
that defines the association between dependent and variety of issues with its practical implementation [1],
independent variables. Prediction is accomplished with some of them discussed below:
regressions support. Statistically regression is the
mathematical model that constitutes connection Security issues: Security is the most critical and vital
amongst the values of dependent variable and values of issue concerning any data transaction process, given
other predictor or independent variable. In regression extremely confidential nature of data, potential illegal
the predicted variable may be continuous variable. In access to the knowledge should be prevented and
regression real valued prediction variables are mapped secrecy must be guarded.
from items of a learning function. Statistical regression,
Neural Network, Support Vector Machine regression
4
440 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
Mining Methodology issues: as different users have which results in refining educational system (C.Romero
interest in diverse kinds of knowledge, data mining et.al 2005)[22].
must cover a wide spectrum of data analysis and
knowledge discovery tasks which may use same data Application of Data Mining in CRM: Data mining in
base in different ways and require development of CRM is currently most discussed topic of research in
numerous data mining techniques. industry and academia with the aim of giving research
summary on utilization of data mining techniques in
User interface issues: Knowledge discovery by data CRM domain (EWT Ngai et.al 2009)[23].
mining tools is advantageous and expressive only as
long as it is presented explicitly and fascinatingly to the Application of Data Mining in Market Basket
used. As it is difficult to know what can be discovered Analysis(MBA): Various techniques of Data Mining
within database, the mining process should be are used for market basket analysis-MBA, a technique
interactive, data should be presented in high level which helps in finding out the association between
language, visual representations or other graphical various items which a customer puts in his shopping
expression forms, so that user can understand and cart during shopping, it observes shopping habits of
interpret it and use it as required. customers. The business houses can use datamining
techniques to identify the buying patterns and behavior
Handling noisy and incomplete data: Data stored in of customer based on which a range of choices can be
databases can be varying as various issues are presented to customer as per his habit of buying (T
associated with data sources, the data may be Raeder 2011)[24].
incomplete, the data may contain cases which may raise
exceptions. Mining data with these irregularities raise Application of Data Mining in Sports Data: Data
ambiguities in the process, causing knowledge model mining techniques have also infringed into the field of
constructed to over-fit data and truncate the accuracy of sports. A huge number of games are being played with
resultant knowledge, so mining methods that can handle each sport generating massive amount of statistical data.
these inconsistencies are required. This massive data needs to be maintained with
regarding to scheduling of events and statistics of
Performance issues: For datamining efficiency and individual player in these events. Data mining can be
scalability are strategic factors for datamining used for forecasting and analysis of performance and
implementing a database system. Information must be for strategy planning (O.K Solieman 2006)[25].
effectually and proficiently extracted from databases as
they are enormous in amount. The algorithms used 8. CONCLUSION
should be proficient and scalable, their running time
needs be predictable and acceptable for large databases.
The paper presented a revision of literature vis-à-vis
7. APPLICATIONS OF DATA MINING data mining, a technique used to ascertain hidden and
useful patterns from vast amount of datasets. These
discovered trends help originations to predict the future
Application of Data Mining in Health Care: Data behavior of customers or products. This study gives the
Mining can be meaningfully advantageous in healthcare idea about various data mining techniques, different
system, but its success axles on availability of clean methods, different processes and some issues related to
data. In healthcare it is used for diagnosis and prognosis datamining. In future we tend to review and compare
of disease, also affiliations among disease can be various algorithm’s used in datamining.
established. Physicians can identify effective and best
practices so that patient gets improved and reasonable 9. REFERENCES
services. As enormous amount of healthcare data is
complex and vast to be processed and analyzed, [1]. Han, Jiawei, Jian Pei, and Micheline Kamber. Data
datamining provides methodology and tools for mining: concepts and techniques. Elsevier, 2011.
transformation of data into information for efficacious
decision making (Parvathi I .et.al. 2014)[21]. [2]. Piatetsky-Shapiro, Gregory. "Knowledge discovery
in databases: 10 years after." ACM SIGKDD
Application of Data Mining in Educational Systems: Explorations Newsletter 1.2 (2000): 59-61.
Data mining in educational system is an evolving field
with researchers showing deep interestingness towards [3]. Fayyad, Usama, Gregory Piatetsky-Shapiro, and
it. As millions of students are enrolled each year in Padhraic Smyth. "From data mining to knowledge
different institutions thus adding immense volume of discovery in databases." AI magazine 17.3 (1996): 37.
data. Data Mining techniques can help in spanning the
[4]. Fayyad, Usama, and Paul Stolorz. "Data mining
knowledge gap in educational system by ascertaining
and KDD: Promise and challenges." Future generation
veiled patterns, connotations and variances. This helps
computer systems 13.2 (1997): 99-115.
stakeholders to improve efficacious decision making
5
441 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 11, November 2016
[5]. Ling, Charles X., and Chenghui Li. "Data Mining useful knowledge from volumes of data."
for Direct Marketing: Problems and Solutions." KDD. Communications of the ACM 39.11 (1996): 27-34.
Vol. 98. 1998.
[19]. Gorunescu, Florin. Data Mining: Concepts,
[6]. Goebel, Michael, and Le Gruenwald. "A survey of models and techniques. Vol. 12. Springer Science &
data mining and knowledge discovery software tools." Business Media, 2011.
ACM SIGKDD explorations newsletter 1.1 (1999): 20-
33. [20]. Chen, Ming-Syan, Jiawei Han, and Philip S. Yu.
"Data mining: an overview from a database
[7]. Domingos, Pedro, and Geoff Hulten. "Mining high- perspective." IEEE Transactions on Knowledge and
speed data streams." Proceedings of the sixth ACM data Engineering 8.6 (1996): 866-883.
SIGKDD international conference on Knowledge
discovery and data mining. ACM, 2000. [21]. Parvathi, I., and Siddharth Rautaray. "Survey on
data mining techniques for the diagnosis of diseases in
[8]. Hand, David J., Heikki Mannila, and Padhraic medical domain." International Journal of Computer
Smyth. Principles of data mining. MIT press, 2001. Science and Information Technologies 5.1 (2014): 838-
846.
[9]. Rygielski, Chris, Jyun-Cheng Wang, and David C.
Yen. "Data mining techniques for customer relationship [22]. Romero, Cristobal, and Sebastian Ventura.
management." Technology in society 24.4 (2002): 483- "Educational data mining: A survey from 1995 to
502. 2005." Expert systems with applications 33.1 (2007):
135-146.
[10]. Keogh, Eamonn, Stefano Lonardi, and Chotirat [23]. Ngai, Eric WT, Li Xiu, and Dorothy CK Chau.
Ann Ratanamahatana. "Towards parameter-free data "Application of data mining techniques in customer
mining." Proceedings of the tenth ACM SIGKDD relationship management: A literature review and
international conference on Knowledge discovery and classification." Expert systems with applications 36.2
data mining. ACM, 2004. (2009): 2592-2602.
[11]. Alhammady, Hamad. "A novel approach for [24]. Raeder, Troy, and Nitesh V. Chawla. "Market
mining emerging patterns in data streams." Signal basket analysis with networks." Social network analysis
Processing and Its Applications, 2007. ISSPA 2007. 9th and mining 1.2 (2011): 97-113.
International Symposium on. IEEE, 2007.
[25]. Solieman, Osama K. "Data mining in sports: A
[12]. Džeroski, Sašo. Relational data mining. Springer research overview." Dept. of Management Information
US, 2009. Systems (2006).
6
442 https://sites.google.com/site/ijcsis/
ISSN 1947-5500