0% found this document useful (0 votes)
297 views13 pages

Chapter-3 DATA MINING PDF

1. Data mining is the process of discovering patterns in large data sets involving methods from machine learning, statistics, and database systems. 2. Knowledge discovery in databases (KDD) is the overall process of extracting knowledge from data, which includes data selection, preprocessing, transformation, data mining, interpretation and evaluation. 3. Data mining is one step in the KDD process that focuses on algorithms to extract patterns from data, while KDD aims to convert raw data into useful knowledge through the entire process.

Uploaded by

Ramesh K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
297 views13 pages

Chapter-3 DATA MINING PDF

1. Data mining is the process of discovering patterns in large data sets involving methods from machine learning, statistics, and database systems. 2. Knowledge discovery in databases (KDD) is the overall process of extracting knowledge from data, which includes data selection, preprocessing, transformation, data mining, interpretation and evaluation. 3. Data mining is one step in the KDD process that focuses on algorithms to extract patterns from data, while KDD aims to convert raw data into useful knowledge through the entire process.

Uploaded by

Ramesh K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT-1

Chapter-3
DATA MINING

 INTRODUCTION

 Data mining is the non-trivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data. With the extensive use of databases
and the explosive growth in their sizes, organizations are facing the problem of
information overload. The effective utilization of these massive volumes of data is
becoming a major problem for all enterprises. Data mining techniques support automatic
exploration of data. Data mining attempts to source out patterns and trends in the data and
infer rules from these patterns .With these rules the user will be able to support review
and examine decisions in some related business or scientific area. This opens up the
possibility of a new way of interacting with databases and data warehouses.
 The evolution of data mining began when business data was first stored in computers and
technologies were generated to allow users to navigate through the data in real time. This
evolution is due to the support of three technologies that are sufficiently mature: massive
data collection, high performance computing and data mining algorithms.

 DATA MINING: DEFINITIONS

 Data mining, the extraction of the hidden predictive information from large databases, is
a powerful new technology with great potential to analyze important information in the
data warehouse.
 Definitions: The term ‘data mining’ refers to the finding of relevant and useful
information from databases. Data mining and knowledge discovery in the databases is a
new interdisciplinary field, merging ideas from statistics, machine learning, databases and
parallel computing. Researchers have defined the term ‘data mining’ in many ways.
Def-1:
“Data mining or knowledge discovery in databases, as it is also known, is the non trivial
extraction of implicit, previously unknown and potentially useful information from the data.
This encompasses a number of technical approaches such as clustering, data summarization,
classification, finding dependency networks, analyzing changes and detecting anomalies.”
 Data retrieval attempts to retrieve data that is stored in the database and presents it to the user
in a way that the user can understand. It does not attempt to extract implicit information.
 One may argue that if we store ‘date-of birth’ as a field in the database and extract ‘age’ from
it, the information received from the database is not explicitly available. But all of us would
agree that the information is not ‘non-trivial’ On the other hand, if one attempts to find out
the average age of the employees in a particular company, it can be visualized as a sort of
non-trivial extraction of implicit information.
 Then we can say that extracting the average age of the employees of a department from the
employees database is a data mining task? It is definitely a type of data mining task, but at a
very low level. A higher level task would, for example, be to find correlations between the
average age and average income of individuals in an enterprise.
Def-2:
“Data mining is the search for the relationships and global patterns that exist in large
databases but are hidden among vast amounts of data, such as the relationship between
patient data and their medical diagnosis. This relationship represents valuable knowledge
about the database and the objects in the database, if the database is a faithful mirror of the
real world registered by the database.”
 Consider the employee database and let us assume that the tool to determine relationship
between fields, say relationship between age and lunch patterns. Assume for example that we
find that most of employees in their thirties like to eat pizzas, burgers or Chinese food during
their lunch break. Employees in their forties prefer to carry lunch from their homes and
employees in their fifties take fruits and salads during lunch.
 If our tool finds this pattern from the data base which records the lunch activities of all
employees for last few months, then we can turn our tool as data mining tool. Just by
examining the database, it is impossible to notice any relationship between age and lunch
patterns.
Def-3:
“Data mining refers to using a variety of techniques to identify nuggets of information or
decision making knowledge in the database and extracting these in such a way that they can be
put to use in areas such as decision support prediction, forecasting and estimation. The data is
often voluminous but it has low value and no direct use can be made of it. It is hidden
information in the data i.e. useful.”
 Data mining is a process of finding value from volume. In any enterprise, the amount of
transactional data generated during its day to day operations is massive in volume.
 Data mining attempts to extract smaller pieces of valuable information of from this massive
database.
Def-4:
“Discovering relations that connect variables in a database is the subject of data mining. The
data mining system self learns from the previous history of the investigated system formulating
and testing hypothesis about rules which systems obey. When concise and valuable knowledge
about the system of interest is discovered, it can and should be interpreted into some decision
support system which helps the manager to make wise and informed business decision.”

Def-5:
“Data mining is the process of discovering meaningful, new correlation patterns and trends by
sifting through large amount of data stored in repositories, using pattern recognition
techniques as well as statistical and mathematical techniques.”

 KDD vs. DATA MINING

 Knowledge Discovery in Database (KDD) was formalized in 1989 with reference to the
general concept of being broad and high level in the pursuit of seeking knowledge from data.
This high –level application technique is used to present and analyze data for decision-
makers.
 Data Mining is only one of the many steps involved in knowledge discovery in databases. The
various steps in the knowledge discovery process include data selection, data cleaning, and
preprocessing and the interpretation of the discovered knowledge.
 Knowledge Discovery in Databases is the process of identifying a valid potentially useful
and ultimately understandable structure in data. This process involves selecting or
sampling data from a data warehouse, cleaning or preprocessing it, transforming or reducing
it, applying a data mining component to produce a structure, and the evaluating the derived
structure.
 Data Mining is a step in the KDD process concerned with the algorithmic means by which
patterns or structures are enumerated from the data under acceptable computational
efficiency limitations.
 Thus, the structures are the outcomes of the data mining must meet the certain conditions so
that these can be considered as knowledge. These conditions are: validity,
understandability, utility, novelty and interestingness.
KDD DATA MINING

1 KDD is the overall process of Data Mining is a step inside the KDD process, which
extracting knowledge from data. deals with identifying patterns in data.

2 KDD is the whole process of trying Data mining is the computing process of discovering
to make sense of data by developing patterns in large data sets involving methods at the
appropriate methods or intersection of machine learning, statistics, and
techniques. database systems.
3 Goal of the KDD process is to Data Mining goals as defined by the goal of the
extract knowledge from data in the application, and they are namely verification or
context of large databases. discovery.
4 KDD is the overall process of In data mining algorithms t used for extract the
converting raw data into useful information and patterns derive by the KDD process.
information.

 STAGES OF KDD:

There are 6 stages/steps of KDD there are:


 Data Selection
 Preprocessing
 Transformation
 Data Mining
 Interpretation & Evaluation
 Data Visualization

The stages of KDD, starting with the raw data and finishing with the extracted knowledge, are given
below:
 SELECTION:

o This stage is concerned with selecting right data that are relevant to some criteria.
o For example, for credit card customer profiling, we extract the type of transactions for each
type of customers and we may not be interested in details of the shop where the transaction
takes place.

 PREPROCESSING:

o Preprocessing is the data cleaning stage where unnecessary information is removed.

 TRANSFORMATION:

o The data is not merely transferred across, but transformed in order to be suitable for the task of
data mining.

 DATA MINING:

o This stage is concerned with the extraction of patterns from the data.

 INTERPRETATION AND EVALUATION:

o The patterns obtained in the data mining stage are converted into knowledge, which in turn, is
used to support decision making.

 DATA VISUALIZATION:

o Data visualization makes it possible for the analyst to gain deeper, more sensitive
understanding of the data and as such can work well alongside data mining.
o Data visualization helps users to examine large volumes of data and detect the patterns
visually.
o Visual displays of data such as maps, charts and other representation allow data to be
presented compactly to the users.

 DBMS vs DM

 DBMS supports query languages which are useful for query generated data exploration,
whereas data mining supports automatic data exploration. If we know exactly what
information we are looking for, a DBMS query would be enough: whereas if we are not clear
about the possible correlations or patterns, then data mining techniques are useful.

 Three different ways/approches in which data mining systems use a RDBMS:


i. 1st Approach:
 Majorities of data mining systems do not use any DBMS and have their own memory
and storage management.
 They treat the database simply as a data repository (storehouse) from which data is
expected to be downloaded into their own memory structures, before the data mining
algorithm starts.
 Conversely, these systems ignore the field–proven technologies of DBMS, such as
recovery, concurrency, etc.

ii. 2nd Approach:

 In loosely coupled DBMS, DBMS is used only for storage and retrieval of data.
 For example, one can use a loosely-coupled SQL to fetch data records as required by
the mining algorithm.
 The applications use a SQL select statement to retrieve the set of records of interest
from the database.
 A loop in the application program copies records in the set one by one from the database
address space to the application address space, where computation is performed on
them. This loosely coupled does not use the querying capability provided by the DBMS.

iii. 3rd Approach:

 In tightly coupled approach, the portions of the application programs are selectively
pushed to the database system to perform the necessary computation.
 Data are stored in the database and all processing is done at the database end.
 It is different from bringing the data from the database to the data mining to the data
mining area.
 On the other hand, the data mining application goes where the data naturally reside. This
avoids performance degradation and takes full advantage of database technology.

DBMS DATA MINING

1 DBMS, sometimes just called a database Data mining is also known as


manager Knowledge Discovery in Data (KDD).
2 DBMS is a collection of computer Data Mining is a felid of computer
programs that is dedicated for the science, which deals with the
management (i.e. organization, storage extraction of previously unknown
and retrieval) of all databases that are and interesting information from raw
installed in a system (i.e. hard drive or data.
network).

3 DBMS is used only for storage and Data Mining is a step inside the KDD
retrieval of data. process, which deals with identifying
patterns in data.
 ISSUES AND CHALLENGES IN DM:

 Data mining systems depend on databases to supply the raw input and this raises
problems, such as those databases tend to be dynamic, incomplete, noisy and large. Other
problems arise as a result of the inadequacy and irrelevance of the information stored.
 The difficulties in data mining can be categorized as:
o Limited information
o Noise or missing data
o User interaction and prior knowledge
o Uncertainty
o Size, updates and irrelevant fields

1. Limited information:
 A database is often designed for purposes other then that data mining and, sometimes,
some attributes which are essential for knowledge discovery of the application domain
are not present in the data. Thus, it may be very difficult to discover significant
knowledge about a given domain.

2. Noise and missing data :


 Attributes that rely on subjective or measurement judgments can give rise to errors, such
that some examples may be misclassified.
 Missing data can be treated in a number of ways-simply disregarding missing values,
omitting corresponding records, inferring missing values from known values, and
treating missing data as a special value to be included additionally in the attribute
domain.
 The data should be cleaned so that it is free of errors and missing data.

3. User interaction and prior knowledge:


 An analyst is usually not a KDD expert but simply a person making use of the data by
means of the available KDD techniques.
 Since the KDD process is by definition interactive and iterative, it is challenging to
provide a high performance, rapid-response environment that also assists the users in
the proper selection and matching of the appropriate techniques, to achieve their goals.
 There needs to be more human-computer interaction and less emphasis on total
automation, which supports both the novice and expert users.
 The use of domain knowledge is important in all steps of the KDD process.
 It would be convenient to design a KDD tool which is both interactive and Iterative.

4. Uncertainty:
 This refers to the severity of error and the degree of noise in the data. Data precision is
an important consideration in a discovery system.

5. Size, updates and irrelevant fields:


 Databases tend to be large and dynamic, in that their contents are keep changing as
information is added, modified or removed.
 The problem with this, from the perspective of data mining, is how to ensure that the
rules are up-to-date and consistent with the most current information.
 OTHER RELATED AREAS:

 STATISTICS:
o Statistics is a theory rich approach for data analysis.
o Statistics, with its solid theoretical foundation, generates results are difficult to interpret.
o Statistics is one of the foundational principles on which data mining technology is built.
o Statistical analysis systems are used by analysis to detect unusual patterns and explain
patterns using statistical models, such as linear models.

 MACHINE LEARNING:
 Machine learning is the automation of a learning process and learning is equivalent to the
construction of rules based on observations.
 This is a broad field which includes not only learning from examples but also
reinforcement learning, learning with a teacher, etc.
 A learning algorithm takes the data set and its additional information as the input and
returns a statement,e.g, a concept representing the results of learning as output.
 Inductive learning, where the system concludes knowledge itself from observing its
environment, has two main strategies: Supervised learning and unsupervised learning.

 Supervised Learning:
o Supervised learning means learning from examples, where a training set is given
which acts as examples for the classes.
o The system finds a description of each class. Once the description has been
formulated, it is used to predict the class of previously unseen objects.
o “In supervised learning, the output datasets are provided which are used to train the
machine and get the desired outputs.”
o Supervised learning is where you have input variables (x) and an output variable (Y)
and you use an algorithm to learn the mapping function from the input to the
output.
Y = f(X)
o The goal is to approximate the mapping function so well that when you have new
input data (x) that you can predict the output variables (Y) for that data.
o It is called supervised learning because the process of an algorithm learning from the
training dataset can be thought of as a teacher supervising the learning process.
o Supervised learning problems can be further grouped into regression and
classification problems.
o Classification: A classification problem is when the output variable is a category,
such as “red” or “blue” or “disease” and “no disease”.
o Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
o Some popular examples of supervised machine learning algorithms are:
 Linear regression for regression problems.
 Random forest for classification and regression problems.
 Support vector machines for classification problems.
 Unsupervised Learning:
o Unsupervised learning is learning from observation and discovery.
o In this mode of learning, there is no training set or prior knowledge of the classes.
The system analyzes the given set of data to observe similarities emerging out of the
subsets of the data.
o “In unsupervised learning no datasets are provided, instead the data is clustered into
different classes.”
o Unsupervised learning is where you only have input data (X) and no corresponding
output variables.
o The goal for unsupervised learning is to model the underlying structure or distribution
in the data in order to learn more about the data.
o These are called unsupervised learning because unlike supervised learning above
there is no correct answers and there is no teacher.
o Unsupervised learning problems can be further grouped into clustering and
association problems.
o Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing behavior.
o Association: An association rule learning problem is where you want to discover
rules that describe large portions of your data, such as people that buy X also tend to
buy Y.
o Some popular examples of unsupervised learning algorithms are:
 k-means for clustering problems.
 Apriori algorithm for association rule learning problems.

 DM TECHNIQUES:
Researchers identify two goals of data mining: prediction and description.
 Prediction makes use of existing variables in the database in order to predict unknown or
future values of interest.
 Description focuses on finding patterns describing the data on the successive
presentation.
 The relative emphasis of both prediction and description differ with respect to the
fundamental application and the technique.
 Two different classifications of DM techniques are:
 User guided or verification-driven data mining and
 Discovery –driven or automatic discovery of rules

 Verification Model:
o In this process of data mining, the user makes a hypothesis and tests the hypothesis on
the data to verify its validity.
o The importance is on the user who is responsible for formulating the hypothesis and
issuing the query on the data to affirm or negate the hypothesis.
 In a supermarket, for example, with a limited budget for a mailing movement to launch a
new product, it is important to identify the section of the population most likely to buy
the new product (i.e. demand of new product).The user formulates a guess to identify
potential customers and their common characteristics. Historical data about transactions
and demographic information can then queried to reveal comparable purchases and the
characteristics shared buy those purchasers. The whole operation can be repeated by
successive refinements of hypothesis until the required limit is reached. Finally user does
verification against database.

 Discovering Model:
 Discovering model is the system automatically discovering important information hidden
in the data.
 The data is filtered in search of frequently occurring patterns, trends and generalizations
about the data with intervention or guidance from the user.
 The discovery model differs in its emphasis in that it is the system automatically
discovering important information hidden in the data.
 The data is sifted in search of frequently occurring patterns, trends and generalizations
about the data without intervention or guidance from the user.
 The discovery or data mining tools aim to reveal a large number of facts about the data
in as short a time as possible.
 For example:Model is a bank database which is mined to discover the many groups of
customers to target for a mailing campaign.The data is searched with no hypothesis in
mind other than for the system to group the customers according to the common
characteristics found.
 The typical discovery driven tasks are:
 Discovery of association rules
 Discovery of classification rules
 Clustering
 Discovery of frequent episodes
 Deviation detection

 Discovery of association rules:


 An association rule is an expression of the form X => Y, where X and Y are the
sets of items. The observant meaning of such a rule is that the transaction of the
database which contains X tends to contain Y.Given a database, the goal is to
discover all the rules that have the support and confidence greater that or equal
to the minimum support and confidence, respectively.
 Clustering:
 Clustering is as a method of grouping data into different groups, so that the data
in each group share similar trends and patterns. Clustering constitutes a major
class of data mining algorithms.
 Clustering is a concept which appears in many disciplines. If a measure of similarity is
available, then there are a number of techniques forming clusters. Another approach is to
build set functions that measure some particular property of groups.
 The objectives of clustering are:
i. To uncover natural groupings
ii. To initiate hypothesis about the data
iii. To find consistent and valid organization of the data.

 Discovery of classification rule:


 Classification involves finding rules that partition the data into disjoint groups.
The input for the classification is the training data set, whose class labels are
already known. Classification analyzes the training data set and constructs a
model based on the class label and aims to assign a class label to the future
unlabelled records. Since the class field is known, this type of classification is
known as supervised learning .A set of classification rules are generated by such
a classification process , which can be use to classify future data and develop a
better understanding of each class in the database. We can term this as
supervised learning too.
 There are several classification discovery models; they are the decision tress, neutral
networks, genetic algorithms and statistical models like linear/geometric discriminates.
The applications include the credit card analysis, banking, medical applications and the
like. Consider the following example. The domestic flights in our country were at one
time only operated by Indian Airlines. Recently, many other private airlines began their
operations for domestic travel. Some of the customers of Indian Airlines started flying
with these private airlines and, as a result, Indian Airlines lost these customers. Let us
assume that Indian Airlines wants to understand why some customers remain loyal while
others leave. Ultimately, the airline wants to predict which customers it is most likely to
lose to its competitors. Their aim to build a model based on the historical data of loyal
customers who have left. This becomes a classification problem.

 DM APPLICATION AREAS:

1. BUSINESS AND E-COMMERCE DATA


o This is a major source category of data for data mining applications. Back-office, front-
office, and network applications produce large amounts of data about business processes;
Using this data for effective decision making remains a fundamental challenge.
 BUSINESS TRANSACTIONS:
o Modern business processes are consolidating with millions of customers and billions
of their transactions. Business enterprises require necessary information for their
effective functioning in today's competitive world.
 ELECTRONIC COMMERCE:
o Not only does electronic commerce produce large data sets in which the analysis of
marketing patterns and risk patterns is critical but, it is also important to do this in
near-real time, in order to meet the demands of online transactions.

2. SCIENTIFIC, ENGINEERING AND HEALTH CARE DATA


o Scientific data and metadata tend to be more complex in structure than business data.
In addition, scientists and engineers are making increasing use of simulation and
systems with application domain knowledge.

 GENOMIC DATA:
o Genomic sequencing and mapping efforts have produced a number of databases
which are accessible on the web. In addition, there are also a wide variety of other
online databases. Finding relationships between these data sources is another
fundamental challenge for data mining.
 SENSOR DATA:
o Remote sensing data is another source of voluminous data. Remote sensing satellites
and a variety of other sensors produce large amounts of geo-referenced data. A
fundamental challenge is to understand the relationships, including causal
relationships, amongst this data.

 SIMULATION DATA:
o Simulation is now accepted as an important mode of science, supplementing theory
and experiment. Today, not only do experiments produce huge data sets, but so do
simulations. Data mining and, more generally, data intensive computing is proving to
be a critical link between theory, simulation, and experiment.

 HEALTH CARE DATA:


o Hospitals, health care organizations, insurance companies, and the concerned
government agencies accumulate large collections of data about patients and health
care-related details.

 WEB DATA:
o The data on the web is growing not only in volume but also in complexity. Web data
now includes not only text, audio and video material, but also streaming data and
numerical data.

 MULTIMEDIA DOCUMENTS:
o Today's technology for retrieving multimedia items on the web is far from
satisfactory. On the other hand, an increasingly large number of matters are on the
web and the number of users is also growing explosively. It is becoming harder to
extract meaningful information from the archives of multimedia data as the volume
grows.

 DM APPLICATIONS:

 There is a wide range of well-established business applications for data mining. These
include customer attrition, profiling, promotion forecasting, product cross-selling, plan
detection, targeted marketing, propensity analysis; credit scoring, risk analysis, etc. We
shall now discuss a few mock case-studies and areas of DM applications.

 HOUSING LOAN PREPAYMENTPREDICTION:


o A home-finance loan actually has an average life-span of only 7 to 10 years, due to
prepayment. Prepayment means that the loan is paid off early, rather than at the end
of, say, 25 years. People prepay loans when they refinance or when they sell their
home. The financial return that a home-finance institution derives from a loan
depends on its life-span. Therefore, it is necessary for the financial institutions to be
able to predict the life-spans of their loans. Rule discovery techniques can be used to
accurately predict the aggregate number of loan prepayments in a given quarter (or,
in a year), as a function of prevailing interest rates, borrower characteristics, and
account data. This information can be used to fine-tune loan parameters such as
interest rates, points, and fees, in order to maximize profits.
o
 MORTGAGE LOAN DELINQUENCY (negligence) PREDICTION:
o Loan defaults usually entail expenses and losses for the banks and other lending
institutions. Data mining techniques can be used to predict whether or not a loan
would go delinquent within the succeeding 12 months, based on historical data, on
account information, borrower demographics, and economic indicators. The rules can
be used to estimate and fine-tuned loan loss reserves and to gain some business
insight into the characteristics and circumstances of delinquent loans. This will also
help in deciding the funds that should be kept aside to handle bad loans.

 CRIME DETECTION:
o Crime detection is another area one might immediately associate with data mining.
Let us consider a specific case: to find patterns in 'bogus official' burglaries. A
typical example of this kind of crime is when someone turns up at the door
pretending to be from the water board, electricity board, telephone department or gas
company. At the same time as they distract the householder, their partners will search
the premises and steal cash and items of value. Victims of this sort of crime tend to
be the elderly. These cases have no obvious leads, and data mining techniques may
help in providing some unexpected connections to known perpetrators.

 STORE-LEVEL FRUITS PURCHASING PREDICTION:


o A super market chain called 'Fruit World' sells fruits of different types and it
purchases these fruits from the wholesale suppliers on a day-to-day basis. The
problem is to analyze fruit-buying patterns, using large volumes of data captured at
the 'basket' level. Because fruits have a short shelf-life, it is important that accurate
store-level purchasing predictions should be made to ensure optimum freshness and
availability.

END OF UNIT-1

You might also like