Lecture 8 Applications of Data Mining
Lecture 8 Applications of Data Mining
COM
A wide range of companies have deployed successful applications of data mining. While
early adopters of this technology have tended to be in information-intensive industries such
as financial services and direct mail marketing, the technology is applicable to any company
looking to leverage a large data warehouse to better manage their customer relationships.
Two critical factors for success with data mining are: a large, well-integrated data warehouse
and a well-defined understanding of the business process within which data mining is to be
applied (such as customer prospecting, retention, campaign management, and so on).
Each of these examples have a clear common ground. They leverage the knowledge about
customers implicit in a data warehouse to reduce costs and improve the value of customer
relationships. These organizations can now focus their efforts on the most important
(profitable) customers and prospects, and design targeted marketing strategies to best reach
them.
There are a number of applications that data mining has. The first is called market
segmentation. With market segmentation, you will be able to find behaviors that are common
among your customers. You can look for patterns among customers that seem to purchase the
same products at the same time. Another application of data mining is called customer churn.
Customer churn will allow you to estimate which customers are the most likely to stop
purchasing your products or services and go to one of your competitors. In addition to this, a
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
company can use data mining to find out which purchases are the most likely to be
fraudulent.
For example, by using data mining a retail store may be able to determine which products are
stolen the most. By finding out which products are stolen the most, steps can be taken to
protect those products and detect those who are stealing them. While direct mail marketing is
an older technique that has been used for many years, companies who combine it with data
mining can experience fantastic results. For example, you can use data mining to find out
which customers will respond favorably to a direct mail marketing strategy. You can also use
data mining to determine the effectiveness of interactive marketing. Some of your customers
will be more likely to purchase your products online than offline, and you must identify them.
While many businesses use data mining to help increase their profits, many of them don't
realize that it can be used to create new businesses and industries. One industry that can be
created by data mining is the automatic prediction of both behaviors and trends. Imagine for a
moment that you were the owner of a fashion company, and you were able to precisely
predict the next big fashion trend based on the behavior and shopping patterns of your
customers? It is easy to see that you could become very wealthy within a short period of time.
You would have an advantage over your competitors. Instead of simply guessing what the
next big trend will be, you will determine it based on statistics, patterns, and logic.
Another example of automatic prediction is to use data mining to look at your past marketing
strategies. Which one worked the best? Why did it work the best? Who were the customers
that responded most favorably to it? Data mining will allow you to answer these questions,
and once you have the answers, you will be able to avoid making any mistakes that you made
in your previous marketing campaign. Data mining can allow you to become better at what
you do. It is also a powerful tool for those who deal with finances. A financial institution
such as a bank can predict the number of defaults that will occur among their customers
within a given period of time, and they can also predict the amount of fraud that will occur as
well.
Another potential application of data mining is the automatic recognition of patterns that
were not previously known. Imagine if you had a tool that could automatically search your
database to look for patterns which are hidden. If you had access to this technology, you
would be able to find relationships that could allow you to make strategic decisions.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
aim to breach the privacy of services, data in a computer system or alternatively, in the
context of discovering evidence left in a computer system as part of criminal activity.
Applications of Data Mining in Computer Security concentrates heavily on the use of data
mining in the area of intrusion detection. The reason for this is twofold. First, the volume of
data dealing with both network and host activity is so large that it makes it an ideal candidate
for using data mining techniques. Second, intrusion detection is an extremely critical activity.
This book also addresses the application of data mining to computer forensics. This is a
crucial area that seeks to address the needs of law enforcement in analyzing the digital
evidence.
Data Mining can offer the individual many benefits by improving customer service and
satisfaction, and lifestyle in general. However, it also has serious implications regarding
one’s right to privacy and data security.
Data Mining can also have multiple personal uses such as:
Identifying patterns in medical applications
To choose best companies based on customer service.
To classify email messages etc.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Data Privacy:
In 1980, the organization for Economic co-operation and development (OECD) established as
set of international guidelines, referred to as fair information practices. These guidelines aim
to protect privacy and data accuracy.
Data Security:
Many data security enhancing techniques have been developed to help protect data.
Databases can employ a multilevel security model to classify and restrict data according to
various security levels with users permitted access to only their authorized level.
5.3 Tools
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
7. Delta Miner:
Delta Miner is a multiple strategy tool for supporting clustering, summarization, and
deviation detection and visualization process.
8. IBM Intelligent Miner:
Intelligent Miner is a integrated and comprehensive set of data mining tools. It uses
decision trees, neural networks and clustering.
9. Mine Set:
Mine Set is comprehensive tool for data mining. Its features include extensive data
manipulation and transformation.
10. SPIRIT:
SPIRIT is a tool for exploration and modeling using Bayesian techniques.
11. WEKA:
WEKA is a S/W environment that integrates several machine learning tools within a
common framework and Uniform GUI.
A Data Mining system, DB Miner has been developed for interactive mining of multiple-
level knowledge in large relational databases. The system implements wide spectrum of data
mining functions, including generalization, characterization, association, classification and
prediction.
Introduction:
With the upsurge of research and development activities on knowledge discovery in
databases, a data mining system, db miner, has been developed based on our studies of data
mining techniques and our experience in the development of an early system prototype,
DBlearn.
The functionalities of the knowledge discovery modules are brief described as follows:
The characterizer generalizes a set of task-relevant data into a generalized relation which can
then be used for extraction of different kinds of rules to be viewed at multiple concept levels
from different angles.
A discriminator discovers a set of discriminator rules which summarize the features that
distinguish the class being examined from other classes.
An Association Rule Finder discovers a set of association rules at the multiple concept levels
from the relevant sets of data in a database.
A meta-rule guided miner is a data mining mechanism which takes a user specified meta-rule
form as a pattern to confine the search for desired rule.
A predictor predicts the possible values of some mining data or the value distribution of
certain attributes in a set of objects.
A data evolution evaluator evaluates the data evolution regularities for certain objects where
behavior changes over time.
A deviation evaluator evaluates the deviation patterns for a set of task relevant data in a
database.
Data mining is the process of discovering previously unknown, actionable and profitable
information from large consolidated databases and using it to support tactical and strategic
business decisions.
The statistical techniques of data mining are familiar. They include linear and logistic
regression, multivariate analysis, principal components analysis, decision trees and neural
networks. Traditional approaches to statistical inference fail with large databases, however,
because with thousands or millions of cases and hundreds or thousands of variables there will
be a high level of redundancy among the variables, there will be spurious relationships, and
even the weakest relationships will be highly significant by any statistical test. The objective
is to build a model with significant predictive power. It is not enough just to find which
relationships are statistically significant.
Consider a campaign offering a product or service for sale, directed at a given customer base.
Typically, about 1% of the customer base will be "responders," customers who will purchase
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
The data collected during the Northridge, California earthquake occupied several warehouses,
and ranged from magnetic media to bound copies of printed reports. Nautilus Systems
personnel sorted, organized, and cataloged the materials. Document were scanned and
converted to text. Data were organized chronologically and according to situation reports,
raw data, agency data, and agency reports. For example, the Department of Transportation
had information on highways, street structures, airport structures, and related damage
assessments.
Nautilus Systems applied its proprietary data mining techniques to extract and refine data.
Geography was used to link related information, and text searches were used to group
information tagged with specific names (e.g., Oakland Bay Bridge, San Mateo, Marina). The
refined data were further analyzed to detect patterns, trends, associations and factors not
readily apparent. At that time, there was not a seismographic timeline, but it was possible to
map the disaster track to analyze the migration of damage based upon geographic location.
Many types of analyses were done. For example, the severity of damage was analyzed
according to type of physical structure, pre- versus post- 1970 earthquake building codes, and
off track versus on track damage. It was clear that the earthquake building codes limited the
degree of damage.
Nautilus Systems also looked at the data coming into the command and control center. The
volume of data was so great that a lot was filtered out before it got to the decision support
level. This demonstrated the need for a management system to build intermediate decision
blocks and communicate the information where it was needed. Much of the information
needed was also geographic in nature. There was no ability to generate accurate maps for
response personnel, both route maps including blocked streets and maps defining disaster
boundaries. There were no interoperable communications between local police, the fire
department, utility companies, and the disaster field office. There were also no predefined
rules of engagement between FEMA and local resources, resulting in delayed response
(including such critical areas as firefighting)
Benefits
Nautilus Systems identified recurring data elements, data relationships and metadata, and
assisted in the construction of the Emergency Information Management System (EIMS). The
EIMS facilitates rapid building and maintenance of disaster operations plans, and provides
consistent, integrated command (decision support), control (logistics management), and
communication (information dissemination) throughout all phases of disaster management.
Its remote GIS capability provides the ability to support multiple disasters with a central GIS
team, conserving scarce resources.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Deficiencies
- A topic of any breadth may easily contain hundreds of thousands of
documents
- Many documents that are highly relevant to a topic may not contain
keywords defining them.
Web Mining Subtasks:
Resource Finding
- Task of retrieving intended web-documents.
Information Selection and Pre-Processing
- Automatic Selection and pre-processing specific information from
retrieved web resources.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Generalization
- Automatic discovery of patterns in Web Sites.
Analysis
- Validation and / or interpretation of mined patterns
Web Content Mining:
Discovery of useful information from web contents / data documents
- Web data contents: text, image, audio, video, metadata and hyperlinks
Information Retrieval View
- Assist / Improve information finding
- Filtering information to users on user profiles
Database View
- Model data on the web integrate them for more sophisticated queries.
Web Structure Mining:
To discover the link structure of the hyperlinks at the inter-document level to generate
structural summary about the website and web page.
Direction 1: based on the hyperlinks, categorizing the web pages and generated
information.
Direction 2: discovering the structure of Web document itself.
Direction 3: Discovering the nature of the Web site.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Answer should be based on the degree of relevance based on the nearness of the
keywords, relative frequency of the keywords etc.
Basic Techniques:
Stop List:
Set of words that is deemed “irrelevant” even though they may appear
frequently.
Eg. A, the, of, for, with etc.
Stop lists may vary when document set varies.
Word Stem:
Several words are small syntactic variants of each other since they
share a common word stem.
A term frequency table:
Each entry frequent table (i,j)
No. of occurrences of the word ti in document di
Usually, the ratio instead of the absolute number of
occurrences is used.
Similarity Metrics:
Measure the closeness of the document to a query ( a set of keywords)
Relative term occurrences
Cosine Distance
Latent Semantic Indexing:
Basic Idea:
- Similar documents have similar word frequencies.
- Difficulty: the size of the term frequency matrix is very large.
- Use a singular value decomposition (SVD) techniques to reduce the size of
the frequency table.
- Retain the K most significant rows of the frequency table.
Method:
- Create a term frequency matrix, freq-matrix.
- SVD Construction: Compute the singular valued decomposition of the
freq-matrix by splitting it into 3 matrices, U, S, V.
Vector Identification:
- For each document d, replace its original document vector by a new
excluding the eliminated terms.
Index Creation:
- Store the set of all vectors, indexed by one of a number of techniques
(such as TV-tree)
Other Text Retrieval Indexing Techniques:
Inverted Index:
- Maintains two hash or B +tree indexed tables.
Document Table:
- a set of documents records < doc_id, postings_list>
Term-table: a set of term records, < term, postings_list>
Answer Query: Find all docs associated with one or a set of terms.
Advantage: Easy to implement
Disadvantage: Do not handle well synonymy and polysely and posting lists could be
too long (storage could be very large)
Signature File:
- Associate a signature with each document.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Document Clustering:
Automatically group related documents based on their contents.
Require no training sets or predetermined taxonomies, generate a taxonomy at
runtime.
Major Steps:
Preprocessing
Remove stop words, stem, feature extraction, lexical analysis,…
Hierarchical Clustering
Compute similarities applying clustering algorithms,…
Slicing
Fan out controls, flatten the tree to configurable number of levels,…
Spatial Data Mining refers to the extraction of knowledge, spatial relationships or other
interesting patterns not explicitly stored in spatial databases.
A spatial database stores a large amount of space-related data, such as maps, preprocessed
remote sensing or medical imaging data, and VLSI chip layout data.
Statistical spatial data analysis has been a popular approach to analyzing spatial data and
exploring geographic information.
The term ‘geostatistics’ is often associated with continuous geographic space, whereas the
term ‘Spatial statistics’ is often associated with discrete space.
Spatial Data Mining Applications:
Geographic information systems
Geo marketing
Remote sensing
Image database exploration
Medical Imaging
Navigation
Traffic Control
Environmental Studies
Spatial Data Cube Construction and Spatial OLAP:
Spatial data warehouse is a subject-oriented integrated, time-variant and non-volatile
collection of both spatial and non-spatial data in support of spatial data mining and
spatial data related decision-making process.
There are three types of dimensions in a Spatial Data Cube:
A non-spatial dimension contains only non-spatial data, each contains nonspatial data
whose generalizations are non-spatial.
A Spatial-to-nonspatial dimension is a dimension whose primitive-level data are
spatial but whose generalization, starting at a certain high level, becomes non-spatial.
A Spatial-to-Spatial dimension is a dimension whose primitive level and all of its high
level generalized data are spatial.
Measures of Spatial Data Cube:
A numerical measure contains only numeric data
A Spatial measure contains a collection of pointers to spatial objects.
Computation of Spatial Measures in Spatial Data Cube Construction:
Collect and store the corresponding spatial object pointers but do not perform
precomputation of spatial measures in the spatial data cube.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Precompute and store a rough approximation of the spatial measures in the spatial
data cube.
Selectively pre-compute some spatial measures in the spatial data cube.
Mining Spatial Association and Co-Location Pattern:
Spatial Association rules can be mined in spatial databases.
A Spatial association rule is of the form A B [s%, c%] where A & B are sets of
spatial or non-spatial predicates.
S% is the support of the rule; c% is the confidence of the rule
For mining spatial associations related to the spatial predicate close to and collect the
candidates that pass the minimum support threshold by
Applying certain rough spatial evaluation algorithms.
Evaluating the relaxed spatial predicate, ‘g close to’, which is generalized
close to covering a broader context that includes ‘close to’, ‘touch’ and
intersect’
Spatial Clustering methods:
Spatial data clustering identifies clusters, or densely populated regions, according to
some distance measurement in a large, multi dimensional data set.
Spatial Classification and Spatial Trend Analysis:
Spatial Classification analyzes spatial objects to derive classification schemes in
relevance to certain spatial properties.
Example: Classify regions in a province into rich Vs poor according to the average
family income.
Trend analysis detects changes with time, such as the changes of temporal patterns in
time-series data.
Spatial trend analysis replaces time with space and studies the trend of non-spatial or
spatial data changing with space.
Example: Observe the trend of changes of the climate or vegetation with the
increasing distance from an ocean.
Regression and correlation analysis methods are often applied by utilization of spatial
data structures and spatial access methods.
Mining Raster Databases:
Spatial database systems usually handle vector data that consists of points, lines,
polygons (regions) and their compositions, such as networks or partitions.
Huge amounts of space-related data are in digital raster forms such as satellite images,
remote sensing data and computer tomography.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Review Questions
Two Marks:
1. List out some of the application areas of Data mining systems.
2. Is data mining a hype or a persistent?
3. Write short notes on text mining.
4. What are the applications of spatial data bases?
5. Define Spatial Data Mining.
6. List out any five various commercial data mining tools.
7. What are the different Data security techniques used in data mining?
8. What is information retrieval?
9. What is keyword-based association analysis?
10. What is HITS algorithm?
11. List out some of the challenges of WWW.
12. What is web usage mining?
13. What are the three types of dimensions in Spatial data cube?
Sixteen Marks:
1. Discuss in detail the application of Data Mining for financial data analysis?
2. Discuss the application of data mining in business.
3. Discuss in detail of applications of data mining for biomedical and DNA data analysis
and telecommunication industry.
4. Discuss the Social impacts of Data Mining Systems.
5. Discuss about the various data mining tools.
6. Explain the Mining of Spatial databases.
7. Discuss the Mining of Text Databases,
8. What is web mining? Discuss the various web mining techniques.
Assignment Topic:
1. Explain in detail about the data mining tool DB-Miner.
WWW.VIDYARTHIPLUS.COM V+ TEAM