5.1 Applications of Data Mining: Unit V - Data Warehousing and Data Mining - Ca5010 1
5.1 Applications of Data Mining: Unit V - Data Warehousing and Data Mining - Ca5010 1
KLNCIT MCA
company can use data mining to find out which purchases are the most likely to be
fraudulent.
For example, by using data mining a retail store may be able to determine which products are
stolen the most. By finding out which products are stolen the most, steps can be taken to
protect those products and detect those who are stealing them. While direct mail marketing is
an older technique that has been used for many years, companies who combine it with data
mining can experience fantastic results. For example, you can use data mining to find out
which customers will respond favorably to a direct mail marketing strategy. You can also use
data mining to determine the effectiveness of interactive marketing. Some of your customers
will be more likely to purchase your products online than offline, and you must identify
them.
While many businesses use data mining to help increase their profits, many of them don't
realize that it can be used to create new businesses and industries. One industry that can be
created by data mining is the automatic prediction of both behaviors and trends. Imagine for
a moment that you were the owner of a fashion company, and you were able to precisely
predict the next big fashion trend based on the behavior and shopping patterns of your
customers? It is easy to see that you could become very wealthy within a short period of
time. You would have an advantage over your competitors. Instead of simply guessing what
the next big trend will be, you will determine it based on statistics, patterns, and logic.
Another example of automatic prediction is to use data mining to look at your past marketing
strategies. Which one worked the best? Why did it work the best? Who were the customers
that responded most favorably to it? Data mining will allow you to answer these questions,
and once you have the answers, you will be able to avoid making any mistakes that you made
in your previous marketing campaign. Data mining can allow you to become better at what
you do. It is also a powerful tool for those who deal with finances. A financial institution such
as a bank can predict the number of defaults that will occur among their customers within a
given period of time, and they can also predict the amount of fraud that will occur as well.
Another potential application of data mining is the automatic recognition of patterns that
were not previously known. Imagine if you had a tool that could automatically search your
database to look for patterns which are hidden. If you had access to this technology, you
would be able to find relationships that could allow you to make strategic decisions.
Data mining is becoming a pervasive technology in activities as diverse as using historical
data to predict the success of a marketing campaign, looking for patterns in financial
transactions to discover illegal activities or analyzing genome sequences. From this
perspective, it was just a matter of time for the discipline to reach the important area of
computer security.
Applications of Data Mining in Computer Security presents a collection of research
efforts on the use of data mining in computer security.
Data mining has been loosely defined as the process of extracting information from large
amounts of data. In the context of security, the information we are seeking is the knowledge
of whether a security breach has been experienced, and if the answer is yes, who is the
perpetrator. This information could be collected in the context of discovering intrusions that
aim to breach the privacy of services, data in a computer system or alternatively, in the
context of discovering evidence left in a computer system as part of criminal activity.
KLNCIT MCA
Applications of Data Mining in Computer Security concentrates heavily on the use of data
mining in the area of intrusion detection. The reason for this is twofold. First, the volume of
data dealing with both network and host activity is so large that it makes it an ideal candidate
for using data mining techniques. Second, intrusion detection is an extremely critical activity.
This book also addresses the application of data mining to computer forensics. This is a
crucial area that seeks to address the needs of law enforcement in analyzing the digital
evidence.
Applications of Data Mining in Computer Security is designed to meet the needs of a
professional audience composed of researchers and practitioners in industry and graduate
level students in computer science.
5.2 Social Impacts of Data Mining
Data Mining can offer the individual many benefits by improving customer service and
satisfaction, and lifestyle in general. However, it also has serious implications regarding ones
right to privacy and data security.
Is Data Mining a Hype or a persistent, steadily growing business?
Data Mining has recently become very popular area for research, development and business
as it becomes an essential tool for deriving knowledge from data to help business person in
decision making process.
Several phases of Data Mining technology is as follows:
Innovators
Early Adopters
Chasm
Early Majority
Late Majority
Laggards
Is Data Mining Merely Managers Business or Everyones Business?
Data Mining will surely help company executives a great deal in understanding the market
and their business. However, one can expect that everyone will have needs and means of data
mining as it is expected that more and more powerful, user friendly, diversified and
affordable data mining systems or components are made available.
Data Mining can also have multiple personal uses such as:
Identifying patterns in medical applications
To choose best companies based on customer service.
To classify email messages etc.
KLNCIT MCA
With more and more information accessible in electronic forms and available on the web and
with increasingly powerful data mining tools being developed and put into use, there are
increasing concern that data mining may pose a threat to our privacy and data security.
Data Privacy:
In 1980, the organization for Economic co-operation and development (OECD) established
as set of international guidelines, referred to as fair information practices. These guidelines
aim to protect privacy and data accuracy.
They include the following principles:
Purpose specification and use limitation.
Openness
Security Safeguards
Individual Participation
Data Security:
Many data security enhancing techniques have been developed to help protect data.
Databases can employ a multilevel security model to classify and restrict data according to
various security levels with users permitted access to only their authorized level.
Some of the data security techniques are:
Encryption Technique
Intrusion Detection
In secure multiparty computation
In data obscuration
5.3 Tools
Data Mining Tools:
1. Auto Class III:
Auto Class is an unsupervised Bayesian Classification System for independent data.
2. Business Miner:
Business Miner is a single strategy easy to use tool based on decision trees.
3. CART:
CART is a robust data mining tool that automatically searches for important patterns
and relationships in large data sets.
4. Clementine:
It finds sequence association and clustering for web data analysis.
5. Data Engine:
Data Engine is a multiple strategy data mining tool for data modeling, combining
conventional data analysis methods with fuzzy technology.
6. DB Miner:
DB Miner is a publicly available tool for data mining. It is multiple strategy tool and
it supports clustering and Association Rules.
7. Delta Miner:
KLNCIT MCA
Delta Miner is a multiple strategy tool for supporting clustering, summarization, and
deviation detection and visualization process.
8. IBM Intelligent Miner:
Intelligent Miner is a integrated and comprehensive set of data mining tools. It uses
decision trees, neural networks and clustering.
9. Mine Set:
Mine Set is comprehensive tool for data mining. Its features include extensive data
manipulation and transformation.
10. SPIRIT:
SPIRIT is a tool for exploration and modeling using Bayesian techniques.
11. WEKA:
WEKA is a S/W environment that integrates several machine learning tools within a
common framework and Uniform GUI.
5.4 An Introduction to DB Miner
A Data Mining system, DB Miner has been developed for interactive mining of multiplelevel knowledge in large relational databases. The system implements wide spectrum of data
mining functions, including generalization, characterization, association, classification and
prediction.
Introduction:
With the upsurge of research and development activities on knowledge discovery in
databases, a data mining system, db miner, has been developed based on our studies of data
mining techniques and our experience in the development of an early system prototype,
DBlearn.
The system has the following distinct features:
1. It incorporates several interesting data mining techniques, including attribute-oriented
induction, statistical analysis, progressive deepening for mining multiple level rules
and meta-rule guided knowledge mining.
2. It performs interactive data mining and multiple concept levels on any user-specified
set of data in a database using an SQL-like Data mining Query Language, DMQL or a
GUI.
3. Efficient implementation techniques have been explored using different data
structures, including generalized relations and multiple-dimensional data cubes.
4. The data mining process may utilize user or expert defined set-grouping or schema
level concept hierarchies which can be specified flexibly, adjusted dynamically based
on data distribution and generated automatically for numerical attributes.
5. Both UNIX and PC (Windows / NT) versions of the system adopt client / server
architecture. The later may communicate with various commercial database systems
for data mining using the ODBC technology.
Architecture and Functionalities:
The general architecture of DB Miner is shown in the figure A1, tightly integrates a relational
database system, such as Sybase SQL Server, with a concept hierarchy module, and a set of
knowledge discovery modules of DB Miner.
Graphical User Interface:
KLNCIT MCA
relationship marketing, by identifying which customers are most likely to respond to the
campaign. If the response can be raised from 1% to, say, 1.5% of the customers contacted
(the "lift value"), then 1000 sales could by achieved with only 66,666 mailings, reducing the
cost of mailing by one-third.
Case Study: Data Mining the Northridge Earthquake
The data collected during the Northridge, California earthquake occupied several
warehouses, and ranged from magnetic media to bound copies of printed reports. Nautilus
Systems personnel sorted, organized, and cataloged the materials. Document were scanned
and converted to text. Data were organized chronologically and according to situation
reports, raw data, agency data, and agency reports. For example, the Department of
Transportation had information on highways, street structures, airport structures, and related
damage assessments.
Nautilus Systems applied its proprietary data mining techniques to extract and refine data.
Geography was used to link related information, and text searches were used to group
information tagged with specific names (e.g., Oakland Bay Bridge, San Mateo, Marina). The
refined data were further analyzed to detect patterns, trends, associations and factors not
readily apparent. At that time, there was not a seismographic timeline, but it was possible to
map the disaster track to analyze the migration of damage based upon geographic location.
Many types of analyses were done. For example, the severity of damage was analyzed
according to type of physical structure, pre- versus post- 1970 earthquake building codes, and
off track versus on track damage. It was clear that the earthquake building codes limited the
degree of damage.
Nautilus Systems also looked at the data coming into the command and control center. The
volume of data was so great that a lot was filtered out before it got to the decision support
level. This demonstrated the need for a management system to build intermediate decision
blocks and communicate the information where it was needed. Much of the information
needed was also geographic in nature. There was no ability to generate accurate maps for
response personnel, both route maps including blocked streets and maps defining disaster
boundaries. There were no interoperable communications between local police, the fire
department, utility companies, and the disaster field office. There were also no predefined
rules of engagement between FEMA and local resources, resulting in delayed response
(including such critical areas as firefighting)
Benefits
Nautilus Systems identified recurring data elements, data relationships and metadata, and
assisted in the construction of the Emergency Information Management System (EIMS). The
EIMS facilitates rapid building and maintenance of disaster operations plans, and provides
consistent, integrated command (decision support), control (logistics management), and
communication (information dissemination) throughout all phases of disaster management.
Its remote GIS capability provides the ability to support multiple disasters with a central GIS
team, conserving scarce resources.
Retrieving pages that are not only relevant, but also of high quality or authoritative on
the topic.
Hyperlinks can infer the notion of authority:
- The Web consists not only of pages, but also of hyperlinks pointing from
one page to another.
- These hyperlinks contain an enormous amount of latent human annotation
- A hyperlink pointing to another web page, this can be considered as the
authors endorsement of the other page.
Problems with the web linkage structure:
- Not every hyperlink represents an endorsement
- One authority will seldom have its web page point to its rival authorities in
the same field
- Authoritative pages are seldom particularly descriptive.
HITS (Hyperlink Inducted Topic Search):
Explore interactions between hubs and authoritative pages.
Use an index-based search engine to form the root set.
Expand the root set into a base set
- Include all of the pages that the root set pages link to, and all of the pages
that link to a page in the root set, up to a designated size cut off.
Apply weight-propagation
- An iterative process that determines numerical estimates of hub and
authority weights
System based on the HITS Algorithm:
- Clever Google: Achieve better quality search results than those generated
by term index engines such as Alta Vista.
Difficulties from ignoring textual contexts
- Drifting
- Topic Hijaking
Automatic Classification of Web Documents:
Assign a class label to each document from a set of predefined topic categories.
Based on a set of examples of pre-classified documents
Keyword-based document classification methods
Statistical Models
Multilayered Web Information Base:
Layer 0: the Web itself
Layer 1: the Web page descriptor layer
Layer 2 and up: various web directory services constructed on the top of layer 1
Applications of Web Mining:
Target potential customers for e-commerce
Improve web server system performance
Identify potential prime advertisement locations
Facilitates adaptive / personalized sites
Improve site design
Fraud / Intrusion detection
Predicts users actions
5.7 Mining Text Database
Mining Text Databases:
Text Databases and Information Retrieval:
KLNCIT MCA
Slicing
Fan out controls, flatten the tree to configurable number of levels,
5.8 Mining Spatial Databases
Spatial Data Mining refers to the extraction of knowledge, spatial relationships or other
interesting patterns not explicitly stored in spatial databases.
A spatial database stores a large amount of space-related data, such as maps, preprocessed
remote sensing or medical imaging data, and VLSI chip layout data.
Statistical spatial data analysis has been a popular approach to analyzing spatial data and
exploring geographic information.
The term geostatistics is often associated with continuous geographic space, whereas the
term Spatial statistics is often associated with discrete space.
Spatial Data Mining Applications:
Geographic information systems
Geo marketing
Remote sensing
Image database exploration
Medical Imaging
Navigation
Traffic Control
Environmental Studies
Spatial Data Cube Construction and Spatial OLAP:
Spatial data warehouse is a subject-oriented integrated, time-variant and non-volatile
collection of both spatial and non-spatial data in support of spatial data mining and
spatial data related decision-making process.
There are three types of dimensions in a Spatial Data Cube:
A non-spatial dimension contains only non-spatial data, each contains nonspatial data
whose generalizations are non-spatial.
A Spatial-to-nonspatial dimension is a dimension whose primitive-level data are
spatial but whose generalization, starting at a certain high level, becomes non-spatial.
A Spatial-to-Spatial dimension is a dimension whose primitive level and all of its high
level generalized data are spatial.
Measures of Spatial Data Cube:
A numerical measure contains only numeric data
A Spatial measure contains a collection of pointers to spatial objects.
Computation of Spatial Measures in Spatial Data Cube Construction:
Collect and store the corresponding spatial object pointers but do not perform
precomputation of spatial measures in the spatial data cube.
Precompute and store a rough approximation of the spatial measures in the spatial
data cube.
Selectively pre-compute some spatial measures in the spatial data cube.
Mining Spatial Association and Co-Location Pattern:
Spatial Association rules can be mined in spatial databases.
A Spatial association rule is of the form A B [s%, c%] where A & B are sets of
spatial or non-spatial predicates.
S% is the support of the rule; c% is the confidence of the rule
KLNCIT MCA
Review Questions
Two Marks:
1.
2.
3.
4.
5.
KLNCIT MCA
KLNCIT MCA