DWDM B Tech Unit 1 Part-A
DWDM B Tech Unit 1 Part-A
UNIT-1
I. What Motivated Data Mining? Why Is It Important?
In recent years, due to the wide availability of huge amounts of data and the imminent need for turning such data
into useful information and knowledge.
o For applications ranging from market analysis, fraud detection, and customer retention, to production
control
Data mining can be viewed as a result of the natural evolution of information technology.
o The database system industry has witnessed an evolutionary path in the development of the following
functionalities: data collection and database creation, data management (including data storage and
retrieval, and database transaction processing), and advanced data analysis (involving data warehousing
and data mining).
The steady and amazing progress of computer hardware technology in the past three decades has led to
large supplies of powerful and affordable computers, data collection equipment, and storage media.
Data can now be stored in many different kinds of databases and information repositories. One data
repository architecture that has emerged is the data warehouse , a repository of multiple heterogeneous
data sources organized under a unified schema at a single site in order to facilitate management decision
making.
o On-line analytical processing (OLAP) that is, analysis techniques with functionalities such as
summarization, consolidation, and aggregation as well as the ability to view information from
different angles.
o Although OLAP tools support multidimensional analysis and decision making, additional data
analysis tools are required for in-depth analysis, such as data classification, clustering, and the
characterization of data changes over time
The abundance of data, coupled with the need for powerful data analysis tools, has been described as a
data rich but information poor situation.
Knowledge discovery as a process is depicted in Figure and consists of an iterative sequence of the following
steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some
interestingness measures)
2
7. Knowledge presentation (where visualization and knowledge representation techniques are used to present
the mined knowledge to the user)
The architecture of a typical data mining system may have the following major components
Database, data warehouse, WorldWideWeb, or other information repository: Data cleaning and data
integration techniques may be performed on the data.
Database or data warehouse server: The database or data warehouse server is responsible for fetching
the relevant data, based on the user’s data mining request
Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns.( concept hierarchies, user beliefs, and additional interestingness
constraints or thresholds)
Data mining engine: This is essential to the data mining systemand ideally consists of a set of functional
modules for tasks such as characterization, association etc.
Pattern evaluation module: This component typically employs interestingness measures and interacts
with the data mining modules so as to focus the search toward interesting patterns.
User interface: Thismodule communicates between users and the data mining system, allowing the user
to interact with the system by specifying a data mining query
3
II. Data Mining—On What Kind of Data?
A database system, also called a database management system (DBMS), consists of a collection
of interrelated data, known as a database, and a set of software programs to manage and access
the data.
The software programs involve mechanisms for the definition of database structures; for data
storage; for concurrent, shared, or distributed data access; and for ensuring the consistency and
security of the information stored, despite system crashes or attempts at unauthorized access.
2. Data warehouses
Definition: A data warehouse is a repository of information collected from multiple sources, stored
under a unified schema, and that usually resides at a single site.
Data warehouses are constructed via a process of data cleaning, data integration, data transformation,
data loading, and periodic data refreshing.
To facilitate decision making, the data in a data warehouse are organized around major subjects,
such as customer, item, supplier, and activity.
The data are stored to provide information from a historical perspective (such as from the past 5–
10 years) and are typically summarized.
For example, rather than storing the details of each sales transaction, the data warehouse may
store a summary of the transactions per item type for each store or, summarized to a higher level,
for each sales region.
A data warehouse is usually modeled by a multidimensional database structure, where each
dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the
value of some aggregate measure, such as count or sales amount.
The actual physical structure of a data warehouse may be a relational data store or a
multidimensional data cube.
A data cube provides a multidimensional view of data and allows the precomputation and fast
accessing of summarized data.
By providing multidimensional data views and the precomputation of summarized data, data warehouse
systems are well suited for on-line analytical processing, or OLAP.
Although data warehouse tools help support data analysis, additional tools for data mining are required
to allow more in-depth and automated analysis.
6
3. Transactional databases
In general, a transactional database consists of a file where each record represents a transaction.
A transaction typically includes a unique transaction identity number (trans ID) and a list of the
items
Suppose you would like to dig deeper into the data by asking, “Which items sold well
together?”
This kind of market basket data analysis would enable you to bundle groups of items together as a
strategy for maximizing sales.
For example, given the knowledge that printers are commonly purchased together with computers, you
could offer an expensive model of printers at a discount to customers buying selected computers, in the
hopes of selling more of the expensive printers.
A regular data retrieval system is not able to answer queries like the one above.
However, data mining systems for transactional data can do so by identifying frequent itemsets, that is,
sets of items that are frequently sold together.
These applications require efficient data structures and scalable methods for handling complex object structures;
variable-length records; semistructured or unstructured data; text, spatiotemporal, and multimedia data; and database
schemas with complex structures and dynamic changes.
a) Object-Relational Databases
A temporal database typically stores relational data that include time-related attributes. These
attributes may involve several timestamps, each having different semantics.
A sequence database stores sequences of ordered events, with or without a concrete
notion of time. Examples include customer shopping sequences,
A time-series database stores sequences of values or events obtained over repeated
measurements of time (e.g., hourly, daily, weekly). Examples include data collected from the
stock exchange, inventory control, and the observation of natural phenomena (like temperature
and wind).
Data mining techniques can be used to find the characteristics of object evolution, or the trend of
changes for objects in the database. Such information can be useful in decision
making and strategy planning.
c) Spatial Databases and Spatiotemporal Databases
Spatial databases contain spatial-related information. Examples include geographic (map) databases, very
large-scale integration (VLSI) or computed-aided design databases, and medical and satellite image
databases.
Geographic databases have numerous applications, ranging from forestry and ecology planning to
providing public service information regarding the location of telephone and electric cables, pipes, and
sewage systems. In addition, geographic databases are commonly used in vehicle navigation
A spatial database that stores spatial objects that change with time is called a spatiotemporal
database, from which interesting information can be mined. For example, we may be able to
group the trends of moving objects and identify some strangely moving vehicles
d) Text Databases and Multimedia Databases
Text databases are databases that contain word descriptions for objects. such as product specifications,
error or bug reports, warning messages, summary reports, notes, or other documents.
Multimedia databases store image, audio, and video data. They are used in applications such as
picture content-based retrieval, voice-mail systems and recognize spoken commands
e) Heterogeneous Databases and Legacy Databases
A heterogeneous database consists of a set of interconnected, autonomous component
databases.
Many enterprises acquire legacy databases as a result of the long history of information
technology development
A legacy database is a group of heterogeneous databases that combines different kinds of data
systems, such as relational or object-oriented databases, hierarchical databases, network
databases, spreadsheets, multimedia databases, or file systems.
f) Data Streams
Many applications involve the generation and analysis of a newkind of data, called stream data,
where data flow in and out of an observation platform (or window) dynamically
8
Typical examples of data streams include various kinds of scientific and engineering data, time-
series data, and data produced in other dynamic environments, such as power supply, network
traffic, stock exchange, telecommunications, Web click streams, video surveillance, and weather
or environment monitoring.
g) The World Wide Web
The World Wide Web and its associated distributed information services, such as Yahoo!, Google,
America Online, and AltaVista, provide rich, worldwide, on-line information services, where data objects
are linked together to facilitate interactive access.
Users seeking information of interest traverse from one object via links to another.
Data discrimination is a comparison of the general features of the target class data objects against the
general features of objects from one or multiple contrasting classes.
Eg. A data mining system should be able to compare two groups of AllElectronics customers, such as
those who shop for computer products regularly (more than two times a month) versus those who rarely
shop for such products (i.e., less than three times a year). The resulting description provides a general
comparative profile of the customers, such as 80% of the customers who frequently purchase computer
products are between 20 and 40 years old and have a university education, whereas 60% of the
customers who infrequently buy such products are either seniors or youths, and have no university
degree. Drilling down on a dimension, such as occupation, or adding new dimensions, such as income
level, may help in finding even more discriminative features between the two classes.
9
Classification and Prediction
Finding models (functions) that describe and distinguish data classes or concepts for predict the
class whose label is unknown
E.g., classify countries based on climate, or classify cars based on gas mileage
Models: decision-tree, classification rules (if-then), neural network
Prediction: Predict some unknown or missing numerical values
Association Analysis
o A frequent itemset typically refers to a set of items that often appear together in a transactional
data set—for example, milk and bread, which are frequently bought together in grocery stores by
many customers.
o Multi-dimensional vs. single-dimensional association
o age(X, “20..29”) ^ income(X, “20..29K”) => buys(X, “PC”) [support = 2%, confidence = 60%]
o contains(T, “computer”) => contains(x, “software”) [support=1%, confidence=75%]
Cluster analysis
o Analyze class-labeled data objects, clustering analyze data objects without consulting a known
class label.
o Clustering based on the principle: maximizing the intra-class similarity and minimizing the
interclass similarity
Outlier analysis
Outlier: a data object that does not comply with the general behavior of the model of the data
It can be considered as noise or exception but is quite useful in fraud detection, rare events
analysis
Trend and evolution analysis
Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
Similarity-based analysis
Each user will have a data mining task in mind, that is, some form of data analysis that he or she would like to
have performed. A data mining task can be specified in the form of a data mining query, which is input to the
data mining system.
A data mining query is defined in terms of data mining task primitives. The data mining primitives specify the
following, as illustrated in Figure
10
Example : Mining classification rules. Suppose, as a marketing manager of AllElectronics, you would like to
classify customers based on their buying patterns. You are especially interested in those customers whose salary
is no less than $40,000, and who have bought more than $1,000 worth of items, each of which is priced at no
11
less than $100. In particular, you are interested in the customer’s age, income, the types of items purchased, the
purchase location, and where the items were made. You would like to view the resulting classification in the
form of rules.
This data mining query is expressed in DMQL3 as follows, where each line of the query has been enumerated to
aid in our discussion.
Are all of the patterns interesting? Typically not—only a small fraction of the patterns potentially generated
would actually be of interest to any given user.
Several objective measures of pattern interestingness exist. These are based on the structure of discovered
patterns and the statistics underlying them.
An objective measure for association rules of the form X =>Y is rule support, representing the percentage of
transactions from a transaction database that the given rule satisfies. This is taken to be the probability P(X U
Y),where X U Y indicates that a transaction contains both X and Y, that is, the union of itemsets X and Y.
Another objective measure for association rules is confidence, which assesses the degree of certainty of the
detected association. This is taken to be the conditional probability P(Y|X), that is, the probability that a
transaction containing X also contains Y.
More formally, support and confidence are defined as
support(X =>Y) = P(X U Y):
confidence(X =>Y) = P(Y|X):
In general, each interestingness measure is associated with a threshold, which may be controlled by the user.
For example, rules that do not satisfy a confidence threshold of, say, 50% can be considered uninteresting.
Rules below the threshold likely reflect noise, exceptions, or minority cases and are probably of less value.
Can a data mining system generate all of the interesting patterns?—refers to the completeness of a data
mining algorithm. It is often unrealistic and inefficient for data mining systems to generate all of the possible
patterns. Instead, user-provided constraints and interestingness measures should be used to focus the search.
12
Can a data mining system generate only interesting patterns?— is an optimization problem in data mining. It
is highly desirable for data mining systems to generate only interesting patterns. This would be much more
efficient for users and data mining systems, because neither would have to search through the patterns generated
in order to identify the truly interesting ones. Progress has been made in
this direction; however, such optimization remains a challenging issue in data mining
Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems,
statistics, machine learning, visualization, and information science.
1. Classification according to the kinds of databases mined: data mining system can be classified
according to the kinds of databases mined. For instance, if classifying according to data models, we may
have a relational, transactional, object-relational, or data warehouse mining system.
2. Classification according to the kinds of knowledge mined: Data mining systems can be categorized
according to the kinds of knowledge they mine, that is, based on data mining functionalities, such as
characterization, discrimination, association and correlation analysis, classification, prediction,
clustering, outlier analysis, and evolution analysis.
Moreover, data mining systems can be distinguished based on the granularity or levels of abstraction of
the knowledge mined, including generalized knowledge (at a highlevel of abstraction),primitive-level
knowledge (at a rawdata level), or knowledge atmultiple levels (considering several levels of abstraction).
3. Classification according to the kinds of techniques utilized: Data mining systems can be categorized
according to the underlying data mining techniques employed.
These techniques can be described according to the degree of user interaction involved (e.g.,
autonomous systems, interactive exploratory systems, query-driven systems) or the methods of data
analysis employed (e.g., database-oriented or data warehouse oriented techniques, machine learning,
statistics, visualization, pattern recognition, neural networks, and so on).
4. Classification according to the applications adapted: Data mining systems can also be categorized
according to the applications they adapt. For example, data mining systems may be tailored specifically
for finance, telecommunications, DNA, stock markets, e-mail, and so on.
13
VIII. Major Issues in Data Mining
The Major issues in data mining regarding mining methodology, user interaction, performance, and diverse data
types.
1. Mining different kinds of knowledge in databases: Because different users can be interested in different
kinds of knowledge, data mining should cover a wide spectrum of data analysis and knowledge
discovery tasks, including association, classification etc.
4. Data mining query languages and ad hoc data mining: Relational query languages (such as SQL)
allow users to pose ad hoc queries for data retrieval.
5. Presentation and visualization of data mining results: Discovered knowledge should be expressed in
high-level languages, visual representations, or other expressive forms so that the knowledge can be
easily understood and directly usable by humans.
6. Handling noisy or incomplete data: The data stored in a database may reflect noise, exceptional cases,
or incomplete data objects.
Performance issues: These include efficiency, scalability, and parallelization of data mining algorithms.
1. Efficiency and scalability of data mining algorithms: To effectively extract information from a huge
amount of data in databases, data mining algorithms must be efficient and scalable.
2. Parallel, distributed, and incremental mining algorithms: The huge size of many databases, the wide
distribution of data, and the computational complexity of some data mining methods are factors
motivating the development of parallel and distributed data mining algorithms.
Such algorithms divide the data into partitions, which are processed in parallel. The results from
the partitions are then merged.
14
Moreover, the high cost of some data mining processes promotes the need for incremental data
mining algorithms that incorporate database updates without having to mine the entire data again “from
scratch.” Such algorithms perform knowledge modification incrementally to amend and strengthen what
was previously discovered.
1. Handling of relational and complex types of data: Because relational databases and data warehouses
are widely used, the development of efficient and effective data mining systems for such data is
important. However, other databases may contain complex data objects, hypertext and multimedia data,
spatial data, temporal data, or transaction data.
It is unrealistic to expect one system to mine all kinds of data, given the diversity of data types
and different goals of data mining. Specific data mining systems should be constructed for mining
specific kinds of data. Therefore, one may expect to have different data mining systems for different
kinds of data.
2. Mining information from heterogeneous databases and global information systems: Local- and wide-
area computer networks (such as the Internet) connect many sources of data, forming huge, distributed,
and heterogeneous databases.
15