0% found this document useful (0 votes)
42 views17 pages

DWDM 1

Uploaded by

banavathshilpa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views17 pages

DWDM 1

Uploaded by

banavathshilpa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Data Mining (UNIT -1)

Introduction to Data Mining:


Data mining refers to extracting or mining knowledge from large amountsof
data. The term is actually a misnomer. Thus, data mining should have been more
appropriately named as knowledge mining which emphasis on mining from
large amounts of data.
It is the computational process of discovering patterns in large data sets
involving methods at the intersection of artificial intelligence, machine learning,
statistics, and database systems. The overall goal of the data mining process is
to extract information from a data set and transform it into an understandable
structure for further use.
The key properties of data mining are
1. Automatic discovery of patterns.
2. Prediction of likely outcomes.
3. Creation of actionable information.
4. Focus on large datasets and databases.

MOTIVATION AND IMPORTANCE:


• Data Mining is defined as the procedure of extracting information from huge
sets of data.
• Data mining is mining knowledge from data.
• The terminologies involved in data mining and then gradually moves on to
cover topics such as knowledge discovery, query language, classification and
prediction, decision tree induction, cluster analysis, and how to mine the Web.
• Here is a huge amount of data available in the Information Industry.
• This data is of no use until it is converted into useful information.
• It is necessary to analyse this huge amount of data and extract useful
information from it.
• Extraction of information is not the only process we need to perform.
• Data mining also involves other processes such as Data Cleaning, Data
Integration, Data Transformation, Data Mining, Pattern Evaluation and Data
Presentation.

DEFINITION OF DATA MINING?


Data Mining is defined as extracting information from huge sets of data. In
other words, we can say that data mining is the procedure of mining knowledge
from data. The information or knowledge extracted so can be used for any of the
following applications
• Market Analysis
• Fraud Detection
• Customer Retention
• Production Control
• Science Exploration.
Major Sources of data: - Business –Web, E-commerce, Transactions, Stocks -
Science – Remote Sensing, Bio informatics, Scientific Simulation - Society and
Everyone – News, Digital Cameras, You Tube * Need for turning data into
knowledge – Drowning in data, but starving for knowledge.

Definition of Data Mining?


Extracting and ‘Mining’ knowledge from large amounts of data. “Gold Mining
from rock or sand” is same as “Knowledge mining from data”
Other terms for Data Mining:
• Knowledge Mining
• Knowledge Extraction o Pattern Analysis.

The Scope of Data Mining :


Data mining derives its name from the similarities between searching for
valuable business information in a large database — for example, finding linked
products in gigabytes of store scanner data — and mining a mountain for a vein
of valuable ore. Both processes require either sifting through an immense
amount of material, or intelligently probing it to find exactly where the value
resides.
Given databases of sufficient size and quality, data mining technology can
generate new business opportunities by providing these capabilities:
Automated prediction of trends and behaviors.
Data mining automates the process of finding predictive information in large
databases. Questions that traditionally required extensive handson analysis
can now be answered directly from the data — quickly.
A typical example of a predictive problem is targeted marketing. Data mining
uses data on past promotional mailings to identify the targets most likely to
maximize return on investment in future mailings. Other predictive problems
include forecasting bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns.
Data mining tools sweep through databases and identify previously hidden
patterns in one step. An example of pattern discovery is the analysis of retail
sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit
card transactions and identifying anomalous data that could represent data entry
keying errors.

What are Data Mining and Knowledge Discovery?


With the enormous amount of data stored in files, databases, and other
repositories, it is increasingly important, if not necessary, to develop powerful
means for analysis and perhaps interpretation of such data and for the extraction
of interesting knowledge that could help in decision-making.
Data Mining, also popularly known as Knowledge Discovery in Databases
(KDD), refers to the nontrivial extraction of implicit, previously unknown and
potentially useful information from data in databases. While data mining and
knowledge discovery in databases (or KDD) are frequently treated as
synonyms, data mining is actually part of the knowledge discovery process. The
following figure (Figure 1.1) shows data mining as a step in an iterative
knowledge discovery process.
The Knowledge Discovery in Databases process comprises of a few steps
leading from raw data collections to some form of new knowledge. The iterative
process consists of the following steps:
• Data cleaning: also known as data cleansing, it is a phase in which noise data
and irrelevant data are removed from the collection.
• Data integration: at this stage, multiple data sources, often heterogeneous,
may be combined in a common source.
• Data selection: at this step, the data relevant to the analysis is decided on and
retrieved from the data collection.
• Data transformation: also known as data consolidation, it is a phase in which
the selected data is transformed into forms appropriate for the mining
procedure.
• Data mining: it is the crucial step in which clever techniques are applied to
extract patterns potentially useful.
• Pattern evaluation: in this step, strictly interesting patterns representing
knowledge are identified based on given measures.
• Knowledge representation: is the final phase in which the discovered
knowledge is visually represented to the user. This essential step uses
visualization techniques to help users understand and interpret the data mining
results.
It is common to combine some of these steps together. For instance, data
cleaning and data integration can be performed together as a pre-processing
phase to generate a data warehouse. Data selection and data transformation can
also be combined where the consolidation of the data is the result of the
selection, or, as for the case of data warehouses, the selection is done on
transformed data.
The KDD is an iterative process. Once the discovered knowledge is presented to
the user, the evaluation measures can be enhanced, the mining can be further
refined, new data can be selected or further transformed, or new data sources
can be integrated, in order to get different, more appropriate results.
Data mining derives its name from the similarities between searching for
valuable information in a large database and mining rocks for a vein of valuable
ore. Both imply either sifting through a large amount of material or ingeniously
probing the material to exactly pinpoint where the values reside.
It is, however, a misnomer, since mining for gold in rocks is usually called
“gold mining” and not “rock mining”, thus by analogy, data mining should have
been called “knowledge mining” instead. Nevertheless, data mining became the
accepted customary term, and very rapidly a trend that even overshadowed
more general terms such as knowledge discovery in databases (KDD) that
describe a more complete process.
Other similar terms referring to data mining are: data dredging, knowledge
extraction and pattern discovery.

RELATIONAL DATABASES:
• A database system, also called a database management system (DBMS),
consists of a collection of interrelated data, known as a database, and a set of
software programs to manage and access the data.
• A relational database: is a collection of tables, each of which is assigned a
unique name.
• Each table consists of a set of attributes (columns or fields) and usually stores
a large set of tuples (records or rows).
• Each tuple in a relational table represents an object identified by a unique key
and described by a set of attribute values.
• A semantic data model, such as an entity-relationship (ER) data model, is
often constructed for relational databases.
• An ER data model represents the database as a set of entities and their
relationships.

DATA WAREHOUSE:
• A data warehouse is a repository of information collected from multiple
sources, stored under a unified schema, and that usually resides at a single site.
• Data warehouses are constructed via a process of data cleaning, data
integration, data transformation, data loading, and periodic data refreshing.
• The data are stored to provide information from a historical perspective and
are typically summarized.
• A data warehouse is usually modelled by a multidimensional database
structure.
• where each dimension corresponds to an attribute or a set of attributes in the
schema, and each cell stores the value of some aggregate measure, such as
count or sales amount.
• A data cube provides a multidimensional view of data and allows the
precomputation and fast accessing of summarized data.
What is the difference between a data warehouse and a data mart?”:
• A data warehouse collects information about subjects that span an entire
organization, and thus its scope is enterprise-wide.
• A data mart is a department subset of a data warehouse. It focuses on selected
subjects, and thus its scope is department-wide.
• Data warehouse systems are well suited for on-line analytical processing, or
OLAP.
• Examples of OLAP operations include drill-down and roll-up, which allow the
user to view the data at differing degrees of summarization.
Transactional Databases:
• Transactional database consists of a file where each record represents a
transaction.
• A transaction typically includes a unique transaction identity number (trans
ID) and a list of the items making up the transaction.
• The transactional database may have additional tables associated with it,
which contain other information regarding the sale, such as the date of the
transaction, the customer ID number, the ID ,number of the salesperson and of
the branch at which the sale occurred, and so on.

➢ ADVANCED DATA AND INFORMATION SYSTEMS AND


ADVANCED APPLICATIONS:
The new database applications include handling spatial data (such as maps),
engineering design data (such as the design of buildings, system components, or
integrated circuits), hypertext and multimedia data (including text, image,
video, and audio data), time-related data (such as historical records or stock
exchange data), stream data (such as video surveillance and sensor data, where
data flow in and out like streams), and the World Wide Web (a huge, widely
distributed information repository made available by the Internet).

OBJECT-RELATIONAL DATABASES:
• Object-relational databases are constructed based on an object-relational data
model.
• This model extends the relational model by providing a rich data type for
handling complex objects and object orientation object-relational databases are
becoming increasingly popular in industry and applications.
• The object-relational data model inherits the essential concepts of
objectoriented databases
Each object has associated with it the following:
• A set of variables that describe the objects. These correspond to attributes in
the entity relationship and relational models.
• A set of messages that the object can use to communicate with other objects,
or with the rest of the database system.
• A set of methods, where each method holds the code to implement a message.
Upon receiving a message, the method returns a value in response. For instance:
the method for the message get photo(employee) will retrieve and return a photo
of the given employee object.
• Objects that share a common set of properties can be grouped into an object
class.
• Each object is an instance of its class. Object classes can be organized into
class/subclass hierarchies so that each class represents properties that are
common to objects in that class.

➢ TEMPORAL DATABASES, SEQUENCE DATABASES, AND


TIME SERIES DATABASES:
TEMPORAL DATABASE:
Typically stores relational data that include time-related attributes. These
attributes may involve several timestamps, each having different semantics.
A sequence database stores sequences of ordered events, with or without a
concrete notion of time. Examples: include customer shopping sequences, Web
click streams, and biological sequences. A time series database stores sequences
of values or events obtained over repeated measurements of time (e.g., hourly,
daily, weekly).

➢ SPATIAL DATABASES AND SPATIOTEMPORAL


DATABASES:
SPATIAL DATABASES:
• It contain spatial-related information. Examples include geographic (map)
databases, very large-scale integration (VLSI) or computed-aided design
databases, and medical and satellite image databases.
• Spatial data may be represented in raster format, consisting of ndimensional
bit maps or pixel maps.
• Example: a 2-D satellite image may be represented as raster data, where each
pixel registers the rainfall in a given area.
• Maps can be represented in vector format, where roads, bridges, buildings,
and lakes are represented as unions or overlays of basic geometric constructs,
such as points, lines, polygons, and the partitions and networks formed by these
components.
“What kind of data mining can be performed on spatial databases?” :
• Data mining may uncover patterns describing the characteristics of houses
located near a specified kind of location, such as a park, for instance.
• A spatial database that stores spatial objects that change with time is called a
spatiotemporal database, from which interesting information can be mined.

➢ Text Databases and Multimedia Databases:


TEXT DATABASES:
• Databases that contain word descriptions for objects.
• These word descriptions are usually not simple keywords but rather long
sentences or paragraphs, such as product specifications, error or bug reports,
warning messages, summary reports, notes, or other documents.
• Text databases may be highly unstructured (such as some Web pages on the
WorldWideWeb). Some text databases may be somewhat structured, that is,
semistructured (such as e-mail messages and many HTML/XML Web pages).
• Text databases with highly regular structures typically can be implemented
using relational database systems.
MULTIMEDIA DATABASES:
• Store image, audio, and video data. They are used in applications such as
picture content-based retrieval, voice-mail systems, video-on-demand systems,
the World Wide Web, and speech-based user interfaces that recognize spoken
commands.
• Multimedia databases must support large objects, because data objects such as
video can require gigabytes of storage.
• Specialized storage and search techniques are also required. Because video
and audio data require real-time retrieval at a steady and predetermined rate in
order to avoid picture or sound gaps and system buffer overflows, such data are
referred to as continuous-media data.
➢ HETEROGENEOUS DATABASES AND LEGACY DATABASES:
HETEROGENEOUS DATABASE:
It consists of a set of interconnected, autonomous component databases. The
components communicate in order to exchange information and answer queries.
Objects in one component database may differ greatly from objects in other
component databases, making it difficult to assimilate their semantics into the
overall heterogeneous database.
LEGACY DATABASE:
It is a group of heterogeneous databases that combines different kinds of data
systems, such as relational or object-oriented databases, hierarchical databases,
network databases, spreadsheets, multimedia databases, or file systems. The
heterogeneous databases in a legacy database may be connected by intra or
inter-computer networks.
DATA STREAMS:
• Many applications involve the generation and analysis of a new kind of data,
called stream data, where data flow in and out of an observation platform (or
window) dynamically.
• Such data streams have the following unique features:
• huge or possibly infinite volume, dynamically changing, flowing in and out in
a fixed order, allowing only one or a small number of scans, and demanding fast
(often real-time) response time.
Examples: streams include various kinds of scientific and engineering data,
time-series data, and data produced in other dynamic environments, such as
power supply, network traffic, stock exchange, telecommunications, Web click
streams, video surveillance, and weather or environment monitoring.
THE WORLD WIDE WEB:
The World Wide Web and its associated distributed information services, such
as Yahoo!, Google, America Online, and AltaVista, provide rich, worldwide, on-
line information services, where data objects are linked together to facilitate
interactive access.
Example: understanding user access patterns will not only help improve system
design (by providing efficient access between highly correlated objects.)
➢ DATA MINING FUNCTIONALITIES:
Data mining functionalities are used to specify the kind of patterns to be found
in data mining tasks. data mining tasks can be classified into two categories:
descriptive and predictive. Descriptive mining tasks characterize the general
properties of the data in the database. Predictive mining tasks perform inference
on the current data in order to make predictions.
In some cases, users may have no idea regarding what kinds of patterns in their
data may be interesting, and hence may like to search for several different kinds
of patterns in parallel. Thus it is important to have a data mining system that can
mine multiple kinds of patterns to accommodate different user expectations or
applications. Furthermore, data mining systems should be able to discover
patterns at various granularity (i.e., different levels of abstraction). Data mining
systems should also allow users to specify hints to guide or focus the search for
interesting patterns. Because some patterns may not hold for all of the data in
the database, a measure of certainty or “trustworthiness” is usually associated
with each discovered pattern.
Data mining functionalities, and the kinds of patterns they can discover, are
described Mining Frequent Patterns, Associations, and Correlations Frequent
patterns, as the name suggests, are patterns that occur frequently in data. There
are many kinds of frequent patterns, including itemsets, subsequences, and
substructures.
A frequent itemset typically refers to a set of items that frequently appear
together in a transactional data set, such as milk and bread. A frequently
occurring subsequence, such as the pattern that customers tend to purchase first
a PC, followed by a digital camera, and then a memory card, is a (frequent)
sequential pattern.
A substructure can refer to different structural forms, such as graphs, trees, or
lattices, which may be combined with itemsets or subsequences. If a
substructure occurs frequently, it is called a (frequent) structured pattern.
Mining frequent patterns leads to the discovery of interesting associations and
correlations within data.
Data mining involves six common classes of tasks:
Anomaly detection (Outlier/change/deviation detection) –
The identification of unusual data records, that might be interesting or data
errors that require further investigation.
Association rule learning (Dependency modelling) –
Searches for relationships between variables. For example a supermarket might
gather data on customer purchasing habits. Using association rule learning, the
supermarket can determine which products are frequently bought together and
use this information for marketing purposes. This is sometimes referred to as
market basket analysis.
Clustering –
is the task of discovering groups and structures in the data that are in some way
or another "similar", without using known structures in the data.
Classification –
is the task of generalizing known structure to apply to new data. For example,
an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression –
attempts to find a function which models the data with the least error.
Summarization –
providing a more compact representation of the data set, including Visualization
and report generation.
Architecture of Data Mining:
A typical data mining system may have the following major components.
1. Knowledge Base:
This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept
hierarchies, used to organize attributes or attribute values into different levels of
abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included. Other
examples of domain knowledge are additional interestingness constraints or
thresholds, and metadata (e.g., describing data from multiple heterogeneous
sources).
2. Data Mining Engine:
This is essential to the data mining system and ideally consists of a set of
functional modules for tasks such as characterization, association and
correlation analysis, classification, prediction, cluster analysis, outlier analysis,
and evolution analysis.
3. Pattern Evaluation Module:
This component typically employs interestingness measures interacts with the
data mining modules so asto focusthe search toward interesting patterns. It may
use interestingness thresholds to filter out discovered patterns. Alternatively, the
pattern evaluation module may be integrated with the mining module,
depending on the implementation of the datamining method used. For efficient
data mining, it is highly recommended to push the evaluation of pattern
interestingness as deep as possible into the mining process as to confine the
search to only the interesting patterns.
4. User interface:
This module communicates between users and the data mining system,allowing
the user to interact with the system by specifying a data mining query or task,
providing information to help focus the search, and performing exploratory
datamining based on the intermediate data mining results. In addition, this
component allows the user to browse database and data warehouse schemas or
data structures, evaluate mined patterns, and visualize the patterns in different
forms.

➢ CLASSIFICATION OF DATA MINING SYSTEMS:


Data mining is an interdisciplinary field, the confluence of a set of disciplines,
including database systems, statistics, machine learning, visualization, and
information science. Moreover, depending on the data mining approach used,
techniques from other disciplines may be applied, such as neural networks,
fuzzy and/or rough set theory, knowledge representation, inductive logic
programming, or high performance computing.
Data mining systems can be categorized according to various criteria, as
follows:
Classification according to the kinds of databases mined:
A data mining system can be classified according to the kinds of databases
mined. Database systems can be classified according to different criteria (such
as data models, or the types of data or applications involved), each of which
may require its own data mining technique.
Classification according to the kinds of knowledge mined:
Data mining systems can be categorized according to the kinds of knowledge
they mine, that is, based on data mining functionalities, such as characterization,
discrimination, association and correlation analysis, classification, prediction,
clustering, outlier analysis, and evolution analysis.
Classification according to the kinds of techniques utilized:
Data mining systems can be categorized according to the underlying data
mining techniques employed. These techniques can be described according to
the degree of user interaction involved (e.g., autonomous systems, interactive
exploratory systems, query-driven systems) or the methods of data analysis
employed (e.g., database-oriented or data warehouse– oriented techniques,
machine learning, statistics, visualization, pattern recognition, neural networks,
and so on).
Classification according to the applications adapted:
Data mining systems can also be categorized according to the applications they
adapt. For example, data mining systems may be tailored specifically for
finance, telecommunications, DNA, stock markets, e-mail, and so on. Different
applications often require the integration of application-specific methods.

➢ MAJOR ISSUES IN DATA MINING:


Mining different kinds of knowledge in databases. –
The need of different users is not the same. And Different user may be in
interested in different kind of knowledge. Therefore it is necessary for data
mining to cover broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction. –
The data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on
returned results.
Incorporation of background knowledge. –
To guide discovery process and to express the discovered patterns, the
background knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at multiple level of
abstraction.
Data mining query languages and ad hoc data mining. –
Data Mining Query language that allows the user to describe ad hoc mining
tasks, should be integrated with a data warehouse query language and optimized
for efficient and flexible data mining. Presentation and visualization of data
mining results. - Once the patterns are discovered it needs to be expressed in
high level languages, visual representations. This representations should be
easily understandable by the users.
Handling noisy or incomplete data. –
The data cleaning methods are required that can handle the noise, incomplete
objects while mining the data regularities. If data cleaning methods are not there
then the accuracy of the discovered patterns will be poor.
Pattern evaluation. –
It refers to interestingness of the problem. The patterns discovered should be
interesting because either they represent common knowledge or lack novelty.
Efficiency and scalability of data mining algorithms. –
In order to effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms. –
The factors such as huge size of databases, wide distribution of data,and
complexity of data mining methods motivate the development of parallel and
distributed data mining algorithms. These algorithm divide the data into
partitions which is further processed parallel. Then the results from the
partitions is merged. The incremental algorithms, updates databases without
having mine the data again from scratch.

You might also like