0% found this document useful (0 votes)
9 views9 pages

Major Components of Data Mining System

The document outlines the major components of a data mining system, including databases, data warehouses, data mining engines, and user interfaces, emphasizing their roles in data processing and analysis. It discusses various types of data repositories such as relational databases, data warehouses, and transactional databases, along with advanced data systems for handling complex data types. Additionally, it highlights the importance of data mining as an evolution of information technology, facilitating the extraction of useful information through systematic processes.

Uploaded by

13it11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views9 pages

Major Components of Data Mining System

The document outlines the major components of a data mining system, including databases, data warehouses, data mining engines, and user interfaces, emphasizing their roles in data processing and analysis. It discusses various types of data repositories such as relational databases, data warehouses, and transactional databases, along with advanced data systems for handling complex data types. Additionally, it highlights the importance of data mining as an evolution of information technology, facilitating the extraction of useful information through systematic processes.

Uploaded by

13it11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Major components of data mining system

 Database, data warehouse, data warehouse, world wide web, or


other information repository:
This is one or a set of databases, data warehouses, spreadsheets, or other kinds
of information repositories. Data cleaning and data integration techniques may
be performed on the data.
 Database or data warehouse server: The database or data warehouse
server is responsible for fetching the relevant data, based on the user’s
data mining request.
Figure 1.5 Architecture of a typical data mining system

Architecture of a typical data mining system


 Knowledge base: This is the domain knowledge that is used to guide the
search or evaluate the interestingness of resulting patterns. This include
concept hierarchies, used to organize attributes or attribute values into
different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included. Examples:
Additional interestingness constraints or thresholds, and metadata
 Data mining engine: This is essential to the data mining system and
ideally consists of a set of functional modules for tasks such as
characterization, association and correlation analysis, classification,
prediction, cluster analysis, outlier analysis, and evolution analysis.
 Pattern evaluation module: This component employs interestingness
measures and interacts with the data mining modules. It may use
interestingness thresholds to filter out discovered patterns. Alternatively,
the pattern evaluation module may be integrated with the mining module,
depending on the implementation of the data mining method used.
 User interface: this module communicates between users and the data
mining system, allowing the user to interact with the system by specifying
a data mining query or task.
It is providing information to help focus the search, and performing exploratory
data mining based on the results. It also allows the user to browse database and
data warehouse schemas, evaluate mined patterns and visualize the patterns in
different forms.
Data mining can be viewed as an advanced stage of on-line analytical
processing(OLAP). Therefore, data mining is considered one of the most
important frontiers in database and information systems.

Kinds of data:
Data mining should be applicable to any kind of data repository such as data
streams. It includes the following data repositories.
 Relational databases
 Data warehouses
 Transactional databases
 Advanced database systems
Relational databases:
A database system, also called as database management system(DBMS),
consists of a collection of interrelated data, known as database, and a set of
software programs to manage and access the data. The software programs
involve mechanisms for the definition of database structures, data storage,
concurrent, and shared data access for ensuring the consistency and security.
A relational database is a collection of tables, each is assigned a unique name.
each table consists of a set of columns and rows. Each row represents an unique
key and described by a set of attribute values.
Entity relationship (ER), a semantic data model represents the database as a set
of entities and their relationships.
Example:
All electronics company contain the following tables: customer, item, employee,
and branch.
The relation table customer consists of a set of attributes, including a unique
customer identity number(cust_id), customer name, address, age, occupation,
annual income, credit information, category and so on. Other tables are
describing their properties with a set of attributes.
Figure 1.6 fragments of relations from a relational database for Allelectronics

Relational data can be accessed by database queries written in a relational query


language such as sql.
A query contains a set of relational operations such as join, selection, and
projection and optimized for efficient processing.
A query allows retrieval of specified subsets of the data. Relational languages
also include aggregate functions such as sum, avg, count, max(maximum) and
min(minimum).
Data mining in relational databases is applied in searching for trends or data
patterns. It also detect deviations and can be further investigated.
Data warehouses:
A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema and resides at a single site. It is constructed via a
process of data cleaning, data integration, data transformation, data loading and
data refreshing.
Figure 1.7 typical framework of a data warehouse for Allelectronics

Data warehouses are organized as major subjects such as customer, item,


supplier and activity to faciliatate decision making. The data are stored to
provide information from a historical perspective and summarized.
A data warehouse is modelled by a multidimensional database structure, where
each dimension refers to an attribute or a set of attributes in the schema. Each
cell stores the value of some aggregate measure.
The actual physical structure of data warehouse is a multidimensional data cube.
It allows the precomputation and fast accessing of summarized data.
Example: A data cube for Allelectronics. It has three dimensions: address(city
values Chicago, new York, Toronto, Vancouver), time (q1, q2,q3,q4) and
item( home entertainment, computer, phone, security). The aggregate value
stored in each cell is sales_amount.
Difference between data warehouse and data mart: A data warehouse collects
information about subjects that span an entire organization, and its scope is
enterprise-wide. A data mart is a department subset of a data warehouse. It
focuses on selected subjects, scope is department-wide.
Data warehouse is well suited for on-line analytical processing by providing
multidimensional data views. OLAP operations use background knowledge to
allow the presentation of data at different levels of abstraction. OLAP operations
include drill-down and roll-up which allows the user to view the data at differing
degrees of summarization.
Figure 1.8 Multidimensional data cube commonly used for data warehousing

Transactional databases:
Transactional database consists of a file where each record represents a
transaction. A transaction typically includes a unique transaction identity number
and a list of items making up the transaction.
Figure 1.9 fragment of a transactional database for sales at All electronics

The transactional database may have additional tables associated with it,
regarding the sale, such as the date of the transaction, the customer ID number,
the ID number of the salesperson and the branch.
Example 1.3 A transactional database for Allelectronics. Transactions can be
stored in a table, with one record per transaction. Transactional database is
stored in a flat file or unfolded into a standard relation. Market basket data
analysis enables you to bundle groups of items together as a strategy for
maximizing sales. Data mining systems for transactional data can identify
frequent itemsets that are sold together.
Advanced data and information systems and advanced applications:
The new database applications include handling spatial data such as maps,
engineering design data such as integrated circuits and system components,
hypertext and multimedia data, time-related data, stream data, and the world
wide web. These applications require efficient data structures and scalable
methods for handling complex object structures; variable-length records;
semistructured or unstructured data; text, spatiotemporal, multimedia data,
database schemas and dynamic changes.
Advance database systems and specific application-oriented database systems
include object-relational database systems, temporal and time-series database
systems, spatial and spatiotemporal database systems, text and multimedia
database systems and web-based global information systems.
These databases require sophisticated facilities to store, retrieve, and update
large amounts of complex data. They provide fertile grounds, raise many
challenging research and implementation issues for data mining.
Object relational databases
These are constructed based on an object-relational data model. This model
extends the relational model by providing a rich data type for handling complex
objects and object orientation.
The object-relational data model inherits the essential concepts of object-
oriented databases, where each entity is considered as an object. Data and code
relating to an object are encapsulated into a single unit. Each object has
associated with the following:
A set of variables that describe the objects. These correspond to attributes in the
entity-relationship and relational models.
A set of messages that the object can use to communicate with other objects, or
with the rest of the database system.
A set of methods, where each method holds the code to implement a message.
Upon receiving a message, the method returns a value in response.
Objects that share a common set of properties can be grouped into an object
class. Each object is an instance of its class. Object classes can be organized into
class or subclass hierarchies so that each class represents properties that are
common to objects in that class. For example sales person is a subclass of the
class, employee. Sales person object would inherit all of the variables pertaining
to its super class of employee. Such a class inheritance feature benefits
information sharing.
Temporal databases, sequence databases, and time-series databases:
A temporal database stores relational data that include time-related attributes.
These attributes may involve several timestamps, each having different
semantics.
A sequence database stores sequences of ordered events, with or without a
concrete a notion of time.
A time-series database stores sequences of values or events obtained over
repeated measurements of time.

Spatial databases and spatiotemporal databases:


Spatial databases contain spatial-related information. Example include
geographic databases, very large-scale integration (VLSI) or computed-aided
design databases, and medical and satellite image databases.
Spatial data may be represented in raster format, consisting of n-dimensional bit
maps or pixel maps. Examples are 2-D satellite image may be represented as
raster data.
Maps can be represented in vector format, where roads, bridges, buildings and
lakes are represented as unions. Basic geometric constructs such as points, lines,
polygons and networks formed by these components.
Geographic databases have numerous applications, ranging from forestry and
ecology planning to provide public service information regarding the location of
telephone and electric cables, pipes and sewage systems. Spatial databases may
cover specified kind of location and climate of mountain areas located at various
altitudes.
The relationships among a set of spatial objects can be examined in order to
discover which subsets of objects are spatially auto-correlated. A spatial
database that stores spatial objects that change with time is called a
spatiotemporal database. For example identifying the trends of moving objects
and identify strangely moving vehicles.
Text databases and multimedia databases:
Text databases are databases that contain word descriptions for objects. These
word descriptions are long distances, paragraphs such as product specifications,
error or bug reports, warning messages, and summary reports.
Text databases may be hightly unstructured such as some web pages or
structured or semistructured such as email messages and many HTML/XML web
pages or relatively well structured such as library catalogue databases.
By mining text data, one may uncover general and concise descriptions of the
text documents, keyword or content associations. To do this, standard data
mining methods need to be integrated with information retrieval techniques and
the construction for text data.
Multimedia databases store image, audio, and video data. They are used in
applications such as picture content-based retrieval, voice-mail systems, video
on demand systems, the world wide web, and speech-based user interfaces.
Multimedia databases must support large objects, because data objects such as
video can require gigabytes of storage. Video and audio data require real-time
retrieval at a steady and predermined rate to avoid picture or sound gaps. Such
data are referred to as continuous-media data.
Heterogeneous databases and legacy databases
A heterogeneous database consists of a set of interconnected, autonomous
component databases. The components communicate in order of exchange
information and answer queries.
A legacy database is a group of heterogeneous databases that combines
different kinds of data systems, such as relational or object – oriented databases,
hierarchical databases, network databases, spreadsheets, or multimedia
databases. These databases may be connected by intra or inter- computer
networks.
Information exchange across such databases is difficult as it requires precise
transformation rules from one representation to another, considering semantics.
Data mining techniques may provide an interesting solution to the information
exchange problem by performing statistical data distribution and correlation
analysis. It transforms the given data into higher, and conceptual levels.
Data streams:
The generation and analysis of a new kind of data, called stream data, where
data flow in and out of an observation platform dynamically. It has unique
features such as huge or possibly infinite volume, dynamically changing, flowing
in and out in a fixed order, allowing only one or a small number of scans, and
demanding fast response time.
Effective and efficient management and analysis stream of data poses great
challenges to researchers. A typical query model in such a system is the
continuous query model, where predefined queries constantly evaluate incoming
streams, collect aggregate data, report the current status of data streams, and
respond to their changes.
Mining data streams involves the efficient discovery of general patterns and
dynamic changes with stream data. Most stream data reside at a low level of
abstraction, and analysts are often interested in higher and multiple levels of
abstraction.
The World wide web
The world wide web provides associated distributed information services such as
yahoo, google, America online, and altavista. Data objects are linked together to
facilitate interactive access. Users search information traverse from one object
via links to another. It helps improve system design and also leads to better
marketing decisions.
Capturing user access patterns in such distributed information environments is
called web usage mining. Web pages can be highly unstructured and lack a
predefined schema, type or pattern. Thus it is difficult for computers to
understand the semantic meaning for systematic information and data mining.
Keyword based searches offer only limited help for users.
Data mining provide additional help by authoritative web page analysis based on
linkages among webpages can help rank web pages. Automated web page
clustering and classification help group and arrange web pages in a
multidimensional manner based on their contents.
Web community analysis helps identify hidden web social network and
communities and observe their evolution. Web mining is the development of
scalable and effective web data analysis and mining methods. It helps to learn,
characterize, classify web pages, and uncover web dynamics and the association
among different web pages, users, communities and web based activities.

Data Mining and Knowledge Discovery


Data mining is devoted specifically to the processes involved in the extraction of
useful information by applying specific techniques based on certain knowledge
domains. These are based on statistics, artificial intelligence, and so on.
However, knowledge discovery is a wide term and is the entire range of activities
right from deciding business objectives, capturing desired data, preparing,
processing, arranging them, applying predefined techniques and then presenting
them in an understandable form to the user. To say specifically, knowledge
discovery can be subdivided into five specific steps which are performed
repetitively till the desired result is reached, and one of them is data mining. (i)
Data processing comprising data selection, data cleaning and data integration.
(ii) Data transformation and organization in a form ready for fast access. (iii)
Data Mining (DM) engine and other techniques, such as OLAP or Online
Transaction Processing (OLTP) for searching and extraction
(iv) Knowledge presentation methods through Graphical User Interface (GUI). (v)
Analysing results and assimilating them in a knowledge domain.
Figure2.1 shows the steps in knowledge discovery.
Data Processing
Data Transformation

Data Mining Engine

Knowledge Presentation through GUI

Result Analysis
Fig. 2.1 Steps in Knowledge

Importance of data mining


Data mining can be viewed as a result of the natural evolution of information
technology. The database system industry has witnessed an evolutionary path in
the development of data following functions:
Data collection, database creation, data management(data storage, retrieval and
database transaction processing), and advanced data analysis(data warehousing
and data mining).

Figure 1.1 The evolution of database system technology


Since 1960, the early development of data collection and database creation
mechanisms helps for later development of effective mechanisms for data
storage and retrieval and query and transaction processing.
Efficient methods for online transaction processing(OLTP), Where query is viewed
as a read-only transaction have contributed to the evolution and wide
acceptance of relational technology as a major tool for efficient storage, retrieval
and management of large amounts of data.
Application-oriented database systems, including spatial, temporal, multimedia,
active,

You might also like