0% found this document useful (0 votes)
34 views25 pages

Data Mining L-3,4

The document discusses various types of data repositories applicable for data mining, including relational databases, data warehouses, and advanced database systems. It outlines the characteristics and structures of these databases, as well as the integration of data mining systems with database systems. Additionally, it addresses major issues in data mining, such as performance, user interaction, and the diversity of data types.

Uploaded by

xataje8102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views25 pages

Data Mining L-3,4

The document discusses various types of data repositories applicable for data mining, including relational databases, data warehouses, and advanced database systems. It outlines the characteristics and structures of these databases, as well as the integration of data mining systems with database systems. Additionally, it addresses major issues in data mining, such as performance, user interaction, and the diversity of data types.

Uploaded by

xataje8102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Rishi Sharma

IIIT Surat
Data for Data Mining
❖ Data mining should be applicable to any kind of data repository, as well as to
transient data, such as data streams.
➢ Relational databases,
➢ data warehouses,
➢ transactional databases,
➢ advanced database systems,
➢ flat files,
➢ data streams, and
➢ World Wide Web.
❖ Advanced database systems include object-relational databases and
specific application-oriented databases: spatial databases, time-series
databases, text databases, and multimedia databases.
Relational Databases
❖ A relational database is a collection of tables, each of which is assigned a
unique name.
❖ Each table consists of a set of attributes and usually stores a large set of
tuples .
❖ Each tuple in a relational table represents an object identified by a unique key
and described by a set of attribute values.
❖ A semantic data model, such as an entity-relationship (ER) data model, is
often constructed for relational databases.
Data Warehouses
❖ A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema, and that usually resides at a single site.
❖ Data warehouses are constructed via a process of data cleaning, data
integration, data transformation, data loading, and periodic data refreshing.
❖ A data warehouse is usually modeled by a multidimensional database structure,
where each dimension corresponds to an attribute or a set of attributes in the
schema, and each cell stores the value of some aggregate measure.
❖ The actual physical structure of a data warehouse may be a relational data
store or a multidimensional data cube.
Fig: Framework of a data warehouse
Transactional Databases
❖ Transactional database consists of a file where each record represents a
transaction. A transaction typically includes a unique transaction identity
number (trans ID) and a list of the items making up the transaction.
❖ The transactional database may have additional tables date of the
transaction, the customer ID number, the ID number of the salesperson and
of the branch.
Advanced Data and Information Systems and Advanced Applications

❖ The new database applications include handling:


➢ spatial data (such as maps),
➢ engineering design data (design of buildings, system components, or integrated circuits),
➢ hypertext and multimedia data (including text, image, video, and audio data),
➢ time-related data (such as historical records or stock exchange data),
➢ stream data (such as video surveillance and sensor data, streams), and
➢ World Wide Web
Object-Relational Databases
❖ Object-relational databases are constructed based on an object-relational
data model.
❖ This model extends the relational model by providing a rich data type for
handling complex objects and object orientation.
❖ Applications need to handle complex objects and structures, object-relational
databases are becoming increasingly popular in industry and applications.
Temporal Databases, Sequence Databases, and
Time-Series Databases

❖ A temporal database stores relational data that include time-related attributes.


These attributes may involve several timestamps, each having different
semantics.
❖ A sequence database stores sequences of ordered events, with or without a
concrete notion of time.
❖ A time-series database stores sequences of values or events obtained over
repeated measurements of time.
❖ Data mining techniques can be used to find the characteristics of object
evolution, or the trend of changes for objects in the database. Such information
can be useful in decision making and strategy planning.
❖ E.g: Banking data, Stock data, Traffic information
Spatial Databases and Spatiotemporal Databases
❖ Spatial databases contain spatial-related information.
➢ Eg: geographic (map) databases,
➢ very large-scale integration (VLSI) or computed-aided design databases, and
➢ medical and satellite image databases.
❖ Spatial data may be represented in raster format, consisting of n-dimensional
bit maps or pixel maps.
❖ A spatial database that stores spatial objects that change with time is called
spatiotemporal database.
➢ Eg: group the trends of moving objects and identify strangely moving vehicles,
➢ distinguish a bioterrorist attack from a normal outbreak of the flu based on the geographic
spread of a disease with time
Text Databases and Multimedia Databases
❖ Text databases are databases that contain word descriptions for objects.
➢ E.g: Word descriptions are usually not simple keywords but rather long sentences or
paragraphs, such as product specifications, error or bug reports, warning messages,
summary reports, notes, or other documents.
❖ Multimedia databases store image, audio, and video data.
➢ E.g: picture content-based retrieval, voice-mail systems, video-on-demand systems, the
World Wide Web, and speech-based user interfaces that recognize spoken commands.
Heterogeneous Databases and Legacy Databases
❖ A heterogeneous database consists of a set of interconnected, autonomous
component databases. The components communicate in order to exchange
information and answer queries.
❖ A legacy database is a group of heterogeneous databases that combines
different kinds of data systems, such as relational or object-oriented
databases, hierarchical databases, network databases, spreadsheets,
multimedia databases, or file systems.
❖ The heterogeneous databases in a legacy database may be connected by
intra or inter-computer networks.
Data Streams
❖ Many applications involve the generation and analysis of a new kind of data,
called stream data, where data flow in and out of an observation platform
❖ Data streams have the following unique features: huge or possibly infinite
volume, dynamically changing, flowing in and out in a fixed order, allowing
only one or a small number of scans, and demanding fast response time.
❖ Mining data streams involves the efficient discovery of general patterns and
dynamic changes within stream data.
The World Wide Web
❖ The World Wide Web and its associated distributed information services, such
as Yahoo!, Google, America Online, and AltaVista, provide rich, worldwide,
on-line information services, where data objects are linked together to
facilitate interactive access.
❖ Capturing user access patterns in such distributed information environments
is called Web usage mining (or Weblog mining).
❖ Automated Web page clustering and classification help group and arrange
Web pages in a multidimensional manner based on their contents. Web
community analysis helps identify hidden Web social networks and
communities and observe their evolution.
Data Mining Task Primitives
A data mining task is represented in the form of a data mining query is defined in
data mining task primitives.
❖ The set of task-relevant data to be mined
❖ The kind of knowledge to be mine
❖ The background knowledge to be used in the discovery process
❖ The interestingness measures and thresholds for pattern evaluation
❖ The expected representation for visualizing the discovered patterns
Task-relevant data Knowledge type to be mined Background knowledge Pattern interestingness measures Visualization of discovered
Database or data warehouse Characterization Concept hierarchies Simplicity patterns
name Discrimination User beliefs about relationships Certainty (e.g., confidence) Rules, tables, reports, charts,
Database tables or data Association/correlation in the data Utility (e.g., support) graphs, decision trees,
warehouse cubes Classification/prediction Novelty and cubes
Conditions for data selection Clustering Drill-down and roll-up
Relevant attributes or dimensions
Data grouping criteria
Integration of a Data Mining System with a Database or Data
Warehouse System

DM system works in an environment that requires it to communicate with other


information system components, such as Database and Datawarehouse systems
integration schemes include:
❖ No coupling,
❖ Loose coupling,
❖ Semi-tight coupling, and
❖ Tight coupling
No Coupling

❖ Data mining system will not use any function, i.e. no communication with database.
It communicate with other storage methods/file system.
❖ Drawback:
➢ DB system provides a great deal of flexibility and efficiency at storing, organizing, accessing, and
processing data. Without using a DB/DW system, a DM system may spend a substantial amount of
time finding, collecting, cleaning, and transforming data.
➢ DM system will need to use other tools to extract data, making it difficult to integrate such a system
into an information processing environment. Thus, no coupling represents a poor design
Loose Cupling

❖ DM system will use some facilities of a DB or DW system, fetching data from


a data repository managed by these systems, performing data mining, and
then storing the mining results either in a file or database or data warehouse.
❖ Loose coupling is better than no coupling because it can fetch any portion of
data stored in databases or data warehouses by using query processing,
indexing, and other system facilities.
❖ Advantage: Flexibility, efficiency, and fast due to store in main memory.
❖ Drawback: High scalability and good performance with large data sets
Semitight coupling
❖ Linking with DM system to a DB/DW system, efficient implementations of a
few essential data mining primItives can be provided in the DB/DW system.
❖ These primitives can include sorting, indexing, aggregation, histogram
analysis, multiway join, and precomputation of some essential statistical
measures, such as sum, count, max, min, standard deviation.
❖ Mining results can be pre computed and stored in the DB/DW system.
Because these intermediate mining results are either precomputed or can be
computed efficiently, this design will enhance the performance of a DM
system.
Tight coupling
❖ DM system is smoothly integrated into the DB/DW system. The data mining
subsystem is treated as one functional component of an information system.
❖ Data mining queries and functions are optimized based on mining query
analysis, data structures, indexing schemes, and query processing methods
of a DB or DW system.
❖ DM, DB, and DW systems will evolve and integrate together as one
information system with multiple functionalities. This will provide a uniform
information processing environment.
❖ It facilitates efficient implementations of data mining functions, high system
performance, and an integrated information processing environment.
Major Issues in Data Mining
Major issues in data mining regarding mining methodology are:
❖ User interaction,
❖ Performance, and
❖ Diverse data types
Mining methodology and user interaction issues

❖ Mining different kinds of knowledge in databases


❖ Interactive mining of knowledge at multiple levels of abstraction
❖ Incorporation of background knowledge
❖ Data mining query languages and ad hoc data mining
❖ Presentation and visualization of data mining results
❖ Handling noisy or incomplete data
❖ Pattern evaluation
Performance issues
❖ Efficiency and scalability of data mining algorithms:
➢ To effectively extract information from a huge amount of data in databases, data mining algorithms
must be efficient and scalable.
➢ From a database perspective on knowledge discovery, efficiency and scalability are key issues in the
implementation of data mining systems.
❖ Parallel, distributed, and incremental mining algorithms:
➢ The huge size of many databases, the wide distribution of data, and the computational complexity of
some data mining methods are factors motivating the development of parallel and distributed data
mining algorithms.
➢ Algorithms divide the data into partitions, which are processed in parallel. The results from the
partitions are then merged.
➢ The high cost of some data mining processes promotes the need for incremental data mining
algorithms that incorporate database updates without having to mine the entire data again “from
scratch.”
➢ Such algorithms perform knowledge modification incrementally to amend and strengthen what was
previously discovered.
Issues relating to the diversity of database types
❖ Handling of relational and complex types of data:
➢ Databases may contain complex data objects, hypertext and multimedia data, spatial data,
temporal data, or transaction data.
➢ It is unrealistic to expect one system to mine all kinds of data, given the diversity of data types and
different goals of data mining.
❖ Mining information from heterogeneous databases and global information
systems:
➢ Local- and wide-area computer networks (such as the Internet) connect many sources of data,
forming huge, distributed, and heterogeneous databases.
➢ The discovery of knowledge from different sources of structured, semistructured, or unstructured
data with diverse data semantics poses great challenges to data Mining.
➢ Web mining, which uncovers interesting knowledge about Web contents, Web structures, Web
usage, and Web dynamics, becomes a very challenging and fast-evolving field in data mining.

You might also like