0% found this document useful (0 votes)
16 views8 pages

Kinds of Data

The document discusses various types of data repositories applicable for data mining, including relational databases, data warehouses, transactional databases, and advanced data systems. It explains the structure and functionalities of these databases, such as data organization, querying, and the importance of data mining in extracting valuable insights. Additionally, it highlights the evolution of data technology and the significance of data mining in transforming large volumes of data into actionable knowledge.

Uploaded by

13it11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

Kinds of Data

The document discusses various types of data repositories applicable for data mining, including relational databases, data warehouses, transactional databases, and advanced data systems. It explains the structure and functionalities of these databases, such as data organization, querying, and the importance of data mining in extracting valuable insights. Additionally, it highlights the evolution of data technology and the significance of data mining in transforming large volumes of data into actionable knowledge.

Uploaded by

13it11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Kinds of data:

Data mining should be applicable to any kind of data repository such as data streams.
It includes the following data repositories.
 Relational databases
 Data warehouses
 Transactional databases
 Advanced database systems
Relational databases:
A database system, also called as database management system(DBMS), consists of a
collection of interrelated data, known as database, and a set of software programs to manage
and access the data. The software programs involve mechanisms for the definition of
database structures, data storage, concurrent, and shared data access for ensuring the
consistency and security.
A relational database is a collection of tables, each is assigned a unique name. each
table consists of a set of columns and rows. Each row represents an unique key and described
by a set of attribute values.
Entity relationship (ER), a semantic data model represents the database as a set of
entities and their relationships.
Example:
All electronics company contain the following tables: customer, item, employee, and branch.
The relation table customer consists of a set of attributes, including a unique customer
identity number(cust_id), customer name, address, age, occupation, annual income, credit
information, category and so on. Other tables are describing their properties with a set of
attributes.
Figure 1.6 fragments of relations from a relational database for Allelectronics

Relational data can be accessed by database queries written in a relational query


language such as sql.
A query contains a set of relational operations such as join, selection, and projection
and optimized for efficient processing.
A query allows retrieval of specified subsets of the data. Relational languages also
include aggregate functions such as sum, avg, count, max(maximum) and min(minimum).
Data mining in relational databases is applied in searching for trends or data patterns.
It also detect deviations and can be further investigated.
Data warehouses:
A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema and resides at a single site. It is constructed via a process of
data cleaning, data integration, data transformation, data loading and data refreshing.
Figure 1.7 Typical framework of a data warehouse for Allelectronics

Data warehouses are organized as major subjects such as customer, item, supplier and
activity to faciliatate decision making. The data are stored to provide information from a
historical perspective and summarized.
A data warehouse is modelled by a multidimensional database structure, where each
dimension refers to an attribute or a set of attributes in the schema. Each cell stores the value
of some aggregate measure.
The actual physical structure of data warehouse is a multidimensional data cube. It
allows the precomputation and fast accessing of summarized data.
Example: A data cube for Allelectronics. It has three dimensions: address(city values
Chicago, new York, Toronto, Vancouver), time (q1, q2,q3,q4) and item( home entertainment,
computer, phone, security). The aggregate value stored in each cell is sales_amount.
Difference between data warehouse and data mart: A data warehouse collects
information about subjects that span an entire organization, and its scope is enterprise-wide. A
data mart is a department subset of a data warehouse. It focuses on selected subjects, scope is
department-wide.
Data warehouse is well suited for on-line analytical processing by providing
multidimensional data views. OLAP operations use background knowledge to allow the
presentation of data at different levels of abstraction. OLAP operations include drill-down
and roll-up which allows the user to view the data at differing degrees of summarization.
Figure 1.8 Multidimensional data cube commonly used for data warehousing

Transactional databases:
Transactional database consists of a file where each record represents a transaction. A
transaction typically includes a unique transaction identity number and a list of items making
up the transaction.
Figure 1.9 fragment of a transactional database for sales at All electronics

The transactional database may have additional tables associated with it, regarding
the sale, such as the date of the transaction, the customer ID number, the ID number of the
salesperson and the branch.
Example 1.3 A transactional database for Allelectronics. Transactions can be stored in
a table, with one record per transaction. Transactional database is stored in a flat file or
unfolded into a standard relation. Market basket data analysis enables you to bundle groups
of items together as a strategy for maximizing sales. Data mining systems for transactional
data can identify frequent item sets that are sold together.
Advanced data and information systems and advanced applications:
The new database applications include handling spatial data such as maps,
engineering design data such as integrated circuits and system components, hypertext and
multimedia data, time-related data, stream data, and the world wide web. These applications
require efficient data structures and scalable methods for handling complex object structures;
variable-length records; semistructured or unstructured data; text, spatiotemporal, multimedia
data, database schemas and dynamic changes.
Advance database systems and specific application-oriented database systems include
object-relational database systems, temporal and time-series database systems, spatial and
spatiotemporal database systems, text and multimedia database systems and web-based
global information systems.
These databases require sophisticated facilities to store, retrieve, and update large
amounts of complex data. They provide fertile grounds, raise many challenging research and
implementation issues for data mining.
Object relational databases
These are constructed based on an object-relational data model. This model extends
the relational model by providing a rich data type for handling complex objects and object
orientation.
The object-relational data model inherits the essential concepts of object-oriented
databases, where each entity is considered as an object. Data and code relating to an object
are encapsulated into a single unit. Each object has associated with the following:
A set of variables that describe the objects. These correspond to attributes in the
entity-relationship and relational models.
A set of messages that the object can use to communicate with other objects, or with
the rest of the database system.
A set of methods, where each method holds the code to implement a message. Upon
receiving a message, the method returns a value in response.
Objects that share a common set of properties can be grouped into an object class.
Each object is an instance of its class. Object classes can be organized into class or subclass
hierarchies so that each class represents properties that are common to objects in that class.
For example sales person is a subclass of the class, employee. Sales person object would
inherit all of the variables pertaining to its super class of employee. Such a class inheritance
feature benefits information sharing.
Temporal databases, sequence databases, and time-series databases:
 A temporal database stores relational data that include time-related attributes.
These attributes may involve several timestamps, each having different
semantics.
 A sequence database stores sequences of ordered events, with or without a
concrete a notion of time.
 A time-series database stores sequences of values or events obtained over
repeated measurements of time.

Spatial databases and spatiotemporal databases:


Spatial databases contain spatial-related information. Example include geographic
databases, very large-scale integration (VLSI) or computed-aided design databases, and
medical and satellite image databases.
Spatial data may be represented in raster format, consisting of n-dimensional bit maps
or pixel maps. Examples are 2-D satellite image may be represented as raster data.
Maps can be represented in vector format, where roads, bridges, buildings and lakes
are represented as unions. Basic geometric constructs such as points, lines, polygons and
networks formed by these components.
Geographic databases have numerous applications, ranging from forestry and ecology
planning to provide public service information regarding the location of telephone and
electric cables, pipes and sewage systems. Spatial databases may cover specified kind of
location and climate of mountain areas located at various altitudes.
The relationships among a set of spatial objects can be examined in order to discover
which subsets of objects are spatially auto-correlated. A spatial database that stores spatial
objects that change with time is called a spatiotemporal database. For example identifying
the trends of moving objects and identify strangely moving vehicles.
Text databases and multimedia databases:
Text databases are databases that contain word descriptions for objects. These word
descriptions are long distances, paragraphs such as product specifications, error or bug
reports, warning messages, and summary reports.
Text databases may be hightly unstructured such as some web pages or structured or
semistructured such as email messages and many HTML/XML web pages or relatively well
structured such as library catalogue databases.
By mining text data, one may uncover general and concise descriptions of the text
documents, keyword or content associations. To do this, standard data mining methods need
to be integrated with information retrieval techniques and the construction for text data.
Multimedia databases store image, audio, and video data. They are used in
applications such as picture content-based retrieval, voice-mail systems, video on demand
systems, the world wide web, and speech-based user interfaces.
Multimedia databases must support large objects, because data objects such as video
can require gigabytes of storage. Video and audio data require real-time retrieval at a steady
and predermined rate to avoid picture or sound gaps. Such data are referred to as continuous-
media data.
Heterogeneous databases and legacy databases
A heterogeneous database consists of a set of interconnected, autonomous component
databases. The components communicate in order of exchange information and answer
queries.
A legacy database is a group of heterogeneous databases that combines different kinds
of data systems, such as relational or object – oriented databases, hierarchical databases,
network databases, spreadsheets, or multimedia databases. These databases may be connected
by intra or inter- computer networks.
Information exchange across such databases is difficult as it requires precise
transformation rules from one representation to another, considering semantics.
Data mining techniques may provide an interesting solution to the information
exchange problem by performing statistical data distribution and correlation analysis. It
transforms the given data into higher, and conceptual levels.
Data streams:
The generation and analysis of a new kind of data, called stream data, where data flow
in and out of an observation platform dynamically. It has unique features such as huge or
possibly infinite volume, dynamically changing, flowing in and out in a fixed order, allowing
only one or a small number of scans, and demanding fast response time.
Effective and efficient management and analysis stream of data poses great challenges
to researchers. A typical query model in such a system is the continuous query model, where
predefined queries constantly evaluate incoming streams, collect aggregate data, report the
current status of data streams, and respond to their changes.
Mining data streams involves the efficient discovery of general patterns and dynamic
changes with stream data. Most stream data reside at a low level of abstraction, and analysts
are often interested in higher and multiple levels of abstraction.
The World wide web
The world wide web provides associated distributed information services such as
yahoo, google, America online, and altavista. Data objects are linked together to facilitate
interactive access. Users search information traverse from one object via links to another. It
helps improve system design and also leads to better marketing decisions.
Capturing user access patterns in such distributed information environments is called
web usage mining. Web pages can be highly unstructured and lack a predefined schema, type
or pattern. Thus it is difficult for computers to understand the semantic meaning for
systematic information and data mining. Keyword based searches offer only limited help for
users.
Data mining provide additional help by authoritative web page analysis based on
linkages among webpages can help rank web pages. Automated web page clustering and
classification help group and arrange web pages in a multidimensional manner based on their
contents.
Web community analysis helps identify hidden web social network and communities
and observe their evolution. Web mining is the development of scalable and effective web
data analysis and mining methods. It helps to learn, characterize, classify web pages, and
uncover web dynamics and the association among different web pages, users, communities
and web based activities.

Importance of Data Mining:


The information and knowledge gained by data mining can be used for applications
ranging from market analysis, fraud detection, and customer retention. It can be viewed as a
result of the natural evolution of information technology. It contains the following
functionalities.
 Data collection
 Database creation
 Data management
 Advanced data analysis
Since 1960, database and information technology has been evolving systematically
from primitive file processing systems to powerful database systems. Efficient methods for
online transaction processing(OLAP), where a query is viewed as a read-only transaction as a
major tool for efficient storage, retrieval, and management of large amounts of data.
It promotes the development of advanced data models such as extended-relational,
object-oriented, object-relational and deductive models. Spatial, temporal, multimedia, active,
stream, sensor and scientific and engineering databases, and knowledge bases.
One data repository architecture that has emerged is the data warehouse, a repository
of multiple heterogeneous data sources organized under a unified schema at a single unit for
facilitate management decision making. It includes data cleaning, data integration, and online
analytical processing(OLAP) with summarization, consolidation and aggregation
functionalities.
In addition, huge volumes of data can be accumulated beyond databases and data
warehouses. Examples are world wide web and data streams, where data flow in and out like
streams, as in application like video surveillance, telecommunication and sensor networks.
The abundance of data, coupled with the need for powerful data analysis tools, described as a
data rich but information poor situation. The widening gap between data and information
calls for a systematic development of data mining tools that will turn data tombs into “golden
nuggets” of knowledge.

You might also like