Kinds of Data
Kinds of Data
Data mining should be applicable to any kind of data repository such as data streams.
It includes the following data repositories.
Relational databases
Data warehouses
Transactional databases
Advanced database systems
Relational databases:
A database system, also called as database management system(DBMS), consists of a
collection of interrelated data, known as database, and a set of software programs to manage
and access the data. The software programs involve mechanisms for the definition of
database structures, data storage, concurrent, and shared data access for ensuring the
consistency and security.
A relational database is a collection of tables, each is assigned a unique name. each
table consists of a set of columns and rows. Each row represents an unique key and described
by a set of attribute values.
Entity relationship (ER), a semantic data model represents the database as a set of
entities and their relationships.
Example:
All electronics company contain the following tables: customer, item, employee, and branch.
The relation table customer consists of a set of attributes, including a unique customer
identity number(cust_id), customer name, address, age, occupation, annual income, credit
information, category and so on. Other tables are describing their properties with a set of
attributes.
Figure 1.6 fragments of relations from a relational database for Allelectronics
Data warehouses are organized as major subjects such as customer, item, supplier and
activity to faciliatate decision making. The data are stored to provide information from a
historical perspective and summarized.
A data warehouse is modelled by a multidimensional database structure, where each
dimension refers to an attribute or a set of attributes in the schema. Each cell stores the value
of some aggregate measure.
The actual physical structure of data warehouse is a multidimensional data cube. It
allows the precomputation and fast accessing of summarized data.
Example: A data cube for Allelectronics. It has three dimensions: address(city values
Chicago, new York, Toronto, Vancouver), time (q1, q2,q3,q4) and item( home entertainment,
computer, phone, security). The aggregate value stored in each cell is sales_amount.
Difference between data warehouse and data mart: A data warehouse collects
information about subjects that span an entire organization, and its scope is enterprise-wide. A
data mart is a department subset of a data warehouse. It focuses on selected subjects, scope is
department-wide.
Data warehouse is well suited for on-line analytical processing by providing
multidimensional data views. OLAP operations use background knowledge to allow the
presentation of data at different levels of abstraction. OLAP operations include drill-down
and roll-up which allows the user to view the data at differing degrees of summarization.
Figure 1.8 Multidimensional data cube commonly used for data warehousing
Transactional databases:
Transactional database consists of a file where each record represents a transaction. A
transaction typically includes a unique transaction identity number and a list of items making
up the transaction.
Figure 1.9 fragment of a transactional database for sales at All electronics
The transactional database may have additional tables associated with it, regarding
the sale, such as the date of the transaction, the customer ID number, the ID number of the
salesperson and the branch.
Example 1.3 A transactional database for Allelectronics. Transactions can be stored in
a table, with one record per transaction. Transactional database is stored in a flat file or
unfolded into a standard relation. Market basket data analysis enables you to bundle groups
of items together as a strategy for maximizing sales. Data mining systems for transactional
data can identify frequent item sets that are sold together.
Advanced data and information systems and advanced applications:
The new database applications include handling spatial data such as maps,
engineering design data such as integrated circuits and system components, hypertext and
multimedia data, time-related data, stream data, and the world wide web. These applications
require efficient data structures and scalable methods for handling complex object structures;
variable-length records; semistructured or unstructured data; text, spatiotemporal, multimedia
data, database schemas and dynamic changes.
Advance database systems and specific application-oriented database systems include
object-relational database systems, temporal and time-series database systems, spatial and
spatiotemporal database systems, text and multimedia database systems and web-based
global information systems.
These databases require sophisticated facilities to store, retrieve, and update large
amounts of complex data. They provide fertile grounds, raise many challenging research and
implementation issues for data mining.
Object relational databases
These are constructed based on an object-relational data model. This model extends
the relational model by providing a rich data type for handling complex objects and object
orientation.
The object-relational data model inherits the essential concepts of object-oriented
databases, where each entity is considered as an object. Data and code relating to an object
are encapsulated into a single unit. Each object has associated with the following:
A set of variables that describe the objects. These correspond to attributes in the
entity-relationship and relational models.
A set of messages that the object can use to communicate with other objects, or with
the rest of the database system.
A set of methods, where each method holds the code to implement a message. Upon
receiving a message, the method returns a value in response.
Objects that share a common set of properties can be grouped into an object class.
Each object is an instance of its class. Object classes can be organized into class or subclass
hierarchies so that each class represents properties that are common to objects in that class.
For example sales person is a subclass of the class, employee. Sales person object would
inherit all of the variables pertaining to its super class of employee. Such a class inheritance
feature benefits information sharing.
Temporal databases, sequence databases, and time-series databases:
A temporal database stores relational data that include time-related attributes.
These attributes may involve several timestamps, each having different
semantics.
A sequence database stores sequences of ordered events, with or without a
concrete a notion of time.
A time-series database stores sequences of values or events obtained over
repeated measurements of time.