0% found this document useful (0 votes)
16 views10 pages

Adbms Unit5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

Adbms Unit5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

DATA WAREHOUSING

W.H. Inmon defined –“ It is a subject-oriented, integrated, time-variant, non-volatile collection of data


in support of managements decisions.”

Data warehouses contain consolidated data from many sources, augmented with summary
information and covering a long time period. Warehouses are much larger than other kinds of
databases; sizes ranging from several gigabytes to terabytes are common.

Comparison between Operational database & data warehouse

Characteristics Operational Database Data Warehouse


Subject-Oriented Functional/Process oriented Data are subject-oriented. Ex:
data- invoices, credits,debits sales,products
Integrated Similar data can have different Provides a unified view of data
representations. with common representation.
Time-Variant Current Transactions are stored Data is historic in nature
Non-volatile Data update/deletes are Once data are stored, no
common changes are allowed

Main Components of Data warehouses

1. Data Acquisition
2. Data Storage
3. Data Access

Data Acquisition

An organizations’ daily operations access and modify operational databases. Data from these
operational databases and other external sources (e.g., customer profiles supplied by external
consultants) are extracted by using gateways, or standard external interfaces supported by the
underlying DBMSs. A gateway is an application program interface that allows client programs to
generate SQL statements to be executed at a server. Standards such as Open Database Connectivity
(ODBC) and Open Linking and Embedding for Databases (OLE-DB) from Microsoft and Java
DatabaseConnectivity (JDBC) are emerging for gateways.

Data is extracted from operational databases and external sources, cleaned to minimize errors and fill in
missing information when possible, and transformed to reconcile semantic mismatches. Transforming
data is typically accomplished by defining a relational view over the tables in the data sources (the
operational databases and other external sources).
Data Storage
Loading data consists of materializing such views and storing them in the warehouse. The cleaned and
transformed data is finally loaded into the warehouse. Additional preprocessing such as sorting and
generation of summary information is carried out at this stage. Data is partitioned and indexes are built
for efficiency. Due to the large volume of data, loading is a slow process. Loading a terabyte of data
sequentially can take weeks, and loading even a gigabyte can take hours. Parallelism is therefore
important for loading warehouses.
After data is loaded into a warehouse, additional measures must be taken to ensure that the data in the
warehouse is periodically refreshed to reflect updates to the data sources and to periodically purge data
that is too old from the warehouse

Data Access

Provides end users with access to the stored warehouse information. Tools such as quering, reporting,
OLAP (Online Analytical Processing), statistics, graphical and geographical information systems can be
used.

Characteristics of DW

 Multidimensional conceptual view


 Generic Dimensionality
 Client/Server Architecture
 Multi-user support
 Accessibility
 Flexible reporting
 Transparency & intuitive data manipulation.

Benefits of DW

 High Return on Investments


 Cost effective
 Competitive advantage
 Enterprise Intelligence
 Enhanced customer service
 Business Reengineering

Limitations of DW

 Query intensive
 Performance tuning is hard
 Scalability can be a problem
 High demand of resources
 High maintenance
 Complexity of integration

DATAWAREHOUSE ARCHITECTURE

Legacy Systems

n Legacy systems are older-generation systems that are incompatible with current generation
standards and systems but still in production use

 E.g. applications written in Cobol that run on mainframes

 Today’s hot new system is tomorrows legacy system!

Operational Data Store

It is a repository of current and integrated operational data used for analysis. It is created when
the legacy system is incapable of reporting. It is one of the most recent concepts in datawarehousing.
Data in ODS is subject-oriented, integrated, volatile and current or near.

Data Warehouse

Data enters the data warehouse into an integrated structure and format. The process involves
conversion, summarization, filtering, and condensation of data.

The datawarehouse DBMS is the cornerstone of datawarehousing environment. It is


implemented on RDBMS technology.Approaches such as parallel databases, multi-relational
databases(MRDBs), multidimensional databases (MDDBs) are also used in the environement.

Metadata

Data about the data which decribes the datawarehouse.


Data Marts

It is a generalized term used to describe data in a data warehouse environment. It is a


subsidiary of the data warehouse. It is localized , single purpose data warehouse implementation. It
describes small, single purpose mini data warehouse.

Adv:

 It enables departments to customize their data as it flows into the data mart.
 Enables departments to select subset of historic data
 Departments can select s/w for their data mart.
 Very cost effective

Disav

 Difficult to extend for other departments.


 Scalability
 Data Integration Problem

MDDBs (Multi dimensional Data bases)

They are tightly coupled with OLAP.

OLAP (Online Analytical Processing)

 Online Analytical Processing (OLAP)


o Interactive analysis of data, allowing data to be summarized and viewed in different
ways in an online fashion (with negligible delay)
 Data that can be modeled as dimension attributes and measure attributes are called
multidimensional data.
o Given a relation used for data analysis, we can identify some of its attributes as measure
attributes, since they measure some value, and can be aggregated upon. For instance,
the attribute number of the sales relation is a measure attribute, since it measures the
number of units sold.
o Some of the other attributes of the relation are identified as dimension attributes, since
they define the dimensions on which measure attributes, and summaries of measure
attributes, are viewed.

 The earliest OLAP systems used multidimensional arrays in memory to store data cubes, and are
referred to as multidimensional OLAP (MOLAP) systems.

 OLAP implementations using only relational database features are called relational OLAP
(ROLAP) systems

 Hybrid systems, which store some summaries in memory and store the base data and other
summaries in a relational database, are called hybrid OLAP (HOLAP) systems.
Data Mining

It is the process of extracting valid, previously unknown, comprehensible and actionable information
from large databases and using it for crucial business decisions.

It is a subarea of statistics (exploratory data analysis) and subarea of AI (KD and Machine leaning).

 Process of semi-automatically analyzing large databases to find patterns that are:

o valid: hold on new data with some certainity

o novel: non-obvious to the system

o useful: should be possible to act on the item

o understandable: humans should be able to interpret the pattern

 Also known as Knowledge Discovery in Databases (KDD)

Applications of Data Mining

 Banking: loan/credit card approval

o predict good customers based on old customers

 Customer relationship management:

o identify those who are likely to leave for a competitor.

 Targeted marketing:

o identify likely responders to promotions

 Fraud detection: telecommunications, financial transactions

o from an online stream of event identify fraudulent events

 Manufacturing and production:

o automatically adjust knobs when process parameter changes

 Medicine: disease outcome, effectiveness of treatments

o analyze patient disease history: find relationship between diseases

 Molecular/Pharmaceutical: identify new drugs

 Scientific data analysis:

o identify new galaxies by searching for sub clusters


 Web site/store design and promotion:

o find affinity of visitor to pages and modify layout

KDD PROCESS/ STEPS

 Problem fomulation

 Data collection

o subset data: sampling might hurt if highly skewed data

o feature selection: principal component analysis, heuristic search

 Pre-processing: cleaning

o name/address cleaning, different meanings (annual, yearly), duplicate removal,


supplying missing values

 Transformation:

o map complex objects e.g. time series data to features e.g. frequency

 Choosing mining task and mining method:

 Result evaluation and Visualization:

Data mining Techniques

1. Association Rules (AR)


Data is regarded as a collection of transactions, each involving a set of item. Association
rule must correlate the presence of a set of items with another range of values for
another set of variables.
Ex: bread Þ milk DB-Concepts, OS-Concepts Þ Networks
o Left hand side: antecedent, right hand side: consequent
o An association rule must have an associated population; the population
consists of a set of instances

2. Classification Trees
It is the process of learning a model that describes different classes of data. The classes
are predetermined.

E.g., given a new automobile insurance applicant, should he or she be classified as low
risk, medium risk or high risk?

Classification rules can be compactly shown as a decision tree.


3. Sequential Patterns
This rules defines the sequential pattern of transactions. For ex: If a person undergoes
cardiac surgery, he may suffer from kidney failure in the next 14 years.
4. Patterns within time series
This rule detects the similarities within positions of a time series of data taken at regular
intervals of time. Ex: daily closing stock, daily sales.
5. Clustering
A given population of events can be partitioned into sets of similar elements. This is
called cluster and each record belongs to exactly one cluster. Ex: women populations
can be grouped as ‘most-likely-to-buy’ and ‘leaset-likely-to-buy’.

Goals of DM

 Prediction
Predict the future behavior of certain attributes within data. Ex:On the basis of seismic
wave pattern the probability of an earthquake can be predicted.
 Identification
DM can identify the existence of an event, item or activity on the basis of data patterns.
Ex: identification of existence of genes based on DNA sequence.
 Classification
DM can partition the data so that different classes can be identified based on
combination of parameters. Ex: loyal and regular customers.
 Optimisation
DM can optimize the use of limited resources such as time, space, money , materials to
maximize output variables.

SPATIAL DATABASES

 Spatial databases store information related to spatial locations, and support efficient storage,
indexing and querying of spatial data.
 Special purpose index structures are important for accessing spatial data, and for processing
spatial join queries.
 Computer Aided Design (CAD) databases store design information about how objects are
constructed E.g.: designs of buildings, aircraft, layouts of integrated-circuits
 Geographic databases store geographic information (e.g., maps): often called geographic
information systems or GIS.

Representation of Spatial Data

 Various geometric constructs can be represented in a database in a normalized


fashion.
 Represent a line segment by the coordinates of its endpoints.
 Approximate a curve by partitioning it into a sequence of segments
o Create a list of vertices in order, or
o Represent each segment as a separate tuple that also carries with it the identifier of the
curve (2D features such as roads).
 Closed polygons
o List of vertices in order, starting vertex is the same as the ending vertex, or
o Represent boundary edges as separate tuples, with each containing identifier of the
polygon, or
o Use triangulation — divide polygon into triangles

Spatial Database Queries

 Nearness queries request objects that lie near a specified location.


 Nearest neighbor queries, given a point or an object, find the nearest object
that satisfies given conditions.
 Region queries deal with spatial regions. e.g., ask for objects that lie
partially or fully inside a specified region.
 Queries that compute intersections or unions of regions.

Spatial join of two spatial relations with the location playing the role of join attribute

 Spatial data is typically queried using a graphical query language; results are also
displayed in a graphical manner.
 Graphical interface constitutes the front-end
 Extensions of SQL with abstract data types, such as lines, polygons and bit maps,
have been proposed to interface with back-end.
o allows relational databases to store and retrieve spatial information
o Queries can use spatial conditions (e.g. contains or overlaps).
o queries can mix spatial and nonspatial conditions

Indexing in Spatial DB

1. Quad Tree
 Each node of a quadtree is associated with a rectangular region of space; the
top node is associated with the entire target space.
 Each non-leaf nodes divides its region into four equal sized quadrants
correspondingly each such node has four child nodes corresponding to the four
quadrants and so on
 Leaf nodes have between zero and some fixed maximum number of points.

R-Trees

 R-trees are a N-dimensional extension of B+-trees, useful for indexing sets of


rectangles and other polygons.
 Supported in many modern database systems, along with variants like R+ -trees
and R*-trees.
 Basic idea: generalize the notion of a one-dimensional interval associated with
each B+ -tree node to an
N-dimensional interval, that is, an N-dimensional rectangle.
 Will consider only the two-dimensional case (N = 2)
 generalization for N > 2 is straightforward, although R-trees work well only for
relatively small N
 A rectangular bounding box is associated with each tree node.
 Bounding box of a leaf node is a minimum sized rectangle that contains all the
rectangles/polygons associated with the leaf node.
 The bounding box associated with a non-leaf node contains the bounding box
associated with all its children.
 Bounding box of a node serves as its key in its parent node (if any)
 Bounding boxes of children of a node are allowed to overlap
 A polygon is stored only in one node, and the bounding box of the node must
contain the polygon
 The storage efficiency or R-trees is better than that of k-d trees or quadtrees
since a polygon is stored only once

You might also like