Adbms Unit5
Adbms Unit5
Data warehouses contain consolidated data from many sources, augmented with summary
information and covering a long time period. Warehouses are much larger than other kinds of
databases; sizes ranging from several gigabytes to terabytes are common.
1. Data Acquisition
2. Data Storage
3. Data Access
Data Acquisition
An organizations’ daily operations access and modify operational databases. Data from these
operational databases and other external sources (e.g., customer profiles supplied by external
consultants) are extracted by using gateways, or standard external interfaces supported by the
underlying DBMSs. A gateway is an application program interface that allows client programs to
generate SQL statements to be executed at a server. Standards such as Open Database Connectivity
(ODBC) and Open Linking and Embedding for Databases (OLE-DB) from Microsoft and Java
DatabaseConnectivity (JDBC) are emerging for gateways.
Data is extracted from operational databases and external sources, cleaned to minimize errors and fill in
missing information when possible, and transformed to reconcile semantic mismatches. Transforming
data is typically accomplished by defining a relational view over the tables in the data sources (the
operational databases and other external sources).
Data Storage
Loading data consists of materializing such views and storing them in the warehouse. The cleaned and
transformed data is finally loaded into the warehouse. Additional preprocessing such as sorting and
generation of summary information is carried out at this stage. Data is partitioned and indexes are built
for efficiency. Due to the large volume of data, loading is a slow process. Loading a terabyte of data
sequentially can take weeks, and loading even a gigabyte can take hours. Parallelism is therefore
important for loading warehouses.
After data is loaded into a warehouse, additional measures must be taken to ensure that the data in the
warehouse is periodically refreshed to reflect updates to the data sources and to periodically purge data
that is too old from the warehouse
Data Access
Provides end users with access to the stored warehouse information. Tools such as quering, reporting,
OLAP (Online Analytical Processing), statistics, graphical and geographical information systems can be
used.
Characteristics of DW
Benefits of DW
Limitations of DW
Query intensive
Performance tuning is hard
Scalability can be a problem
High demand of resources
High maintenance
Complexity of integration
DATAWAREHOUSE ARCHITECTURE
Legacy Systems
n Legacy systems are older-generation systems that are incompatible with current generation
standards and systems but still in production use
It is a repository of current and integrated operational data used for analysis. It is created when
the legacy system is incapable of reporting. It is one of the most recent concepts in datawarehousing.
Data in ODS is subject-oriented, integrated, volatile and current or near.
Data Warehouse
Data enters the data warehouse into an integrated structure and format. The process involves
conversion, summarization, filtering, and condensation of data.
Metadata
Adv:
It enables departments to customize their data as it flows into the data mart.
Enables departments to select subset of historic data
Departments can select s/w for their data mart.
Very cost effective
Disav
The earliest OLAP systems used multidimensional arrays in memory to store data cubes, and are
referred to as multidimensional OLAP (MOLAP) systems.
OLAP implementations using only relational database features are called relational OLAP
(ROLAP) systems
Hybrid systems, which store some summaries in memory and store the base data and other
summaries in a relational database, are called hybrid OLAP (HOLAP) systems.
Data Mining
It is the process of extracting valid, previously unknown, comprehensible and actionable information
from large databases and using it for crucial business decisions.
It is a subarea of statistics (exploratory data analysis) and subarea of AI (KD and Machine leaning).
Targeted marketing:
Problem fomulation
Data collection
Pre-processing: cleaning
Transformation:
o map complex objects e.g. time series data to features e.g. frequency
2. Classification Trees
It is the process of learning a model that describes different classes of data. The classes
are predetermined.
E.g., given a new automobile insurance applicant, should he or she be classified as low
risk, medium risk or high risk?
Goals of DM
Prediction
Predict the future behavior of certain attributes within data. Ex:On the basis of seismic
wave pattern the probability of an earthquake can be predicted.
Identification
DM can identify the existence of an event, item or activity on the basis of data patterns.
Ex: identification of existence of genes based on DNA sequence.
Classification
DM can partition the data so that different classes can be identified based on
combination of parameters. Ex: loyal and regular customers.
Optimisation
DM can optimize the use of limited resources such as time, space, money , materials to
maximize output variables.
SPATIAL DATABASES
Spatial databases store information related to spatial locations, and support efficient storage,
indexing and querying of spatial data.
Special purpose index structures are important for accessing spatial data, and for processing
spatial join queries.
Computer Aided Design (CAD) databases store design information about how objects are
constructed E.g.: designs of buildings, aircraft, layouts of integrated-circuits
Geographic databases store geographic information (e.g., maps): often called geographic
information systems or GIS.
Spatial join of two spatial relations with the location playing the role of join attribute
Spatial data is typically queried using a graphical query language; results are also
displayed in a graphical manner.
Graphical interface constitutes the front-end
Extensions of SQL with abstract data types, such as lines, polygons and bit maps,
have been proposed to interface with back-end.
o allows relational databases to store and retrieve spatial information
o Queries can use spatial conditions (e.g. contains or overlaps).
o queries can mix spatial and nonspatial conditions
Indexing in Spatial DB
1. Quad Tree
Each node of a quadtree is associated with a rectangular region of space; the
top node is associated with the entire target space.
Each non-leaf nodes divides its region into four equal sized quadrants
correspondingly each such node has four child nodes corresponding to the four
quadrants and so on
Leaf nodes have between zero and some fixed maximum number of points.
R-Trees