DWM Unit 1
DWM Unit 1
DWM Unit 1
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
It is a database designed for investigative tasks, using data from various applications.
It supports a relatively small number of clients with relatively long interactions.
It includes current and historical data to provide a historical perspective of information.
Its usage is read-intensive. It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in
support of management's decisions."
Basic Concepts
Characteristics of Data Warehouse
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore,
data warehouses typically provide a concise and straightforward view around a particular
subject, such as customer, product, or sales, instead of the global organization's ongoing
operations. This is done by excluding data that are not useful concerning the subject and
including all data needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among
different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update, insert, and delete operations are not performed. It usually requires only two procedures
in data accessing: Initial loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.
from the source systems is stored in an area called the data staging area, where the data is
cleaned, transformed, assembled, and duplicated to prepare the data in the data warehouse.
The data staging area is usually a set of machines where simple activities like sorting and
sequential processing take place. The data staging area does not provide as soon as possible a
system provides query or presentation services, it is classified as a presentation server. A
presentation server is the destination machine on which data is loaded from the data staging
area and directly stored for query by end-users, report authors, and other applications.
There are three different types of systems required for a data warehouse –
1. Source Systems
2. Data Staging Area
3. Presentation Server
The data moves from the data source area through the staging area to the presentation server.
The entire process is better known as ETL (extract, transform, and load) or ETT (extract,
transform, and transfer).
Warehouse Manage –
The warehouse manager is responsible for the warehouse management process.
The operations performed by the warehouse manager are the analysis, aggregation, backup
and collection of data, de-normalization of the data.
Query Manager –
Query Manager performs all the tasks associated with the management of user queries.
The complexity of the query manager is determined by the end-user access operations tool
and the features provided by the database.
Detailed Data –
It is used to store all the detailed data in the database schema.
Detailed data is loaded into the data warehouse to complement the data collected.
Summarized Data –
Summarized Data is a part of the data warehouse that stores predefined aggregations
These aggregations are generated by the warehouse manager.
Archive and Backup Data –
The Detailed and Summarized Data are stored for the purpose of archiving and backup.
The data is relocated to storage archives such as magnetic tapes or optical disks.
Metadata –
Metadata is basically data stored above data.
It is used for extraction and loading process, warehouse, management process, and query
management process.
End User Access Tools –
End-User Access Tools consist of Analysis, Reporting, and mining.
By using end-user access tools users can link with the warehouse.
Building a Data Warehouse –
Some steps that are needed for building any data warehouse are as following below:
1. To extract the data (transnational) from different data sources:
For building a data warehouse, a data is extracted from various data sources and that data is
stored in central storage area. For extraction of the data Microsoft has come up with an
excellent tool. When you purchase Microsoft SQL Server, then this tool will be available at
free of cost.
2. To transform the transnational data:
There are various DBMS where many of the companies stores their data. Some of them are:
MS Access, MS SQL Server, Oracle, Sybase etc. Also these companies saves the data in
Smt. Shweta Patil Deptment
Deptment of Computer Science BCA V Sem
7
Data Warehousing & Mining
spreadsheets, flat files, mail systems etc. Relating a data from all these sources is done
while building a data warehouse.
3. To load the data (transformed) into the dimensional database:
After building a dimensional model, the data is loaded in the dimensional database. This
process combines the several columns together or it may split one field into the several
columns. There are two stages at which transformation of the data can be performed and
they are: while loading the data into the dimensional model or while data extraction from
their origins.
4. To purchase a front-end reporting tool:
There are top notch analytical tools are available in the market. These tools are provided by
the several major vendors. A cost effective tool and Data Analyzer is released by the
Microsoft on its own.
Database architecture for parallel processing in data warehouse
A parallel DBMS is a DBMS that runs across multiple processors or CPUs and is mainly
designed to execute query operations in parallel, wherever possible. The parallel DBMS link a
number of smaller machines to achieve the same throughput as expected from a single large
machine.
In Parallel Databases, mainly there are three architectural designs for parallel DBMS. They are
as follows:
1. Shared Memory Architecture
2. Shared Disk Architecture
3. Shared Nothing Architecture
1. Shared Memory Architecture- In Shared Memory Architecture, there are multiple CPUs
that are attached to an interconnection network. They are able to share a single or global main
memory and common disk arrays. It is to be noted that, In this architecture, a single copy of a
multi-threaded operating system and multithreaded DBMS can support these multiple CPUs.
Also, the shared memory is a solid coupled architecture in which multiple CPUs share their
memory. It is also known as Symmetric multiprocessing (SMP). This architecture has a very
wide range which starts from personal workstations that support a few microprocessors in
parallel via RISC.
Advantages :
1. It has high-speed data access for a limited number of processors.
2. The communication is efficient.
Disadvantages :
1. It cannot use beyond 80 or 100 CPUs in parallel.
2. The bus or the interconnection network gets block due to the increment of the large number
of CPUs.
2. Shared Disk Architectures :
In Shared Disk Architecture, various CPUs are attached to an interconnection network. In this,
each CPU has its own memory and all of them have access to the same disk. Also, note that
here the memory is not shared among CPUs therefore each node has its own copy of the
operating system and DBMS. Shared disk architecture is a loosely coupled architecture
optimized for applications that are inherently centralized. They are also known as clusters.
Advantages :
1. The interconnection network is no longer a bottleneck each CPU has its own memory.
2. Load-balancing is easier in shared disk architecture.
3. There is better fault tolerance.
Disadvantages :
1. If the number of CPUs increases, the problems of interference and memory contentions
also increase.
2. There’s also exists a scalability problem.
Advantages :
1. It has better scalability as no sharing of resources is done
2. Multiple CPUs can be added
Disadvantages:
1. The cost of communications is higher as it involves sending of data and software
interaction at both ends
2. The cost of non-local disk access is higher than the cost of shared disk architectures.
Note that this technology is typically used for very large databases that have the size of
1012 bytes or TB or for the system that has the process of thousands of transactions per second.
Parallel DBMS vendor
It accepts requests from the application and instructs the operating system to transfer the
appropriate data. The major DBMS vendors are Oracle, IBM, Microsoft and Sybase (see
Oracle Database, DB2, SQL Server and ASE). MySQL and SQLite are very popular open source
products
Multi-Dimensional Data Model
It is a method which is used for ordering data in the database along with good arrangement
and assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow customers to
access data in the form of queries. They allow users to rapidly receive answers to the requests
which they made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi dimensional databases.
It is used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data from
many dimensions and perspectives. It is defined by dimensions and facts and is represented by
a fact table. Facts are numerical measures and fact tables contain measures of the related
dimensional tables or names of the facts.
Working on a Multidimensional Data Model
On the basis of the pre-decided steps, the Multidimensional Data Model works.
The following stages should be followed by every project for building a Multi Dimensional
Data Model :
Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional Data Model
collects correct data from the client. Mostly, software professionals provide simplicity to the
client about the range of data which can be gained with the selected technology and collect the
complete data in detail.
Stage 2 : Grouping different segments of the system : In the second stage, the Multi
Dimensional Data Model recognizes and classifies all the data to the respective section they
belong to and also builds it problem-free to apply step by step.
Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which the
design of the system is based. In this stage, the main factors are recognized according to the
user’s point of view. These factors are also known as “Dimensions”.
Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth
stage, the factors which are recognized in the previous step are used further for identifying the
related qualities. These qualities are also known as “attributes” in the database.
Stage 5 : Finding the actuality of factors which are listed previously and their qualities
In the fifth stage, A Multi Dimensional Data Model separates and differentiates the actuality
from the factors which are collected by it. These actually play a significant role in the
arrangement of a Multi Dimensional Data Model.
Stage 6 : Building the Schema to place the data, with respect to the information collected
from the steps above : In the sixth stage, on the basis of the data which was collected
previously, a Schema is built.
• There is a fact table at the center. It contains the keys to each of four dimensions.
• The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of attributes. For
example, the location dimension table contains the attribute set {location_key, street, city,
province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes province_or_state and
country.
Characteristics of Star Schema
The star schema is intensely suitable for data warehouse database design because of the
following features:
o It creates a DE-normalized database that can quickly provide query responses.
o It provides a flexible design that can be changed easily or added to throughout the
development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
o It reduces the complexity of metadata for both developers and end-users.
Advantages of Star Schema
Query Performance, Load performance and administration,
Built-in referential integrity, Easily Understood
2.Snowflake Schema
• Some dimension tables in the Snowflake schema are normalized.
• The normalization splits up the data into additional tables.
• Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.
• Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
• The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and therefore,
it becomes easy to maintain and the save storage space.
Advantage of Snowflake Schema
1. The primary advantage of the snowflake schema is the development in query
performance due to minimized disk storage requirements and joining smaller lookup
tables.
2. It provides greater scalability in the interrelationship between dimension levels and
components.
3. No redundancy, so it is easier to maintain.
Disadvantage of Snowflake Schema
1. The primary disadvantage of the snowflake schema is the additional maintenance efforts
required due to the increasing number of lookup tables. It is also known as a multi fact
star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.
3.Fact Constellation Schema
• A fact constellation has multiple fact tables. It is also known as galaxy schema.
• The following diagram shows two fact tables, namely sales and shipping.
Fact tables store data about sales while dimension tables data about the geographic region
(markets, cities), clients, products, times, channels.
c. Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.
Concept Hierarchy
A concept hierarchy represents a series of mappings from a set of low-level concepts to larger-
level, more general concepts. Concept hierarchy organizes information or concepts in a
hierarchical structure or a specific partial order, which are used for defining knowledge in brief,
high-level methods, and creating possible mining knowledge at several levels of abstraction.
A conceptual hierarchy includes a set of nodes organized in a tree, where the nodes define
values of an attribute known as concepts. A specific node, “ANY”, is constrained for the root of
the tree. A number is created to the level of each node in a conceptual hierarchy. The level of the
root node is one. The level of a non-root node is one more the level of its parent level number.
Because values are defined by nodes, the levels of nodes can also be used to describe the levels
of values. Concept hierarchy enables raw information to be managed at a higher and more
generalized level of abstraction.
There are several types of concept hierarchies which are as follows −
Schema Hierarchy − Schema hierarchy represents the total or partial order between attributes
in the database. It can define existing semantic relationships between attributes. In a database,
more than one schema hierarchy can be generated by using multiple sequences and grouping of
attributes.
Set-Grouping Hierarchy − A set-grouping hierarchy constructs values for a given attribute or
dimension into groups or constant range values. It is also known as instance hierarchy because
the partial series of the hierarchy is represented on the set of instances or values of an attribute.
These hierarchies have more functional sense and are so approved than other hierarchies.
Operation-Derived Hierarchy − Operation-derived hierarchy is represented by a set of
operations on the data. These operations are defined by users, professionals, or the data mining
system. These hierarchies are usually represented for mathematical attributes. Such operations
can be as easy as range value comparison, as difficult as a data clustering and data distribution
analysis algorithm.
Rule-based Hierarchy − In a rule-based hierarchy either a whole concept hierarchy or an
allocation of it is represented by a set of rules and is computed dynamically based on the current
information and rule definition. A lattice-like architecture is used for graphically defining this
type of hierarchy, in which each child-parent route is connected with a generalization rule.
The static and dynamic generation of concept hierarchy is based on data sets. In this context, the
generation of a concept hierarchy depends on a static or dynamic data set is known as the static
or dynamic generation of concept hierarchy.
Types of concept hierarchy
1] Binning
• Binning is a top-down splitting technique based on a specified number of bins.
• The basic notion is that for accurate discretization, the relative class frequencies should
be fairly consistent within an interval.
• Therefore, if two adjacent intervals have a very similar distribution of classes, then the
intervals can be merged.
Data Discretization
• Dividing the range of a continuous attribute into intervals.
• Interval labels can then be used to replace actual data values.
• Reduce the number of values for a given continuous attribute.
• Some classification algorithms only accept categorically attributes.
• This leads to a concise, easy-to-use, knowledge-level representation of mining results.
• Discretization techniques can be categorized based on whether it uses class information
or not such as follows:
o Supervised Discretization - This discretization process uses class information.
o Unsupervised Discretization - This discretization process does not use class
information.
• Discretization techniques can be categorized based on which direction it proceeds as
follows:
Top-down Discretization -
• If the process starts by first finding one or a few points called split points or cut points to
split the entire attribute range and then repeat this recursively on the resulting intervals.
Bottom-up Discretization -
• Starts by considering all of the continuous values as potential split-points.
• Removes some by merging neighborhood values to form intervals, and then recursively
applies this process to the resulting intervals.
Characteristics of OLAP
In the FASMI characteristics of OLAP methods, the term derived from the first letters of the
characteristics are:
Fast
It defines which the system targeted to deliver the most feedback to the client within about five
seconds, with the elementary analysis taking no more than one second and very few taking more
than 20 seconds.
Analysis
It defines which the method can cope with any business logic and statistical analysis that is
relevant for the function and the user, keep it easy enough for the target client. Although some
Smt. Shweta Patil Deptment
Deptment of Computer Science BCA V Sem
19
Data Warehousing & Mining
preprogramming may be needed we do not think it acceptable if all application definitions have
to be allow the user to define new Adhoc calculations as part of the analysis and to document on
the data in any desired method, without having to program so we excludes products (like Oracle
Discoverer) that do not allow the user to define new Adhoc calculation as part of the analysis and
to document on the data in any desired product that do not allow adequate end user-oriented
calculation flexibility.
Share
It defines which the system tools all the security requirements for understanding and, if multiple
write connection is needed, concurrent update location at an appropriated level, not all functions
need customer to write data back, but for the increasing number which does, the system should
be able to manage multiple updates in a timely, secure manner.
Multidimensional
This is the basic requirement. OLAP system must provide a multidimensional conceptual view of
the data, including full support for hierarchies, as this is certainly the most logical method to
analyze business and organizations.
Information
The system should be able to hold all the data needed by the applications. Data sparsity should
be handled in an efficient manner.
Some more characteristics of OLAP are as follows:
1. Multidimensional conceptual view: OLAP systems let business users have a
dimensional and logical view of the data in the data warehouse. It helps in carrying slice
and dice operations.
2. Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should
provide normal database operations, containing retrieval, update, adequacy control,
integrity, and security.
3. Accessibility: OLAP acts as a mediator between data warehouses and front-end. The
OLAP operations should be sitting between data sources (e.g., data warehouses) and an
OLAP front-end.
4. Storing OLAP results: OLAP results are kept separate from data sources.
5. Uniform documenting performance: Increasing the number of dimensions or database
size should not significantly degrade the reporting performance of the OLAP system.
6. OLAP provides for distinguishing between zero values and missing values so that
aggregates are computed correctly.
7. OLAP system should ignore all missing values and compute correct aggregate values.
8. OLAP facilitate interactive query and complex analysis for the users.
9. OLAP allows users to drill down for greater details or roll up for aggregations of metrics
along a single business dimension or across multiple dimension.
10. OLAP provides the ability to perform intricate calculations and comparisons.
11. OLAP presents results in a number of meaningful ways, including charts and graphs.
OLAP Operations in DBMS
OLAP stands for Online Analytical Processing Server. It is a software technology that allows
users to analyze information from multiple database systems at the same time. It is based on
multidimensional data model and allows the user to query on multi-dimensional data (eg. Delhi -
> 2018 -> Sales data). OLAP databases are divided into one or more cubes and these cubes are
known as Hyper-cubes.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
• Moving down in the concept hierarchy
• Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving
down in the concept hierarchy of Time dimension (Quarter -> Month).
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the
OLAP cube. It can be done by:
• Climbing up in the concept hierarchy
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
In the cube given in the overview section, a sub-cube is selected by selecting following
dimensions with criteria:
• Location = “Delhi” or “Kolkata”
• Time = “Q1” or “Q2”
• Item = “Car” or “Bus”
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-
cube creation. In the cube given in the overview section, Slice is performed on the
dimension Time = “Q1”.
5. Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation, performing
pivot operation gives a new view of it
It is well-known as an online
database query management It is well-known as an online
1. Definition system. database modifying system.
Backup and It only need backup from time to Backup and recovery process is
12. Recovery time as compared to OLTP. maintained rigorously