Fundamentals of Data Warehouses
4/5
()
About this ebook
Related to Fundamentals of Data Warehouses
Related ebooks
Computer Science and Ambient Intelligence Rating: 0 out of 5 stars0 ratingsPredictive Maintenance in Smart Factories: Architectures, Methodologies, and Use-cases Rating: 0 out of 5 stars0 ratingsRequirements Engineering Rating: 3 out of 5 stars3/5Industrial Sensors and Controls in Communication Networks: From Wired Technologies to Cloud Computing and the Internet of Things Rating: 0 out of 5 stars0 ratingsData Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data Rating: 0 out of 5 stars0 ratingsEmbedded Systems: Analysis and Modeling with SysML, UML and AADL Rating: 0 out of 5 stars0 ratingsA Framework for Visualizing Information Rating: 0 out of 5 stars0 ratingsApplication of FPGA to Real‐Time Machine Learning: Hardware Reservoir Computers and Software Image Processing Rating: 0 out of 5 stars0 ratingsSpatio-temporal Design: Advances in Efficient Data Acquisition Rating: 0 out of 5 stars0 ratingsCognitive Radio Communication and Networking: Principles and Practice Rating: 0 out of 5 stars0 ratingsFailure Analysis: A Practical Guide for Manufacturers of Electronic Components and Systems Rating: 0 out of 5 stars0 ratingsPipelined Processor Farms: Structured Design for Embedded Parallel Systems Rating: 0 out of 5 stars0 ratingsMulticriteria Portfolio Construction with Python Rating: 0 out of 5 stars0 ratingsReal-Time Analytics: Techniques to Analyze and Visualize Streaming Data Rating: 0 out of 5 stars0 ratingsKnowledge Discovery with Support Vector Machines Rating: 0 out of 5 stars0 ratingsRemoting Patterns: Foundations of Enterprise, Internet and Realtime Distributed Object Middleware Rating: 4 out of 5 stars4/5Non-volatile Memories Rating: 0 out of 5 stars0 ratingsStatistical Monitoring of Complex Multivatiate Processes: With Applications in Industrial Process Control Rating: 0 out of 5 stars0 ratingsIoT Streams for Data-Driven Predictive Maintenance and IoT, Edge, and Mobile for Embedded Machine Learning: Second International Workshop, IoT Streams 2020, and First International Workshop, ITEM 2020, Co-located with ECML/PKDD 2020, Ghent, Belgium, September 14-18, 2020, Revised Selected Papers Rating: 0 out of 5 stars0 ratingsApplication Design: Key Principles For Data-Intensive App Systems Rating: 0 out of 5 stars0 ratingsConnectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control Rating: 5 out of 5 stars5/5Distibuted Systems: Design and Algorithms Rating: 0 out of 5 stars0 ratingsEvolutionary Algorithms and Neural Networks: Theory and Applications Rating: 0 out of 5 stars0 ratingsValidated Numerics: A Short Introduction to Rigorous Computations Rating: 0 out of 5 stars0 ratingsMining Over Air: Wireless Communication Networks Analytics Rating: 0 out of 5 stars0 ratingsData-Driven AI Architectures Rating: 0 out of 5 stars0 ratingsLive Trace Visualization for System and Program Comprehension in Large Software Landscapes Rating: 0 out of 5 stars0 ratingsCognitive Computing and Big Data Analytics Rating: 0 out of 5 stars0 ratingsCommunication Nets: Stochastic Message Flow and Delay Rating: 3 out of 5 stars3/5Domain-Specific Knowledge Graph Construction Rating: 0 out of 5 stars0 ratings
Databases For You
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Access 2010 All-in-One For Dummies Rating: 4 out of 5 stars4/5COMPUTER SCIENCE FOR ROOKIES Rating: 0 out of 5 stars0 ratingsPractical Data Analysis Rating: 4 out of 5 stars4/5Blockchain Basics: A Non-Technical Introduction in 25 Steps Rating: 4 out of 5 stars4/5Learn SQL in 24 Hours Rating: 5 out of 5 stars5/5CompTIA DataSys+ Study Guide: Exam DS0-001 Rating: 0 out of 5 stars0 ratingsGrokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Access 2019 For Dummies Rating: 0 out of 5 stars0 ratingsLearn Git in a Month of Lunches Rating: 0 out of 5 stars0 ratingsServerless Architectures on AWS, Second Edition Rating: 5 out of 5 stars5/5The AI Bible, Making Money with Artificial Intelligence: Real Case Studies and How-To's for Implementation Rating: 4 out of 5 stars4/5Visual Basic 6.0 Programming By Examples Rating: 5 out of 5 stars5/5Troubleshooting PostgreSQL Rating: 5 out of 5 stars5/5Learn SQL Server Administration in a Month of Lunches Rating: 3 out of 5 stars3/5Go in Action Rating: 5 out of 5 stars5/5Visualizing Graph Data Rating: 0 out of 5 stars0 ratingsData Analysis with R Rating: 5 out of 5 stars5/5Starting Database Administration: Oracle DBA Rating: 3 out of 5 stars3/5MATLAB Machine Learning Recipes: A Problem-Solution Approach Rating: 0 out of 5 stars0 ratingsProfessional ADO.NET 3.5 with LINQ and the Entity Framework Rating: 3 out of 5 stars3/5Advanced Analytics in Power BI with R and Python: Ingesting, Transforming, Visualizing Rating: 0 out of 5 stars0 ratingsPython Projects for Everyone Rating: 0 out of 5 stars0 ratingsDeveloping Analytic Talent: Becoming a Data Scientist Rating: 3 out of 5 stars3/5Dark Data: Why What You Don’t Know Matters Rating: 3 out of 5 stars3/5Learn dbatools in a Month of Lunches: Automating SQL server tasks with PowerShell commands Rating: 0 out of 5 stars0 ratingsArtificial Intelligence Basics: A Non-Technical Introduction Rating: 5 out of 5 stars5/5R: Recipes for Analysis, Visualization and Machine Learning Rating: 0 out of 5 stars0 ratings
Reviews for Fundamentals of Data Warehouses
1 rating0 reviews
Book preview
Fundamentals of Data Warehouses - Matthias Jarke
1
Data Warehouse Practice: An Overview
Mattias Jarke¹ , Maurizio Lenzerini² , Yannis Vassiliou³ and Panos Vassiliadis³
(1)
Dept. of Computer Science V, RWTH Aachen, Ahornstraße 55, 52056, Aachen, Germany
(2)
Dipartimento di Informatica e Sistemistica, Università di Roma La Sapienza
, Via Saleria 113, 00198, Rome, Italy
(3)
Dept. of Electrical and Computer Engineering, Computer Science Division, National Technical University of Athens, 15773, Zographou, Athens, Greece
Mattias Jarke
Email: [email protected]
Maurizio Lenzerini
Email: [email protected]
Yannis Vassiliou
Email: [email protected]
Since the beginning of data warehousing in the early 1990s, an informal consensus has been reached concerning the major terms and components involved in data warehousing. In this chapter, we first explain the main terms and components. Data warehouse vendors are pursuing different strategies in supporting this basic framework. We review a few of the major product families and the basic problem areas data warehouse practice and research are faced with today.
A data warehouse (DW) is a collection of technologies aimed at enabling the knowledge worker (executive, manager, and analyst) to make better and faster decisions. It is expected to have the right information in the right place at the right time with the right cost in order to support the right decision. Traditional online transaction processing (OLTP) systems are inappropriate for decision support and high-speed networks cannot, by themselves, solve the information accessibility problem. Data warehousing has become an important strategy to integrate heterogeneous data sources and to enable online analytic processing (OLAP).
A report from the META Group in 1996 predicted data warehousing would be a US$ 13 000 million industry within two years ($8000 million on hardware, $5000 million on services and systems integration), while 1995 represented $ 2000 million in expenditures. In 1998, reality had exceeded these figures, reaching sales of $14 600 million. By 2000, the subsector of OLAP alone exceeded $ 2500 million. Table 1.1 differentiates the trends by product sector.
Table 1.1.
Estimated sales in millions of dollars [ShTy98] (* Estimates are from [PeCr00])
The number and complexity of projects — with project sizes ranging from a few hundred thousand to multiple millions of dollars — is indicative of the difficulty of designing good data warehouses. Their expected duration highlights the need for documented quality goals and change management. The emergence of data warehousing was initially a consequence of the observation by W. Inmon and E. F. Codd in the early 1990s that operational-level online transaction processing (OLTP) and decision support applications (OLAP) cannot coexist efficiently in the same database environment, mostly due to their very different transaction characteristics. Meanwhile, data warehousing has taken a much broader role, especially in the context of reengineering legacy systems or at least saving legacy data. Here, DWs are seen as a strategy to bring heterogeneous data together under a common conceptual and technical umbrella and to make them available for new operational or decision support applications.
A data warehouse caches selected data of interest to a customer group, so that access becomes faster, cheaper, and more effective (Fig. 1.1). As the long-term buffer between OLTP and OLAP, data warehouses face two essential questions: how to reconcile the stream of incoming data from multiple heterogeneous legacy sources, and how to customize the derived data storage to specific OLAP applications. The trade-off driving the design decisions concerning these two issues changes continuously with business needs. Therefore, design support and change management are of greatest importance if we do not want to run DW projects into dead ends.
Fig. 1.1.
Data warehouses: a buffer between transaction processing and analytic processing
Vendors agree that data warehouses cannot be off-the-shelf products but must be designed and optimized with great attention to the customer situation. Traditional database design techniques do not apply since they cannot deal with DW-specific issues such as data source selection, temporal and aggregated data, and controlled redundancy management. Since the wide variety of product and vendor strategies prevents a low-level solution to these design problems at acceptable costs, serious research and development efforts continue to be necessary.
1.1 Data Warehouse Components
Figure 1.2 gives a rough overview of the usual data warehouse components and their relationships. Many researchers and practitioners share the understanding that a data warehouse architecture can be understood as layers of materialized views on top of each other. Since the research problems are largely formulated from this perspective, we begin with a brief summary description.
Fig. 1.2.
A generic data warehouse architecture
A data warehouse architecture exhibits various layers of data in which data from one layer are derived from data of the lower layer. Data sources, also called operational databases, form the lowest layer. They may consist of structured data stored in open database systems and legacy systems or unstructured or semistructured data stored in files. The data sources can be either part of the operational environment of an organization or external, produced by a third party. They are usually heterogeneous, which means that the same data can be represented differently, for instance through different database schemata, in the sources.
The central layer of the architecture is the global
data warehouse, sometimes called primary or corporate data warehouse. According to Inmon [Inmo96], it is a collection of integrated, nonvolatile, subject-oriented databases designed to support the decision support system (DSS) function, where each unit of data is relevant to some moment in time, it contains atomic data and lightly summarized data.
The global data warehouse keeps a historical record of data. Each time it is changed, a new integrated snapshot of the underlying data sources from which it is derived is placed in line with the previous snapshots. Typically, the data warehouse may contain data that can be many years old (a frequently cited average age is two years). Researchers often assume (realistically) that the global warehouse consists of a set of materialized relational views. These views are defined in terms of other relations that are themselves constructed from the data stored in the sources.
The next layer of views are the local
warehouses, which contain highly aggregated data derived from the global warehouse, directly intended to support activities such as informational processing, management decisions, long-term decisions, historical analysis, trend analysis, or integrated analysis. There are various kinds of local warehouses, such as the data marts or the OLAP databases. Data marts are small data warehouses that contain only a subset of the enterprise-wide data warehouse. A data mart may be used only in a specific department and contains only the data which is relevant to this department. For example, a data mart for the marketing department should include only customer, sales, and product information whereas the enterprise-wide data warehouse could also contain information on employees, departments, etc. A data mart enables faster response to queries because the volume of the managed data is much smaller than in the data warehouse and the queries can be distributed between different machines. Data marts may use relational database systems or specific multidimensional data structures.
There are two major differences between the global warehouse and local data marts. First, the global warehouse results from a complex extraction-integration-transformation process. The local data marts, on the other hand, result from an extraction/aggregation process starting from the global warehouse. Second, data in the global warehouse are detailed, voluminous (since the data warehouse keeps data from previous periods of time), and lightly aggregated. On the contrary, data in the local data marts are highly aggregated and less voluminous. This distinction has a number of consequences both in research and in practice, as we shall see throughout the book.
In some cases, an intermediate layer, called an operational data store (ODS), is introduced between the operational data sources and the global data warehouse. An ODS contains subject-oriented, collectively integrated, volatile, current valued, and detailed data. The ODS usually contains records that result from the transformation, integration, and aggregation of detailed data found in the data sources, just as for a global data warehouse. Therefore, we can also consider that the ODS consists of a set of materialized relational views. The main differences with a data warehouse are the following. First, the ODS is subject to change much more frequently than a data warehouse. Second, the ODS only has fresh and current data. Finally, the aggregation in the ODS is of small granularity: for example, the data can be weakly summarized. The use of an ODS, according to Inmon [Inmo96], is justified for corporations that need collective, integrated operational data. The ODS is a good support for activities such as collective operational decisions, or immediate corporate information. This usually depends on the size of the corporation, the need for immediate corporate information, and the status of integration of the various legacy systems. Figure 1.2 summarizes the different layers of data.
All the data warehouse components, processes, and data are — or at least should be — tracked and administered from a metadata repository. The metadata repository serves as an aid both to the administrator and the designer of a data warehouse. Since the data warehouse is a very complex system, its architecture (physical components, schemata) can be complicated; the volume of data is vast; and the processes employed for the extraction, transformation, cleaning, storage, and aggregation of data are numerous, sensitive to changes, and vary in time.
1.2 Designing the Data Warehouse
The design of a data warehouse is a difficult task. There are several problems designers have to tackle. First of all, they have to come up with the semantic reconciliation of the information lying in the sources and the production of an enterprise model for the data warehouse. Then, a logical structure of relations in the core of data warehouse must be obtained, either serving as buffers for the refreshment process or as persistent data stores for querying or further propagation to data marts. This is not a simple task by itself; it becomes even more complicated since the physical design problem arises: the designer has to choose the physical tables, processes, indexes, and data partitions, representing the logical data warehouse schema and facilitating its functionality. Finally, hardware selection and software development is another process that has to be planned from the data warehouse designer [AdVe98, ISIA97, Simo98].
It is evident that the schemata of all the data stores involved in a data warehouse environment change rapidly: the changes of the business rules of a corporation affect both the source schemata (of the operational databases) and the user requirements (and the schemata of the data marts). Consequently, the design of a data warehouse is an ongoing process, which is performed iteratively throughout the lifecycle of the system [KRRT98].
There is quite a lot of discussion about the methodology for the design of a data warehouse. The two major methodologies are the top-down and the bottom-up approaches [Kimb96, KRRT98, Syba97]. In the top-down approach, a global enterprise model is constructed, which reconciles the semantic models of the sources (and later, their data). This approach is usually costly and time-consuming; nevertheless it provides a basis over which the schema of the data warehouse can evolve. The bottom-up approach focuses on the more rapid and less costly development of smaller, specialized data marts and their synthesis as the data warehouse evolves.
No matter which approach is followed, there seems to be agreement on the general idea concerning the final schema of a data warehouse. In a first layer, the ODS serves as an intermediate buffer for the most recent and detailed information from the sources. The data cleaning and transformation is performed at this level. Next, a database under a denormalized star
schema usually serves as the central repository of data. A star schema is a special-purpose schema in data warehouses that is oriented towards query efficiency at the cost of schema normalization (cf. Chap. 5 for a detailed description). Finally, more aggregated views on top of this star schema can also be precalculated. The OLAP tools can communicate either with the upper levels of the data warehouse or with the customized data marts: we shall detail this issue in the following sections.
1.3 Getting Heterogeneous Data into the Warehouse
Data warehousing requires access to a broad range of information sources:
Database systems (relational, object-oriented, network, hierarchical, etc.)
External information sources (information gathered from other companies, results of surveys)
Files of standard applications (e.g., Microsoft Excel, COBOL applications)
Other documents (e.g., Microsoft Word, World Wide Web)
Wrappers, loaders, and mediators are programs that load data of the information sources into the data warehouse. Wrappers and loaders are responsible for loading, transforming, cleaning, and updating the data from the sources to the data warehouse. Mediators integrate the data into the warehouse by resolving inconsistencies and conflicts between different information sources. Furthermore, an extraction program can examine the source data to find reasons for conspicuous items, which may contain incorrect information [BaBM97].
These tools — in the commercial sector classified as Extract-Transform-Load (ETL) tools — try to automate or support tasks such as [Gree97]:
Extraction (accessing different source databases)
Cleaning (finding and resolving inconsistencies in the source data)
Transformation (between different data formats, languages, etc.)
Loading (loading the data into the data warehouse)
Replication (replicating source databases into the data warehouse)
Analyzing (e.g., detecting invalid/unexpected values)
High-speed data transfer (important for very large data warehouses)
Checking for data quality, (e.g., for correctness and completeness)
Analyzing metadata (to support the design of a data warehouse)
1.4 Getting Multidimensional Data out of the Warehouse
Relational database management systems (RDBMS) are most flexible when they are used with a normalized data structure. Because normalized data structures are non-redundant, normalized relations are useful for the daily operational work. The database systems used for this role, so called OLTP systems, are optimized to support small transactions and queries using primary keys and specialized indexes.
While OLTP systems store only current information, data warehouses contain historical and summarized data. These data are used by managers to find trends and directions in markets, and supports them in decision making. OLAP is the technology that enables this exploitation of the information stored in the data warehouse.
Due to the complexity of the relationships between the involved entities, OLAP queries require multiple join and aggregation operations over normalized relations, thus overloading the normalized relational database.
Typical operations performed by OLAP clients include [ChDa97]:
Roll up (increasing the level of aggregation)
Drill down (decreasing the level of aggregation)
Slice and dice (selection and projection)
Pivot (reorienting the multidimensional view)
Beyond these basic OLAP operations, other possible client applications on data warehouses include:
Report and query tools
Geographic information systems (GIS)
Data mining (finding patterns and trends in the data warehouse)
Decision support systems (DSS)
Executive information systems (EIS)
Statistics
The OLAP applications provide users with a multidimensional view of the data, which is somewhat different from the typical relational approach; thus their operations need special, customized support. This support is given by multidimensional database systems and relational OLAP servers.
The database management system (DBMS) used for the data warehouse itself and/or for data marts must be a high-performance system, which fulfills the requirements for complex querying demanded by the clients. The following kinds of DBMS are used for data warehousing [Weld97]:
Super-relational database systems
Multidimensional database systems
Super-relational database systems. To make RDBMS more useful for OLAP applications, vendors have added new features to the traditional RDBMS. These so-called super-relational features include support for extensions to storage formats, relational operations, and specialized indexing schemes. To provide fast response time to OLAP applications, the data are organized in a star or snowflake schema (see also Chap. 5).
The resulting data model might be very complex and hard to understand for end users. Vendors of relational database systems try to hide this complexity behind special engines for OLAP. The resulting architecture is called Relational OLAP (ROLAP). In contrast to predictions in the mid-1990s, ROLAP architectures have not been able to capture a large share of the OLAP market. Within this segment, one of the leaders is MicroStrategy [MStr97] whose architecture is shown in Fig. 1.4. The RDBMS is accessed through VLDB (very large databases) drivers, which are optimized for large data warehouses.
Fig. 1.3.
MicroStrategy solution [MStr97]
Fig. 1.4.
MDDB in a data warehouse environment
The DSS Architect translates relational database schemas to an intuitive multidimensional model, so that users are shielded from the complexity of the relational data model. The mapping between the relational and the multidimensional data models is done by consulting the metadata. The system is controlled by the DSS Administrator. With this tool, system administrators can fine-tune the database schema, monitor the system performance, and schedule batch routines.
The DSS Server is a ROLAP server, based on a relational database system. It provides a multidimensional view of the underlying relational database. Other features are the ability to cache query results, the monitoring and scheduling of queries, and generating and maintaining dynamic relational data marts. DSS Agent, DSS Objects, and DSS Web are interfaces to end users, programming languages, or the World Wide Web.
Other ROLAP servers are offered by Red Brick [RBSI97] (subsequently acquired by Informix, then passed on to IBM) and Sybase [Syba97]. The Red Brick system is characterized by an industry-leading indexing and join technology for star schemas (Starjoin); it also includes a data mining option to find patterns, trends, and relationships in very large databases. They argue that data warehouses need to be constructed in an incremental, bottom-up fashion. Therefore, such vendors focus on support of distributed data warehouses and data marts.
Multidimensional database systems (MDDB) support directly the way in which OLAP users visualize and work with data. OLAP requires an analysis of large volumes of complex and interrelated data and viewing that data from various perspectives [Kena95]. MDDB store data in n-dimensional cubes. Each dimension represents a user perspective. For example, the sales data of a company may have the dimensions product, region, and time. Because of the way the data is stored, there are no join operations necessary to answer queries which retrieve sales data by one of these dimensions. Therefore, for OLAP applications, MDDB are often more efficient than traditional RDBMS [Coll96]. A problem with MDDB is that restructuring is much more expensive than in a relational database. Moreover, there is currently no standard data definition language and query language for the multidimensional data model.
In practical multidimensional OLAP products, two market segments can be observed [PeCr00]. At the low end, desktop OLAP systems such as Cognos Power-Play, Business Objects, or Brio focus on the efficient and user-friendly handling of relatively small data cubes on client systems. Here, the MDBS is implemented as a data retailer [Sahi96]: it gets its data from a (relational) data warehouse and offers analysis functionality to end users. As shown in Fig. 1.5, ad-hoc queries are sent directly to the data warehouse, whereas OLAP applications work on the more appropriate, multidimensional data model of the MDDB. Market leaders in this segment support hundreds of thousands of workplaces.
Fig. 1.5.
Example of a DW environment for integrated financial reporting and planning
At the high end, hybrid OLAP (HOLAP) solutions aim to provide full integration of relational data warehouse solutions (aiming at scalability) and multidimensional solutions (aiming at OLAP efficiency) in complex architectures. Market leaders include Hyperion Essbase, Oracle Express, and Microsoft OLAP.
Application-oriented OLAP. As pointed out by Pendse and Creeth [PeCr00], only a few vendors can survive on generic server tools as mentioned above. Many more market niches can be found for specific application domains. Systems in this sector often provide lots of application-specific functionality in addition to (or on top of) multidimensional OLAP (MOLAP) engines. Generally speaking, application domains can be subdivided into four business functions:
Reporting and querying for standard controlling tasks
Problem and opportunity analysis (often called Business Intelligence)
Planning applications
One-of-a-kind data mining campaigns or analysis projects
Two very important application domains are sales analysis and customer relationship management on the one hand, and budgeting, financial reporting, and consolidation on the other. Interestingly, only a few of the tools on the market are able to integrate the reporting and analysis stage for the available data, with the planning tasks for the future.
As an example, Fig. 1.6 shows the b2brain architecture by Thinking Networks AG [Thin01], a MOLAP-based environment for financial reporting and planning data warehouses. It shows some typical features of advanced application-oriented OLAP environments such as efficient custom-tailoring to new applications within a domain using metadata, linkage to heterogeneous sources and clients also via the Internet, and seamless integration of application-relevant features such as heterogeneous data collection, semantics-based consolidation, data mining and planning. Therefore, the architecture demonstrates the variety of physical structures encountered in high-end data warehousing as well as the importance of metadata, both to be discussed in the following subsections.
Fig. 1.6.
Central architecture
1.5 Physical Structure of Data Warehouses
There are three basic architectures for a data warehouse [Weld97, Muck96]:
Centralized
Federated
Tiered
In a centralized architecture, there exists only one data warehouse which stores all data necessary for business analysis. As already shown in the previous section, the disadvantage is the loss of performance compared to distributed approaches. All queries and update operations must be processed in one database system.
On the other hand, access to data is uncomplicated because only one data model is relevant. Furthermore, building and maintaining a central data warehouse is easier than in a distributed environment. A central data warehouse is useful for companies, where the existing operational framework is also centralized (Fig. 1.7).
Fig. 1.7.
Federated architecture
A decentralized architecture is only advantageous if the operational environment is also distributed. In a federated architecture, the data is logically consolidated but stored in separate physical databases at the same or at different physical sites (Fig. 1.8). The local data marts store only the relevant information for a department. Because the amount of data is reduced in contrast to a central data warehouse, the local data mart may contain all levels of detail so that detailed information can also be delivered by the local system.
Fig. 1.8.
Tiered architecture
An important feature of the federated architecture is that the logical warehouse is only virtual. In contrast, in a tiered architecture (Fig. 1.9), the central data warehouse is also physical. In addition to this warehouse, there exist local data marts on different tiers which store copies or summaries of the previous tier but not detailed data as in a federate architecture.
Fig. 1.9.
Distribution of data warehouse project costs [Inmo97]
There can be also different tiers at the source side. Imagine, for example, a super market company collecting data from its branches. This process cannot be done in one step because many sources have to be integrated into the warehouse. On the first level, the data of all branches in one region is collected, and in the second level, the data from the regions is integrated into one data warehouse.
The advantages of the distributed architecture are (a) faster response time because the data is located closer to the client applications and (b) reduced volume of data to be searched. Although, several machines must be used in a distributed architecture, this may result in lower hardware and software costs because not all data must be stored at one place and queries are executed on different machines. A scalable architecture is very important for data warehousing. Data warehouses are not static systems but evolve and grow over time. Because of this, the architecture chosen to build a data warehouse must be easy to extend and to restructure.
1.6 Metadata Management
Metadata play an important role in data warehousing. Before a data warehouse can be accessed efficiently, it is necessary to understand what data is available in the warehouse and where is the data located In addition to locating the data that the end users require, metadata repositories may contain [AdCo97, MStr95, Micr96]:
Data dictionary: contains definitions of the databases being maintained and the relationships between data elements
Data flow: direction and frequency of data feed
Data transformation: transformations required when data is moved
Version control: changes to metadata are stored
Data usage statistics: a profile of data in the warehouse
Alias information: alias names for a field
Security: who is allowed to