Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Fundamentals of Data Warehouses
Fundamentals of Data Warehouses
Fundamentals of Data Warehouses
Ebook436 pages4 hours

Fundamentals of Data Warehouses

Rating: 4 out of 5 stars

4/5

()

Read preview

About this ebook

Data warehouses have captured the attention of practitioners and researchers alike. But the design and optimization of data warehouses remains an art rather than a science. This book presents the first comparative review of the state of the art and best current practice of data warehouses. It covers source and data integration, multidimensional aggregation, query optimization, update propagation, metadata management, quality assessment, and design optimization. Also, based on results of the European Data Warehouse Quality project, it offers a conceptual framework by which the architecture and quality of data warehouse efforts can be assessed and improved using enriched metadata management combined with advanced techniques from databases, business modeling, and artificial intelligence. For researchers and database professionals in academia and industry, the book offers an excellent introduction to the issues of quality and metadata usage in the context of data warehouses.
LanguageEnglish
PublisherSpringer
Release dateMar 9, 2013
ISBN9783662051535
Fundamentals of Data Warehouses

Related to Fundamentals of Data Warehouses

Related ebooks

Databases For You

View More

Related articles

Reviews for Fundamentals of Data Warehouses

Rating: 4 out of 5 stars
4/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Fundamentals of Data Warehouses - Matthias Jarke

    1

    Data Warehouse Practice: An Overview

    Mattias Jarke¹ , Maurizio Lenzerini² , Yannis Vassiliou³ and Panos Vassiliadis³

    (1)

    Dept. of Computer Science V, RWTH Aachen, Ahornstraße 55, 52056, Aachen, Germany

    (2)

    Dipartimento di Informatica e Sistemistica, Università di Roma La Sapienza, Via Saleria 113, 00198, Rome, Italy

    (3)

    Dept. of Electrical and Computer Engineering, Computer Science Division, National Technical University of Athens, 15773, Zographou, Athens, Greece

    Mattias Jarke

    Email: [email protected]

    Maurizio Lenzerini

    Email: [email protected]

    Yannis Vassiliou

    Email: [email protected]

    Since the beginning of data warehousing in the early 1990s, an informal consensus has been reached concerning the major terms and components involved in data warehousing. In this chapter, we first explain the main terms and components. Data warehouse vendors are pursuing different strategies in supporting this basic framework. We review a few of the major product families and the basic problem areas data warehouse practice and research are faced with today.

    A data warehouse (DW) is a collection of technologies aimed at enabling the knowledge worker (executive, manager, and analyst) to make better and faster decisions. It is expected to have the right information in the right place at the right time with the right cost in order to support the right decision. Traditional online transaction processing (OLTP) systems are inappropriate for decision support and high-speed networks cannot, by themselves, solve the information accessibility problem. Data warehousing has become an important strategy to integrate heterogeneous data sources and to enable online analytic processing (OLAP).

    A report from the META Group in 1996 predicted data warehousing would be a US$ 13 000 million industry within two years ($8000 million on hardware, $5000 million on services and systems integration), while 1995 represented $ 2000 million in expenditures. In 1998, reality had exceeded these figures, reaching sales of $14 600 million. By 2000, the subsector of OLAP alone exceeded $ 2500 million. Table 1.1 differentiates the trends by product sector.

    Table 1.1.

    Estimated sales in millions of dollars [ShTy98] (* Estimates are from [PeCr00])

    The number and complexity of projects — with project sizes ranging from a few hundred thousand to multiple millions of dollars — is indicative of the difficulty of designing good data warehouses. Their expected duration highlights the need for documented quality goals and change management. The emergence of data warehousing was initially a consequence of the observation by W. Inmon and E. F. Codd in the early 1990s that operational-level online transaction processing (OLTP) and decision support applications (OLAP) cannot coexist efficiently in the same database environment, mostly due to their very different transaction characteristics. Meanwhile, data warehousing has taken a much broader role, especially in the context of reengineering legacy systems or at least saving legacy data. Here, DWs are seen as a strategy to bring heterogeneous data together under a common conceptual and technical umbrella and to make them available for new operational or decision support applications.

    A data warehouse caches selected data of interest to a customer group, so that access becomes faster, cheaper, and more effective (Fig. 1.1). As the long-term buffer between OLTP and OLAP, data warehouses face two essential questions: how to reconcile the stream of incoming data from multiple heterogeneous legacy sources, and how to customize the derived data storage to specific OLAP applications. The trade-off driving the design decisions concerning these two issues changes continuously with business needs. Therefore, design support and change management are of greatest importance if we do not want to run DW projects into dead ends.

    Fig. 1.1.

    Data warehouses: a buffer between transaction processing and analytic processing

    Vendors agree that data warehouses cannot be off-the-shelf products but must be designed and optimized with great attention to the customer situation. Traditional database design techniques do not apply since they cannot deal with DW-specific issues such as data source selection, temporal and aggregated data, and controlled redundancy management. Since the wide variety of product and vendor strategies prevents a low-level solution to these design problems at acceptable costs, serious research and development efforts continue to be necessary.

    1.1 Data Warehouse Components

    Figure 1.2 gives a rough overview of the usual data warehouse components and their relationships. Many researchers and practitioners share the understanding that a data warehouse architecture can be understood as layers of materialized views on top of each other. Since the research problems are largely formulated from this perspective, we begin with a brief summary description.

    Fig. 1.2.

    A generic data warehouse architecture

    A data warehouse architecture exhibits various layers of data in which data from one layer are derived from data of the lower layer. Data sources, also called operational databases, form the lowest layer. They may consist of structured data stored in open database systems and legacy systems or unstructured or semistructured data stored in files. The data sources can be either part of the operational environment of an organization or external, produced by a third party. They are usually heterogeneous, which means that the same data can be represented differently, for instance through different database schemata, in the sources.

    The central layer of the architecture is the global data warehouse, sometimes called primary or corporate data warehouse. According to Inmon [Inmo96], it is a collection of integrated, nonvolatile, subject-oriented databases designed to support the decision support system (DSS) function, where each unit of data is relevant to some moment in time, it contains atomic data and lightly summarized data. The global data warehouse keeps a historical record of data. Each time it is changed, a new integrated snapshot of the underlying data sources from which it is derived is placed in line with the previous snapshots. Typically, the data warehouse may contain data that can be many years old (a frequently cited average age is two years). Researchers often assume (realistically) that the global warehouse consists of a set of materialized relational views. These views are defined in terms of other relations that are themselves constructed from the data stored in the sources.

    The next layer of views are the local warehouses, which contain highly aggregated data derived from the global warehouse, directly intended to support activities such as informational processing, management decisions, long-term decisions, historical analysis, trend analysis, or integrated analysis. There are various kinds of local warehouses, such as the data marts or the OLAP databases. Data marts are small data warehouses that contain only a subset of the enterprise-wide data warehouse. A data mart may be used only in a specific department and contains only the data which is relevant to this department. For example, a data mart for the marketing department should include only customer, sales, and product information whereas the enterprise-wide data warehouse could also contain information on employees, departments, etc. A data mart enables faster response to queries because the volume of the managed data is much smaller than in the data warehouse and the queries can be distributed between different machines. Data marts may use relational database systems or specific multidimensional data structures.

    There are two major differences between the global warehouse and local data marts. First, the global warehouse results from a complex extraction-integration-transformation process. The local data marts, on the other hand, result from an extraction/aggregation process starting from the global warehouse. Second, data in the global warehouse are detailed, voluminous (since the data warehouse keeps data from previous periods of time), and lightly aggregated. On the contrary, data in the local data marts are highly aggregated and less voluminous. This distinction has a number of consequences both in research and in practice, as we shall see throughout the book.

    In some cases, an intermediate layer, called an operational data store (ODS), is introduced between the operational data sources and the global data warehouse. An ODS contains subject-oriented, collectively integrated, volatile, current valued, and detailed data. The ODS usually contains records that result from the transformation, integration, and aggregation of detailed data found in the data sources, just as for a global data warehouse. Therefore, we can also consider that the ODS consists of a set of materialized relational views. The main differences with a data warehouse are the following. First, the ODS is subject to change much more frequently than a data warehouse. Second, the ODS only has fresh and current data. Finally, the aggregation in the ODS is of small granularity: for example, the data can be weakly summarized. The use of an ODS, according to Inmon [Inmo96], is justified for corporations that need collective, integrated operational data. The ODS is a good support for activities such as collective operational decisions, or immediate corporate information. This usually depends on the size of the corporation, the need for immediate corporate information, and the status of integration of the various legacy systems. Figure 1.2 summarizes the different layers of data.

    All the data warehouse components, processes, and data are — or at least should be — tracked and administered from a metadata repository. The metadata repository serves as an aid both to the administrator and the designer of a data warehouse. Since the data warehouse is a very complex system, its architecture (physical components, schemata) can be complicated; the volume of data is vast; and the processes employed for the extraction, transformation, cleaning, storage, and aggregation of data are numerous, sensitive to changes, and vary in time.

    1.2 Designing the Data Warehouse

    The design of a data warehouse is a difficult task. There are several problems designers have to tackle. First of all, they have to come up with the semantic reconciliation of the information lying in the sources and the production of an enterprise model for the data warehouse. Then, a logical structure of relations in the core of data warehouse must be obtained, either serving as buffers for the refreshment process or as persistent data stores for querying or further propagation to data marts. This is not a simple task by itself; it becomes even more complicated since the physical design problem arises: the designer has to choose the physical tables, processes, indexes, and data partitions, representing the logical data warehouse schema and facilitating its functionality. Finally, hardware selection and software development is another process that has to be planned from the data warehouse designer [AdVe98, ISIA97, Simo98].

    It is evident that the schemata of all the data stores involved in a data warehouse environment change rapidly: the changes of the business rules of a corporation affect both the source schemata (of the operational databases) and the user requirements (and the schemata of the data marts). Consequently, the design of a data warehouse is an ongoing process, which is performed iteratively throughout the lifecycle of the system [KRRT98].

    There is quite a lot of discussion about the methodology for the design of a data warehouse. The two major methodologies are the top-down and the bottom-up approaches [Kimb96, KRRT98, Syba97]. In the top-down approach, a global enterprise model is constructed, which reconciles the semantic models of the sources (and later, their data). This approach is usually costly and time-consuming; nevertheless it provides a basis over which the schema of the data warehouse can evolve. The bottom-up approach focuses on the more rapid and less costly development of smaller, specialized data marts and their synthesis as the data warehouse evolves.

    No matter which approach is followed, there seems to be agreement on the general idea concerning the final schema of a data warehouse. In a first layer, the ODS serves as an intermediate buffer for the most recent and detailed information from the sources. The data cleaning and transformation is performed at this level. Next, a database under a denormalized star schema usually serves as the central repository of data. A star schema is a special-purpose schema in data warehouses that is oriented towards query efficiency at the cost of schema normalization (cf. Chap. 5 for a detailed description). Finally, more aggregated views on top of this star schema can also be precalculated. The OLAP tools can communicate either with the upper levels of the data warehouse or with the customized data marts: we shall detail this issue in the following sections.

    1.3 Getting Heterogeneous Data into the Warehouse

    Data warehousing requires access to a broad range of information sources:

    Database systems (relational, object-oriented, network, hierarchical, etc.)

    External information sources (information gathered from other companies, results of surveys)

    Files of standard applications (e.g., Microsoft Excel, COBOL applications)

    Other documents (e.g., Microsoft Word, World Wide Web)

    Wrappers, loaders, and mediators are programs that load data of the information sources into the data warehouse. Wrappers and loaders are responsible for loading, transforming, cleaning, and updating the data from the sources to the data warehouse. Mediators integrate the data into the warehouse by resolving inconsistencies and conflicts between different information sources. Furthermore, an extraction program can examine the source data to find reasons for conspicuous items, which may contain incorrect information [BaBM97].

    These tools — in the commercial sector classified as Extract-Transform-Load (ETL) tools — try to automate or support tasks such as [Gree97]:

    Extraction (accessing different source databases)

    Cleaning (finding and resolving inconsistencies in the source data)

    Transformation (between different data formats, languages, etc.)

    Loading (loading the data into the data warehouse)

    Replication (replicating source databases into the data warehouse)

    Analyzing (e.g., detecting invalid/unexpected values)

    High-speed data transfer (important for very large data warehouses)

    Checking for data quality, (e.g., for correctness and completeness)

    Analyzing metadata (to support the design of a data warehouse)

    1.4 Getting Multidimensional Data out of the Warehouse

    Relational database management systems (RDBMS) are most flexible when they are used with a normalized data structure. Because normalized data structures are non-redundant, normalized relations are useful for the daily operational work. The database systems used for this role, so called OLTP systems, are optimized to support small transactions and queries using primary keys and specialized indexes.

    While OLTP systems store only current information, data warehouses contain historical and summarized data. These data are used by managers to find trends and directions in markets, and supports them in decision making. OLAP is the technology that enables this exploitation of the information stored in the data warehouse.

    Due to the complexity of the relationships between the involved entities, OLAP queries require multiple join and aggregation operations over normalized relations, thus overloading the normalized relational database.

    Typical operations performed by OLAP clients include [ChDa97]:

    Roll up (increasing the level of aggregation)

    Drill down (decreasing the level of aggregation)

    Slice and dice (selection and projection)

    Pivot (reorienting the multidimensional view)

    Beyond these basic OLAP operations, other possible client applications on data warehouses include:

    Report and query tools

    Geographic information systems (GIS)

    Data mining (finding patterns and trends in the data warehouse)

    Decision support systems (DSS)

    Executive information systems (EIS)

    Statistics

    The OLAP applications provide users with a multidimensional view of the data, which is somewhat different from the typical relational approach; thus their operations need special, customized support. This support is given by multidimensional database systems and relational OLAP servers.

    The database management system (DBMS) used for the data warehouse itself and/or for data marts must be a high-performance system, which fulfills the requirements for complex querying demanded by the clients. The following kinds of DBMS are used for data warehousing [Weld97]:

    Super-relational database systems

    Multidimensional database systems

    Super-relational database systems. To make RDBMS more useful for OLAP applications, vendors have added new features to the traditional RDBMS. These so-called super-relational features include support for extensions to storage formats, relational operations, and specialized indexing schemes. To provide fast response time to OLAP applications, the data are organized in a star or snowflake schema (see also Chap. 5).

    The resulting data model might be very complex and hard to understand for end users. Vendors of relational database systems try to hide this complexity behind special engines for OLAP. The resulting architecture is called Relational OLAP (ROLAP). In contrast to predictions in the mid-1990s, ROLAP architectures have not been able to capture a large share of the OLAP market. Within this segment, one of the leaders is MicroStrategy [MStr97] whose architecture is shown in Fig. 1.4. The RDBMS is accessed through VLDB (very large databases) drivers, which are optimized for large data warehouses.

    Fig. 1.3.

    MicroStrategy solution [MStr97]

    Fig. 1.4.

    MDDB in a data warehouse environment

    The DSS Architect translates relational database schemas to an intuitive multidimensional model, so that users are shielded from the complexity of the relational data model. The mapping between the relational and the multidimensional data models is done by consulting the metadata. The system is controlled by the DSS Administrator. With this tool, system administrators can fine-tune the database schema, monitor the system performance, and schedule batch routines.

    The DSS Server is a ROLAP server, based on a relational database system. It provides a multidimensional view of the underlying relational database. Other features are the ability to cache query results, the monitoring and scheduling of queries, and generating and maintaining dynamic relational data marts. DSS Agent, DSS Objects, and DSS Web are interfaces to end users, programming languages, or the World Wide Web.

    Other ROLAP servers are offered by Red Brick [RBSI97] (subsequently acquired by Informix, then passed on to IBM) and Sybase [Syba97]. The Red Brick system is characterized by an industry-leading indexing and join technology for star schemas (Starjoin); it also includes a data mining option to find patterns, trends, and relationships in very large databases. They argue that data warehouses need to be constructed in an incremental, bottom-up fashion. Therefore, such vendors focus on support of distributed data warehouses and data marts.

    Multidimensional database systems (MDDB) support directly the way in which OLAP users visualize and work with data. OLAP requires an analysis of large volumes of complex and interrelated data and viewing that data from various perspectives [Kena95]. MDDB store data in n-dimensional cubes. Each dimension represents a user perspective. For example, the sales data of a company may have the dimensions product, region, and time. Because of the way the data is stored, there are no join operations necessary to answer queries which retrieve sales data by one of these dimensions. Therefore, for OLAP applications, MDDB are often more efficient than traditional RDBMS [Coll96]. A problem with MDDB is that restructuring is much more expensive than in a relational database. Moreover, there is currently no standard data definition language and query language for the multidimensional data model.

    In practical multidimensional OLAP products, two market segments can be observed [PeCr00]. At the low end, desktop OLAP systems such as Cognos Power-Play, Business Objects, or Brio focus on the efficient and user-friendly handling of relatively small data cubes on client systems. Here, the MDBS is implemented as a data retailer [Sahi96]: it gets its data from a (relational) data warehouse and offers analysis functionality to end users. As shown in Fig. 1.5, ad-hoc queries are sent directly to the data warehouse, whereas OLAP applications work on the more appropriate, multidimensional data model of the MDDB. Market leaders in this segment support hundreds of thousands of workplaces.

    Fig. 1.5.

    Example of a DW environment for integrated financial reporting and planning

    At the high end, hybrid OLAP (HOLAP) solutions aim to provide full integration of relational data warehouse solutions (aiming at scalability) and multidimensional solutions (aiming at OLAP efficiency) in complex architectures. Market leaders include Hyperion Essbase, Oracle Express, and Microsoft OLAP.

    Application-oriented OLAP. As pointed out by Pendse and Creeth [PeCr00], only a few vendors can survive on generic server tools as mentioned above. Many more market niches can be found for specific application domains. Systems in this sector often provide lots of application-specific functionality in addition to (or on top of) multidimensional OLAP (MOLAP) engines. Generally speaking, application domains can be subdivided into four business functions:

    Reporting and querying for standard controlling tasks

    Problem and opportunity analysis (often called Business Intelligence)

    Planning applications

    One-of-a-kind data mining campaigns or analysis projects

    Two very important application domains are sales analysis and customer relationship management on the one hand, and budgeting, financial reporting, and consolidation on the other. Interestingly, only a few of the tools on the market are able to integrate the reporting and analysis stage for the available data, with the planning tasks for the future.

    As an example, Fig. 1.6 shows the b2brain architecture by Thinking Networks AG [Thin01], a MOLAP-based environment for financial reporting and planning data warehouses. It shows some typical features of advanced application-oriented OLAP environments such as efficient custom-tailoring to new applications within a domain using metadata, linkage to heterogeneous sources and clients also via the Internet, and seamless integration of application-relevant features such as heterogeneous data collection, semantics-based consolidation, data mining and planning. Therefore, the architecture demonstrates the variety of physical structures encountered in high-end data warehousing as well as the importance of metadata, both to be discussed in the following subsections.

    Fig. 1.6.

    Central architecture

    1.5 Physical Structure of Data Warehouses

    There are three basic architectures for a data warehouse [Weld97, Muck96]:

    Centralized

    Federated

    Tiered

    In a centralized architecture, there exists only one data warehouse which stores all data necessary for business analysis. As already shown in the previous section, the disadvantage is the loss of performance compared to distributed approaches. All queries and update operations must be processed in one database system.

    On the other hand, access to data is uncomplicated because only one data model is relevant. Furthermore, building and maintaining a central data warehouse is easier than in a distributed environment. A central data warehouse is useful for companies, where the existing operational framework is also centralized (Fig. 1.7).

    Fig. 1.7.

    Federated architecture

    A decentralized architecture is only advantageous if the operational environment is also distributed. In a federated architecture, the data is logically consolidated but stored in separate physical databases at the same or at different physical sites (Fig. 1.8). The local data marts store only the relevant information for a department. Because the amount of data is reduced in contrast to a central data warehouse, the local data mart may contain all levels of detail so that detailed information can also be delivered by the local system.

    Fig. 1.8.

    Tiered architecture

    An important feature of the federated architecture is that the logical warehouse is only virtual. In contrast, in a tiered architecture (Fig. 1.9), the central data warehouse is also physical. In addition to this warehouse, there exist local data marts on different tiers which store copies or summaries of the previous tier but not detailed data as in a federate architecture.

    Fig. 1.9.

    Distribution of data warehouse project costs [Inmo97]

    There can be also different tiers at the source side. Imagine, for example, a super market company collecting data from its branches. This process cannot be done in one step because many sources have to be integrated into the warehouse. On the first level, the data of all branches in one region is collected, and in the second level, the data from the regions is integrated into one data warehouse.

    The advantages of the distributed architecture are (a) faster response time because the data is located closer to the client applications and (b) reduced volume of data to be searched. Although, several machines must be used in a distributed architecture, this may result in lower hardware and software costs because not all data must be stored at one place and queries are executed on different machines. A scalable architecture is very important for data warehousing. Data warehouses are not static systems but evolve and grow over time. Because of this, the architecture chosen to build a data warehouse must be easy to extend and to restructure.

    1.6 Metadata Management

    Metadata play an important role in data warehousing. Before a data warehouse can be accessed efficiently, it is necessary to understand what data is available in the warehouse and where is the data located In addition to locating the data that the end users require, metadata repositories may contain [AdCo97, MStr95, Micr96]:

    Data dictionary: contains definitions of the databases being maintained and the relationships between data elements

    Data flow: direction and frequency of data feed

    Data transformation: transformations required when data is moved

    Version control: changes to metadata are stored

    Data usage statistics: a profile of data in the warehouse

    Alias information: alias names for a field

    Security: who is allowed to

    Enjoying the preview?
    Page 1 of 1