This document provides an overview of data warehouse and OLAP technology. It discusses key concepts such as the components of a data warehouse including operational data sources, operational data stores, load managers, warehouse managers, query managers, and end user access tools. It also covers ETL processes, data warehouse architectures including star schemas and snowflake schemas, differences between OLTP and OLAP systems, and considerations for data warehouse implementation and conceptual modeling.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
122 views45 pages
Unit 2 Data Warehouse New
This document provides an overview of data warehouse and OLAP technology. It discusses key concepts such as the components of a data warehouse including operational data sources, operational data stores, load managers, warehouse managers, query managers, and end user access tools. It also covers ETL processes, data warehouse architectures including star schemas and snowflake schemas, differences between OLTP and OLAP systems, and considerations for data warehouse implementation and conceptual modeling.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45
Unit 2
Data Warehouse and OLAP Technology
• A data warehouse is simply a single, complete and consistent store of data optained from a variety of sources and made available to end users in a way they can understand and use it in a business context. • A data warehouse is a subject oriented ,integrated, time variant and nonvolatile collection of data in support of managements decision making process. Data warehouse- subject oriented • Oriented to the major subject areas of the corporation that have been defined in the data model. • for example, for an insurance company :customer, product, transaction or activity, policy ,claim, account etc. Data warehouse-Integrated • There is no consistency in encoding, naming conventions, among different data sources. • heterogeneous data sources • when data is moved to the warehouse, it is converted. Data warehouse- nonvolatile • Operational data is regularly accessed and manipulated a record at a time and update is done to data in the operational environment. Data warehouse- time variance • That time Horizon for the data warehouse is sufficiently longer than that of operational systems. • operational database: current value data Building blocks or component • Meta data -good metadata is essential to the effective operation of a data warehouse and it is used in data collection, data transformation and data access. • Meta data maps the translation of information from the operational system to the analytical system. Data marts • Data mart are smaller than data warehouses and generally contain information from a single department of a business or organisation. The current trend in data warehouseing is to develop a data warehouse with several smaller related data marts for specific kinds of queries and reports. Security • As with any information system security of data is determined by the hardware software and the procedures that created them. The reliability and authenticity of the data and information extracted from the warehouse will be a function of the reliability and authenticity of the warehouse and the various source systems. Construction • That steps in planning of data warehouse are identical to the steps for any other type of computer application. Users must be involved to determine the scope of the warehouse and what business requirements need to be met. Why a warehouse • Two approaches: • 1.Query-driven (lazy) • 2.Warehouse (Eager) • The traditional research • Query driven( lazy, on demand) Disadvantages of query driven approach • Delay in query processing. • Slow or unavailable information sources • complex filtering and integration • inefficient and potentially expensive for frequent queries • competes with local processing at sources • has not caught on in industry The warehousing approach • Information integrated in advance • stored in warehouse for Direct. • Advantages of warehousing approach • High query performance • but not necessarily most current information • does not interfere with local processing at sources • complex queries at warehouse. Data warehouse architectures • 1. Single layer • every data element is stored once only • virtual warehouse • 2.Two layer • real time+ derived data • most commonly used approach in industry today • 3. three layered architecture • transformation of real time data to derived data really requires two steps: view level ‘particular informational needs’ • physical implementation of the data warehouse. Data warehouse architecture Data warehouse components • 1.Operational data sources-For the data warehouse is supplied from mainframe operational data held in first generation hierarchical and network data bases ,departmental data held in File systems, private data held on work stations and private servers and external systems such as the internet, commercially available database or database associated with and organisationa’s suppliers or customers. • 2. Operational datastore (ODS)- is a repository of current and integrated operational data used for analysis. it is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse. • 3. Load manager-Also called the front-end company, it performance all the operations associated with the extraction and loading of data into the warehouse. these operations include simple transformations of the data to prepare the data for entry into the warehouse. • 4. Warehouse manager- performs all the operations associated with the management of the data in the warehouse. The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalisation and aggregations. • 5. Query manager- also called back and component, it performs all the operations associated with the management of user queries. The operations performed by this component include directing queries to the appropriate tables and scheduling the execution of queries. • 6. End user access tools -can be categorised into five main groups data reporting and query tools, application development tools, executive information system tools, online analytical processing tools and data mining tools. • Diagram in data warehouse slide. Data warehouse implementation • Includes loading data, Implementing transformation program, design user interface, developing standard query and reports and training to warehouse users. ETL in data warehouse • The process of extracting data from source system and bringing it into the data warehouse is commonly called ETL which stands for: • Extraction -to retrieve all the required data from the source system with as little resources as possible. • Transformation –Applies a set of rules to transform the data from the source to the target . • converting any measured data to the same dimension using the same units so that they can later be joined. • it also requires joining data from several sources, generating aggregates, sorting, deriving new calculated values. • Loading-To ensure that the load is performed correctly and with as little resources as possible. The target of the load process is often a database. The referential integrity needs to be maintained by ETL tool to ensure consistency. Advantages of data warehouse implementation • 1. Better data management and delivery -one of the most important advantages of using a data warehousing system in the organisation is efficient data management and delivery .It helps in the storage of all types of data from different sources into a single base that can be used for analysis purposes. • 2. Better decision making- the use of effective inside cell business intelligence the management of the organisation can take effective decisions based on solid data analysis. • Cost reduction -it helps in avoiding duplication of works that ultimately helps in reducing the cost and increasing the efficiency of the organisation. • Competitive advantages- as the organisation is able to make effective decision, they would be ready to out with their competitors as they are able to fully utilise their resources and can focus on activities in a better way. Data processing models • There are two basic data processing motels • 1. 0LTP-The main aim of OLTP is reliable and efficient processing of a large number of transactions and ensuring data consistency. • 2. OLAP- The main aim of OLEP is efficient multi dimensional processing of large data volumes. Traditional OLTP • Traditionally DBMS Have been used for online transaction processing OLTP • Order entry :pull up order and update status field • banking: transfer rupees thousand from account X to account Y • critical data processing task • detailed up to date data • structured repetitive tasks • Short transactions are the unit of work • read and update a few records • isolation, recovery and integrity are critical OLTP vs OLAP • OLTP: online transaction processing • describes processing at operational sites • OLAP :online analytical processing • describes processing at warehouse Comparison of 0LTP system and data warehousing system Conceptual modelling of data warehouse • Three basic conceptual DBMS schemas: • Star schema • snowflake schema • fact constellation Star schema • A single object in the middle connected to a number of dimension tables. • Terms • Basic notion: a measure(e.g sales quality etc.) • given collection of numeric measures • each measures depends on a set of dimensions (e.g sales volumes as a function of product ,time and location ) • relation which relates the dimensions to the measure of interest is called the fact table(e.g sale) • information about dimensions can be represented as a collection of relations called the dimension table(Product, Customer ,store) • each dimension can have a set of associated attributes. • Diagram in data warehouse slide Snowflake schema • A refinement of Star schema where the dimensional hierarchy is represented explicitly by normalising the dimension tables. • Diagram in data warehouse slide Fact constellation • Multiple fact table share dimension tables database design methodology for data warehouse • 1. choosing the process • 2 choosing the grain • 3. identifying and confirming the dimensions • 4. choosing the facts • 5.storing the precalculation in fact table • 6.rounding out the dimension tables • 7. choosing the duration of the database • 8. tracking slowly changing dimensions • 9.deciding the query priorities and the query modes. • Choosing the process- • the process (function) refer to the subject matter of a particular data marts. The first data mart to be built should be the one that is most likely to be delivered on time within budget and to answer the most commercial important business questions. • The best choice for the first data mart tends to be the one that is related to sales. • Choosing the grain- • Choosing the grain means deciding exactly what affect people record represents. • Only when the grain for the fact table is chosen we can identify the dimensions of the fact table. • The grain decision for the fact table also determines the grain of each of the dimension tables. • Identifying and conforming the dimensions- • Dimensions set the context for formulating queries about the facts in the fact table. • We identify dimensions in sufficient detail to describe things such as clients and properties at the correct Grain. • Choosing the facts- • The grain of the fact table determines which facts can be used in the data mart -all facts must be expressed at the level implied by The Grain. • Storing pre-calculation in the fact table- • Once the facts have been selected it should be re-examined to determine whether there are opportunities to use pre- calculations.ex :a profit or loss statement. • Rounding out the dimensions tables- • In this step we return to the dimention tables and add as many text descriptions to the dimensions as possible. • The text description should be as understandable to the users as possible. • Choosing the duration of the data warehouse- • The duration measures how far back in time the fact table goes . • for some companies(e.g insurance companies) there may be a legal requirement to retain data extending back five or more years. • Tracking slowly changing dimensions- • The changing dimension problem means that the proper description of the old client and the old branch must be used with the old data warehouse schema. • Deciding the query priorities and the query moves- • In this step we consider physical design issues. • The presence of pre- stored summaries and aggregates • security issue • backup issue etc
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)