Warehouse Complete

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 6

Bachelor of Computer Application (BCA) Semester 6 BC0058 Data Warehousing 4 Credits

Assignment Set 1 (60 Marks) Ques 1 Explain the differences between OLTP and Data Warehouse. Ans Application databases are OLTP (On-Line Transaction Processing) systems where every transaction has to be recorded as and when it occurs. Consider the scenario where a bank ATM has disbursed cash to a customer but was unable to record this event in the bank records. If this happens frequently, the bank wouldnt stay in business for too long. So the banking system is designed to make sure that every transaction gets recorded within the time you stand before the ATM machine. A Data Warehouse (DW) on the other end, is a database (yes, you are right, its a database) that is designed for facilitating querying and analysis. Often designed as OLAP (On-Line Analytical Processing) systems, these databases contain read-only data that can be queried and analyzed far more efficiently as compared to your regular OLTP application databases. In this sense an OLAP system is designed to be read-optimized. Separation from your application database also ensures that your business intelligence solution is scalable (your bank and ATMs dont go down just because the CFO asked for a report), better documented and managed. Creation of a DW leads to a direct increase in quality of analysis as the table structures are simpler (you keep only the needed information in simpler tables), standardized (well-documented table structures), and often de-normalized (to reduce the linkages between tables and the corresponding complexity of queries). Having a well-designed DW is the foundation for successful BI (Business Intelligence)/Analytics initiatives, which are built upon. Data Warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction. OLTP VS Data Warehouses Property OLTP Nature of Data Warehouse 3 NF Indexes Few Joins Many Duplicate data Normalized Aggregate data Rare Queries Mostly predefined Nature of queries Mostly simple Updates All the time Historical data Often not available Data Warehouse Multidimensional Many Some Demoralized Common Mostly adhoc Mostly complex Not allowed, only refreshed Essential

2.With necessary diagram, Explain about Data Warehouse Development Life Cycle. Ans The Data Warehouse project As an IT professional, you have worked on application projects before. You know what goes on in these projects and are aware of the methods needed to build the applications from planning through implementation. You have been part of the analysis, the design, the programming, or the testing phases. If you have functioned as a project manager or a team leader, you know how projects are monitored and controlled. A project is a project. If you have seen one IT project, have you not seen them all?The answer in not a simple yes or no; the Data Warehouse projects are different from projects building the transaction processing systems. If you are new to Data Warehousing, your first Data Warehouse project will reveal the major differences. We will discuss these differences and also consider ways to react to them. We will also ask a basic question about the readiness of the IT and user departments to launch a Data Warehouse project.How about the traditional system development life cycles (SDLC) approach? Can we use this approach to Data Warehouse projects as well? If so, what are the development phases in the life cycle? Data Warehouse Development Life Cycle

The Data Warehouse development life cycle covers two vital areas. One is warehouse management and the second one is data management. The former deals with defining the project activities and requirements gathering; where as the latter deals with modeling and designing the Warehouse. Life Cycle of Data Warehouse Development

Life Cycle steps of a DWH (SDLC) 3. What is Metadata? What is its use in Data Warehouse Architec ture Ans Acquisition metadata maps the translation of information from the operational system to the analytical system. This includes an extract history describing data origins, updates, algorithms used to summarize data, and frequency of extractions from operational systems. Transformation metadata includes a history of data transformations, changes in names, and other physical characteristics. Access metadata provides navigation and graphical user interfaces that allow non-technical business users to interact intuitively with the contents of the warehouse. And on top of these three types of metadata, a warehouse needs basic operational metadata, such as procedures on how a data warehouse is used and accessed, procedures on monitoring the growth of the data warehouse relative to the available storage space, and authorizations on who is responsible for and who has access to the data in the data warehouse and data in the operational system. Technical Metadata It is the metadata concerned with the information system characteristics. This technical Metadata is focuses on granularity of the data. Technical metadata (ETL process metadata, back room metadata, transformation metadata) is a representation of the ETL process. It stores data mapping and transformations from source systems to the data warehouse and is mostly used by data warehouse developers, specialists and ETL modelers. Most commercial ETL applications provide a metadata repository with an integrated metadata management system to manage the ETL process definition. The definition of technical metadata is usually more complex than the business metadata and it sometimes involves multiple dependencies. The technical metadata can be structured in the following way: Source Database or system definition. It can be a source system database, another Data warehouse, file system, etc. Target Database Data Warehouse instance Source Tables one or more tables which are input to calculate a value of the field Source Columns one or more columns which are input to calculate a value of the field Target Table target DW table and column are always single in a metadata repository Target Column target DW column Transformation the descriptive part of a metadata entry. It usually contains a lot of information, so it is important to use a common standard throughout the organization to keep the data consistent. Field to field mappings between sources to target. Number of scanned reports and ad-hoc reports

4 What is Surrogate key? When do we need it in data warehouse implementation? Ans An important distinction between a surrogate and a primary key depends on whether the database is a current database or a temporal database. Since a current database stores only currently valid data, there is a one-to-one correspondence between a surrogate in the modelled world and the primary key of some object in the database. In this case the surrogate may be used as a primary key, resulting in the term surrogate key. In a temporal database, however, there is a many-to-one relationship between primary keys and the surrogate. Since there may be several objects in the database corresponding to a single surrogate, we cannot use the surrogate as a primary key; another attribute is required, in addition to the surrogate, to uniquely identify each object. Although Hall et al. (1976) say nothing about this, others[specify] have argued that a surrogate should have the following characteristics: 1.the value is unique system-wide, hence never reused 2.the value is system generated 3.the value is not manipulable by the user or application 4.the value contains no semantic meaning 5.the value is not visible to the user or application 6.the value is not composed of several values from different domains. The main reason for building a Data Warehouse application is to make data available to business users. Users know the data best, and their participation in the testing effort is a key component to the success of a Data Warehouse implementation. User Acceptance Testing (UAT)typically focuses on data loaded to the Data Warehouse and any views that have been created on top of the tables, not the mechanics of how the ETL application works. Consider the following strategies: Use data that is either from production or as near to production data as possible. Users typically find issues once they see the "real" data, sometimes leading to design changes. Test database views comparing view contents to what is expected. It is important that users sign off and clearly understand how the views are created. Plan for the system test team to support users during UAT. The users will likely have questions about how the data is populated and need to understand details of how the ETL works. Consider how the users would require the data loaded during UAT and negotiate how often the data will be refreshed. 5. What is Data Loading? Explain the Full Refresh Loading. Ans Two distinct groups of tasks form the data loading function. When you complete the design and construction of the Data Warehouse and go live for the first time, you do the initial loading of the data into the Data Warehouse storage. The initial load moves large volumes of data using up substantial amounts of time. As the Data Warehouse starts functioning, you continue to extract the changes to the source data, transform the data revisions, and feed the incremental data revisions on an ongoing basis. The figure below illustrates the common types of data movements from the staging area to the Data Warehouse storage.

Data Movements

Data Storage Component The data storage for the Data Warehouse is a separate repository. The operational systems of your enterprise support the day-to-day operations. These are online transaction processing applications. The data repositories for the operational systems typically contain only the current data. Also, these data repositories contain the data structured in highly normalized formats for fast and efficient processing. In contrast, in the data repository for a Data Warehouse, you need to keep large volumes of historical data for analysis. Further, you have to keep the data in the Data Warehouse in structures suitable for analysis, and not for quick retrieval of individual pieces of information. Therefore, the data storage for the Data Warehouse is kept separate from the data storage for operational systems. In your databases supporting operational systems, the updates to data happen as transactions occur. These transactions hit the databases in a random fashion. How and when the transactions change the data in the databases is not completely within your control. The data in the operational databases could change from moment to moment. When your analysts use the data in the Data Warehouse for analysis, they need to know that the data is stable and that it represents snapshots at specified periods. As they are working with the data, the data storage must not be in a state of continual updating. For this reason, the Data Warehouses are read-only data repositories. Generally, the database in your Data Warehouse must be open. Depending on your requirements, you are likely to use tools from multiple vendors. The Data Warehouse must be open to different tools. Most of the Data Warehouses employ relational database management systems. Many of the Data Warehouses also employ multidimensional database management systems. Data extracted from the Data Warehouse storage is aggregated in many ways and the summary data is kept in the multidimensional databases (MDDBs). Such multidimensional database systems are usually proprietary products. Information Delivery Component Who are the users that need information from the Data Warehouse? The range is fairly comprehensive. The new user comes to the Data Warehouse with no training and, therefore, needs prefabricated reports and preset queries. The casual user needs information once in a while, not regularly. This type of user also needs prepackaged information. The business analyst looks for ability to do complex analysis using the information in the Data Warehouse. The power user wants to be able to navigate throughout the Data Warehouse, pick up interesting data, format his or her own queries, drill through the data layers, and create custom reports and ad hoc queries. In order to provide information to the wide community of Data Warehouse users, the information delivery component includes different methods of information delivery. The figure below shows the different information delivery methods.

Information Delivery methods


Ad hoc reports are predefined reports primarily meant for novice and casual users. Provision for complex queries, multidimensional (MD) analysis, and statistical analysis cater to the needs of the business analysts and power users. Information fed into Executive Information Systems (EIS) is meant for senior executives and high-level managers. Some Data Warehouses also provide data to data-mining applications. Data-mining applications are knowledge discovery systems where the mining algorithms help you discover trends and patterns from the usage of your data. In your Data Warehouse, you may include several information delivery mechanisms. Most commonly, you provide for online queries and reports. The users will enter their requests online and will receive the results online. You may set up delivery of scheduled reports through e-mail or you may make adequate use of your organizations intranet for information delivery. Recently, information delivery over the Internet has been gaining ground.

6 What Data Quality factors effects Data Warehouse. Explain them Ans Data quality in Data Warehouse Data Warehouse Components The DWQ project will provide a neutral architectural reference model covering the design, the setting-up, the operation, the maintenance, and the evolution of data warehouses. Figure 6.1 illustrates the basic components and their relationships as seen in current practice. The terms used in this figure can be briefly explained as follows: Sources: any data store whose content is subject to be materialized in a data warehouse. Wrappers: to load the source data into the warehouse Destination databases: data warehouses and data marts Meta database: repository for information about the other components, e.g. the schema of the source data Agents for administration (data warehouse design, scheduler for initiating updates, etc.) Clients to display the data, for example statistical packages

Structure of a Data Warehouse


The Linkage to Data Quality: DWQ provides assistance to DW designers by linking the main components of DW reference architecture to a formal model of data quality. Main differences to the initial model lie in the greater emphasis on historical as well as aggregated data. A data quality policy is the overall intention and direction of an organization with respect to issues concerning the quality of data products. Data quality management is the management function that determines and implements the data quality policy. A data quality system encompasses the organizational structure, responsibilities, procedures, processes and resources for implementing data quality management. Data quality control is a set of operational techniques and activities which are used to attain the quality required for a data product. Data quality assurance includes all the planned and systematic actions necessary to provide adequate confidence that a data product will satisfy a given set of quality requirements.

Quality Factors in Data Warehousing


Types of Data Quality Problems The following list of quality problems occur during data warehouse creation. All these problems have to be rectified during ETL processing. Dummy values in source system fields Absence of data in source system fields Multipurpose fields Cryptic data Contradicting data Improper use of name and address lines Violation of business rules Reused primary keys Non-unique identifiers

You might also like