0% found this document useful (0 votes)
55 views21 pages

Nirali DWM (Unit 1)

Unit 1 introduction of Data warehousing

Uploaded by

harshadchatte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
55 views21 pages

Nirali DWM (Unit 1)

Unit 1 introduction of Data warehousing

Uploaded by

harshadchatte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 21
Describe need and architecture for the given data warehouse. Explain the benefits of data warehousing of the given applicat ig Describe the given data warehouse models, Seem To understand Basic Concepts in Data Warehouse and Data Warehousi To learn Architecture of Data Warehousing. as To study various Data Warehouse Models (Data Mart, Virtual Warehouse etc.) q Tounderstand ETL Concept in Data Warehouse ; To know Benefits of Data Warehousing very complex and tedious. Consequently, a number of methods, techniques and tools were developed tosolve that problem. These included decentralized processing, extract processing, Executive, Information Systems (EIS), query tools, relational databases, etc. The need for timely and accurate decisions also led to the development of Decision Support Systems (DSSs).. pata warehousing began to grow explosively starting in the mid-nineties. It is still characterized by high growth, pata warehousing is the process of constructing and using a data warehouse. Data warehousing is the process of constructing and using a data warehouse. The data warehouse is a basis for informational processing. Data warehousing enables easy organization and maintenance of large data in addition to fast retrieval and analysis in the manner and depth required from time to time. Data Warehousing, OnLine Analytical Processing (OLAP) and Data Mining represent some of the latest trends in computing environment and Information Technology (IT) applications to large-scale processing and analysis of data. oa Seen eee SS Data mining is the process of discovering new information out of data in a data warehouse, cannot be retrieved within the operational system. ‘pata mining refers to the extraction of useful information from a bulk of data or data ethan is the computational process of discovering patterns in large data sets invo stziods at the intersection of artificial intelligence, machine learning, statistics, and database Data warehousing is tha process of Constructing’ and ising al data warsisoase™ A Aga wrareHigeTh constricted by Inegrarng sen conrructing an curcee that support anagcl 3 — Teportin, structured and/or adhoc queries eo denn ees roures that St Data warehousing involves data cleaning, data integration, and data consolidations, Data Warehouse I “A database contains information organized in columns, rows and tables that is periodically indexed to make accessing relevant information more accessible. Many enterprises and organizations create and ‘manage databases using a database management System. Special DBMS software can be used create and store product inventory and customer information. Organizations most often use databases for OnLine Transaction Processing (OLTP). A database was built to store current transactions and enable fast access to specific transactions for ongoing business processes, known as OnLine Transaction Processing (OLTP). Most enterprises generate unlimited amounts of data from their OLTP systems, Point-of-Service (oS) systems, financial ATMs and the Web. The challenge faced by these enterprises/organizations with regard to the massive data-rich but information-poor collection is to extract valuable information to be available at a particular tim Place and in the form needed to support the decision-making process. warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. data helps analysts to take informed decisions in an organization. = A Data Warehouse (DW) is a collection of technologies aimed at enabling the knowledge wor! * (executive, manager, and analyst) to make better and faster decisions. f 2 It is expected to have the right information in the right place at the right time with the right cost. order to support the right decision. ott Data warehouses and databases are both relational data systems, but were built to serve differ complex queries across all the data, typically using OnLine Analytical Processing (OLAP). Data Warehousing > A data warehouse is a database, which is kept separate from the organization's operational d A data warehouse is a collection of data specific to the entire organization. A data warehouse is a decision-support environment that leverages data stored in different organizing it and delivering it to decision makers across the enterprise, regardless of their platf or technical skill level. aia Warehousing with Mining Techniques + Italso provides the appropri: decisions. Characteristics of Data Warehouse The formal definition of a data warehouse by W. H. Inmon is given below: ‘A data warehouse can p; consistent data. Data warehousing Involves processed that: extract ccheuse nnorm the data, integrate it, remove any flaws and inconsistencies, nore it into a data warehouse, aiid Provide endl sare with access to the date Sota they can carry out complex data analysis and prediction queries. A data warehouse ensures the consistency of management rules and conventions applied to the data ate tools to extract specific data, convert it into business information, and monitor for changes and hence, it is possible to use this information sa tenia insightful “A data warehouse Is a subject-oriented, integrated, time-varying, non-volatile collection of data in support of the management's decision-making process.” The key characteristics/features of a data warehouse are discussed below: 1. Subject Oriented: A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations. These subjects can be Product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing. operations, rather it focuses on modelling and analysis of data for decision making, Integrated: A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data. Time Variant: The data collected in a data warehouse is identified with a particular time Period. The data in a data warehouse provides information from the historical point of view. Non-volatile: Non-volatile means the previous data is not erased when new data is added to it. A Gata warehouse is always a physically separated store of data. Due to this separation, date warehouse does not require transaction processing, recovery, concurrency control, ete. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse. Difference between Operational Database System and Data Warehouse Operational database management systems also called as OLTP (OnLine Transactions Processing Databases), are used to manage dynamic data in real-time. Following table shows the differences between OLTP and data warehouse. Operational database designed to support transaction processing. systems are high-volume Data warehousing systems are typi designed to support high-volume anal processing (i.e., OLAP). Operational database systems are usually concerned with current data. Data within operational systems are mainly updated regularly according to need. Data warehousing systems are us concerned with historical data. ‘Non-volatile, new data may be added regularly. Once, the data added rarely changed. It is designed for real-time business dealing and processes, Itis designed for analysis of business measures It is optimized for a simple set of transactions, generally adding or retrieving a single row at a time per table. by subject area, categories, and attributes. It is optimized for extent loads and high, complex, unpredictable queries that access many rows per table. It is optimized for validation of incoming information during transactions, uses validation data tables. Loaded with consistent, valid informa! requires no real-time validation. It supports thousands of concurrent clients. Operational database systems are widely functional or process-oriented. It supports a few concurrent clients relative LTP. Data warehousing systems are widely oriented. Operational systems are usually optimized to perform fast inserts and updates of associatively small volumes of data. Operational database system focuses on Data In | Less number of data accessed. | Relational databases are created for OLTP. a Data integration in operational database Data warehousing systems are optimized to perform fast retrievals relatively high volumes of data. Data warehousing system focuses on Data Large number of data accessed. Data warehouse designed for OLAP. Data integration data warehouse is is application based. based. 14, | It provides detailed and flat relational | It provides summarized and mt view of data. view of data. Components of Data Warehouse A data warehouse consists of various components (building blocks) which are oe optimum way to get the maximum benefit out of it. The arrangement of these components mainly depends on certain circumstances and requirements of an organization. Interna Q Archived Data storage: ¥ig. 1.1: Components of Data Warehouse 'ta warehouse are explained below: Gata comprises the data coming into the data warehouse from different areas. This source data can be grouped into foliowing four categories: @) Production Data: The data in this category comprise different segments of data chosen from the various operational systems of he enterprise, The data are selected on the basis of requirements in data warehouse. (099 Mtecaml Dated Nevery ergaisetion; users keep their personal data such as private spreadsheets, documents, customer profiles, departmentar databases, etc. All these data form the internal data which could be useful in » sare warehousing environment. Gil) Archived Data: This is a compiled form of older data which is stored in archived files such depen teem flat Mes, etc. As data warehouse containeiiayaiee many years, so depending on the requirement, data warehouse data are archived from time to time. Gv) External Data; It refers to those data sources which are available outside the organization, ‘Zhe production, internal and archived data give an insight view of what organization is doing presently and have done in the past. However, by including external data in data warehouse, executives can spot current trends availing in the market end can compare the performance of their organizations with others. "Data Staging Component: After extracting data from various operational systems and external sources, data need to be prepared for storing in the data warehouse. The date staging component of data warehouse, helps in making data ready to be stored in data warehoun It comprises following three main functions: () Pata Extraction function deals with a large amount of data sources and for each data source an appropriate technique must be applied. Data extraction is a complex and critical task, as source data may be from different source machines in various data formats and with different data models (relational, network or hierarchical). Various data extraction tools are available in the market today. The organization can use these tools for certain data Sources. However, these tools incur high initial cost. For some other data sources, in-house The basic components of a dat 4. Source Data: The source [ESE Introduction to Data Ws he development and mait may incur t sped. But, these may tracted into a separate PI Data Warehousing with wining Techniques programs can also be develo} Fa emery, Sroging relational database from which m warehouse would be easier. = (ii) Deta Transformation: The data sources may contain some Minot anne ineonsleaea Porevaimple, the narnes are often saiespelied, and strest, aren OF CNY O00 C7 is oy an ier Mamatied or aip codes are entered incorrectly. These incorrect Sly Cys gh eae et oe cainicaiae the errors and fill in the iilesing information \elleu, Does iyi eCentrecting and preprocessing the! datal is called/datalclesnsing Tite ran eq eerste ome reasonable levelby looking up a dstabase containing street names #24 8 ee is each cep: the apprositnate matching of Auta required for tls tasiieretece=a ar fursy lookup. In some cases, the data managers in the organization want to upgrade {is This process is known as back-flushing. These data are then (ii) Data Loading: The cleaned and transformed data are finally loaded into the ware Data aie’ partitioned, and indexas or other, access paths are built fom fest a co Say Reisieealor-aata: teating’is a slow process due to the large vohania Gf Gn eaae aa Ioading a terabyte of data sequentially can take weeks and a gigabyte can take nous. 2 ‘paraileliama is ienportant for loading warehouses: The raw data generated DY "seces =a processing system may be too large to store in a data warehouse; therefore some OSs oat JRored in a summarised form. Thus, additional preprocessing such as sorting and generation, of summarized data is performed at this stage. ouse is called Extract, Transform and Load jodically refreshed to This entire process of getting data into the data wareh« t the data is loaded into a warehouse, it must be peri lations at the data sources and periodically purge of old data. data with the cleaned data. transformed to accommodate semantic mismatches. (ETL) process. Once, reflect the updates on the rel 3. Data Storage Component: This component consists of a separate repository for storing desired data in the data warehouse. In the data repository of a warehouse, huge amount of histori data are kept along with the current data in specific structures suitable for analysis; however, these repositories are made read-only in the data warehouse. This is because for analysis, must not have data storage to be in such a state where continual updations are made to it. 4. Information Delivery Component: This component includes different methods for renderiny information to the wide group of data warehouse users. Some common methods include ad h ‘reports, Multi-Dimensional (MD) analysis, statistical analysis, Executive Information Syst (EIS) feed and data mining applications. 5. Metadata Component: Metadata is the data about the data. The metadata stores data ina way as the data dictionary or data catalogue does in a DBMS but it also keeps information the logical data structures, files, addresses, indexes, etc. 6. Management and Control Component: This component of data warehouse manages coordinates the various services and activities within the data warehouse from the beginning the end. It also works with the database management system and enables data to be stored in the repositories. It controls the data transformation into the data warehouse and moderates information delivery to the users. It also supervises the movement of data the staging area and from there into the data warehouse storage itself. While performing functions, it interacts with the metadata component as metadata is the source of inft for the management module. [pete Warehousing with Mining Tech a7 Introduction to Data Warehousing + Data warehousing technology is becoming essential for effective business intelligence as it enables aeenee a ation and maintenance of large data in addition to fast retrieval and analysis in the manner and depth required from time to time. Following polnts show the reed and importance of data warehouse: 3, Pata warehouse helps business users to access critical data from some sources all in one place. B provides consistent information on various cross-functional activities. aie rehousing helps to integrate many sources of data to reduce stress on the production system. ft Data warehousing Hels tsexs to reduce total turnaround time for analysis andxenartine) 5 Data warehousing helps users to access critical data from different sources in a single place so, it Free eet time of retrieving data information from multiple sources. We can also access data from the cloud easily. 6. Data warehousin, #€ allows to stores a large amount of historical data to analyze different periods and trends to make future predictions. 7% Data Warehousing enhances the value of operational business applications and customer relationship management systems. 8. It separates analytics both systems. Data warehousing provides more accurate reports. Processing from transactional databases, improving the performance of Data warehousing provides architectures to systematically organize, understand, and use their data to make strategic decisions. Data warehouse architecture depends upon the organization's situation. The following architecture Properties are essential for a data warehouse system: 3. Separation: Analytical and transactional processing should be kept apart as much as possible. 2. Scalability: Hardware and software architectures should be easy to upgrade as the data ‘volume, which has to be managed and processed, and the number of users’ requirements, which have to be met, progressively increase. 5. Extensibility: The architecture should be able to host new applications and technologies without redesigning the whole system. 4. Security: Monitoring accesses is essential because o: warehouses. Administerability: Data warehouse management should not be overly difficult. ‘There are following three common types of data warehouse architectures: 1. Basic architecture (single-layer) of data warehouse. 2. Two layer architecture of data warehouse. 3. Three layer architecture of data warehouse. f the strategic data stored in data j Date Warehousing with Mining Techniques 12 Introduction to Data W BRERY single tayer architecture of Data Warehouse * Single layer architecture is a basic architecture of data warehouse which is today. Single layer architecture is shown in Fig. 1.2. In single layer architecture, end users can access data directly from the various source through the data warehouse. The goal of basic architecture of data warehouse is to minimize the = remove the data redundancies. * The basic structure lets end users of the warehouse directly access eon ae oe source systems and perform analysis, reporting, and mining on that data. This structure is useful; when data sources derive from the same types of database systems. not frequently mount of data stored Fig. 1.2: Basic Architecture of Data Warehouse * The weakness of basic architecture of data warehouse architecture lies in its failure to meet requirement for separation between analytical and transactional processing. * Analysis queries are submitted to operational data after the middleware interprets them. It this. the queries affect regular transactional workloads. * In addition, although basic architecture of data warehouse architecture can meet the requirem for integration and correctness of data, it cannot log more data than sources do. ‘* For these reasons, a virtual approach to data warehouses can be successful only if analysis needs Particularly restricted and the data volume to analyze is huge. EEE7 two layer architecture of Data Warehouse * The requirement for separation plays a fundamental role in defining the typical architecture data warehouse system, as shown in Fig. 1.3. Fig. 13 shows two-level (two-layer) architecture of data warehouse to highlight the separat physically available sources and data warehouses. This separation process is important to clean process operational data, basically it consists of following five stages: 1. Data Source: It is a heterogeneous source data, it might be operational data, flat files, etc. 2. Staging Area: In this area, data stored to sources should be extracted, cleansed to inconsistencies and fill gaps before the warehouse. with Mining Techniques. 19. Introduction to Data Warehousing Warehouse: It is a centralized repository which can access data directly, but it can also be used as. a source for creating data marts. 4. Data Marts: Data mart is a partially copy of organization's data and is designed for a specific purpose like purchasing, sales, inventory, etc. 5, Users: End users can access the processed report, analyze them and mine them. Fig. 1.3 : Two-Level Architecture of Data Warehouse its of a Two-Layer Architecture: In data warehouse systems, good quality information is always available, even when access to sources is denied temporarily for technical or organizational reasons. Data warehouse analysis queries do not affect the management of transactions, the reliability of which is vital for enterprises to work properly at an operational level. Data warehouses are logically structured according to the multidimensional model, while operational sources are generally based on relational or semi-structured models, Three Tier Data Warehouse Architecture Generally a data warehouses adopts a three tier (layer/level) architecture. Fig. 1.4 shows the three tier architecture of data warehouse. Following are the three tiers of the data warehouse architecture: Bottom Tier: The bottom tier of the architecture is the data warehouse database server. It is the relational database system. We use the back end tools and utilities to feed data into the bottom tier, These back end tools and utilities perform the Extract, Clean, Load, and refresh functions. The following are the functions of data warehouse tools and utilities: (@) Data Extraction: involves gathering data from multiple heterogeneous sources, (i) Data Cleaning: Involves finding and correcting the errors in data. (ii) Data Transformation: Involves converting the data from legacy format to warehouse format. _ (iv) Data Loading: Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions. (¥) Refreshing: Involves updating from data sources to warehouse. be implemented in eit we have the OLAPServer that can e management .ded relational database ma 1 data to standard relational operations, rectly implements the multidimenst By Relational OLAP (ROLAP),which is an exten tools and reporting ‘The ROLAP maps the operations on multidimensional (i) By Multidimensional OLAP (MOLAP) model, which dit data and operations. .e quer Top Tier: This tier is the front-end client layer. This layer holds the query analysis tools and data mining tools. Top Ter Fig. 1.4: Three Layer Architecture of Data Warehouse The main advantage of the reconciled data layer is that it creates a common reference data for a whole enterprise. At the same time, it sharply separates the problems of source data extra and integration from those of data warehouse population. A data warehouse is an electronic system that gathers data from a wide range of sources within company and uses the data to support management decision-making. Data warehouse architecture is changing with time. There are two types of data w: architectures, namely, Traditional data Warehouse and Cloud-based architectures, Traditional data warehouse architecture employs a three-tier structure, (as explained ear composed of the Bottom tier, Middle tier and Top tier. Ina traditional architecture there are three common data warehouse models are virtual data mart, and enterprise data warehouse. ased data warehouses do not adhere to the traditional architecture; each data The cloud-1 offering has a unique architecture. The view over an operational data warehouse is known as a virtual warehouse. A virtual data warehouse is a set of separate databases, which can be queried together, so a user can effectively access all the data as if it was stored in one data warehouse. he operational data warehouse can be a virtual but complex component of an enterprise data warehouse (EDW). Jt is also @ multi-purpose structure that enables transactional and decision support processing. Because the data originates from multiple sources, the integration often involves cleaning, resolving redundancy, and checking it against business rules for integrity. A data mart is a subset or an aggregation of the data stored to a primary data warehouse. A data mart includes a set of information pieces relevant to a specific business area, corporate department, or category of users. A data mart is a subject-specific data warehouse that is usually set up to meet the information needs of users of a particular department or functional unit within an organization. The size of a data in art, therefore, is generally many times smaller than an enterprise data warehouse. Data marts contain a subset of organization-wide data that is valuable to specific groups of people in an organization. In other words, a data mart contains only those data that is specific to a particular group. For example, the marketing data mart may contain only data related to items, customers and sales. Data marts are confined to subjects. A data mart can be called as a subset of a data warehouse or a sub-group of corporate-wide data corresponding to a certain set of users. Fig. 1.5 shows a graphical representation of data marts. A data mart can be implemented rising a top- down or bottom-up approach. In the former, which is called a dependent data mart, data is drawn directly from an enterprise data warehouse. In the latter, which is called an independent data mart. Individual data marts are built by capturing and transforming data from existing local operational databases in a department or business area, of Data Mart: Fig. 1.5: Graphical Representation of Data Marts Depending on the source of data, data marts can be categorized as independent data mart or dependent data mart. independent data marts are sourced from data captured from one or more operational systems or external information providers, or from data generated locally within a particular department or Seographic area. Dependent data marts are sourced directly from enterprise data warehouses. Sats w. 332 vith ning Techn .d all dependent data ,e data warehouse, Jr data warehouse, a” Jource - the enterpris A dependent data mart is one whose source is another within an organization are typically fed by the same e oh Dependent Data Mart (Data Mart exists with Data Warel ees systems, * Am Independent data mart is one whose source is directly from transaction directly Sa applications, or external data feeds. Independent data mart can collect t different sources. ‘Operational ‘Sysioms ) Fig. 1.7: Independent Data Mart Advantages of Data Marts: 1. Building a data mart is simpler as compared to implementing a corporate data warehouse. 2. Data marts are small in size. 6. Data marts are flexible, stages of Data Marts: 4, Increase in their size of data marts results in perfo creates problems when data warehouse needs to be 2, The data marts are frequently short-term, architecture. Development can be unorganized, which creates blocks for creating an enterprise data warehouse. 3 because data marts focus on individual needs. mate upgrade to an enterp1 e systern. ice between Data Warehouse and Data Mart: Its scope is enterprise-wide. fed machines to allow users to break away from. profoundly powered machines and still handle processing of the reports 5. The cost of implementing a data mart is far less when compared to build a data warehouse. temporary solutions that are not part of a corporate Problems when data marts are used as building 4. The process of data access, consolidation and cleanin, 5. Their design is not as thorough as with a data warehouse due to limited consideration for an ulti- 6 They can be expensive in the long-term process as activities such as extraction and processing can get duplicated. Then, additional persons will be required for maintenance and support. rmance deterioration, data inconsistency and upgraded. ig in data marts becomes very difficult Its scope is department-wide. Control and management process of data warehouse is centralized. Its process is decentralized, \ Due to huge amount of data it is complex and difficult to manage and thus, takes long, time to produce the result. Due to fewer amounts of data itis easy to. build and manage. There are many internal and external sources, thus staging design takes much. more time. The data stored inside the data warehouse are always detailed and accurate when compared with data mart. 6. | A data warehouse isa large repository of data | collected from different organizations. Data warehousing includes large area of the corporation which is why it takes a long time | to process it. A data warehouse is a blend of technologies and components which allows the strategic use of data. ‘There are only few internal and external sources and it is self-explanatory; thus it is faster to build. ‘The data stored inside the data mart is short and limited. A data mart is an only subtype ofa data warehouse. Data marts are easy to use design and implement as it can only handle small amounts of data. A data mart is simple form of a data warehouse. It is focused on a single subject. It is designed for a long period of time. It is built with a given objective, andhas a a) short lifespan. reanlzation. This model sees the data warehouse as the hea p stem, with integrated data from all business units. oe 1t Provides corporate-wide data integration, usually from So information, Providers, and is ‘cross-functional in scope F typically contains detailed data as well s summarized 4at3, Sigabytes to hundreds of gigabytes, terabytes, or beyond: An enterprise data warehouse may be implemented on 8 Servers, or Parallel architecture platforms. It requires extensi eee object in the data The goal of EDW is to Provide a complete overview of any. pie ona Thi m accomplished by identifying and wrangling the data from different sys “consistent and conformed model. * After all the information is gathered by EDW which has the capability of providing aco Jocation where different tools can be used to perform analytical functions and Predictions. The research teams can identify new trends or patterns and focus on tl business grow, and can range in siz. ditional mainframes, jive business modeling ay The ETL process encompasses data extraction, transformation, and loading. ETL tools are very important because they help in combining Logic, Raw Data and Sc and loads the information to the Data Warehouse or Data Marts. Data Source Data Staging Data Storage Fig. 1.8: ETL Sometimes, ETL loads the data into the Data Marts and then information is Warehouse. This approach is known as the Bottom-up approach. 4 The approach where ETL loads information to the Data Warehouse directly is known ¢ Approach. rn be generated easily as Dat and it is relatively are created firs interact with data marts. ured ynouse can be Not as strong but data ware! ied vended and the number of dats marts.can ite and consistent view of tion from the data ate data marts. provides a defini information as informa\ warehouse is used to cre ‘strong model and hence prefe! companies. ce is high. ation. Some of them are given below pusiness decisions. wered by ETL- Time, cost and maintenan¢ adopting ETL in the org® ithelps companies to anal ata for taking critical sransactional databases cannot answer comPICX pusiness questions that can De Sis ‘a data warehouse provides a common data repository: into a data warehouse. ETL provides a method of moving the data from various sources vase will automatically UP ‘As data sources change, the Data Wareho well-designed and documented ETL system is almost essenti There are many reasons for date. al to the success of a Data Veep e Warehouse project. | allow verification of data transformation, aggregation and calculations rules. | ETLprocess allows sample data comparison between the source and the target system. ETL process can perform complex transformations and requires the extra area to store the data. It helps to Migrate data into a Data Warehouse. Convert to the various formats and types to adhere to one consistent ‘system. .e data into the target database. _ this a predefined process for accessing and manipulating soures ,. ETLoffers deep historical context for the business. without a need for technical skills. Ithelps to improve productivity because it codifies and reuses ps in getting Data into the Data Warehouse: Extraction: Extraction is the first step in the process of getting data into the data warehouse environment. During extraction, data is specifically identified and then taken from many different locations, referred to as the Source. ‘The Source can be a variety of things, such as files, spreadsheets, database tables, a pipe, etc. Relevant data is obtained from sources in the extraction phase. Extracting means reading and understanc S needed ding the source data and eo: for further manipulation. At this point, the data a aie oa og ere are two types of data warehouse extraction methods, namely, logical oie : : ani methods as shown in Fig. 1.9. Extraction: Extracti inst methods: reece on when the data needs to be extracted and loaded f¢ first time. In full extraction, the data from the source is extracted completely. This Neen ae change 2 ours dat oa tracked since the last successful extraction. Only these changes in data will be extracted and: loaded. These changes can be detected from the source data which have the last timestamp. Also a change table can be created in the source system, which keeps track of changes in the source data. @) Physical Extraction: * Physical extraction has two methods: OnLine and Offline extraction: (@) Online Extraction: In this process, extraction process directly connect to the source system extract the source data. Gi) Offline Extraction: The data is not extracted directly from the source system but is explicitly outside the original source system. 2. Transformation: * After data is extracted, it must be physically transported to the target destination and converted; the appropriate format. * It converts data from its operational source format into a specific data warehouse format. * This data transformation may include operations such as cleaning, joining, and validating generating calculated data based on existing values. * Whether the transformation takes place in the data warehouse or beforehand, there are common and advanced transformation types that prepare data for analysis. * Some of these include: Basic transformations: ) Cleaning: Mapping NULL to 0 or "Male" to"M" and "Female" to"F," date format cons (ii) Deduplication: Identifying and removing duplicate records. (iii) Format Revision: Character set conversion, unit of measurement conversion, conversion, etc. (iv) Key Restructuring: Establishing key relationships across tables, warehousing with Mining Techniques saz Introduction to Data Warehousing, transformations: Derivation: Applying business rules to the data that derive new calculated values from existing data — for example, creating a revenue metric that subtracts taxes. (i) Filtering: Selecting only certain rows and/or columns. inking data from multiple sources - for example, adding ad spend data across multiple platforms, such as Google Adwords and Facebook Ads. (iv) splitting: Splitting a single column into multiple columns. (v) Data Validation: simple or complex data validation ~ for example, if the first three columns in. ‘a row are empty then reject the row from processing. (vi) Summarization: Values are summarized to obtain total figures which are calculated and stored at multiple levels as business metrics ~ for example, adding up all purchases a customer has made to build a Customer Lifetime Value (CLV) metric. (vii) Aggregation: Data elements are aggregated from multiple data sources and databases. (vili)integration: Give each unique data element one standard name with one standard definition. Data integration reconciles different data names and values for the same data element. Loading: Loading into a data warehouse is the last step to take. The final step in the ETL process involves loading the transformed data into the destination target. This target may be a database or a data warehouse. Loading can be carried out in two ways namely full load and incremental load. (i) The full load method involves an entire data dump that occurs the first time the source is loaded into the warehouse. (ii) The incremental load, on the other hand, takes place at regular intervals. These intervals can be streaming increments (better for smaller data volumes) or batch increments (better for larger data volumes). There two different methods of loading data into a warehouse are given in following table: Extract, Transform, Load (ETL) first extracts the data from a pool of data sources, which are typically transactional databases. * With Extract Load Transform (ELT), data is immediately loaded after being extracted from the source data pools. The data is held in a temporary staging |+ There is no staging database, meaning the data is immediately loaded into the single, centralized repository. database. Transformation operations are then performed, to structure and convert the data into a suitable form for the target data The data is transformed inside the data warehouse system. warehouse system for use with business The structured data is then loaded into the intelligence tools and analytics. warehouse, ready for analysis, E Data Warehousing with Mining Techniques Traditional approach for analysis of data is ETL. Metadata is simply defined as, data about data. The data that are used to ee ‘know as metadata. For example, the index of a book serves as metadata for the contents in th In other words we can ‘say that metadata is the summarized data that leads us to the detail In terms of data warehouse we can define metadata as: 1. Metadata is a road map to data warehouse. 2. Metadata in data warehouse define the warehouse objects. 3. The metadata act as a directory. This directory helps the decision support system to ~ contents of data. warehouse, Metadata in a data warehouse is similar to the data dictionary or the data catalog in a management system. In the data dictionary, we keep the information about the logical data structures, the about the files and addresses, the information about the indexes, and so on. The data d contains data about the data in the database. The metadata can be broadly categorized into following three categories: 1. Business Metadata: This metadata has the data ownership information, business definitio changing policies. 4 2. Technical Metadata: Technical metadata includes database system names, table and. names and sizes, data types and allowed values. Technical metadata also includes information such as primary and foreign key attributes and indices, 3. Operational Metadata: This metadata includes cu: data means whether data is active, archived or migrated and transformation applied on it. In the data warehouse architecture, meta-data Plays an important role as it specifies the usage, values, and features of data warehouse data. It also defines how data can be Processed. It is closely connected to the data warehouse. The generation and management of metadata serves two purposes: rency of data and data lineage. Curren Purged. Lineage of data means history @® Information on the _ Gi) Information on the E the refreshment : Feconciled data, «ii srormation on the implicit semantics of data (with respect to@ waren 889 other kind of data that aids the end-user exp! (iv) Information on the ‘sources of the data (%) Information inch administrator tune the .cture, contents of the data warehouse, their location and their stru ing shouse back-stage, concern! Processes that take place in the data warehouse back stagi» CetO Of the warehouse with clean, up-to-date, semant common enterprise model), Joit the information of the the infrastructure and physical characteristics of components and warehouse, and, security, authentication, and usage statistics that aids the fae ‘operation of the data warehouse as appropriate. 6 data “ th as i (whether @ = imp data quality like consistency (uniform and no duplicates), completeness aT accuracy (precision and confidence of the data), timeliness (value up-to-date). (4) Improves query, retrieval and answer. quality. has very important role in data warehouse. The role of metadata in warehouse is different the warehouse data yet it has very important role. The various roles of metadata are explained. ‘The metadata act as a directory. Metadata are also used for query tools. Metadata are used in reporting tools. Metadata are used in extraction and cleansing tools. Metadata are used in transformation tools. Metadata also plays important role in loading functions. The directory helps the decision support system to locate the contents of data warehouse. Metadata helps in decision support system for mapping of data when data are trarisformed from operational environment to data warehouse environment. . Metadata helps in summarization between current detailed data and highly summarized data. Metadata also helps in summarization between lightly detailed data and highly summarized data. metadata repository is a database of data about data ‘bottom tier of the data warehousing architecture. Purpose of the metadata repository is to provide a consistent (metadata). A metadata repository is used in and reliable means of access to metadata repository should contain the following: Introduction which includes the warehouse schema, De ian 308 as well as data mart locations scription of the data : acs pats and derived data definitions, dimensions, hierarchies, contents. 2. Operational metadata, transformations applied to it), currency ire oo information (warehouse usage statistics, error reports, and audit ). we used for summarizat! ich include measure and dimension definitioy ithms includs ure and J Ae chs,algort zation, whic] algorithms, data on granularity, partitions, subject areas, aggregation, summarization, jeries and reports. 4 Eee cs the ital environment to the data warehouse, ee oe 74 databases and their contents, gateway descriptions, data partitions, data : e: transformation rules and defaults, data refresh and purging rules, ant security q authorization and access control). 5. Data related to system performance, access and retrieval performance, in addition to rules for thé update, and replication cycles. 7a 6. Business metadata, which include business terms and definitions, data ownership informatio and charging policies. story of migrated data and the sequen ey on which include data lineage ( migrated dein one of data (active, archi which include indices and profiles that improve da e timing and scheduling of ret * Various benefits of data warehousing are given below: 1. Improved Control of Data: Information in the data warehouse is under the control of dat warehouse users so that, even if the source system data is removed over time, the information the warehouse can be stored safely for extended periods of time. 2. Better Retrieval of Data: Because they are separate from operational systems, data wareho Provide retrieval of data without slowing down operational systems. 3. Increased Productivity of Corporate Decision Makers: Data warehousing improves Productivity of corporate decision makers by creating an integrated database of consistel subject-oriented, historical data. It integrates data from multiple incompatible systems into form that provides one consistent view of the organization. By transforming data meaningful information, a data warehouse allows business managers to perform mo substantive, accurate, and consistent analysis. 7 4. More Cost-Effective Decision-Making: Data warehousing helps to reduce the overall cost of th product by reducing the number of channels. 5. Better Enterprise Intelligence: It helps to provide better enterprise intelligence and enhanc customer service. Potential High Returns on Investment: Return on investment (ROI) refers to the amount ¢ increased revenue. Implementations of data warehouses and complementary busi intelligence systems have enabled business to generate higher amounts of revenue and pro substantial cost savings. Introduction to Data Warehousing Warehousing with Mining Techniques a1 Competitive Advantage: The huge returns on investment for those companies/organizations that have successfully implemented a data warehouse is evidence of the enormous competitive advantage that accompanies this technology. The competitive advantage is gained by allowing decision-makers access to data that can reveal previously unavailable, unknown, and untapped information on, for example, customers, trends, and demands. tations of Data Warehouse: 3, Hidden Problems with Source Systems: Sometimes hidden problems associated with the source systems feeding the data warehouse may be identified after years of being undetected. For example, when entering the details of a new property, certain fields may allow nulls which may result in staff entering incomplete property data, even when available and applicable, 2. Required Data Not Captured: In some cases the required data is not captured by the source systems which may be very important for the data warehouse purpose. For example, the date of registration for the property may be not used in source system but it may be very important analysis purpose. 3, Increased End-User Demands : Once a data warehouse is OnLine, it is often the case that the number of users and queries increase together with requests for answers to more and more complex queries. 4. Data Homogenization: The concept of data warehouse deals with similarity of data formats between different data sources. Thus, results in to lose of some important value of the data. 5. High Demand for Resources: The data warehouse requires large amounts of data. Data Ownership: Data warehousing may change the attitude of end-users to the ownership of data. Sensitive data that owned by one department has to be loaded in data warehouse for decision making purpose. But some time it results in to reluctance of that department because it may hesitate to share it with others. High Maintenance: Data warehouses are high maintenance systems. Any reorganization of the business processes and the source systems may affect the data warehouse and it results high maintenance cost. Long-Duration Projects: The building of a warehouse can take up to three years, which is why some organizations are reluctant in investigating in to data warehouse. Whereas, data marts support only the requirements of a particular department and limited the functionality to that department or area only. Complexity of Integration: An organization must spend a significant amount of time determining how well the various different data warehousing tools can be integrated into the overall solution that is needed. This can be a very difficult task, as there are a number of tools for 9. every operation of the data warehouse. 20. Under-Estimation of Resources of Data Loading: Sometimes, we underestimate the time required to extract, clean, and load the data into the warehouse. It may take the significant Proportion of the total development time, although some tools are there which are used to reduce the time and effort spent on this process.

You might also like