0% found this document useful (0 votes)
73 views121 pages

Data Mining and Warehousing

The document discusses data warehousing, covering its architecture, delivery processes, and the importance of data preprocessing such as cleaning and transformation. It highlights the need for a structured approach in developing data warehouses, including understanding business requirements and the integration of operational data. Additionally, it outlines the various types of data warehouses and the tools required for their management and maintenance.

Uploaded by

it202125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
73 views121 pages

Data Mining and Warehousing

The document discusses data warehousing, covering its architecture, delivery processes, and the importance of data preprocessing such as cleaning and transformation. It highlights the need for a structured approach in developing data warehouses, including understanding business requirements and the integration of operational data. Additionally, it outlines the various types of data warehouses and the tools required for their management and maintenance.

Uploaded by

it202125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 121
DATA MINING AND WAREHOUSING UNIT a Unir-1: DATAWAREHOUSING fenaeat] DATA WAREHOUSING eaooicten, Debvery process, Dela werehwuse a/c TSENG n= Data preprocessing : Data cleaning, Dat. transformation, Data reduction : DATA WAREHOUSING — INTRODUCTION, DELIVERY PROCESS, DATA WAREHOUSE ARCHITECTURE databases, Issues in data minin Introduction to fuzzy sets and Fuzzy logic. UNIT-4; SUPERVISED LEARNING UNIT-5: CLUSTERING & ASSOCIATION RULE MINING. Hierarchical algorithms, Parttional algorithms, Ci ‘Apriori and FP growth algorithms, Q.3. What are the architectural components of a data warehouse ? (RGPY., Dec. 002) Or ley components of data warehouse. GRY, June 2013) structures, attribute measures, ~ any Snbject-oriented —A data warehouse is constric around main subjects like custome, supplier, sales and product. For ul ‘warchouse emphasizes on the modeling and warehouses offer a simple and clear =w arol climinating data that serve no purpose the In essence, a data warcho consist physical implemenation ofa decision support da a reser a ase by an enterprise 10 make sa ‘warehouse is also viewed as an architecture, construc a ig de ‘oaliple heterogeneous sources 1 support structured or ad hoe : L. accion making and analytical reporting. (@ Load Manager ~ The load manager is the system component ‘On the basis of this information data to support the extract and load process. using a combination of off-the-shelf tools, warehouse needs data integration, data cl authors use the term “data warchousing” to refer only to the warehouse construction, whereas the term “warehot refer to the management and utilization of data ware scture of load manager. It performs the following data from the source systems. 1¢ extracted data orary data store. A imple transformati asi si the one in the pa eer Simple transformation into a swustre sina o la EE Pine sin Sah tt th ne mca a a a a IES CT NEES 6 Date Mining and Warehousing Data Warohousing 7 within the data fi development phases. ‘Typically, the query manager is desigi uid phase, once the database and user access tools technologies have been determined. in of third-party systems manage” ment software, shell scripts, C programs and bespoke coding. The complexity of the warehouse ‘manager is driven by the extent to which the operational manage- ment of the data warehouse has been automated. Third-party ‘ols will probably contribute @ ‘maximum of 40% of the total system, with the bulk ofthe cont- G5 aS aie ‘bution from systems man (casa eau Oe oe eematedbackup! Fig. 14 Warehouse Manager an Fig. 1.5 Query Manager Architecture recovery/archiving Fig. 1.4 shows the architecture of a warehouse manager. It performs th following operations — (a) To perform consistency and refer analyze the data, Fig 1. shows the architecture of tecture ofa query manager It pero lowing operations — x . ror Direct queries to the appropriate tables integrity checks, ine the souree data in the temporary se, for veloping data warehouse ? Ther ae seve neds fo developing dats warehouse as fl lata w : (¢) Create any new agaregations that may be required. opertional detabates as wel a extcoal Seo a me, Ee ie 8 Data Mining art Data Warehousing 9 and Prototyping ~ Organizations experiment with data analysis and educate themselves on the value of having a This is addressed by prototyping. ind benefits of a data warehouse. The te educational process as long developed for decision support queries | hence only data that is needed for as od in the warel sent : ade from data required by users who wany coesS i |, and so a sea Pet af hepa Sab in data warehouse. ‘of the joint application developmen, a warehouse. We have staged the risks. The approach that we will lelivery time-scales but ensures the ‘incrementally trough the development proces, into phases to reduce the project and delivery process are shown in fig. 1.6. bosiness benefits are delivered: The delivery progess is broken risk. The stages in the delivery delivers business bene (@) Understand the short-term and medium-term requirements quantifiable Ifa data warehouse does fends to suffer from credi ry process. Therefore in data need to understand the business case for investment, house projects, we (iv) Business Requirements — ements for both the fies the short-term of growing to the ful (b) At least 20% of the time should be spent on understanding the likely longer-term ret (©) Within this stage, we must determine ~ (1) The logical model for information wi the data warehouse. Fig, 16 Delivery Process Werenousing ° ont ‘The source systems that provide this data (ha rmaping OS) ape busines rules tobe applied deta {a) The query profiles forthe immediate requirement, int ~ This phase needs to deliver an overay requirements. This phase also deliver, Temented on a short-term basis to derive any tifies the following — ‘imp! 1¢ data retention policy ff plan for hardware and infrastructure ‘and data mart architecture, igure an ad hoe query is can generate the sess tools when the ‘Ad hoc Query ~ In this phase, we conti perate a data warehouse. These recommended not to use these Data Warehousing 11 (x) Requirements Evolution ~ From the perspective of delivery process, the requirements are always changeable. They are not static. The elivery process must support this and allow these changes to be reflected within the system. ‘This issue is addressed by designing the data warehouse around the use of data within business processes, as opposed to the data requirements of existing queries. “The architecture is designed to change and grow to match the business needs, the process operates as a pseudo-application development pro ‘where the new requirements are continually fed into the develope: and the partial deliverables are produced. These partial deli ‘the users and then reworked ensuring that the o' bach continually updated to meet the business needs. warehouse is extended or Discuss three-tier architecture for data warehouse. (R.GPV., Dec. 2008) Or Briefly describe 3-ier data warehouse architecture. (R-GPY., June 2011) Or Describe the overall and typical architecture of data warehouse. (R.GPV, June 2013) or Describe the term data warehouse architecture. (RGBV., Dec. 2010, June 2014) Or With a neat sketch explain the architecture of a data warehouse. (RGR, May 2019) Or Write a short note on data warehouse archit %, Nov. 2019) Ans, Data warehouse uses a three-tier architecture, as shown in fig. 1.7 Anaya [nll] Q.7. What is data warehouse ? Explai pI with diagram. Or data warehouse ? Discuss a three-tier data warehouse (R.GPY., Dec. 2003, 2009) Or What is data warehouse ? Discuss a three-tier data warehouse. (RGPY., June 2014) Ans, Refer to Q.1 and Q.6. Q.8. Draw the data warehouse architecture and explain (RG s components. /, June 2016) Ans, Refer to Q.6 and Q.3. rentiate between two and three tier architectures of data (RGPY,, Dec. 2004) it model. The top tier ‘cols and utes carry out data extraction, data cleaning, daa transfor ‘and load and refresh functions to keep up to date data warehouse. An application ac program interfaces called gateways are used to extract data. A gateway 8 ap ported bythe undeving DBMS and permits client programs to prot dat mart environment. The top and baton ae SOL code to be executed at a server. ODBC (open database connection) and in two-tiered architecture. OLEDB (Open Linking and Embedding for Database) by Microsoft, yBC : (Java Database Connection) are examples of gateways. The bottom 2.10. Discuss system development life cycle of a data warehouse. What ort Lap infomation aout the data warehouse adit 7 Should be considered while designing a data warehouse? Jemented usiy Ans. The data warehouse developm LAP (ROLAF| Software development life eycle, The v implement life cycle of a data warehouse are as follows — ‘An OLAP server is the middle tier th cither a multidimensional OLAP (MOLAP) model or model. A MOLAP model is a special server 14. Data Mining and Warehousing first phase of a data w 2a Data Warehousing 15 t schedule and various tasks wha by performing (Planning ~ This is the Planning describes goals, objectives, projer progress of the projec. Y 1 specifies the basic re Requirements ~ This phase spec sie sustomer for gathering requirements. _ that are required to carry out ‘organizing meetings with the (iy Analysis ~ This phas data warehouse and data mart mode! needed to connect data warehouse, dat (iv) Design This phase deals with the deta In this phase modification is done afte ses which are developed in the design phase Qn. to be up-to-date support, and make up the operational database. has large amount of data, which re of a data warehouse is is nonvolatile. According (ore of data transformed 1 data environment. A data Is with the development of logicy so defines the processes Which ag jources and tools together. design of the projee, (v) Construction ~ performing testing of those process vrs phase validates data extraction and transformation functions and ayy confirms data quality. (oi) Deployment ~ This phase finally deploys the data warehouse after undergoing through all the phases. 2 only be loaded f This is the most important phase of di spouse needs to be maintained from time to require to update data (vii) Maintenance ~ dated, changed or del warehouse life cycle. The warel time so as to keep the performance up to the mark. ‘The main factors that need to be considered whi ‘warehouse include heterogeneity of data sources, use of hi increasing size of databases and business driven nature of dat Besides, some other factors which should be con: data warehouse is built on jorical data and is not guaranteed to be up-to-date information”. Q.13, Give three examples of problems likely to be encountered when operational data are integrated into the data warehouse. (R. rec. 2011) Or warehouse are data content, metadata, data distr Give reason, why it is necessary to separate data warehouse from QL. What are the differences between the three main types of data operational database. (R.GPRY., June 2016) swarchouse usage information processing, analytical processing and dae Ans. The integration of operational data into the data warehouse can ‘mining ? (RGRV., June 2009, 2017) encounter three types of problems such as ~ Ans. There are three types of data warehouse a is _ @ Information Processing - Xt suppors basic statistical analysis querying, and reporting using crossabs, charts, tables or graphs. In dat warehouse information processing, a current trend is to construct low-cos Web-based accessing tools that are then integrated with Web browsers. (i) Analytical Processing — It supports basic OLAP 0 = pe ee ‘ape i PEE Se seen RE Pere eaig peer a th sari added forms The eso sent of OLA tesa elise veappintons ec feacred uae Amoetig ov natn pose manera Som the various source flds othe data warchouse elds mist oot rarclionne ‘transform the data to the data warehouse appropriately. os One simple example ing of gender. In one ncoded as m/f. In yet it doesn’t to populate the data w isnot always straightforward. i advanced database systems and einen eystem with example. (R. “ns. The advanced database systems can be cateporised in diferes fy. Spatial Databases and Spatiotemporal Data ‘Sessati ak peated infomation. Geographic dt satellite image databases, and VLSI or d-aided some examples of spatial databases. Ras making up of nimensiona Fist a2 vector format, where i polygons, points lies, andthe perttions and networks formed by thse ents. ao sptotemporal database is formed bY spatial objects that change with time. Interesting from spatiotemporal database. (i) Data Stream Management Systems involve the analysis and generation of a new kind of da data, where flow of data in and out of an observation ‘Typical features of data streams include flowing in and out in a ‘permitting only one or a small number of sc ‘volume, dynamically changing, and demanding fast c streams, network traffic, sock exchange, and telecommunic: = Se ‘model, object-relational databases are formed. By giving a ‘handling object orientation and complex objects, this model extends the relational ‘model. In applications and industry, object-relational databases are becoming very popular because most complicated database applications need to handle ‘complex structures and objects. (iv) Heterogencous Databases and Legacy Databases — A beter ‘geneous database is made up ofa set of autonomous, interconnected componest databases. The communication between the components takes place in order © exchange information and answer queries. Objects in one component database 1edia databases, because data objects like video age, must be support ‘may need gigabytes of (vii) Temporal Databases, Sequence Databases and Time-series ‘Databases ~ Data that contain time-related attributes are stored by a temporal database. These attributes may include several timestamps, each containing different semantics. ‘The sequencing of events in w! database. A sequence data ey occurred is stored by a sequence encing with or without a concrete notion of time, Biological sequences, streams, and customer shopping. sequences are some of the example uence database. ‘The sequencing of values or events received over repeated measurements of time is stored by a time-series database, Typical examples of time-series database include inventory control, the observation of natural phenomena, and data gathered from the stock exchange. 0.15. Discuss in brief various types of data warehouse. (GRY, Dec. 2002) Or Compare enterprise warehouse, data mart and virtual warehouse. (R.GRY., June 2010) TER ee eT Ei GAT 18. Date Mining and Warehousing Or _ The enterprise warehouse, the Describe three data warehouse models 0 ine 2018) ata mart and the vrzal warehouse Briefly explain the data warehouse models Ars. There are three types of data warehoy (i Data Mart — Data marts are part warehouse. Data marts come in many fora numberof reasons, Dat marines by reducing the volume of ‘data mart has a subset of corporate-wide =e troup of users. The scope is limited t specific 4 its subjects to custo ae co ra panzed. The implementa they are local users and their applications do not even need to know the physical location of the data. refreshing pro Pe a wn ates See Ans, Refer to Q.1 and Q.15 (i). Q.17, Data warehouse used some tools. What are the functions of them ? Or Explain the different tools required to manage a data warehouse. (RGB, May 2018) utilities are used by data warchouse systems data. The functions ofthese tools and utilities are ‘Ans. The back-end to populate and ref. iction ~ Collects data from multiple, heterogeneous, and external source (ii) Data Cleaning ~ itis used to find errors in the data and corrects them as required. ta Transformation — Used to transform the data from legacy consolidates, checks integrity, tions. he updates from the data sources to to cleaning, loading, refreshing, and metadata definition tools, dala warehouse systems offer a good set of data warehouse management tools. To improve the quality of data, there are two important steps data cleaning and data transformation, Q.18. Discuss data warehouse functions and explain the data flow within data warehouse. (R.GRY., Dec. 2002) Ans. The processes shown in fig, 1.8, are correspond to the data flows within a data warehouse, — saan eee | ‘Extract oe ext eal Fig. L8 Process Flow Within a Data Warehouse nee systems and makes it available 10 data and loads it into data (a) Controlling the Process ~ 0). (b) Two functions are performed by each transform. The fig sentations of the data nonpar ‘function carries out some data smoothing, like a sum or weighted average sampling, clustering and histograms. “The second function carries out a weighted difference which acts to bring ou the detailed features of the data. (©) The two functions are performed to pairs of data points is ). This leads to two data set tod nonparametc. For pramet the data, rs (4) Until the resulting data sets obtained are of length 2, thet backet represents only. functions are recursively performed to the data set obtained in the previow There are sever: follows — bute values partitioning rules, some of them are as (e) The wavelet coefficients of the transformed data are des eee ese dames te stove tem (9) Equat-frequency ~The frequency of each bucket is constant in ‘an equal frequency histogram. This is also known as equidepth or equal-depth. 38 Data Mining and Werehousing (iy Equatwideh — The width ofeach bucket cequal width histogram. iii) MaxDiff sider the difference (iii) MaxDiff Here, we 60° of adjacent values. A bucket boundary is ‘established bet pairs containing the P— 1 Ingest ferences, ‘where, B represents of buckets specified by user . Opti y-optimal histogram isthe one wi Gv) ¥-Optimal ~The V-opti sariance, if we consider all of the possible histog™im for ag Prgat A weighed sum ofthe orginal values thot cach buck Fiagsam variance, where bucket weight represenis the total m inthe bucket. : (043, Waite short note on clustering techmiaes: “Ans. Data tuples are considered as objet a cluster are of same type and basis of a distance functi are in space. The “quality” of “The cluster representations ofthe data in data reduction are used to chang, the actual data. This technique is more effective for data that can be organize timo different clusters. The effectiveness of this technique relies on the natu to give estimate answers to queries. For a given set of data objects, tree recursively divides the mul sional space, where the root noik represents the whole space. Typically, these trees are balanced and conti, intemal and leaf nodes. Each parent node has keys and pointers to child node that collectively show the space represented by the parent node. Each lea ‘node has pointers to the data tuples they represent. Hence, an inde ‘store aggregate and detail data at varying levels of resol hierarchy of clusterings of the data set is provided by cach cluster contain a label that holds forthe data conta O44. Why sampling is used as a data reduction technique ‘the advantage of sampling for data reduction ? "Ans, Satnpling is used as a data reduction technique since it represents Jarge data set by a much smaller random sample of the data. Consider, Sos: ae ee 40. Data Mining and Warehousing discretization is performed. the: as a supervise “Ans. ( Reasons for Data P “eee aaa sre deseibed under the cone ofa few points then repeats this process come by merging neighborhood values to form int cess recursively on the resuling intervals. Onan a8 rae be performed recursively to offer 2 hierarchical or anitonmng of the atribute values called a concept hierarchy. For giv hierarchy defines discretization of thea Concept hierarchies are used to reduce the data set size By using lo gh-level concepts. The generalized data may to use, however, such data generalization may log ome details. This contributes to a consistent representation of data mining ‘multiple mining tasks, which is a common requi from this, mining on a reduced data set needs lesser input/out : appre backup a table having all ‘SQL is written as ~ step. SELECT prod, city, sum(qty) Fig. 1.14 shows a concept hierarchy forthe attribute price in dol ‘ Here, t satisfy the requirement of several users, more than one concep hierarchy ean be defined for the same attribute price. For a ‘expert, manual definition of concept hierarchies can be a te ‘consuming task. Therefore, various discretization metho ‘automatically generate or dynamically refine concept hierarchies for nume attributes. eee et tn sec bins values that may be helpful for deci: att pent ei rg Soo ess whose aggregate value exceeds the threshold. Q.47. Describe various concept hierarchy generation strategies for numerical data. Ans. For he rehies are generated arta each method consider Fig. 1.14 A Concept Hierarchy 42 Data Mining and Warehousing (Binning ~ Bioning is 43 down spliting techmique that depends ona given nub xa smoothing techni sized partitions the values has the 7 0.48. What do yo concept hierarchy, repeat the aphierarchy can be generated for categ partition, withthe procedure Or cluster analysis. ‘analyse class distribution information in its cal ‘a numerical atribut Jn this method, each distinct value ofa numerical attribute to be one interval. 17 tests are carried out for every pair of ad) eee ll Re zo — os Data Ware) $c Prairies Canada” and “(British Columbia, Prairies Canada) _ ‘ef Anributes, but not Of their Payy, § ing a concept hie ‘arti z st create the ety ierarchy can be automaticaly created on attribute in the given ce ypieal dimensions. (iv) Select the measures res are numeric additive quant “The data warehouse implem (0.49. What are the various steps involved in design and construction g «a data warehouse ? GRY, Dec. 2003 Or Explain the design and construction of a data warehouse. (RGRY, May 2015, jeal notation ail schemata by both ‘easily and new designs and technologies can be adapted in Following steps are involved in the warehouse design {@) Selecta business process to model. For example, invoices, inventory, account administration, orders, or the data warehouse model should be followed ifbusiness process — ae ll ee (451. Explain ar scheme with commie plain th ese of faces, dimensions and atribate in he St sche ing paral cchema isthe mest common modeling para oe ac tass le cna netnorn a redundancy, and a set of smaller attendant rjc ne foreach dimension. This schema graph jog ee to Sa a Sant ns nn ‘around the central fact table. or example, consider a star schema of sales with four di time, item, branch, and location. This. maintains keys to each of the four dollars sold and units_sold. Fig we technique is 10 save Pi fpatyis.Am end user query fgeesses the summarized diy and DBMS limita se limits, and the maximum number of records effects than raw storage by wing single record Fig, 1.15 Star Schema for Sales In star schema, each dimension is shown by a single table, and each tabk re cane indog_ blend places i in a different location al 0.54. Explain snowflake schema. How does it differ from star schema ? Or Briefly describe the similarities and differences between star and snowflake schemas. (RGPY, June 2017) ‘are both cities in M.P, state of India. These types of entrie dimension table willreate redundancy among the attributes provi and country, i, (... Gwalior, M.P, India) and (.., Indore, dimension table, the atributes may form either & hierarchy or a lattice aa. 48 Data Mining and Warehouso Ams. The snowflake schema is slightly differ some dimension tables are normalized, hence 1 ‘additional tables. This isthe main difference between Sno models. The snowflake schema graph representa! Q.5%. Differentiate beoveen st se help of an example co item key, item_name, brand, type and supplier_key, in whic! ‘connected to the supplier dimension table. This table contains suppl _and.supplier_type information. In the same way, iain te vn scan ey aera acd ake See eal 50 Date Mining and Warehouse th the help (RGPN, Dec. 2008, _galaxy-schemas for multidimensiong (RGR, Dec. 2011) “Ans. Snowflake Schemas ~ Refer 10 Q.54 Galaxy Schemas — Refer to Q.56. 0.59. Discuss the various ways of horizontal patitioni Ans. The various ways in which fact data could be parti deciding on the optimum solution areas follows ~ (Partitioning by Time into Equal Segments ~ The fact table is paritioned on atime period basis, where each ime period represent significa retention period within the business. For example, if the majority of the user queries are ‘on month-to-date values, it is probably appropriate t segments. If the query period is fortnight-o-date, fortnightly segments as long as the total number something in the order of 500. ‘Table partitions can be reused, by removin ‘we have to take into account that a number of the pa ‘transactions over a busy period in the business, and substantially smaller Gi) Partitioning by Time into Different-sized Segments ~ When jon the f ied, before ‘active data and even larger partitions for inactive ‘using this technique are that all the detailed information remains avail ‘There may be good r other dimension. For exampl istnct regional departments regon ends f0 query on info more effective 1 pa that all the queries for that information that Clearly, the benefit oft queries regarding a region, rega is particularly appropriate where organization. (is) Partitioning by Siz of Table nay not be a clear basis for these instances, you shoul basis ~ thi requires metadata to identify what data If we consider a custom area, we could find that the bus {sno operational concept of t can occur at any transactions on a di ime, It may be inappropriate to sp weekly/monthly basis. (9) Partitioning Dimensions — In some cases, the dimension ma contain such a large number of entries that it may need to be p : same Way a8 a fact table. Put another way, we need to check th dimension over the lifetime of the data warehouse. For example, let us con: she requirement exists to store all the vanat that dimension may become quite large. larg affect query response times. (vi) Partitioning using Round-robin Partitions ~ Once th warehouse is holding the full historical complement of information, faron is required the oldest partion willbe archived. Its po chive the oldest partition prior to creating a new one and re jerition for the Iatest data. Metadata is used in order to al fools to refer tothe correct table partition. The w tpeaningful table name, such as sales_month to_dat represents the data content of a physical partition it simpler to automate many of the table management ‘carchouse, by allowing the system to referto the same physical ta é The information period they cover will change, but this can be managed by using appropriate metadata. 0.60. Discuss vertical partitioning in detail. Ans. In vertical partitioning, data is split vertically. This proc: in fig. 1.18. This process can take two forms — normalization and ‘Normalization isa standard relational method of database organ ‘common fields to be collapsed into single rows, thereby reducing space usage For example, table 1.1 and 1.2 show a nor In the data warehouse arena the approach tends to be other way. Large tabl of the fact data. Fig. 1.18 Vertical Partitioning Date Warehousing 53 on | Region = = : N 399 Cheapo | London | se Cheapo | London | se Table 1.2 Tables After Normalization rare ‘Store_name Location | Region G ‘Cheapo London SE Gr Tatty York N Product_id | Quantity | Value | Sates daw | Sore la or 5 425 6 Q.61. Write short note on data warehoi Ans, Data warehouses conta that decision support qu. it is crucial for data w: Eemplton techniques, efficent implementation of data warehouse system support efficient fxmption of cam cubs, indexing OLAP dia and processing of OLAP —— (least specific) and the base cuboid is the least generalized (more specific) of the cuboids. ‘set of cuboids or — orm the OLAP operations onto the selected cubo 2 a hoe aul be oped fice ring oad and “The significance of partial materialization as comparc materialization ofthe data cube is that full materialization pr 0.64. Why most data warehouse system support index structures ? Discuss ‘methods to index OLAP data. (RGEY,, June 2015) _Ans, Most data warehouse systems support index structures efficient data accessing. Bitmap Indexing — This meth to fast searching in data cubes. record_ID (RID) list In this method for a given at ‘vector, By, for each value v in the domain of the attribute. Total n required for each entry in the bitmap index when the domain ofa given attribute value v fora given row in the data the corresponding row ‘consists of n values. Ifthe attribute has table, then the bit representing that valt of the bitmap index. All other bits for ‘comparison and aggregation operations are reduced to significantly reduces the processing time. For higher-cardi ng compression techniques. Bitmap indexing results in significant reduction in Inpu/Output and space because a string of characters are shown by a single bit. This method is advantageous compared to tree and hash indices. Join Indexing — This method is popular from it use in relational database query processing In traditional indexing, the value is mapped toa given columa toa list of rows having that value. However, join indexing registers the joinable (0.65. What is data mart ? What are the types of data mart ? 7” (RGPY., June 2016) Write short note on data mart. ‘Ans, Data Mart — Refer to Q. ‘Types of Data Mart — Data mar (RGPY,, June 2017) say be built by a one department wi or without any considera anindependent data mart, there is no req whole DSS requirements for an organizat aia. To create, an independent data mart maris are famous. Looking back at diagram it can be seen that muti Independent Data Marts Representation int to the independent mart is the lata mart is shown in fig. 1.21. “tare, Data Warehousing 59 Datars 9 sp Tanta mars together serve ax The PIDs must provide the easiest possible access operation Data Warcowse (2.67, Explain in brie distributed data mart re cl data martis Fig, 1.21 Dependent Data Marts Representation err niior sepnate “The opposite of an independeat data mart is a dependent data man, 4 4 COP Bt crs dependcor an mat sone which s read from data coming from the day SOM ptwork. The am isto ‘warehouse. The dependent data mart does not depend on legacy or ope toa separate components data for its source. For source data, the dependent data mart depends on o pat ‘as a single global data the data warehouse. Forethought and investment ae required by the dependey data mart. Someone"think globally” is equred by pes oe Gependent data mart needs many users to pool their information requires fy racic ‘of the data warehouse. In other words, the dependent data may needs advance planning, a long terms perspective, global analysis, ang cooperation and coordination ofthe definition of requirements among differen departments of an organization. piling, sales ia data mart, What, epoting sn (066. Diferenite herwen a data narchouse and a data mart. What er the role of data mart in data warehousing ? (RGPY, Dee. 200 re val x “Ans, Differentiation between Data Warehouse and Data Mart 4 ttecsand other things on internet and by dato warehouse gathers information about subjects that span the complete fSme simple and on right time. Each comp. ‘organization like customers, sales, items, assets and personnel. On the othe yabase so they needed a data warehouse. All organiz hand, a data mart isa subset ofthe data warehouse ‘am of money in this technology so that they can gain fi subjects. The scope of data warehouse is enterprise: environment of productivity. Data warehouse is a set of materialized views data man is department-wide. Usually the f ter data sources. Ralph Kimball et al defined “A data ware- house isa copy of transaction data specially structured for query and analysis", But beceuse of competitiveness, of market all enterprises has thinks on a larger platform and then acted on it in their enterprises. For this need of changes in data warehouse required and distributed data warehouse fulfill these requirements perfectly. 0.68. What are the features of distributed data warehouses/marts ? Ans, There are several features of distributed data warehouses/marts as follows — {The data copied into a data warehouse does not change. The data warehouse is a historical record of the state of an organization, The searehouse MABE Data Marts Re nt nae, dees ae OE es 62 Date Mining and Warehousing frequent changes of the source ‘warehouse by adding new data, 1 total data warehouse effort. (i) Data owners lose control 0 which are described as follows (@) Inmoo’s method () Inmon's Method ~ both local and global data warehouses with data st ‘exclusive as shown in fig. 1.23. The local Ae ips ce bt rvchveaiinctca Fig. 1.23 Representation of Innon’s Method ‘making functions. The global data warehouse cont (i) White's Method ~ This is kx the corporation and data integrated from the various local staging arcas for hich is the combination of both cer /o-tier data warehouse”, ta warehouses and a yehouse contains norma malized jonal systems at detailed data capture and cleaned fro intervals. The central data warehouse maintains dat scott vf dota derived from the detailed base data. Data collections are the user viey Srdata warchouse data and may contain denormalized detailed data. as wet a, Summarized data. A data distribution service is provided by the da to distribute data collections to decentralized data mars a tesites of the corporation. The data marts are sub sequently i fither sites of the corporation. Data mars allow Systems, which improves both performance and av eerore Data Mart Data Mart Fig. 1.24 Representation of White's Method Q.71. Explain in brief about distributed database designs. Ans. buted data environment, two approaches for distributed ; database design were introduced - the top-down approach and the bottom-up ‘approach. The top-down approach is used if the databases are non-existent ‘Although, once the databases exis, the bottom-up design is the suitable approach (Top-down Design ~ In the top-down design approach, the datz ‘a process of creating data models which contairs ies and relationships, to which successive refinements are applied it first. The data marts are then created from the dats, aml Warehousing 63 m-UP approach is appy ie 5 ropriate when Tra cant dtabase systems. The boom. eptual schemas and the objective of st deine : oo lennon komt se ced and etrogney ofthe aloe ‘2 common database: pe ‘model for describing the global the common data model the existing databases into =TA DATA, EXAMPLE OF A MULTIDII DEL, INTRODUCTION TO PATTERN WAREHOUSE ‘offer an unambiguous interpret data warehouse and the point (@ Build-time Metadata — Whenever we vwarchouse, the metadata that we generate can be termed as house terminology (iti) Control Metadata ~ Toe third way metadata is used is, of course, by the databases and other tools to manage their own operations. For example DBMS builds an internal representation of the database catalogue for use as 3 working copy from the build-time catalogue. Thi control metadata. Most control metadata programmers. However, one subset which is generated and used by the tools tht Populate the warehouse, is of considerable intrest o users and data warehouse ‘Administrators, It provides vital information about the timeliness of warehous data and helps users track the sequence and timing of warehouse events. 0.74, Explain multidimensional data models briefly. (RGR, June 2011) hitecture. This model views data in the ya hypercube, Data can be modeled and viewed in multiple ‘dimensions by using data cube. A data cube is defined by dimensions and facts Dimensions are the entities with respect to which an organization wants ‘to keep records. For example, asales data warehouse is created to keep records “of the store's sales along with the dimensions item, time, branch an ‘These dimensions enable the store to keep record of data such as mor sales, branches, and locations, where these items were sold. Each dimens! spedata cut ile or spreadsheet ynal ro« Table 1.3 A 2-D View of | Location = “Gwalior eee aE Tiem (Opes) (quarters) | _1V_| Computer| Phone | ~Game a 120 980 Oo 318 @ 420 784 46 416 @ 314 516 1 572 ew 218 800 110 388 ‘Now, consider a case where we would like to view the ; like to view the sales data wit thd dimension. Let, we would like to view the data with dim Sic nd location for New York, Chicago, Agra and Gw istepresented as series of 2D tables as shown in table 1.4. Fi ihe same data in the form of 3D data cube, Fi Table 1.4 A 3-D View of Sales Data Dneion= "New Vork"] Levaton = Chay TV Compater Phone Game Items (970) Fig, 1.25 A 3-D Data Cube of Sales Data Now, consider that we would like to view 0 dimension, supplier. The 4D cube can be represen ional data storage. for multidimensic Supplier=*SUPI" ___Supplier= “SU Fig, 1.26 A 4-D Data Cube of Sales Data 0.75. Discuss dara cube technology briefly. (R-GPV., June 2004 Or Describe the term data cube. (RGBV, Dec. 2010, June 2014 Ans, Refer 10 Q.74. ss follows~ | Fig. 1.27 A Typical University Management Hierarchy t OP WN iia oe nian Sr) PE a ary oe 68 Data Mining and Warehousing An example ofthe first relation, i. tude Table 1.7 The Ri Table 1.S Relational Enrolment ee a ‘Student id DOB ar ; Vin1980 ae ; eed BSc 6 ca 2s Sei 3/3/1983 = $ z a 44/1983 =~ g a 5/5/1984 acon a = 66/1980 | 88 Long i” som ae 7981 | 1M: 3/8/1982 | Null Sachin Singh | UAE | 9/9/1983 | Null Rahul Kumar | India | 10/10/1984 | Null 9176543 | Saurav Gupta | UK 11/11/1985 | 1, Captain Drive “Table 1.6 presents an example ofthe relation e ich th teed relate to the numbers umber of stadents admitted Table 1.8 A Two-di fee is assumed to be in thousand Table 1.6 The Relation Enrolment ional Table of Aggregates for Student_id | Degree id | SSemester Semester 2000-01 #900020 1236 2002-01 = = pis rasan | err | 00 Daee | 28 | 448 | wes | acon | arr [- s700078 3321 2002-02 Apa} 5 | 2 | 15 Ti 3900020 444 200 Bee | oO] 8 | oon Mabysa | 5 1 10 8801234 1256.1 i beter | 22 801234 3321 i99-02 | | Ssveden | 5 01234 3333 1999-02 | jue 5 | 8977665 3333 2000-02 | ; 2 70 Date Mining and Weremousing Using this two-dimensional view we are abl joining any degree from any count that we are quickly able to answer are ~ ‘© How many students started BIT in 2000-01 ? sany students joined from Singapore in 2000- ia table L.8is fora particular semester, 2000-01. A Tole for other semesters. Let us assume that the data table 19. Table 19 A Two-dimensional Table of Aggregates for Semester 2001.4) = How Count’ | se | Lie | MBBS | BCom | BIT | ALL Degree = Ausmaia | 7 io | 6 | 3 10 | 96 India tpg 22 |3..|. 6 Mabysa | 5 1 wo | 9 | 2 | 6 Sineapore | 2 2 wo | 2 | 23 | 2 ‘Sweden 8 0 5 16 7 36 UK 4 13 26 ny 74 USA 4 2 10 12 38 ALL 39 28. 158 | 96 | 418 [Letus now imagine that table 1.9 is put on top of table 1.8. We now have ‘a three dimensional cube with SSer ‘as the vertical dimension. We now put on top of these two tables another table that gives the vertical sums, x shown in table 1.10. ‘Table 1.10 Two-dimensional Table of Aggregates for Both Semesters _ Fig. 128 The Cube Formed 0.78. Whatis the conceptof hierar | Ans. A sequence of map; Ausirala [12 | 30 31 103 Bees | a eae Bp Fvoyn as cn : ierarchies a Mabysa | 10 | 2 | 29 | 31 Jn level concepis with highe Singpore| 4 | 4 | 20 | 22 Fonte dimension .0wn in fig. 1.29. For the di ecden | 13] 0-10. [41 ihgeiyatibutes are Chicago, New Yor ie = eel cay ees ‘ach city can be done to the state to which sae dep ee -[ 05 ‘mapping of Gwalior can be done to the MP. In e: 2 i In essence, the mapping of ti Peelers il ons: | an [oviess or states can be done tothe country to wh Lis al ae ich ‘example, ‘bemapped to India. These mappings form a concept hierarchy forthe 72. Data Mining and Warehousing dimension location, mapping a set of low-lev general concepts. 0.78. Briefly exp rem warehouse is a jms. Pat na persistent manner. a ithin data warchouses. Fig. 1.30 explains the process flow of populating pa ing Patterns within pattems soreouse For a given attribute or dimension, concept hierarch by diseretizing or grouping values, resulting in a set_grouping hierarchy. For given dimension or atrbute, there can be more than one concept hierarchy the basis of various user viewpoints. In data mining system, concept hierarchig ay be predefined if they are common t0 man systems, users are also provided with the flexi i their particular needs. Concept hierar >t levels of abstraction. of the data distribution, concey it may be provided manually by Fig. 1.30 Process Flow for Patt Here, fig. 1.30 clearly explain On the basis of statistical hierarchies are automatically gene system users, domain expert, or knowledge engineers. For a user ora doma expert, manual defiaition of a concept hierarchy can be a tedious and tng consuming task. Fortunately several disretiztion method can be used automaticaly generate or dynamically refine concept hierarchy for numeric! atribute. Furthermore, many hierarchies for categorical attributes are impli : f Se a cytcmiontkaly defined a the micah HMMM i eaicsooron teen definition level. ‘nfmation for anal ecu ysis purpose. Relation of Concept of Hierarchy to Web Mining-Taggingadocums (i) As paterns are with s concept impiety entails itstaging withallthe ancestors ofthe cones_sal analysis the whole process of data mining h hierarchy. Therefore, itis desired that a document should be tagged with te oMaiing certain results. Bee oe aa lowest conceps possible. The method to automatically tag the SOCum=tt 49, Ging compar ihe Wieetely Ws fop-down approach’ Au evaluation function determing geass crc M700" amons database, data warehouse and pattern whether a document currently tagged to a node can also be tagged to =a Be ‘child nodes. If so, then the tag moves down the hierarchy till it cannot Ans. The comparison among database, data warchouse and pattern ‘any further. ae ge spans upto TBs ie management of data warchouse for tle Form in data warehouse, so for even a 74 Dote Mining end Warehouse Table LI [tase [awe Warhone SYSTEMS — BASIC CONCEPTS, oLap records QUERIES, Types Re ave re OF OLAP SERVERS, OLAP OPERATI wcsesed | pounds [Hundreds | Few(ony atric eae ipcceeeatel SQL DMQL PMQL Characteristics of OLAP- The OLAP: thename derived from the first letter in OLAP system tastobe like that of a search engine. Ifthe response consumes more than say 2 seconds, ly 10 move away to something else co the query. Achieving such performance is ii) Analytic — Rich analytic functionality must be provided by an OLAP system and itis expected that most OLAP queries can be solved without ‘pogamming. The system should be able to cope with any relevant q __theapplication and the user. | (iii) Shared — An OLAP system is a shared resource. However, itis tnlkelyto be shared by hundreds of users. An OLAP system may be used by for 76 Date Ntining sre Warehousra finding the most 1 be possible to find the mos for example, considering the telecommunications industry and only con: ‘one product, communications minutes, there is @ large amount of di company wanted to analyze the sales of product for every hour of the day hhours), differentiate between weekdays and weekends regions to which calls are made into 50 re (04, Explain the processing of OLAP qi das. The processing of OLAP query iyzing the costs associated identify expenditures that produce ahigh return o ‘ment (ROM). For example, recruiting a top salesperson may costs, but the revenue generated by the salesperson may, Q.3. Explain OLAP software archi diagram. ‘Ans. During the last half of the 1990s, demand for OLAP cap: increased rapidly Sofivare vendors ofall kinds attempted to capture Operations are to be Aapplied — tata ‘a bieycle factory. Fig. 2.1 shows three approaches supported by contemporan end multidim ‘multidimensional queries are mapped direct ‘software packages. These approaches are as follows ~ ures because the slorage model of a MOLAP seve

You might also like