0% found this document useful (0 votes)
75 views39 pages

Unit 2 (Data Warehousing)

Uploaded by

rohit983999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
75 views39 pages

Unit 2 (Data Warehousing)

Uploaded by

rohit983999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 39
COMPONENTS OR BUILDING BLOCKS OF DATA WAREIOUSE AnStsture ie the proper arangmnt othe lens We bul «data warehouse with sofware and hardware “srapnents Tosi he equa of ou oan, We BAYS hese ulge we nay wan boop nother part with exta tools and services. Al ofthese depend on our circumstances Source data Exc 7 & Information delivery @ coined i| aa 57> 3 Metadata : Datamining 5 a> :e & Z 3] is z Data Warehouse ‘Multi gs DBMS dimensional < Database, OLAP ep Data storage| g € Sy Data marts fa] Data staging Report Query Components or ling Blocks of Data Warehouse ‘The figure shows the essential elements of a ypical warehouse. We se the Source Data component shows onthe Jeft. The Data staging element serves asthe next block. In the middle, we see the Data Storage component that handles the data warehouse, data, This element not only sores and manages the daa; it also keeps track of data using the metadata repository, The Information Delivery component shown on the right consists of al the diffrent ‘ways of making the information from the data warehouses available tothe uses. J. Souree Data Component ‘Source data coming into the data warehouses may be grouped into four broad eategories: ‘+ Produetion Data: This typeof data comes from the different operating systems ofthe enterprise, Based on the Gata requirements inthe dala warehouse, we choose segments of the data fom te various operational modes, ‘+ Internal Data: In each organization, the clint keeps thei "private spreadsheets, reports, customer profiles, and sometimes even department databases, This isthe itera data pat of which could be useful inn data warchouse. © scanned with OKEN Scanner + Archived Data: Operations ystems are nay Si ional " es are minly intended to run the current business, fn every operational Baers Pevidically ko the od date and ste I face ne * erate ee est Ree depron hasan ve el ofr ge percent he Hey ws They Wess ssc wl intyprodced by heated dearest ‘Tre thtes primary Faetions that take plage in the aging area, * Pala Extrnetlon: This method kaso deat with rerous data sources, We have to employ the a techniques for each data ‘source, ee eee a anon 0M, data fora daa warehouse comes from many diferent sures, data varchouse posture big chaltenge, data transform: presents even more significant challenges. We perform several individual tasks as ‘Part of data transformation. First, we clean the data extracted from each source, Cleaning may be the ‘correction of misspellings or may deal with providing default values for poral te eae me nicl rly cana nan tin of dt components fons ngs part ofa esfomaton, Dau essorton contains many forms of combining ieses of data fom dine sourees, We combine data from a single source ‘record or related data pars from many soure cords, On the other hand, data transformation also contains Purging source data that is not useful and separating outsourced records into now ‘combinations. Sorting and merging of data take place ona large seale in the da staging area, When the data transformation function ends, wehavea collection of integrated data thats cleaned standardized, and surmvartond © Pare Konatg: Two diet xpos of tasks fmm dla odig fans, When we comples the suture and construct 3, Data Storage Compon Dats sore for data warchousing isa split repostry. The data repositories forte operational systems generally include only the current dats, Also, these data epsitoriesineude the daa sructred in highly normale fast nd efficient processing. 4. Information Delivery Component The information delivery element is used to enable the proves of subscribing for data warchouse files and having it transfered to one or more destinations according to some cstomer-speciied scheduling algortio, Oa Sh [ a Teformatondelvery component d © scanned with OKEN Scanner SX Metadata Con Netataisna da warcouscis qa oth dt dct Fh daa stg ina ate anagaent sytem ted distr, we ep hed st te se JM SES, eau ee infomation about the indexes, and 0 08. 6 Data Marts Includes subset of eomportenide ata hati of valet & sect rou of wes The sap is conined fri ais Donne ns aioe ate ay agen ts een alteugh development inthe data warehouse indy bes made stndad and increas on achievable: Data mars are lover than data Warchousts and silly contin openuion tn in data vartousing ate to develop a data warehouse wi sever smaller ented da mar for penis ates and reports. 7. Management and Control Component ‘The management and contol elements coordinate the services and functions within the data warehouse. These components control the data transformation and the dats tester ito the data warehouse serge, On te other hand, ‘tmoderates the data delivery to the clients, Its work withthe database management systems and authorizes dat to be correctly saved inthe repositories, I monitors the movement of information into the steging method and fom ‘there into the data warehouse storage itself, METADATA Metadata is data about data, In data warchouse is equal tothe data dictionary or the data catalog in a database ‘management system. In the data dictionary, we keep the data about the logical data structures, the data about the records and addresses, the information about the indexes, nd soon. Metadata can be stored in various forms, such 8s text, XML, or RDF, and can be organized using metadata standards and schemas. There ae many metadate standards that have been developed to facilitate the creation and management of metadata, such as Dublin Core, schema.org, and the Metadata Encoding and Transmission Standard (METS), Metadata schemas define the structure ‘and format of metadata and provide a consistent framework for organizing and describing dat, ‘Types of Metadata ‘There are many types of metadata that canbe used to describe different aspects of data, suchas its content, format, structure, and provenance. Some common types of metadata include: 1. Descriptive metadata: This type of metadata provides information about the content, structure, and format of. ata, and may include elements such as ttle, author, subject, and keywords. Descriptive metadata helps to identify and describe the content of data and can be used to improve te diseoverabilty of data through search engines and other tools. 2. Administrative metadata: This type of metadata provides information about the management and technical characteristics of data, and may include elements such as file format, size, and creation date, Administrative ‘metadata helps to manage and maintain data over time and can be used to support data governance and preservation, > | 3. Structural metadata: This type of metadata provides information about the relationships and organization of ata, and may include elements such a links, tables of contents, and indices. Structural metadata helps to ‘organize and connect data and can be used to facilitate the navigation and discovery of data, about the history and origin of data, and ‘ ‘rovide context and credibility to data and can be used to support data governance and preservation, 5. Rights metadata: This type of metadata provides information about the ownership, licensing, and access Controls of data, and may include elements such as copyright, permissions, and terms of use. Rights metadata u © scanned with OKEN Scanner Selah manage and protest the iets at data al can be used to suppart data governance aad onptianss Neca propa igh of daa and up S Rlocatonal metadata This typeof ta i he elcatfnal value and earn iH We oF metadata proves ination about the educational vale an costes of daa arnt may incline el Nets such a lung outers eatont levels and competes Seagal canbe wid sg lon resources an to spport the ; ot the iscovery and use of eet ‘Sesion and evaluation of teaming ening a — Bramples of Metadata ‘Metsu lsd that peoviesinfrmstion shou ule at Hee ave ew examples of metadata F tase wate ts ts ofan abeuta fl suchas is ae, si, ye nro % Image metadata This includes infomation stow image, suc as its resolution, eolor depth, and cumera stings ‘Music metadata: This includes infonn such as is ttle, artist, album, and genre, about a piece of musi $ Yiteo metadata: This includes intonation ion, and fame rte 4 Fs negation about a vide, such as its length, resolution, and frame rate. S Dananasetmetudate: This includes information atout decom ates teeta le i ercation date, & Database metadat {is includes ofomacon abut a datas sich ss sute table and eld 7% Web metadata: Th es information about a webpage, suchas is tle, Keywords and desertion, Metadata Repository ‘Ametadas repostery should contin the following: 1 desertion ofthe data warchouse structure; I includes the warouse schema, view, dimensions, Uisrachies and derived data definitions, as wellas data mart locations ond contene, Qperational metadata: It includes dats Hneage (ory of migrated dala and the seguenoeof transformations spel 0), cureney of data (activ, archived or purged) nd monitoring inforecton Gooch usage statsties, error reports, and audit tals). % The algorithm used for summarization: It includes measure and dimension definition algorithms, data on aah rat eseiptios, data prions, dt extraction, cleaning, transformation rules, and default, data ‘effesh and purging rules, and security (user authorization and access contr). Sate related t system performance: I includes indices and profes tht improve data access and retival Peraanes in addition to rules for the timing and scheduling ofrefesh, update, and repletion eyes 6 Business metadata It includes business teams and definitions, dta ownership information and charging polices. Benefits of Metadata Repository himeadat repository isa centralized database or system thats used to store and manage metadsts, Some ofthe benefits of using a metadata repository include: 1. Improved data quality: A metadata repository can eens that metadata is consistently stucurd and ‘accurate, which can improve the overall quality ofthe data, 2. Increased data acc -Ametadsta repository can make it esier for users o acess and understand the data, by providing context and information about the data, 3. Enhanced data integration: A metadata repository canfcitate data intepraton by providing a common place to store and manage metadata from multiple sourees. © scanned with OKEN Scanner 4. tmproved data governance: Amends reposiory ca help enfores metus standard and polices, making it ensler to ensure that data fs being used and managed approprigiely, - S. Bane tn secu: Amt postr m8 help pret th privany and secur of meta, by providing controls to restric acceso seni¥e or conden! information, Metadata repositories can provide many benefits Ja terms oF improving the quality, access data, and management of ‘Challenges for Metadata Management ‘There are several challenges that ean arise wen mana 1. Lack of standardization: Different organizations or systems may use different standards or conventions for metadata, which ean make it difiult to effectively manage metadata aeros different sources, 2. Data quality: Poorly sutured o incorrect mada can lead to problems with data quality, making it tore difficult to use and understand the data, 3. Data integration: When integrating data from multiple sources, it canbe challenging to ensure thatthe metadata is consistent and aligned nero the diferent souees 4, Data governance: Establishing nd enforcing metadata standards and policies can be difficult especially in arge organizations with multiple stakeholders. 5, Data security: Ensuring the security and privacy of metadata ean bea challenge, espe with sensitive or confidential information, ‘Metadata Management Software: ‘Software for managing metadata makes it easier to assess, curate, collect, and store metadata. In order to enable data ‘monitoring and accountability, orgenizations should automate data management, Examples ofthis kind of software ude the following: + SAP Power Designer by SAP: Ths dota management sytem has. a good level of stabil. Ii forts abiliy to serve a platfom for mode testing + SAP Information toward by SAP: This solutions data insights make it valuable + IDM Infosphere Information Governance Catalog by IBM: The ability to use Open IGC to build unigue assets and data lineages ia key featur ofthis system, + lation Data Catalog by Alaton: Ths provide a userrendy intuitive interac, It is valued forthe queries it can publish n Standard Query Language (SQL) + Informatiea Enterprise Data Catalog by Informatie: The eehnology used by this solution, which ean toh sean and gather infomation fram diverse soures, i highly respected. ecognised © scanned with ‘ OKEN Scanner JRtuet ip men sronee Be gete Path warcahauaet 4 OLAR teob ada based 1 a mueltidimeitinald lala nuded Letat ik adala Cube A data cube allows dala % be Medel aud wiewed Lue onttt pte atirmenrefous . OF 1s Aafliced by diimenctons aid pot ok Dimeewsin are tee peupecttare alend eutitice wile apie te Uehtel one oregaunizalion want tr “ep eucperols Eg. MLEuctumics cay occas Q €aly tata warchoue Lu oud ty “eof “seecoredts det Lowe's Kalis Utith Leapiet to Aimenciorg fed , Lene, branch aud Letatbaxe — Le dimuuine allow “tle Tore t beep track Of eaags bal rmomthty daha The Une aud pranede aud “pro localize Of urliclh the Sms weesee dold- Beets bf rncuston howe a table auoclaled wiith ik Catt birnencEa be Cable 7 Dinensfon table can be specified saws ee | Gr aude malivaliy gerercoTad aud huted baaed m data olictrtoult i. — fer hack ate Nemec meawe * a Wey ane bike YanEities uthich we waut- B analy ce Utelatlorit tp Ainensln a Es datesatl Famte, lotuarct— Lolo (dole amount aie Helter), Untte gory Comoe ntl cold), trout © scanned with OKEN Scanner he fact lable Corvttatus pune name Y teeters or MALES As WELL ar fp, pea tue “vechakid dinentin tattu. — dectwouge ib ty assumed tnat dhe cubs ic a 3D but : 0 Grombdo Mise tte ee, etalon watnonsing Duc data cube do N-hi meng Poral- & fepucdilet tor sales hate, peor AueuTionie 2-D View (Aeerrolteg Be Tate aseol Cent) doralion = “Vancouver” Teo Chype) Colottar soll) « bo jy. Hus (querer) a | . Prone | hecusity oO 605 825 i Yor © G80 4qsn 3) Sie Lh i ga 1028 Bo 2 } oy aoe oss ee | ; — > Llu abou tabte Seapets AL ElecDonfes salto data i ae per purse, the iw cig gf Veneouur ke 2D table veopeuunte At alts prescient, coil to the Cre lmrension Covgauized en eesettans Ere nse Corpanized aeroretiag fe te yee of Yems cotd)- Che pac or menaura Aisplaged is olottar sold CLintbeusaucle). - a © scanned with OKEN Scanner fe data we want +0 Wiew eof 9 ‘> | ny the u y 2D fe, CJ Kine (uate). Qecorlitig fo Liiec aud! ile AA Lett ay zing Aacablern, for tee cilia of Atteago:, MeO Yount. Toronto and Veucouver le Gb abou Meepretane a dani of 2-Dkablen. Rebseaenlali err gh Table. HX Cuby. fom —slrea Thar Cities. Aufoplicr »" Supa" Aubblic "sur?" » a fine CQuenter). © scanned with OKEN Scanner = “8-D VIEW OP SALES DATA FOR AWLELECTROAIICE, Acording Ze Lime, q @ Ll. Aud locallon a : fu =" Mew Yorke ” (locale = “Toronto” bocalien. ="Yaneoouel phe Mabie | becotiin l toove! Locales = CMAP [| ale He b\w. Howe conp | peo | 82 ee ee Yeon [pene see [eles fof T , seit ene d 18 [P46] 43 [st | ©) 105 }825 | 14 |Yoo 36 a2 | OF f toa? [48 | asy | 882 | 84 67S | Oy 6y \628 | Or se [rove us coy far laos | Ar 874 fred Sr ies gr pao) 4/81, 19 G12 fO2S | 20 |e | oa fauolras|sa pa | OF BY fara}iezs|ee g 3 agar jes | 892) 94 |4a4per|sa jase Dy jrre |ae4 59 Pes 024 : whe © scanned with OKEN Scanner po Wild sal dala elle, @ S Srephore ere Loa Olimansin, Lito as Luppier: Qed aclheenal foeette PS Viewlug ut 4-0 bscornet Delchey Necees aliae. Loainte 4 a Y-D cube Be beteg 9 cescies gf 3-0 Cesta. teoum Ln ebous fir’ Oy wera corttinise CEE OSG 2 AS me oat abtny y-dimonsforal dala 29 shetiee % Gy) iene zoel Beas the dela Cube. ica melaphor por mutilfmensforel Aala Teraye » Ths ailral plyetcal Umrage Y tuck dala ee Lifor fetomm Lay Lagtea! seufetan Talia an portant Clng to tetmember fg That Late Cubes asre n-olimentioal aud cbonat emafine alata te 3-D- — Qu data wwe lug a data coke, ase stoped to as cecboid: yo ffi 8 dk inantion , use cou penotate a Cuboid fr Gack of Tit pouiile dubits y Tia giuow Olomonsi ova: Ula sueute urcutd form a tattfee af ie, Puen serfered ty aya olala extbe- ° © scanned with OKEN Scanner O-D (apen) all fas a trDeerBard. YD Eb cae) Corterd Bene , Reve, detatin supebir Latice of Cxbwides ratleieg Uh a od cate. cetbe. fer fine, dian, alin auel a paring Cw, deen CEack cuboid serbreraut aol Afferent ign ¢ beets) Summuzalint }* oe cuebatd teat trolds cre Loruet tenet deamescicathos is Cxlted Lhe bare, otbeld: (Eg Prom fig? Lie , fem Arcatfe, perphler )« the tight hero af — Yu o-P eubotd , ss cated the apex eursard. Atemmorization Clg Totalsabes , or leltir Lot seemmaseiseaa © scanned with OKEN Scanner SCHEMAS POR MULTIDIMENSIONAL = DaTA MonELe i. 2 Whee tadihey ~ Mimubif, botla 1, © ue ¢ f trelarte ‘b Mroedlol is Commonly, tod tu the Aeatgn Y Melita’ Atabares g etere & Adta baae sehoem% Condicts a cet ; Chile, ied tee Pacey hts, dens sdecch dake model fs APbeopetate fr ontdve > Dus sehema is a legttal hourpen, ey Te “erties AeLabas,, B trelucles 210 name aud i Arecseipivonr alt recorde Tybes Mecclucclilng att anuocldtid date ilims aud 24 regalia. —A data warchoue | howeute eeqiiog atmere , Atthj cet -etdiiuitd Seboma that Grcititatis online dota aecbysie: KS Une most papelar date mode pu a tala parcehouse £ 2 onuttidimensional rodlel, Util Can burgh Lu the jew OY 4 thar cohema, & sireroftate « cehema, ora pect Conihlatlalter Lbbema: i 1 t I © scanned with OKEN Scanner @ SCHEMAS FOR MULTIDIMENSIONAL DATABASES, 1, StarSchema ‘* The mast common modeling paradigm is the star schema, in which the data warehouse contains (1) a Large rca bbe (fact able) cossning the blk ofthe data With n redundancy and (2) ast allstar ‘ables (dimension tables) one foreach dimension, The schema graph resembles a starburst, wih he dimension bles, cromnl fst ble. + Esch dimension ina star schema is represented with only one-dimension ble, + This dimension tale cootins these of atuibutes +The falowing diagram shows the sales data ofa company’ With respect to the four dimensions, namely time, iter, brinch and location. ‘+ Theresa fac table atthe center. contains the keys to each of four dimensions. The facttsble also contains the anvibutes namely dollars sold and units sold. played in a radial pattern around the time sales item Dimension table Feet table Dimension table item fey ene Bea Inee | svppler hry Location Dimension table location_key \sueer ity province or sat country ‘Note ~ Each dimension has only one dimension table and each table holds set of attributes. For example, the location dimension table contains the auribute st (location_key, tect, city, province_or_statecountry). This constraint may cause data redundency. For example, "Vancouver" and "Victoria" both te cities are in the Canadian province of British Columbia. The entries for such cities may cause data redundancy along the attributes province_or_stste and country. 2 Snowflake Schema + Some dimension tables in the Snowflake schema are normalized. ‘+ Thenormalization spis up the data into additional tables. | + Unlike Star schema, the dimensions table in a snowflake schema is normalized For example the item dimension table ina star Schema is normalized and spit into two dimension tables namely item and supplier table. The resulting schema graph forms a shape similar to a snowflake. ‘+ Now the item dimension able contains the abuts item_ey,item_name, Pe brand and supplir-key. ‘+ The supplier key is linked tothe supplier dimension table, The supplier dimension table contains the attributes supplier key and supplier ype : ‘Note ~ Due to normalization in the Snowflake schema, the redundancy i reduced and therefore, it becomes ca5) fo rainiain and save storage space. © scanned with OKEN Scanner ay | tmenon ale stem ised dimension able — dimension'abie Tone ty ay Z Uyak ain aut lie by 1 Fact Constellation Sehenn Sophisticated apleations may This kind of schema can be constellation a -——tzt=CstN The soles fic oble isthe sme ath inthe ‘Ths shiping fact ble as th ve dimensions, marely em hey, to_location, ~ The shipping fact abe also contains two measures, namely dlls sold and uit sold. Tisato Possible to share dimension tbls between ft tables. For example, ime ie, ad loeation dimension tables are shared beeen the sles and sing heels ‘eaulte multiple fat tables to share dimension tables. "ved 05a collection of stars, and hee sealed a gay schema or a fact s_key,shipper_key,from_loeation, time sales Item shipping dimensiontable facttable——dimensiontable fact table Tine key time key henley co Ten fey 7 a ime_Fey Branch Fey shipper key ut aT (8221 | [from locaton quarter Solar sold} |) [ouppier fey] | | reer year anisole | [dollars cost | unts_shippeal location dimensiontable branch dimension tabl shipper Tocation_key dimension table street city Province_or_state [ cournry | shipper_type © scanned with OKEN Scanner © NOTE ‘+ In data warehousing, there is a distnetion between data warehouse and a data mart, + Adata warehouse colleets information about sve that span the entire organization, suc as customers, items, sales, assets, and personnel, and thus its Seope is enterprise-wide, + Fordsta warehouses, the fat constellation schema is Commonly used, snce it can model multiple, interelted subjects + Adata mart, on the other hand, i a department subset ofthe data warehouse that focuses on selected subjects, and thus is scope is deparent-vise + For data mars the star and snowflake schema is commonly used, since both are geared towards modeling single subjects, although the star schema is more popula and ecient, © scanned with OKEN Scanner Poe CF CONCEPT HMmARCHIEL MMEN CONS (Ie) —Ghe Concept Mfercanctyy staff 0 kepeinen of map hieg fom tenet demeeple AY Cuct | mnesce grouseat a act ¥ bnew Concepts — Crnusider a coneapt Atewcarele fe Lhe Almundan Jotatler. Elly waterss for hocation scrolude Vauauver, Terrvite, Mau Yor, ¢ f Aud Cut cage Each eft, heweus, co b+ Wemmel vated Dp te prcorinte OF ate te uhiol Lh eters Por example, Vancouner tau 62 mapped ts Becilist Columera, aud Cree GO Tlinok. he, prouires oud Mole wou Lu turn be mapped couatig leg tauada or the Uolled etatis) & eartitcl. 7 cote —Yasse mappings form & ommexpt Ricaley fer Mee hocatian , rapplg a ut of towo- leet eoncepte Cie cibes) @ Aggnr - Leet, core geruscal comerp Cie corrcefots) comctbcies ) A concipt deeemnhiy for bertine £4) fat] Chocitinn ~ HD ZO ty tos Ceceenbug) [osar Cprecouiee- < A ou _slati) mK (eee © scanned with OKEN Scanner wt Ha ie “ted —_ Oy ———_——— ae a Y Conteh di emareting arcs fonptict uilitn te CTR ol ata brag Cocos en ther Example | detfore, stat ine ctinuntion Lavalfen Ue teseelicd CRC TI b ee te, i Abe om t roor , dlocect, cily g prcomlnce ove hate 4 ls greemiber, Heed yr? porte. aed Cweeectry HUhese, AEE eee Heelalid by 4 fatal orden, booing Aemept hi : . 7 Meee, Qutalicett € cles peccntinss or ie < conuebey * Coxsntiy One or. tee O quasdtic | - vocoke. ahioovhag” Pe a lettteg fine — Atwuatively que abebuls qa Lt mcastn be onganteed Lis a partial orter, forming & laMice - fn enamble 4 we. te Tine Aruntly booed on the alirdbett Pett ele cuted Year 18 Cha ye frmuonti.c quater; wen ]e 12°F ; - Witeky me Lesa © scanned with OKEN Scanner WA doeeke Atisancty Ltt fa a Petal or partial Onder Among atleribulty wh Adatabae Letuma % cated a @ detema _hievearcoley —rncept hiecacotaps Chah eC Commun te mane Aplications (eg. for Ten) May bo fecedtepbed ule the alate madning egulne > Cercept Aararchior omagy ako be Aifnced aig eating a routing Malus for a Geen Lrnenshou, or albeute , Verouttiog wis a aut geoublng Neranetey « Gy. A Woe Or partial oxda can be cefined areng greeted watues y for Ae mension prdite, “there auc Lalirusl Yd y) olenotas fhe serege feeom $x( encleafive) BA 7 (inetacia 7 Yow bs mare that Ont Concept hicscarceliyp fer 0 pes aTbakuti or dlirention bared on iffercsnt- oo Uieeporics » for LncTauce , Oa tei 2 orgavize chef é weauge La enpinsbie yrnorlescale speed oid enpemtnc: r T byt wars domain Oxpeds , or kuomtedge engeneers oA may be aetomaticabiy Grete tase m slatictical Qhalyds 4 fa data Aidl®forlter’ © scanned with OKEN Scanner OLAP OPERATIONS IN THE MULTIDIMENSIONAL DATA MODEL, tn the muidimensional model, the record are organized into various dimensions, and each dimension inclodes ‘ultple levels of abstraction described by concept hierarchies ‘This organization supports users with the Mexibility o view data from various perspectives. A number of OLAP data cute operations eit to demenstate these diferent views slowing intenctive quence aed soeach doh ‘hand. Hence, OLAP supports a user-friendly environment for interactive data analysis. Consider the OLAP operations which ar tobe performed on multidimensional at. The figure shows ata cubes far sales ofa shop. The cube contains the dimensions, loation, and time and item, where the location is angregated With regard to city values, time is aggregated with respect 0 quarters, and an item i aggregated with respect o itera ‘pes. 1. Roll-Up The rall-up operation (also known as dillup or aggregation operation) performs aggregation on a data cube, by climbing down concept hierarchi dimension reduction. Roll-up is like zooming-out on the data cubes. Fi shows the result of roll-up operations performed on the dimension locaton, The hierarchy for the location is defined ‘8s the Order Street, city, province, or state, country. The roll-up operation aggregates the data by ascending the location hierarchy from the level of the city to the level of the country. When a roll-up is performed by dimensions reduction, one or more dimensions are removed from the eube. For ‘example, consider a sales data cube having two dimensions, location and time, Roll-up may be performed by ‘removing, the time dimensions, appearing in an aggregation ofthe total sales by location, relatively than by location and by time. Example, Consider the following cubes illustrating the temperature of certain days recorded weekly: Temperature = | 64658] 7] TLC SBCs Week to Week? 0 10 jo Consider that we want to setup levels (hot (80-85), mild (70-75), cool (64-69) in temperature from the above cubes. To do this, we have to group columns and add up the values according to the concept hierarchies. This ‘operation is known a a roll-up. By doing this, we contain the following cube: © scanned with OKEN Scanner ‘mp the Infurmation by tevele 0 ‘he Hows digo totae8 No slam wy Roll UP aa ‘Mobile Modem Phone Security Htemitypes) 2 Follup on focation (from cities to ‘countrles) ° o $ ae so LS hea ae nae . Toronte A! Vancouver 335, 0s [e2s | 14 | ao 2 2 Time (Quarter) 2 a Mobile Modem Phone Security Memitypes) 2 Drill-Down imate dawn epeaion (ato called roll-down) isthe revere operation of roll-up. Diledown vious deta cube. navigates from les detailed records o more deuiled data. Delldone wo ‘ther stepping down a concep hierarchy fora dimension or adding additonal diners | "igure shows a drill-down operation performed on the dimension ti {s defined as day, month, quarter, and year. Dril-down a the quarter to a more detailed level of the month, by stepping down a concept hierarchy which Peas by descending the time hierarchy ffom the level of cd © scanned with OKEN Scanner Bxample Drillaown adds more details othe given date ‘Temperature Day7 Day 10 Day IT Day 12 Day 13, Day 14 ‘ © scanned with OKEN Scanner Ce» he folowing ingrany ih tos ow Della wks 4 sigs Drill Down M ! saeasinetinn wit aoa roan ies maremt aseige aoa ‘emanate ™ o cng newton tom vaneury Seeereseeey 3. Slice Alice isa subset ofthe cubes Example slice operation is executed when the customer wants «cy ‘cube resulting in a two-dimensional ste. So, ‘cube, thus resulting in a subcube, Day 3 ° Day ° Days © scanned with OKEN Scanner Day 6 De? best Day 9 Day 1 Day 12 Dey 13 Day 14 The following diagram illustrates how Slice works. Slice ee FP merit Var = wee = ig 3 ou “ slice’ for time ="Q1" tue li New York toomta voncuve ae a ats eda Shoe Sty ‘se Her {twill form a new sub-cubes by selecting one or more dimensions, lice is functioning for the dimensions “time” using the criterion time * "Ql" > Scanned with OKEN Scanner 4. Dice ‘The dice operation describes a subcube by operating a selection on two or more dimensions, For example, implement the selection (time = day 3 OR time = day 4) AND (temperature = a) te orga ees We tthe loving ube orden) a OR emreratre= NewYork AGI OO ngs Vancouver ‘a (5 [235 [16] a] ‘Mobile Modem Phone Secuity ‘The dice operation on the cubes based on the following selection criteria involves three dimensions. © (location = "Toronto" or "Vancouver") 7 © (lime = "QI" oF Q2") © tem =" Mobile” or "Modem") | | | a © scanned with OKEN Scanner S Prot "e pitt operation ae called arto. Pivot isalizaon operation hat tates he een preven eatin rset ef he aH) SH SN tows ane re rowsdimensions into the column dimensions. a Time ‘Temperature Consider the following diagram, which shows the pivot operation. SS chicago New York {cities) Locations Toronte| {= taal Vancouver] 605 | #25 | 14 | 400] ‘Mobile. Modem Phone Security Pivot | movite[ | 5 Modem ms (0) Phene 4 sean |_| 1) fronto. Vancouver York Location (cities) al © scanned with OKEN Scanner Other OLAP Operations eae ng mr fatal eg etm to Other OLAP operations may contain rank in its, a well as calculating moving = 18 the tap-N or bottom-N elements in lists, ‘8 8 averages, growth res and interests, tral ate of tum, depreciation, mene conversions, and Satsticl OLAP offers analytical modeling capabilites, conning acaleultion engine for determining ratios, variance ete, and for computing measures across various dimensions It ean generate summarization, aggregation, and hierarchies at each granularity level and at every dimensions interscetion, OLAP also provide functional models for forecasting, twend analysis, and statistical analysis. In this context, the OLAP engine is powerful data analysis tool a © scanned with OKEN Scanner DATA WAREHOUSE ARCHITECTURE: 3 THER ARCL tune, ‘Dun Warsow i vettne tothe dats epost thi is anne separately fy som Mal er Data Warhoune rcietie conn he atoning ga EA pert Boxtorn Tier AMiaate Tier 1 Wp ter Query/report Analysis Top tier front-end tools \dministrat\yn Data warehouse Datamarts C= ‘8S 2 tottonti ~ S| dita warenou —” Lap J server Data Operational database Extemal Sources Three-Tier Data Warehouse Architecture Bottom Tier(Data sources and data storage) : 1. The bottom Tier usually consists of Data Sources and Data Storage. 2. Iisa warehouse database server. For Example RDBMS. : 3. In Bottom Tier, using the application program interface(called gateways), data is extracted from ‘operational and! external sources 4. Application Program Interface likes ODBC(Open Database Connection), OLL-DB(Oper-Linking and mbeding for Database), JOHCUava Database Connection) is supported. TA. stands for Extract, Transtonn, and Load, ral popular ETL tools include: IDM Infosphere {nformati MIL. Conflveet © scanned with OKEN Scanner AW. Microsoft SSIS, NV. Snaplogiec VIL Alooma Middle Tiers ‘The middle tie isan OLAP server that is typically implemented using either = Arcana OLAP (KOLAR) mode, aed mons DUMS at ap petation fom sanded des standard daa); oF 4 mutdinensinat OLA (MOLAP) model e, 8 pei porose server that ditety implements tnultigimensional data and operations), OLAP server models come i L thre different categories, neu i actively broken down nt several dimensions as part of elaional online analytical proessing(ROL-AP). Thi ‘sed when everything that is contained inthe epoitory is relational database system, 2. MOLAP: A different type of onlin analytical processing called multidimensional online analytical processing(MOLAP) includes directories and catalogs that are immediately integrated into its multidimensional dbase system. This fs used when all thats contained in the repository isthe multidimensional database system. HOLAP: A combination of relational and multidimensional online analytical processing paradigms is hybrid ‘online analytical processing(HOLAP). HOLAP isthe ideal option fora seamless functional flow across the database systems when the repository houses both the relational database management system and the ‘multidimensional database management system, ‘Top Tier: ‘The top tir isa front-end client layer, which includes query and reporting tools, analysis tools, and/or data mining, tools (eg, trend analysis, prediction, etc). Here area few Top Tier tools that ae often used: = SAPBW + SAS Business Intelligence = IBM Cognos # Crystal Reports ‘Microsoft BI Platform Advantages of Mult-Tier Archit Scalability: Various components can be added, deleted, or updated in accordance with the data warehouse’s shifting needs and specifications. 2. Better Performance: The several layer enable parallel and efficient processing, which enhances performance and reaction times. 3. Modularity: The architecture supports modular design, which facilitate the ereation, testing, and deployment of separate componen'. 4. Security The data warchouse's overall security canbe improved by applying various security measures to various layers. 5. Improved Resource Management: Diferent ters can be tuned to use the proper hardware resources, cating expenses overall and increasing effectvenes 6 Easier Maintenance: Maintenance is simpler because individual components can be updated ot maintained without affecting the data warehouse asa whole, Improved Reliability: Using many tiers can offer redundaney and failover capabilities, enhancing the data ‘warchouse’s overall reliability. J © scanned with OKEN Scanner DATA WARENOUS MODELS rom the perspective of data warchouse architecture, we have the following data warchouse models = + Virtual Warehouse + Dataman + Enterprise Warehouse Enterprise Warehouse: — ‘© Anenterprise warehouse collets al information topics spread throughout the organization, + Itprovides corporate-wide data integration, typically from one or several operational systems or extemal information providers, and is cross-functional in seope. + itusually contains detailed data as well ag summarized data and ean range in size from a few gigabytes to hundreds of gigabytes terabytes, or beyond. Can be an enterprise data warehouse, ‘© The traditional mainfame, computer super server, or parallel architecture has been implemented on platforms. This requires extensive commercial modeling and may take years o design and manufacture, Data Mart: +A data mart contains a subset of corporate-wide data that is important toa specific group of users. +The scope is limited to specific selected subjects. + For example, a marketing data mart may limit its topics to customers, goods, and sales. + The data contained in the data mars are summarized. Data mart are typically applied to low-cost departmental servers that are Unix/Linux or Windows-based, “The implementation eycle of a data masts more likely to be measured in weeks rather than months oF _years. However, it ean bein the long run, complex integration isiavolved in its design and planning were not enterprise-wide. Virtual Warehouse:~ + Avira warehouse is a group of views on an operational database. + For efficient query processing, nly afew possible summary views can be physical. + Creating a virtual warehouse is easy, but requires aditonal capacity on operational database servers. © scanned with OKEN Scanner F WHAT ARE THE PRos AND Cons OF THE ToP-pown AND BoTTonN- UP Aimonakd TO _DATA waathoue DEVELOPMENT: @ he tap - leu cece lopment 7 Culipuitn Wareehortae, Auvcuus as a tycbmate doleet for od mibntnitves Seutighation Probie —Horuenua , Lt b enpensine , takes @ long time te clectalapr, avs laches plecititily chew ty tee chiffecttay ta achientag Tornutstency aad tonsensed fr A commen date model fer fae Cues ercgernteattons + Uke Dolem—up apprench te Abe ducton, development aud Aeplezpment of Cclefendot ala mas preoctcla foceitity | bow cost, ancl veapic weskinen of Krutstment Ot, heweur, au Lead 10 feces thes ibigratt. restos diparate dala mots Lutp a temetlent- tntiripice clata wasrcchora- PA seoommended omitted for the development Y date rascetiouts cyeliins be te Lerplenent he prasechoess eg 10 an LCuseenrontal ad colette aree mann. > Pbuk, 0 big Cue conporeti hata model Le hoftined wrtttatir | @ puasenob ly short puselod (detec ot One or +00 menthiz) fiat prroutdu A Fran Mohd. Pbrakenn. POMEAT Temata of arte into @ Handased erde F/No-L e-2 PLU Noa PUT No.2 PL ND 2 Plat a2 Pla the missing data, ‘The following points must be rectified in this phase: © aoee texts may hide valuable information. For example, XYZ PVT Lid doesnot ‘explicitly show that this isa Limited Partnership company. ° Inger mas ca bowed for ind datFor example, aa canbe avd as sung or asthe integers © Matching that associates equivalent fields in different sources, © Selection that reduces the numberof source fields and records. Cleansing and Transformation processes are often closely linkedin ETL tools, 3. Leading Tee Otis the process of writing the data into the target database. Dating the load step, itis necessary to ensure {hat the load is performed correctly and with as few resources as possible, ‘Loading can be carried out in two ways: 1 Refresh: Data Warchouse data is completely rewriten, This means tha older fle is replaced. Refresh is usually sed in combination with static extraction to populate a data warehouse ina % Update: Only thse changes applied o source information ar added tothe Daa Warehouse. Anup cS ‘ypically carried out without deleting or modifying preexisting ‘data. This method is used in combination with incremental extraction to update data warehouses regularly. © scanned with OKEN Scanner @) El _ _ \- . Mi eamedoees - = | = 7 ne exasliing det Ula puectss oY Deanifosening peers Gout, aeumnmareloeg, a 1“ 7d eee wb pum ik Digie Omioticlalts , compiles eeice = fem “ffeo late Eats @ coulslnt Hats £0 Spy as beets Luckices oud |S | east Can be plaud tnt | pacts 3 DATA QUALITY What is Data Quality? Data quality i defined as: the degree to which data meets a company’s expectations ofauray, validly, completeness, and conssteny By tracking data quality, a busines can pinpoint potential issues harming quality, and ensure that shared data sf to be used for a given purpose, ‘When collected dats fis to meet the company’s expetations of acurcy, validly, completeness, nd consistency, it can have massive negative impacs on eustomer service, employee productivity, and key stones ‘Why Is Data Quality Important? Quality data is key to making accurate, informed decisions. White al data has some level of “quality” a variety of characteristics and factors determines the degree of data quality (high-quality versus low-quality). Furthermore, different data quality characteristics will likely be more important to various stakeholders across the organization. Alist of popular data quality characteristics and dimensions include + Accoraey + Completeness + Consistency + tegrty + Reasonabilty + Timeliness + Uniqueness/Dedupication + Validity + Accesibily Because data accuracy isa key attribute of high-quality data, a single inaccurate data point can wreak havoe across the entire system. Without accuracy and reibilty in data quali, executives cannot tt the dats or make informed decisions. This an, in tum, increase operational costs and wreak havoc for downszeam users. Analysts wind up relying on imperfect reports and making misguided conclusion based on those findings. And the productivity of end-users wil diminish duc to flawed guidlines and practices being in place. Poorly maintained data can lead toa variety of other problems, too. For example, out-of-date customer information ‘may result in missed opportunities for up-or cross-selling products and services. Low-quality data might also cause a company to ship their products to the wrong addresses, resulting in lowered customer satisfaction ratings, decreases in repeat sales, and higher costs due to reshipment, ‘And in more highly regulated industries, bad data can result inthe company receiving fines for improper financial or regulatory compliance reporting. © scanned with OKEN Scanner r harcoctineisties of Delt Sunt lp @ © Acceernaay-» lec varttece fala MUSE Compooum te acral, seal teortl seenaretoc auch weaftect 4ecal—wuntd ebjects auch erente clratyeG Lheetd He Lrcscipfabte doners Te Comp Sctne ethic menace, ye : 1 keloumereol Aco Clore the walters fim MUA hee wescitied Corvect agora tion, CoLeeed « ® ComplTemes » lomplefeners Wreretwuc the datals abit ty ALi cer at the manclalory Halt Cat aseo uaflable Auecechesttig © cna > Dake isteney Avcekee tre dala le UnYormtiy as Lt mous aecrons Appiication aid nefioorla Ad chen Tomes from recttipie Soererens . Comet. Aso means tuat tle Untguenus means thal no olutptications or vuduidaut ace artenfa| awosalt tie Lalacek- No peeesrolr tu Bre datacle fmalyits tote olla otrast Aeolp adele 2 loo Undyrenes score: enicts rueeltible Lines - aud cleduplicalitor CS © scanned with OKEN Scanner ee ee a . aueding 2 Fhe. Ouaticti oe a Must be rotteeted ance 5 ~ J paseamelira ahaa ee : zatinsts defiives olnen oer : pe . feed dual e im tis toweat, ine Leyormation Stoutd tae oompre” / Tus accepted format, aud ay clatasete “@ roe 9 he taetilaty, ebbee preoper steauge- tbe Terud Hu ® Kelewoloiy + Whe sane date mot rol que Tea ome place Cn a uyefine @ Aeeciaetetty ify > Wie extut t ushic data ve actatl oy eactly aud pete es @ eojesthuthy + Wee cetane te verter the data is Unbiareel- © Charity » otasdtiy ue ekteBued ¥4 : Concer tiers ~ Ot hetb t make the data Ly Uuderslord by strc. Clemente ne | © scanned with OKEN Scanner DATA QUALITY C |ALLENGES Managlng the data structure and optimization Tre ate nny toprocess datas to stutureit 8 vy hat wil id your fue operations. As you sd tron and more daa to your warhouse, suturing BECOMES inreasingl dieu and can slow down the ETL process Als. it becomes neeasingly diel or sytem managers to qualify the data fr advanced analytics, In terms of system optimization, it’s important to carefully design and configure data analysis tools that are better suited to business needs. “Managing user expectations ‘Asmmore information gets loaded int a data warehouse, management systems strugle more to find and analyze it This means thet business users expect refined and relevant results from any analysis they run. However, data warehouse performance cen decrease a the data volume increases, which inevitably leas to reduced speed and efficiency It's your job to manage the expectations of your team so that they aren't frustrated when the buffering occurs ‘The costs of data warehousing ‘Acommon problem with traditional data warehouses isthe high failure rat. According toa Gartner report, ‘more than 50% of data warehouses fail at one point — not only because of the technical challenges and complex architectre but also because the proecs fil to met user requirements. ‘Organizations then face the same challenges when tying to update a data warchouse to accommodate new reporting requirements or data mode's. Even if'such projects don't fil, they have high costs and timelines. Ul these factors make waditonal data ‘warehouses inadequate for real-time data requirements and scalability. ‘Onthe other hand, ityou go with acloud-based data warchouse ll the maintenance rests on the cloud provider, while the cost is formed by the used GBs per month Snowflake, for example, even has la ate of 23/TDB/month, Google BigQuery's active storage costs $0.02 per GB per mont, withthe fist 10 GB free each month. Data quality “Maintaining quality data is difficult na traditional data warehouse where manual erors and missed updates lead to corrupt or obsolete daa. This inevitably impacts business decisions and causes innccurate data processing. Asbusinesses increasingly adopt digital wansformtion i intended data silos. This ours when departments heavily rely on cloud tools accompanied by the democratization of technology — where each department is moc likely to be responsible for purchasing and developing technologies for its use. Each ofthese silos represents enother source system from which uses need to pul, integrate, and analyze data to use itcorectl in decision-making. To make mates worse silos often don't follow the same set of businesswide standards, making data integration even more diffi ‘And due to the democratization of cloud technologies, your organization might even have valuable data silos tha IT doesn't know about. “Modem warehousing solutions can automate the data quality process, preventing data silos, outlier, manual erors, redundancy, and other data inconsistencies from occurring. ‘With an automated data warehousing solution, you are able to provide high-quality data that brings the most vale to your organization, Data Accuraey ‘f you want your data insights and business intelligence to be reliable, the data that is analyzed warehouse needs tobe acurte. Traditional data warehouse often suffer from inconsistencies hat lead to inaccurate data as a result of manual processing and other errs. ‘There are several ways to go around this challenge, but the fist and most important sto ensure that a data colton od string process gare nda hen dts oso sorsty before it enters te warehouse Data accuracy can also be improved through regular esting iatves, they often run into the problem of © scanned with OKEN Scanner However, with the right data warehousing solution | that Supports automated | twansfers, the chance for heman ‘error is minimal, If you use an {11.100 not only can YOU prevent inaccurate data from entering yout data ‘warchoue, but aso flag errors so hat you can OPimize your data accuracy athe care, Adjusting to non-technical users , ‘Traditional data warehouses are often complex for nor-technical teams to use, ‘Sure, everyone can master data analysis enough to be able to query data from any Source and know how to use the dat provided Ws vy. But the reality is diferent. Non-technical users often need to interact with company data, very efficient if you use 2 ‘traditional data warehouse — submitting a request to the data team, waiting forthe data team to fulfill the request, and using the data once delivered to them. “The process might work in small teams, but fr larger teams, it's time-consuming and inefficient, a datz teams can quickly become saturated with requests, leading to frustration and bottlenecks. However, with modem, self-managed data warchouses and automated ETL tools, this challenge is easy to ‘overcame Data transfer tools like What graph allow any user to move data from disparate sources to Google BigQuery without enlisting any help from the data or developer team, With point-and-click solutions, even non-technical users can operate a data warehouse without slowing dowm the workflow. Data pollution ‘Sometimes the data gets comupted in the source systems. Some ofthe common sources of data poliion is: + System Conversions Data Aging Heterogeneous System Integration oor Database Design Incomplete information at data entry Input rors Intemationalizaton and Localization of ta Fraud Lack of policy © scanned with OKEN Scanner

You might also like