Open navigation menu
Close suggestions
Search
Search
en
Change Language
Upload
Sign in
Sign in
Download free for days
0 ratings
0% found this document useful (0 votes)
33 views
Data Mining Notes Module2
Uploaded by
ChalaTamene
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
Download now
Download
Save Data Mining Notes Module2 For Later
Download
Save
Save Data Mining Notes Module2 For Later
0%
0% found this document useful, undefined
0%
, undefined
Embed
Share
Print
Report
0 ratings
0% found this document useful (0 votes)
33 views
Data Mining Notes Module2
Uploaded by
ChalaTamene
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
Download now
Download
Save Data Mining Notes Module2 For Later
Carousel Previous
Carousel Next
Save
Save Data Mining Notes Module2 For Later
0%
0% found this document useful, undefined
0%
, undefined
Embed
Share
Print
Report
Download now
Download
You are on page 1
/ 51
Search
Fullscreen
Moputé TI DATA PREPROCESSING Dala Prepwo cessing. teh | ratty S43 le Real wovtd databases 97° ward L to mocsy, messing and mt “ : h sexe Be ee rypecal ac 5 and (pla several Gegaby's oy mat we ecigre fron Oe es ae Courtes. eee ~ how _qualila ali, wetl bad to mesulls: how _quealely omining ———— wnoreluu _Dala mecds 4 be feeproes fo hetp +e improve Te 4 i and Ree earn ees A of resulls: —There are sereval preprowess A Fechne’g we 1 Date cae : cam be appleed to wemeve data: noise and correct vmemsestenares 1% dala: BD* pees aie easy merges dala keows mullple, Gourees tnto a Coherent date oes AC as dala warehouse. 3. Paka Reduction: can reduce the Sud xb: aggregating , eleoenalins redundant a ov i cluslanen 4. Dake. Transformabions Coq: normalizalkos ) may be applied , where data are Scaled to fatt wether a Smaller yrange lke 0-0 +0 lo. Tacs Can eraprove the acum aad een mencng algon avetum eppoee “4 of 9 ta fams Wve! 4 degtance measure mt vl These techniques are mok mutually exclusweé , Ld red work eg Ea Data clea may mveolvé Wansfesmations te correc l all Wrong data. , such as Frans for ming enWirs fs a dak frelel fo a Commow format. Data has quately 4 = Sates 4 the rman of Ce entended ure. The three elements of dala qual are (i) Acerracy- i ti (fn) Conplakess Cin) ee| "| Inacursate, lacomplele and Inconsesten€ teed Aale are common place prope a real-world data bares and cate, ware houses. eas ons foe There are man posseble x usacurake data ( howe a athcbule values) . The (Stee ieee eae Seta 4 b wos dunn cnco data, collector, be human ev computer © Users neg purpose alg alues fer Cubmil incorrect data “ mandator celds wohen Wed d + submel personal inf dala enky- > mot wes hy ermal ory. Cr choosn4 duplaged Re cal ee: as —— me os Eavos aw dale brans nu's 310% can also be teehnolog Lenn chatroom S : Janaasy ’ u known dafoutt valur bivtedag ) ee occuy, mere “oO Such Os fumed falter sixe For comdenaling § nemronewed data transfer consumphon Incoreeet dalla, also result row Cm consis btn eg o dala Codes mm un Sromung _tonvent tensPesce a Cacouscstenk fx mals dp teelds (eg. te: Duplocate huples also requme dala cleaneng Mayon Tasks tn Deka Preprocessing: ~The aes sleps involved ™ | are B) Docka Imtcgractuen 2) Data Reduchon Data Ceapee cess “g A Data Ceancng Full or messeng valews ; J Smooth nowy data , deaf y- or remove outhers and resctve cacoms istenece $ Le eaemiinth 4, waa lleple chetebas. dota eubes om eles. Reduced vepreseatak on of tire dota set ve much Smaller iy volume yet produces the Same analy teal vega - er) Demensionality Reduchon LD Numi sel ky Data Compression D Dola Vamsfeymahen © (1 Normalecah on Ss Ch Concept heerarehy Gererahio:Steps Wo Date Prepsverssing agg \\I// i Data leaning Abus ALA’ an AMS 32, 100,59,48 002,032, 100,059, 04% ci) Messeng Valuss a Ow Ree Dali 1 Date Cleaneng’ (usa Incondistent Data: Gi) Hessung Values Mans tup leg have mo vet fos no altri ules ° 2) lgmoxe tee tuple: Ths wad done when We class abel G wesseoy (ef task deni Tas wethod & not 7 tre huple contauns Sv atibulis oles ee poee: vohen fre hed orded volun wu ssant valuts per allbute Vanes W'S Sow conse davak » Ecll wm wu'sscne values manual Thus & tum consume es ad oasis fs ” A a lege data wun merge kas oes et i_ i .}»@§ a te global doustant to sell ci missing values Replace all messeng altibule values if some £ such as a Inbel Lk 4 Unknown” or “= Constant ” a wnvs5eng valuss ame replaced by a jrer the WAntrg prograre may war'stakenty tronk trak neg Porm an entercsten g concept: -Atinoughy Semple tres mefnod Ub vo re Commntn ded 4 Use the aleibu meanto Pell uw he mois senvg value’ : Replace missing values wetr Ke mean of valu f trot attrrbute Ss. Use alttrrbul wean ov median for all Samples 1 belong ng to the gawe class as the gwen tuple. G. Use Im most probable value fo fell 09 tae messing value: Thes ee be deli mined vocth regression, (aleren ee Boned tools woerg a ee females wi or deusion Wee induckom- Methods 3% G beas the dala — lke Lelled cn value may not be Cowect «Method 6s eo popular chrategy —AKis metnod Woes We nent dala te mo st (infor smottont rom pre predech mess wg values> Noisy Data tO Vananc, pious eae wandow OF moa measured von able techniques — smooth ~ Dato Smooth eng dala tp remove moese: out a) Benning: —Bunntn methods srmnostty o Sere b consrltng data value song ¢ » values cf “med rborhood” “ around ak. Pues are “tke sovkea val dustkebuted Reto ae a bes « hackels” 7 — Seance Bunnie ds coms alf fre mecanboshes & of values, faa peajerno Hocal_Srmontheng > oo awe rnootrensiy be bawdana Smootring by vo gees ra fess _ bin rarans wa queer but are \ ‘dantifeed as bu each bin Vat pround Ones Enel free replaced. bry, IRe shrelmninit yates fru rena cloeest Aoundauy value rx _the_onidth, grate 3 te TES ~ha of gmootning:eq Govfed data fx price(e detlars ) 4, 8,15) 21,221,245 25/28534. Ben &xe = 3 Porwtetor tao Cc ual — pend bens | Berl : 4, 8 IS Bwa at, al, a4 Bind + a5, 28> 3H. Smootheng “4 Ba means’ Bes | oe }Bin 2 22, 22, 22 [Bin 3 29, 29, 29 Smo cothing by ben boundantes ati ett eta ory Bin 2 21, 21, oy Bins 25, 2534 | PF hecstering Mg. Suppose Inak the dala Pr amalyses tnelude the altachule age: The age values fer data Are tA Ke tease orcdey 13, 15, 16, l6, 19, 20, 20, 21,22, 22, 25, 95, 25,25, 30, 23, 83, a5; 25, 35,35, 36, 40,45, 46,527 Use Smoothing et means to smooth We abere dak : Using Q ben depth!le x ve Ike qa dala ws aluady sorted Partclon dala wp equcde pl. bins £ dupth 3. ee 3, 35° Bett: 13a, 16, tb am 32127! s Bina: 16,19, 20 Bin? 1 35139! : 0/45 Bn 3. 20, 21, 22 Bina iisecn dl Bnd: Ab, 92, 70: Bing ) 22, 26,25 Bin 5 : 25, 25,30 com f each bev w each ber Calurlate arthmel Replace each ¢ the values &y an thmettc mean cal ewlated fer te bey ¢ Buwr. 4, a, 4 Bue: 23, 33, 33 Bw2: te, 'e!8 Bin? : 25) 35) 35 } Qing, zeae! ene 40, 40/40 Binal || 2a 2424 Bind: 56 56,56. . ' Bin 5 26, 26, 26 Dy aiaae | Outtress mona be detected by pr ual Ss are orgonened where Semdlar value fr eee Lp & or clusters” Gaia oni jnat fati subside | Intuievely » values rat TR" the Set of chaste a be consedered . outleers eae— % = ~ y ) Ct | (ee \ r \ ) x £ 4
Aatacubes chama inte: sation and o bgeck maclehing Ts > Chama 1 (we) Can be Beeky- “Heo car ccaliow pro blem Entety -cdew cee real oot entek'es pet mmaliple bale sources be mattered up) the ale analy font a "how Can that cestmer—t tn’ one DB be Save and cust number ern anotrer reper to tha Same altabute ? - MetaData ian be used to Aolp te avord : eee Pon 7&9 eres tn Schema (™% d om e detreGute emetede f each metadataName i: EN ME ML ANI daka type,amd 7a e {values permitted Gite _attu bate ank, Gre mull Aafes handleng bl Keo oy null values & Redeindancy. « another rap oy tssue eal “ y . = a S$ dala, tntegr alow - Intonsedteneces mi athadule 7 Acmension namin can. osc t eee redundancies is ie vesallety — Buch reduadanwes Are defected ad tomclaltn amnalysc: » Auvem turo altwebutes, reat aly core, measure how “4 one allabule Complies Lice olnry baocd .om the available dale- —For mumerce allacbubes , we can evaluate sae correl alow bfo foo alhibulds A and B by Lowpn the covrelalow coef pcaent (Pearsons oduct moment toc{fcwent ), P fty— no: of tuples: a, be — respective ana beens f A and Bw tuple ce A, Bite respecte wacan walues of 4 and B Ta FB- respcelewe standard dewatoms e A and B (b> Weo¥A,@ 28 qrealés Han 0, A aud B are poschevely comelated > volece of A wncrease Os LAD value <8 inure ase ae ae aa value , fe Slrmgiy the Condation , and cnddcaal hat A (or B) & wedun daneg. ace De remo ved asaa) 8 cs . ee and theve ts wo conelalow befween she we cee TAB eo, then A_and B avt Mgakwely comclaled , woherc thre values fone altbule moreast as the valaos hes mMLANS + other attrcbule Acreare frat each alvrcbal discourages the other. ge) Deteckon and Rasotuom p date value touLleels ~Poy the Same veal world entity , Se cee eee ottebut values fom afew’ Sours ray Adler Ths —o8 be dust Aalgoenens wopres ent aro ns, ecakng © encoding: eg0e over g wulte Unt nk alte bule may be Stored fs uw ow syste and une wm anotier: Brisk oe w af fe ent ches may tnvolw wot we nk ULE nUes bud ond taxes. KL pres ceeoine only dv also | Attlee rte gernees NTL Date Translos mn alow pala G transformed or consoidalzd ento volves forms appropriate fs ccete tt (we Petloweng - a) Smoothing : which works to remove netse ow dala - - bunnen 4 5 regres Stow clustarcng , >) Agqucqatcon. Summary 7 aggregate | Hevahons are apphed te dala —_ PP cA dady gales dala ™ be so as to compalé [ ancunls aggregated | eee 2) Gers ralixalor how kevel or Prspnctere (x00) dale are vepiaced by higher ; leet conceplé trereug i toe Comcept g concept Muronets. chas Slack an w ae atucbulis $ to ha her level de genexalinid concepts Aeke oe ov coun Ee art Gonsirucked and added from Ini gue set of altebulis to Kelp the metarng process: ag we mayo to add afnbulg Sarea’ & oy alnbukes heugat + Sos iid th a) Attacbute Conshructemn : New alibules (eat ure Comsty ro)®) Nosmalexation The altrcbule dala are Nomatez ation * Nee eee Pell Scafed so as +0 fal wethin a smaller Tange o-otelo -\r ie Oo ee rivo bie ~Nosmalizok' ow attempts to que alt altchules an equal wecght: ~ woeful ef applicahons Ake cla ssefceatrow tworks al gon fFAms tn volun neural meter =< eyeh “5 o adestance measuremenls § Me aCe iO SO ale 4 weartsl nevghbor elassef[cea ory 2% ooh BE eee Nethods for Nermaléxat ow Min -Max Nomaldxattom L-Ste ve ecaaasleerig Normalization be deumal Scaling. oO) Yin -Max Noymalization. ~Peformns a Aenea’ by anslormahon om - Sexppose Anak Meng and MAK, art fre WHKACmUM and Maximum values en nee of om altibute A.Pee eee Min-max normalexaton maps a value Ve A ' . stove wh Ike range [neo_muing , newmara| ad Comp weg eae = TINA rtsematy — neve IMR mar, — MIM, pf Rewring. = This wormalizatiow presaves tre welatumshep § sere S the elev om ong omgumal data values » Hlt well encounky an ‘out -of- bounds eecer aha puturre tn podt cane for normalixakow fatls cutscde (Ae nal data range frA eg: Suppose thak mencmamr and maximum. values fy the al&ebule mrome acre £12,000 and $4¢,000 res peckwely Map. a value ef 473, 600 t a value ea eae Po.0, 0] T aac macnn max normaltzatow wv = 73600 — 12000 (0-0-0) +0 E000 —l2e60 i O- 4th©) Zscore_normalixakion (mre neem aleZorte i emai afin (seme —— ) —The values fa > Q booed om the me and Hea “standard devotes of Whine A and wa ave We mee and Standards deviahon abe -Tucs metro of norm alixahow a unefal when Teheran ae the and maxtmunr of attrbule A are wn ao when unakao trrre are cutras grat dominate Wa Min -ar normale Rahay eq. Suppose that the mean an dawaton fers Ihe values for income Ae § 54,000 and $16,000 ores pectowely useng <-Seore normalixatow we vale § 73,600. Trans 6" mwa 73 600— 54000 Sacer et late 6000 ie) Novmalxalion fy deamal Sealeng: crermnalites ty mowmg tmz deunral poet oe values alGrbub A eg number of duimral pocals mow d depends ow Uke maxtmum absolut value. + A ~ A value VV an plGeibalg dA es Mormalé xed tp vy! Ea eae “t computing Y= Evin \ 10° eae aeceeean Ea where | wo the Svratlest wteges guch that modal) < | 4. Suppose thot tke retorded values f A te po —FS6 wm UF. The maxmunm absolute value of AG 986. To normale dermal mg 7 we Were fore duude each value ty looo wyee So nat —486 rormalixes to —0r9Rg and UF normalizes 0 ip4 Use Ke followin methods +e normaltxe the greet set of data. 200> 3200, 400, 600,1000 204+ Dome -max normalgaton by seltny ee) 2) X-S Love movmal ation Cale 3) L-Seve nov mrale Zot wacng Erase nn a'o davadion | washed siege 3 Wer pvnlixal men — Wee ormal Uiggacheow! 200! = 00 = (200- 200) (t-) 40 !0co— 200 = 0 B00! = (800-200) I-) yo = O95 #00 Os Avo! = (400-200) (1-8) 40 5 === S020 esd : Goo . (God -208-) 50 = see eect 30 \@od = (1000 - 200)(t-®) +0 eee &vd (exatow The values ap ler macy. Max WOM A Eee (0, 195, 025 0:50 10) ———ay) Z-Sore normaltzaliom - Mean = Sx . he = 200 na oe ere cornea . ce = 2500 = 500 Claadard diwali = : : (=) (a (400-50 0) 4€6 eee 2 = (200-500) +( 300 -500) + 4(lo00 - S00) ~ = 7 = Q2Bag as 2 : \ AO = 200-4500 _ =1-06 600 = 600-S0d QQAi@ QPAs ) = 0353 300 - 300-S00 —0: FoF ' 282. === |\\000 = (0 00-S¥U Qea’ Avo! = 400-500 = —035 . agye ———— = 146 =— La Z-Svo ve normalization are Tre values af ( -106, =o 107, —9°3S> 0353, 146)5) : om. . Mean absoleete acwvatiew > & = | ]200-So00] + laoo- Seal + [so0-seo| + Peed |eoo-soo} sheoo-se] el if 2 9° 5 co = 240 REO = 20-500. -1.25 sual = Geotosy 240 == aceorn = . 30d = 300-500 = —0'g33 OAlT Pe eee et —— 240 looo!= 1oco- SUD 1 a 400 = A400-SvO L O47 ae 24D The values afte normale xafon are (2s, 9.633, —0-4i7, 417, 2-08) aoa ) The Smatlest tovkeger {seem tat y
the years 2 2v00g to 2010: - ~ tl we ave interested 8 tre annnal ay) , rather Sales ( Jetal gales om ye tr of _zetal por quali: Thus te dala can be 29 greg dala Summanac im stead e aled So frat the acsullkn ; tte total salks per eo Per auorker. ~The -wesulteng Ante & smaller ev volume oul Boss of alarm attow me cess Ott pa analyse pos ke Year 2010eS eee Data cubes Store maltedemension al, aga regafed infer = The following gure shows a data cube Por: muattedimensional omalyses P sales data wr & aquaual Safes per eben ty pe Por each Aliclectroris branch LED s rol c B A home entertainment | 568 computer | 750 & phone | 150 security | 50 2002 2003 2004 year Ee vel -Each cell holds ar weqate date value, Cores p onder to Re dakapocnt ow malteelmensl a he | ~coneept Merarehy may exest fir ea° | attrcbulr, alloweng tre analyses f date at om allep le abstyacliow levels.
) Pterg eee EEC Fach av lad abstactton level Purfrer reduces ae wes long dala sexe op how plying fo data menin weg ues 8 > he gmatlest available enrbocd releranl fo (Ke queen task Should be wed (4D alee | Redurctow Data sels few amabyses maa conlacy hundreds of altacbetes, which may be pe qd oe vedundant— which wulkuant +o the marincng task Law glow the meneng Prec ss: : ied Demensionalely reduction, aeduces (ke data “such altebabes ordenanstons Sted by wemoreng frow ck. fe SubSeb Selectow ! — to-fend — Metnod of allicb el athebulis such that Oe a minunun Set dcsl&c blow of tae dale wesulteng probably eble to me oniginal Classes 03 as close as poss des bicbutron cftacndd _ mseng atl attabuls- ~ Meneng on a@ wedaced seb allie Cerlis borefite ~reduces we number catobulés _appeant to makes wh ke discovered palterns, netpen the patterns easter to understand, Ieee eee ee eee eee eeAtKcbube Subsek Selechon tuclucle Me Fotlovoing fechnog rae ) Stepwocse Forward Selechiory ~The procecerre starts oocfh aw ee Set ¥ allseb ates api the reduced Seb. ~The best Gs tke on'ginal alucbu lis cs duitee mace ae added “fo Ke redeiced set Aa.ation oe slop, tre best 4 icant ane reemacneng on eal atezbules & added fo the AE each subsequent Sel. Ewa afcow. 3). Sep Backeoard gehectrow ~ The procedure clawfs weth Pall cel E og attribubes, — AE each sep, worst allkcbule were atm AG ch removes THe: — us tel. eee> Ul Comben ateon: of Frnvard Selechonw and, Backward elvmenalow! — The slipwwe fered arr geleckow and backend mcanalkdw mretrods are combed eo thank (Re procedure selecls Ue dite worst from ek afk each stip > and %é moves “a lbutes - pest altbulr among une remnaeneng Backward elimination Decision tree induction Forward selection Tnitial attribute set: tyes Ass Aah Tnitial attribute set: | Initial attribute set: {Ay An Ay Aa Age Aad | (Are As Av As Ag? => (Ay As Aus Ass dpb Initial reduced set: cy > Ay Av As ded = {A => Rediaced abate set: = lAvAd {Ay Aud > Rediced attbute set: {Aye A Add ‘=> Reduced attribute se: LO adn deh 4 Deeescon Tree Induchow ~Deescon bree algomfams tke. (Da, CS and GtaT am Mntended fx olasssficalicre Fs Dewsron (ree _enclucton conslyacls aflow chart &ki shictere where each cnternal Crom Leaf) mode represenls a tes om am aticGule. » each branch corresponds to an oetcome of te feck ancl each exlanala class predvels ore G@eapmode) arm oles tne al omtam chooses -AE each modes tre “best altiibute” fe parton Aale enfo irdeu'd ual induche classes: aw ow wed gu altrouts sutet selehiow 12 ee aw conskackd from Che gives aa, when deurscon He Atl attKcbufes drat do uot appent uw a Wee ant assumed to be relevant. Tre Sel of alGcbulés appeam 4 Ww acduced subset of dae lee form lhe altrchutes: ng enter a for the mefrods The shoppe The proce deure ahaa many voy enoplng atrshetd oS TERETE used be dela mene. whew to Shop process. the attibule Schekow YeeData Compression. 0 ~Tra nsforratrous are applic Pa pe foe ROR co weduced or Compre ssedl wproentateon . of. the origen al. date, _ Two ty a — Loss dessles s € we couskuct ml, an roxtrmaton Onqinal data reconshu ck cfeol ow compressed aaba the onvquaal dala - wetronk ang info: £2 foss- Of (Re oni gin dl det wavelel bansferw doss dak aconspres ston Peer eee - Pronapat Component Analy ses. Wavelel Transform . ~The diserclé wanelel Lransform QwT) Sai Leineay Vay Sanat Processing fechavg te. wohen app lu'd'to a dala vecter ¥, oe «f to a nunca veel yy depperent vecter x! — woavelet Coeppcorenlé -> The two vectors arc of (nt Same fength. ~ Wahew applying 4 ob fechacque dala wduchow , couscdey th ak each tuple ag aw: n-dimensconal oala vector: ie fx, XQ, %B,- et xn} dupeckeng \ make on tae ow’ mreaseareenl’ tuple frore Wo Aolm bane al&eb ales- —wavelek & anspermsd dale caw be luiw caked. A counpresseel capproxemerlton, | of hale can be rehacned by show mg ceaall / eae oul a, brackow of ck Sha mgest of tre wavelel coef free als. etecstereteraneret eee + eg alt waveleb coefiveenls danger tran Some wr specefred tres Oro lad inch and olviy CoefferoealS can be reta sek fo O- ~The vesalt'n ee 7 th Pusfere Hem SpOreee operations that can tals atvantage of dake tparschy arc tows atahonally | So taalcre ea ba tetas — hes teohacq we works +0 atmo Woese rout smootneng out Ce main fafurtS ae d See of data, makeng ik eine: fox ab dala cleaners — Bwer a cet 2 coeffeuen » an appre sox imnatog of cogendl | data can be conse clecl oy apply ing ke inverse of DwTt void . pat closely xe (atid to Duele or (DET) a segnat genes aad Foun Transfer tech eqns envaln gt process ng cosines: — DWT achieves bellia £085 compress (0% - and prowdes ™ wore accurate approxmaatten ee) 4G pwT wequaes 1685 inal Aala rer same no: ofeoe pees | S space t than DET 06 04 04 02 0.2 0.0 cd | -10 -05 00 05 1.0 15 20 0 2 4 6 (a) Haar-2 (b) Daubechies-4 a ular ee transforms taclude Haar-2, Daubechias-4, Daubechees -6 ele:general procedure fer apply ing 1Q. Acsecte wavelet Anan sfeornw Wes & nfhrv. That halves tmz dala at each pase Reevar checal Pyramed alge LCR along : : Ca alow , resulting 5 Compa alin al Speed. Hevarchical Pyramed Afgonttr The tangit. L of Re cempaet data vector must be an tnteger emer of 2. Thes condcfrow can be- met be pees tae data vector weft eos Aas NLCE SS AR Zach translerm imustues too fuacfeons, The frst apples some data smoothng Auckas Sum ey wee eet” The Second performs” a weeg hfe ay foence The foro functions ore applced to pas dalaponls ev X wheeh wesulls wf hoo data sets of dang La The two functions are veers evel, +o dota sels. obfacned en ae phe untel tKe wesulhag > Rakasels obtacned ane Ss on Faans - Gelocfedk values from (Ke hataselS obfacned ca tae preseus cbaratons axe dusdgnated tre wavetel cocpeteents of cai transformed data - eee eal ee phe pier ne ibit Equevalendtty. 0 matics multplecalrsw ean be ee pled fo, (Re, tupect dala te oblacn the woawelel voef fcr als whee Ce malix ured depends ow he “gwen DT. pphell to. mwaunelel Laarsfermations can be .s a data matte deorensconal aka uch as Rees eee ee cate: Thes cs done oe fudl applyeng’ the bansfornn to cme fusk Aimeonscow , nen_to (Ke second, and So ow: Applécalcows ef wavelel Qanshems - ~ Compresscon of fenger prt ae a 2: Compurley w'ecow- — analascs of Heme seas dala - ~~ SES ~ Datla eta a56 40 ¢ 24 48 48 4° 42 16 49 28 8 8 32 3g 6 10 Beale io) eile |e ee 85 0 6 [(oisneesfansmmees ae) qa SF “0 16 10 oO ° sla - bansform es arti we start from tice bottom Rew: WE add and subbasl sre Aafpenee fo Ue wean and sepeat the Process upto te furs Row: 35-3 16 to @ -8 oOo 12 32 38 (6 10 8 -8 o 12. 48 16 4g ag 8 -8 Oo '2 s6 40 @ 24 4@ 48 40 16.iene Pree} al, Component Analyscs —Refer yous Hachene eran alae sale a nema Nofes Numeoscdy Reductiow - Reduce tke dale votame cheoseng allanathuve Smaller forms °F date ¥ w
oq nese grams clasts og i samphug. Regression aes and dog Lrrear a J Modes. Ddencar Peseta : € wx +h ~ Two egresstow coe Pe nls wand b 4pee Ake Line and are bo hedata: estimated useng ~ Weng (We heousit squares ca fer cow to wie known values of Yu V2 Mir % . 2) Multcple Re wess'om ¥ cba + bi min Pate “man now Lene ae function’ Cam ato the above: be franspormed bog henear Models: acsucle malk — Ap oxi nate xf ity Aisin but org ~Bshmale Me probabdaty fee1. Histograms * Use binning to approximate data distributions * Popular form of dat ‘ion * Histogram for an attribute , A, partitions the data distribution of A into disjoint subsets, or bucket + Each bucket — only a single attribute- val value/frequency pair — singleton buckets Histogram Analysis - Example = Price data: 1,1,5,5,5,5,5,8,8,10,10,12,14,14,15,15,15,15,15,15,18, 18,18, 18,18, 18,18,18,20,20,20,20,20,20,20,21,21,21,21, 25,25,25,25,25,28,28,30,30,30 Equal with hogar wth bucket Sze $10 ‘Histogram for pce using singleton bucketsHistogram Analysis Partitioning rules: ~ Equal-width:- width of each bucket is uniform, ‘ ~ Equal-frequency (frequency: constant) ~ Maxdiff Consider theaiterence between exch pio adjacent ales — V-optimal > one wth east vance 7 Histogram variance weighted um of the rial values that ech bucket Teen where bucket weights equalto te number of values inthe bucket the Imovmum dstane betwen any tamabjeccin the cher * Centroid distance > alternative measure of cluster quality > average distance of each cluster ‘object from the cluster centroid * Can have hierarchical Clustering and be ‘stored in multi- ‘Sfarchical clustering ar dimensional index tree structures oo3. Sampling + Data reduction technique. Allows large data set to be represented bya miuch smaller random samples of the data. * Sampling: obtaining a small sample s to represent the whole data set V + Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data + _Key principle: Choose a representative subset of the data — Simple random sampling may have very poor performance in the presence of skew — Develop adaptive sampling methods, e.g., stratified sampling: Types of Sampling + Simple random sampling = There is an equal probability of selecting any particular item * Sampling without replacement — Once an object is selected, it is removed from the population + Sampling with replacement — Aselected object is not removed from the population +_Stratified sampling: — Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data)col: of data wen (ne Ffolleewerg 24, 26, 28 val fg MAGron, “ z backel eq ot a) Conskucl a Bee equal wre of fe 3 bucks histogram, ) Constack % d : | + Data :0, 4, 12, 16, 16, 18, 24, 26, 28 * Equal width | ~ Bina: 0,4 [-,10) — Bin2: 12, 16, 16, 18 [10,20) — Bin3: 24, 26,28 [20,+) * Equal frequency - Bin1: 0,4, 12 [ 14) — Bin2: 16, 16,18 [14, 21) | — Bin3: 24, 26,28 (21,4) Equal width Equal frequency id ESESGEUET Sz asnraunucmacammmcazens=t| EG fT ess sss sss =sanouozesaSnERNEREEIOTS | fl) 030) 209)Discretization Three types of attributes vy, € Nominal— values from an unordered set,e., color, rofession -)%, Ordinal—values from an ordered set, ©8, military or academic rank an Namerie—real numbers, eg, integer or real numbers Discretization: Divide the range ofa continuous attribute into intervals Interval labels can then be used to replace actual data values ~ Reduce data size by discretization ~ Supervised vs. unsupervised ~ Split top-down) vs. merge (bottom-up) ~ Discretization can be performed recursively on an attribute ~ Prepare for further analysis, eg, classification Data Discretization Methods {Yplcal methods: All the methods can be applied recursively - Binning * Top-down split, unsupervised — Histogram analysis * Top-down split, unsupervised ~ Clustering analysis (unsupervised, top-down split or bottom- up merge) ~ Decision-tree analysis (supervised, top-down split) scision-tree analysis — Correlation (e.g., x2) analysis (unsupervised, bottom-up merge)Simple Discretization: Binning + Equal-width (sistance) partitioning = Divides the range into N intervals of equal size: uniform grid ~ iA and 8 are the lowest and highest values ofthe attribute, the width of intervals will be: W = (6 -AV/N. — The most straightforward, but outliers may dominate presentation — Skewed data is not handled well + Equal-depth (frequency) partitioning — Divides the range into N intervals, ach containing approximately same number of samples — Good data scaling — Managing categorical attributes can be tricky Binning Methods for Data Smoothing 1D Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equidepth) bins: -Bin 1:4,8,8, 15 ~Bin 2: 21, 21, 26, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: -Bin 1:9,9,9,9 - Bin 2: 23, 23, 23, 23 ~ Bin 3: 29, 28, 23, 28 Smoothing by bin boundaries: = Bin 1:4,4,4,15 ~ Bin 2: 21, 21, 25, 25 = Bin 3: 26, 26, 26, 34 (aDiscretization by Classification & Correlation Analysis + Classification (e.g,, decision tree analysis) — Supervised: Given class labels, e-g., cancerous vs. benign — Using entropy to determine split point (discretization point) — Top-down, recursive spit * Correlation analysis (e.g,, Chi-merge: x2-based discretization) — Bottom-up merge: find the best neighboring intervals (those having, similar distributions of classes, ie., low x values) to merge ~ Merge performed recursively, until a predefined stopping condition Concept Hierarchy Generation Concept hierarchy organizes concepts (i.., attribute values) hierarchically *+ usually associated with each dimension in a data warehouse Concept hierarchies facilitate drilling and rolling in data warehouses to view data in multiple granularity * Concept hierarchy formation: Recursively reduce the data by collecting and replacing low level concepts g (such as numeric values for age) by higher level concepts (such as youth, ‘adult or senion) * Concept hierarchies can be explicitly specified by domain experts and/or data warehouse designers + Concept hierarchy can be automatically formed for both numeric and nominal data.Concept Hierarchy Generation for Nominal Data * Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts £ Street < city < state counte) * Specification ofa hierarchy for a set of values by explicit data grouping ~ {Urbana, Champaign, Chicago} < Illinois * Specification of only a partial set of. attributes ~ Eg,, only street < city, not others * Automatic gener of hierarchies (or attribute levels) by the analysis of the number of distinct values — Eg, for a set of attributes: (street, city, state, country} Automatic Concept Hierarchy Generation * Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set — The attribute with the most distinct values is placed at the lowest level of the hierarch ~ Exceptions, e.g, weekday, month, quarter, year norte or sate > 365 sint values os 3567 distinct values sora ee 674,339 distinct values
You might also like
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6125)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (627)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1148)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (932)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4/5 (8214)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (631)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1253)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4/5 (8365)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (860)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (877)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (954)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4/5 (2922)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (484)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (277)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (4972)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (444)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4281)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (2061)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (447)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2283)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1068)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (1987)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (1993)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2641)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (1936)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (125)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (692)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (1912)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4074)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (830)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (75)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (143)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (901)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2530)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M L Stedman
4.5/5 (790)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4/5 (105)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
3.5/5 (109)
Related titles
Click to expand Related Titles
Carousel Previous
Carousel Next
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Principles: Life and Work
From Everand
Principles: Life and Work
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Steve Jobs
From Everand
Steve Jobs
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Yes Please
From Everand
Yes Please
The Outsider: A Novel
From Everand
The Outsider: A Novel
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
John Adams
From Everand
John Adams
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
Little Women
From Everand
Little Women
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel