We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 15
Assignment i- 3
Ga=0
WAME t= Sanjivanee Mohan Jarhad
Sto :- Frm ja
Rol) mo !~ ¢€24036
Advanced Deedubase Management System,
G-") || €xplain in peal! KDD Process
~The k0D prvess stands Por knowledge-
Discovery and Data Mining proless
~The kDN press 75 of discovering vsePul (nidde)
Pdt+hkyns, ex Plovatuyy data analysis , informattrn
harveHing and unsupervised patted recogni tHe pn
~|400 1s the process of finding useful
infosmetion ancl patterns fn deed,
Tota mining Is +he use of gorithms to
exhrath the Information and putirns devived
by +he KDN provesy
~The IDD process consist: of Following steps
FJ selectHon
vi The target subset of data and the athibukee
of interest ave identified by examining
the entre raw cataset,
may be objained form many different and
“| The data needed for the data mining Protege |
hederogenous dati Sourles
-| Trig obtaing +he deakt Rom various databases
Piles and non- elechonic Sources
TTT] poder cleaning
-| Moise and outlers aoc removed , Preld vedues
ant transformed +o common unihs ond some nee
=it
|
es
fields are created by combining existing: 1 Beles 5,
Pucilitete anclysts
~The deta 75 typicalling put Tn are tettonal
formed, and several tables might be vombind in
a dénormolization ster:
iti})||Dose Preprocessing
The clotey +o be usecl by the Process may have
Incorrec+ ox missing daby
Thex. anay be anomalous deck Rom multirle som
Fovelving diPfexm -acHvities performed at ot,
a Exnoneous data may be conrcted. or removed,
where ag missing data must he Supplied oF
|| Pacdicted.
iv]| pate Transformation,
||pota Rorn di Ffeormt sources musy be ronverted
Trio 4 Common formed: for processing. -
~The dea maybe encoded ov trains formed fnky
mor unsable formats.
Dada RedutHon ray be used to aedule the
Me. of possible data velues being ronsidexd
VJ| Data mining
—||Rased on the obita mining task befng
pevfermed this Step applies cgasthms te bt
rransformed deta to Senerake the destred
results.
we appld the DM Alyoasthmsg 4a Cachet +
Tnrerespng padierns,
~ They epplt associction rwWe learnin of
ide!
g to
cHhot ate Aequently boughy og ephey .
_vi J)
Tnresprye tation ) évea)luation
The. patterns are, presented 4p end-users
Tr an understandable form
The pm results are presented +o the userc
TH exRemelyY Important because pre usefulnes
of +he results fs clependenk on fh
varseus visudization and Gv! Shrotegl ec
ort used.
Entegrekt,
ERT] = (ae aS
TRANS FoTmed
Dake
inidt L = + wa a
Dota Data Date |
7
“
Rnowkye
vitel for eliscovering valuable!
|
The _ kDD process 78
Pnsights from clete , which can significant}
enharre decighon ~ making ancl Shretegy dle velopment
fa various Fells=s
latte aq Short noit on
Data Processing
Deca Poocessing fnvolves tansPoeming raw dody
Into _useful information
Data processing means +o Processing of dati he
to lonver+ 346 format
Data rrveessing is the Series of opercet ons,
performed on clatu yo transform , analyse and
organize 74+ into & useful format
The goal fe +o exhute pertinent information thal
Can be applied jn decision meleing protessesor
Suppot-+< exishing technologies a
The following fs the process of Dat Prvlessing:
Daty Colec-Hon
J]
|
petty Prcparaton
Dara input
|| Dota Provessing
Data o/P ancl Interpretation
Dakg Story?
The following ove tne tures of oatu Provessing
Real~ Time Provessing
T+ fs essemtod for dese thet require immediah
handling of dada _apan receipt) providing Instead
Protessing and Feedback
Multiprocessing ( paralle) Processing)
Er Fovolves pHilizing mute processing uniis ot Ph
do handle warous +agks Simudtauneous by This approsl?
elves for more efficient dale Breressing , porn)
'
= 4c=)
for complex computations that con be broken down
Into smaller, concusren} asks thexby speeding yp
overall prOCeSsing +me.
€]]| Dishibuted Processing
+ favolves Spreading Computetional +g k5 Aces
routtiele computers or devices +o Emprwve protesting
Speed anel weliabi lity
S| Mamucd Jato Pressing
Ft wequare human inderventen for the Input
| Processing and olp of clade rhpheolly without the
aid of. electronic devices
it]|| Dota Cleaning
= |Data Cleaning fs +he Process of F¥xIng 07 rernoving |
Pncovrcct ) Commupted 1 INcomecHy Formatted: idupliccae |
or incorrectly Formerdmd within a dedusel |
-||d+ 76 a process thee removes data theok obes |
no} belong in your datuset
= Peata cleaning fs clso Known as data cleansing and
clatg Sirubbing
The following are the ways to Clean clata! —
Remove duplication on fxrelevand observation
WLRPx structuvot esers
Hil] Prey unvoanted outliers
~5v]]| prandie missing daty
VJ]vatidete and aA ;
||) fs couctcel step in tne data pot paration/
Process , playing on fmportant rele FO ensurin
L
the actusacy ; reliability and overall quality oy
a detuseb
Moreovey , clean caty facil tates mort eh RecHve
modeling and patlern vecogniton ag clgonthm,
bexform optimally when Fed high-quality , enn,
Bee input.
1
[The following ave +he 4echniques fer cleanin,
! “i
oordy
fl al |tgnore +he Huples
~! QM tn the missing value
Ss CJBinning method
na 4]| negxssfon
Fe €})| Clusdewiing
iS he% are same usage of vate Cleaning
7 “Yo. ineegrection
b] 0. migrestion
cl O- Transformaton
an Ao vebugsing Pn ETL processes
— |The chesactenshis of pata Cleaning ane actual?
[coherence 1 vedtdity ) uniformity , elec Vea Brcadteny
cian deka bat kflord.
o~
~ tJ] Date @educton :
~ ~ [Dara Reduction $5 @ method of 2educing size of
al orginal data so shu f4 rnay
be Weprasented tnd
— aamuth Smaller space
[t+ aims +o define f+ morw compacit y
The fnieantt of +he ovginal data, cota rsducBon
tethniques are UHilsed +o generale 4 owduted vexsivn
Of tre detuset -€. Substanbally Smalley fn volume
The outcome of pm fs _unartecied by cata xdulHor
The gow of dat reduction is -o make Information
move Compal
Tt fs easier +0 use Sophisticatel and computable
expeNSive algevthms when the dota amount fs
less
THE deck (an be vecluced-Fn terms of the numbe x
oF wows (xttords ) ox the number of columns
Celimensfons )
Following are Some yechniques of dati aeduche pn
a)
Deta Sarneling
oJ
bimensiondity Reduction
e]
Dota Comprsssion
a]
Ded —DiscrxeH2asor
2]
Pecturt Selection
IMethods of Data Reducton
J
Odtea Cube Aggregation
4)
DPmension Reduction
~ Sep ~wige fovusarel selection
— S¥p- wise Backword Seechor
= Combination of eoruwarding and backwerel selection,
Data Compression
pumersiIyY geduction
Hl oigmeHeabion & Concert hi evurchY Ope mbon 4
ngQa=—=0
@ | Explain Dimensfonality Reduction fn debut!
~>-l pimensfone?ty aecuction fs _o Preces? eae,
dethnique +o vredute the number of dimension
On feotuers” In _a deta seb
~The goal is +o decrease the dete set's
Complexihy py aceducing the nomber of Feahiry
While Keeping the most Imporhint properve
of ne oviginal detd
= _|Dimensionality Reducton fs a Way of (envertin,
she highty dimensions dataset Fnto lesser
dimensions dutusek ensunng see PH prvides
SIM lay FPos mechan
= [The Foilewing awe «jhe _methools of dimensions
|| Re ductfon
iJ Featin selecHon
iF] Pecsrise Exhachon
5] || Feertuve Se lection
— [T+ selects a subsey of the origina: Featurs base
lon _-thefy Importun(e
—=[me common methods inclucle¢
a [Eder me thoels
Uses Stetistoe measuxs like Comclatan
metucd Fnfermedion or vanance threshold ¢
variance thoesholding remove
: S Peccbune P
1ow Vvamanle seth
b] i werapper metoel
+ evaluakS subser of Feotuns based of
[model performeanre 20-9 REE pe Retyrsi¥
feature el?minahon~|
2
Embedded metrees
T+ includes techniques Jike fasse xINSSTON
thet penalize jess Important features dunng
model training.
Featuve ExtracHon
T+ _Snelucles tansfosming +he clata fom a hig h
el? mengionel SPACE +0 a owe, ! dimenstone}
Space
SIPcb ( paincipte)- Component Analysis)
Project date onto orthogonal axes callec/
Pr nerped ComPonents
The Fi+s+ Components Captures the maximum
Varian te » Pollwe el hy,
MIILOR (Linear Diseniminant AnUSts_)
Simileay +0 POA but FOCUSES ON maxtmirin
C'A5S SePercthjll ty
i 7s 4 curesvised method spat aims to find
Vinety Cembinotion of featuns tnat best SCPC rele
AtPferent Classes
L~ pistébutee StochasHe mweighbous embeddin ¢
wy
Anon linear technique. for high .climensfonad
data visualization
It yedluces climensionality while _m dinicining
the serotive distance. between data pornts
Duto ~ encoderSS
Neural nehoorks designed jo learn Compre sce
‘
atP r¢S5emacHoN of Fnpub clerty
they recluce d?mensionality Hmugh 4 bottler:
lay ery
‘|
*
App licaHons
e
Dak come xssion
i]
id!
Roise Reduction
improved _vigualizcction
iv])
PrtPomess'ng SkP
——E@. 4)
Explain the -apyMeation of pata mining tn *
fousiness Injeltrgente
=> -
Data mining playS a prvatul ale jn Business
Intdligence by enadting OMAN rations +o
ex halt voluable insights fom large _decteseks
—| These insights Suppoa} betes decfsfen - making
5
MPRWVE operational efficfenty, enhante Customey
experiences , and: dave comp ettve advantige
— |The Following are ane applicotieA of pry in
mit
ST Customer Relationship Management Ceam)
SJ sates and morketing optmizaton
ni
Fraud oerecHon & Pxventon
NJ
financial Analysts anc fore casting
YJ |isupply chain and operectFonS management
vi)
Human Resourte Anal +Hcs
cam
or helps businesses undeystine) CuStorney
behavior . ancl Pochoence, 1 leading +o fmprved
CusPomer SatisfacHon and loyalty
Tt Tneludes applicetHon f-e. custome behoviosy
Segmentetton Churn paedicHon anal PerSonaljr eg
mar keting
t]||Sales and Maaketng optimization
| Carnp align
OM _parvides jnsights into Seles pattern , markehiny
ef RecBveness and rrurkeb bends
Tig based ON marketS Based Analysis
effectiveness and saleS ForccastinQa—==0)
——— 1 =
U7 | Baud DeteHon and PreverrHor i
IPM fdemifies unusel patierns fn -finantl
[transacHons Which coulel incidase frau
=|Predictve models assign visk soot $ +o
Hransaction or! eniHes , helping pre rtize
Snve sig actors
iv} Finance! Analysic and forecasting
= [Paci cdive andyhes delamines credit @eithine
by anelyzing historical date , patmen} neyo
ane derrographic. _informechon
“| Clustering ane classthicahon algosthm help
fdentify the most and leass P7Fi+teble produ
| Sewvices oF customey Segments
VJiisurply chain and opersioned Managerren +
= |OM optimizes various aspects of Supply Chain
pereretions , enhaning etficteney and reducing
[Costs ~ . :
—||OM tdentfies inePfetienies Fn Procluctton vy
nogisties andl suasesic Im prevernen »=
Explair Bayes Theo* and wafve RAYS
Theor rw t
2 2 A
NAve Bayes ctgettam fs based on cond? Hoaned
Prebab illite
J+ Uses BayxPS Theorem a formula thot caltul ae
Probability by lounting the Requency of velu es
and -combinedtions of velues tn the. historical dat
The advantige of Waive gayes ts fs Speed
The’ mMeive"” part of the name Indices the
Step tifuing assumpHon’ mecle by the nlafve Baya,
Class ifFiey :
The “Gayes* part of the name refers to
| Revexned Thomes Bayes) an eth centr
Seuteficechon and theologian rwoho formulatedl Gay ec
Hh 20% en:
The general sttemen} of Bayes Theorem Is
TRE Conde pond prrbalihy of an even: A
JiveEN she Ocedinente of another even) -B, 5
equal +o the product of the event 6, diven
A and pre prwhabili HH of A divided by the
probability of evens 8") Fe
PCAIR) = P(8/9). pla)
PCB)
Devivetion of Bayes Theort m
~The prof of Bayes Theowm Js IVE a5, aK
Jo FRE Cond fone poob ability formuld
P(EI/A) = PLE NA) —@)
ein)
Then, by
Using the multiplication we of——— a aN
a Poobability ; we gel -
7 | Plern A= (ej) PCA Jet) —
~ Mow 1 by the Jotul perpbabl li hy These ry
7 Pin) = 2? (te). paler)
rr Subs titing the value of P [EI NA) an
7 PLA) Prom eq. @ and £99 GD Fn ing
a we get,
PCEIJA) = Pl E+) -PIA) FL)
+2 (Bre). plA/ Bx)
— || Bayes theorem fs also Known as IP. form
Poy ne probability of "Causes 7
|The £3 !s ase a patton cf the Sample
Space and ot Any given Hire only one
ef tne events occurs,
Tess related to bayes theorem a¢
| hx rotheses
Evenis happening in tHe Sample Spare
El E24 ++. En fs called the hypomese,
wd Pastor? Prob abj lity
I+ fs the initial probabil hy ofan
evenk occu before ANY too clea Ps Jeuken jo
Pale, PEs). ts isne-prlost Poobebility of-
hypothesis EB.
iv] Posterior Pr babi | THY.
| Considering new fnfirrmectiory PLEIIA) Pe const:
Jos tne poster probabiys Hy oF YPohesrs F.
!
SeE=0
Tre. following ax he assumplfon of Naive
}
Bares
Feocthun. _jndevendenr &
vl
Me missing data
ComHnous Rekines are normally distribu ol
wi
Distoehe. Reatures have multinomial dis bs bt79 ng
a)
Feechh