0% found this document useful (0 votes)
17 views10 pages

IR - Manual

Ir manual

Uploaded by

ae685233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views10 pages

IR - Manual

Ir manual

Uploaded by

ae685233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A.I.S.S.M.

S
INSTITUTE OF NFORMATION TECHNOLOGY
Kennedy Road, Near R.T.0. Pune - 411001.

Assignment 1

Problem Statement: uite a program for pepiacessing og o de nt


document gueh as gto word remorel ,shemmeng

Objeche To undestand con cep ino eslvaJ 4 web mininq

Theny: Text date deied tom nahuallahguage


unabathed aad aaisy Teott preonCLAing
dext nto clean an caniista toma dhaE
be fed ints a madel ex tuethe anayais4leazning
So that
Text pepneeaicg tcehques may be qeneial
hy at applicable to mang types applicaha hey
tase foy eq the metho
pecite
m OceASin sdeniic document equaiog ond
oitt
other mathe mahca) ambais an be qute dheen eom
hose te dealing Ae Cemmenb meda

The
NIP Prepaseasing pipeline
NIP sygtem tem txtua data rads pmcesse analyzes
inteepr ts tert, As a
Hext nto a shucw al fomat
woing secinldteut
put fom ne
stages stage beconmes m 4he
nert hen ce nae
peporeing pipeline"
Seqmentaieni IE involves beakih uyo
ndanq gen tence Lole his mqy seem lik a hvd task
i has o fee challengea nthe engliih laguage
pesiocl nemally indicatis cad a sentence
se
A.I.S.S.M.S
INSTITUTE OF
Kennedy Road,INFORMATION TECHNOLOGY
Near R.T.0. Pune - 411001.

Tokenizaicm i This Qtcae


eicae involvea caminq a sente
"okens"
basle hulding blocks upom uhich anayais 4 o4hes
methods
bul o

Change
to lowrecase Changinguppeccoathe aecaeso that
inoles conrctna al
dl cword
der
inga tolla)
Consiglan fommat lowercasing a la tho mene fecqen choc

Speu Cerrecinn May NLP applicahns incudde a ep to


Coee the
-spelling al w
cmda in the tert.

Shetning : The teen uSrd sem bomonded tem linqesiea


and UAed Yetel to the baae o raok romt a crd tr
lean

leazning
tex t
nomal'izaion : lt is preprocessinq tage that Connvet
tert to a Conenico representabenA conamon pplicabm
shortene u9orckoe spelled in ditgeat ays
Canclusicmi By wsing aabove seps , wehare peaternaed
pre proceain a teet decument such a
A.LS.S.M.S
INSTIIUTE OF INFORMATION TECHNOLOGY
Kennedy Road, Near R.T0. Pune - 411001.

Asgignnen

docnen waing invericd ilea

objcctuei Lvaluale and aunalyac rtlhixal inha


)To aud lndcacing_.lnycrkcd leA aad 3ca7ching oih help
) inycrted ile

Thecn An inveakcd ind x


napping ram conlend, auch as
ocahns ih. el o dacument

Ineetea Jnelee
Cecahng wilcreate wbcd lerel ineateal index dhat is it
will return liat cg inea in wshichwexd i preacnF we uill
alo Create a
dichoaay in uhich kay xalues repreACt eacde
preaent in Jheile and Jhe ralue of a dictiomany wil be
7ep4ent h ile
7uncna ueoli
Open igAea fo open the file
readi This tunchen is wed to read content he ile
3eek (o) + eturn

Tokenize dada indiridua oeads


Applg inguishe prprocsing by caaeating each werda
tn
In be
he setence tntn da kes Tokcuzinq
Tokchizing the
hi enlences
help with Creahng the denns AerupcaiAg indexing
eperahoa
A.L.S.S.M.S
INSTIIUIE OF INFORMATION TECHNOL.OGY
Kennedy Road, Near R.T.0. Pune - 411001.

de tokauewardCile -ccntens):
Tokenize 4ile cCômtat
þrametes

7ile - centns list


A liit q sings mlaininq amen q the file
Rehn

list A isl oq Shing1 cemtaining entenacq htoteni


ile

fox i in range Cden Cle -conkenb ])

pin
tokeNzed lecontenh |i1 split)
teaultiappend (tabanied)
Create an inYerted index..

fr in io Yange Cline ):
chck qmy l:lower()
for item in takons ith¡nto
i; iem io check
iF ikm ngt in chect

docuraenh ing inveied ilea


A.IS.S.M.S
INSTITUTE OF INFORMATION TECHNOLOGY
Kennedy Road, Near R.T.0. Pune - 411001.

Assignmen
Problem tatement Wnte a onnqram to consmch a Bayeian
e n k Canaideung nedical data. use this model n dnonahhale
heartpatienb uaing standad Heard D:DiseeSetDate
the -diaghasia q
Obiechire: i) Faluake A anauze xetirel inemabion
i) Ta study Bayeaian hetesk
nut model

wticb
ThemyiA Bayeaian nehuark is a drect alyclie qraph innode
each eclge coortape tna conduicna dependenty4 each
comeprols to o wnique randon yaiable

Bayeaian Newc conaisk q tDO Mar part aCreced asylic


qraph and a set 4 comdutional pmmbablhy dighóbuis
The directed acyelic qraph noces
The condibcn panpably duibhen q a node deqii
es ecey passible ntcame
Evdenee
chuges
Commputer
maltunchon

Conunute

The comeapoodng doateol anycic raph is depicked


Detase
THe: Heart Disease Databaues
The cleeland database contan 16 atributea but al publishe c
A.LS.S.M.S
INSTITUTE OFINFORMATION TECHNOLOGY
Kennedy Road, Near R.T.0. Pune - 411001.

them
eepeuemenb en ta waing q ubset

Databcse 2 3 total
Clevelandl 3S 13

AHtibutk intoi
:) age : 4e in yeans
Sex ! gex ( aale jo emale)
cp ches pain type
) ralue
typical angina
2) ralue 2.
ralue3:
árypicel argina
nemi angindl pain

|i) hethng heatbps resing bleed presue (in m Hq a


acmusien h be haspítaly
the
v) Chol serum Chalehsal in mgae
blaad sugar Ci-bue ofalae
vi) reatecq eting elecmcardiopraphic reuts
viii) thalachi maxinwm heat rae achlered
ix) exang i ereeetse nuce aging CU y'estno)
x) aldpealc =ST depeAsiCn induced by exercise selahie tn
xi) Slope glape peaE exeuse
tegment
xi) Hhal 3 = norma
xiid Heart disecse
ticed de<ect
= is inteq ae valued omo to 4

Concluaiai In thi, cosiqnment cuehane canyhcted


Dayesic hetwmk coaideung medical data.
A.I.S.S.M.S
INSTIT¯IE OF INFORMATION TECHNOLOGY
Kennedy Road, Near R.T0. Pune - 411001.

Assianment - 4

Prablem tetemen: Jmplement a mail spam ltung Dsing tex


classiicahon algooithm ith dataset

Ohiechve )Baaic Concept q spam lkaing


algoaithra

They In he heo era ea technical advancemen, eleconie ma)


Ce mails has gathed sigailicant LAeRs er proe6sinal, commee
cial and pesomal cemnnicasonn 2014, n
person eciered 30 emils each day nd oeral 2qG Billion
emauls were sent had yea.
Becuse the bigh demand and huge usee base dhere is
upswe in wnwankd emils, also lenouwn

’Content Baed ilkeung technique :

en usde and phiases inside conent emails 4 seqpegate


hem into spUn coeganies
Case Based iltering technique
Algesth haune m well annoked spam nom spam na ted
emai ls ty to claasity incenng emails into tuo0 cotegmes
Rule based
Hewigic or
spam ilteng teahnique
Algoithn edefined mes m
SCore to the meage n emails
Adaptie spam
Gpam hltezing techaique.
Algoithms claasity incenming cmails nh vam groups
based on the compaisim scoea cg erer qoup oth the deftned
get gnups spam or no} spam emails aqet segregateo
A.I.S.S.M.S
INSTITUTEOF INFORMATION TECHNOLOGY
Kennedy Road, Near R.T.O. Pune - 41 1001.

Testing Phase
Training (Gpam
Pre
Enail ’rain PrDcess
rinint mode Net
Spam
Corps +est fectue
exhact

step 1: Emai Date Collechon The dataset ontauned in a


Cops plays cncia ole in assessing p7ermance any
sfam ilte,
E-mail content As the dataset s
Sep 2 Preprocessing
WAill ned pre proceasing t clat, AL
in tex t 4amat , we
ue Mahly perterm tokenizaon
naail.
ali skp

Afte pre prrcesSing Can have a Single hmbe c worelo


e can maintain a date baae that cntain tequency q
Her

Septi kNN (E-Neest Neihba) Iaplemendahn Similae


algothkeNN alganidhm 3e2seks he pupose
clsteaing gill iakead
ohe neo
at the closes t k instqn ces
Ou
incoirg
Algonkhom
glep si Pezyormanee Analysis
Luet check models peahemance Grena single mSed imporlant
messae ma Cause a
ing
Concluhim: To demne cg Jhe numnbee a Spam emails set
daily and numbe cg money peaple lose erzy at becauae
a theaespaoa, spam ilkinq beceres poáhaaziy need e
al email praicing Qompanes
A.I.S.S.M.S
INSTITUTE OF INFORMATION TECHNOLOGY
Kennedy Road, Near R.T.0. Pune - 411001.

Assignment
Paoblem Stte ment! Inplement Aaglanaeeaire Weeaact
luteing algothm taln appznpaiake daaac
CbËechei To study Raodorn oalk, Gauate 4 analye
eteirel. ingo uing Page Ranting Algoath
Theony Prereauisteo: Agglanmeraire lusteainq
he common hienaehicel csteaub techniques.
Dataset i Credit Cae datase!

S+ep \mporh'ng neaue d libraes


impot pandes
ispet nupy
inapeatscipy cuberhierQLehy

Step 2: loacing and ceaning the data


weinq louaHen to the loc he e
changing Cust. id column rom data
danpping
* Handing Daising scalue
X-lna (method Hillnplace Tue)
Step 3 Prepieosessingdeta,
Scaleu Steundaud ScaleC)

ncrrnaJize Ce-scled)
OC- ormalizel p detafraMe (x-normaized
A.I.S.S.M.S
INSTIIUIE OFINFORMATION TECHNOLOGY
Kennedy Road, Near R.T.O. Pune - 41 1001.

dimens)cmatihyc the data


PCA: pca (n-coponens = 2)
X- pmncipal
Y-pincipal pd. clate frame (x-prineipaly
pd.
pineipa, cetunmy p,'P

SHpsisualiaing 4he weeting e¢ Derdroga


plt He CWsuolizthg th deta
Dendhocqam shei dentr ogam ((she: ineae cx- prncipal,
metto wRcl)

hoizot lines being cempleel


elata,iiae
heriztel d ten ae colelatiny mancimwm
listance be
dstdnce

Step G: Builoin and Msuall zin dftea) k: 2clustag


alues k

Step 7 Eralutng a e n models A sualrzing

Cmelwsin hare udied abat andemc


A
4 eraluated t anakyzed metevel Into sing pa_e
anking adgothn

You might also like