DM Unit 5
DM Unit 5
Government intormation
Digital libraries
Web applications.
Web also contains hidden and unindexed data where unindexed data refers to the data produced dynamically through
queries.
It discovers the relation of webpages and the pages to which the links are found. The relation is determined
hyperlinks found.
on the basis of synonyms or similar content found on the webpages.
It also determines the networks in a certain domain. This determination makes the process of querying mtóre easier and
eficient. Morebver, it helps to find out authorities and overview sites for the subjects pointing to multiple authorities. It allow
to study the inter document structure (i.e, the structure of documents within the web).
I he collection of hyperlinked pages say Vcan be represented by a directed graph G= (K, E). lIn this graph, Vis theset
of nodes (pages) and the directed edge (a, b) e Emeans that page a points to pages b. The out-deyree ofa is the count of linkad
nodes frorn 'a', whereas the in-degree of 'd is the count of nodes l1nked to it.
There are different algorithms that are employed in web structure mining. Specially, PageRank, HlTS and CLEVER are
Look for the SiA GROUP LOGO o n the TITLE COVER before youbuy
UNT Westh arnt Test Min 5.
tiwnntin atut the n the h pape ( i e , R ) , ' udeaaol the euester, anda ineslanp
h e t fiope ntlh etovw Mobaten ul Wel iaed e o e e e nerven nwed to mainlain a large Webilky, databas
alnt i l l m a f Weblp records. This Weblog elatabane can be accesued with the help f Weblog
wh h R An fmatio
min tv hunges n n h must ie deyeloped by conmidering he ollowing iipecta
reteved
arWan f any
Wehlp mni telnije depenudi n buow mch nelevánt and robable infomnatjon can be
fi il an lay
laala collectmg thin relevant inlor mation, he technque mut he abl¢
to
pusfy, comprcss and
Ransw she datatba
une etc, Can be
1h m u t h a t t n alut h q n t veh pape q e n , cqqucntly refercd Web pages, peak usage
th nmont
Aed n perfwmimg te myl dnieumal OAP analynin on the Weblog entries Ths nfomation s very esentral
et i cider an enanle i nhom hon welh usage mmg N Nelul n day to day lile. A computer science student types
h engine and etting the searched results, he selects topics that are pamed ay "Python
h 4 sarh quen ax
"Pthon m a
sea afler
Ngam lanuag fo iCwing
ivn thIN Y e s ihe neb unage m n g sy NtCm Can dcduce t}hat the pagex wilh the tule "Pylhon programming languaye
YhNY nerestef t the user than the pages ahout the python snake JJence, it increanes the page rank of thhe pages selectcd
relevant pape frOm the next scach
a M Slar th11 h t A n m n g lanpuape Thix facltate he uxCr to get
T , th seav engne abulin i seanh wll be mproved, ax the scarch has been iltered andl personalized
PageNank
g rank s the mst c 1levtne and pular algorithin to mprove the web search. Ideally, the peopular search engine
g e y r a e s m aye Ranh algorthm he wkung of the algoritlhm intiatea by employing the incoming and outpoing links
the populanty of the welb-page with the users query. not
axs mlate the web nage veore It conmpacted dependng upon is
retrieval strategy proluces two documents of same rank, but the page
iTN I R he mge Raná alporthn, the fradiinal
h a n e s the smilants measure with respect o the popular dwumem, where in a document is said lo be popular if it has
Where
Number of pages which A.
CD) connects to page
POg¢ NJiKil t thte
role ot the dampening factor
is to dss1g tt-t
ampening factor ranging from ) to I the identitication, the daLrnpeningtE
SCiihen he
Exampie , t h e AIIÓCT H
Rank o
With the common dampening factor of0.85
and setting the beginmng ofeach page
performed before converging the scores were 8.
PageRank ),s9
Social Network
(ii)
Social networks (link analysis) are the networks
wheren the assovalion among dattoreaN eAAN a e e e d, n
Technological Domain
(a)
The examples of social network in this domain are elecivival power geids, ichephone cati gayas. spiad l'soi
viruses, wwW and etc.
Sociology Domain
(b)
The examples of social network in this domain are exchange of e nail messages wihia euporiks twvd
and etc.
chatroons, fiiendships
() Biology Domain
ln this domain, the exampleu Fanges from epidemiclogical neiwaxks. ccinutar aad awatade neiwn: 4atiwtWe
nenatide cacnortabiiis ciegans.
the neural rietwork of the
wom
Look for the SiA GROUP LOGO an the TiFLE COVER fpre you by
UNIT-5 Web and Text Mining 5.7
5.3 TEXT MINING
61. /Explain in detail about text mining.
AR`Wer Msdel Pagerf. (311(a)
Text Mining
Text Minng is defined as a process of extracting high quality oriented information, for text (document) databasex he
main purpose ot this mining is to process unstructured information and to extract meaningtiul numeric indices trom the datatase
so as to make the information accessible to different data mining algorithms. lext mining is an important part ef data m g
process because, such mining enables the user to make a comparison among several documents. provde priority to essenttal
documents Or 1dentity the procedure of several documents. The tasks of text mmng inchude,
The historic text mining techniques use keywords and frequency counts so as to mine the text.On the other hand, imodet
text mining techniques use artificial neural networks for mining the text based on some semantie network analy sis. Steh mined
text is heipful for the creation of data, summarizations, semantic text based navigation, ete. The functionalittes possessed hy the
modern text mining techniques are,
0) Text mining extracts and delivers the accurate semantie network of database This network is built by consdering the basie
concepts and relations existing in the database. Thus. it gives a detailed description about the text and helps in the liurthe
analysis,
ii) It controls the size of data and provides a high quality, aceurate and summarized data.
(iv) It performs navigation of knowledge base from concepts of semantic netork to useful infornmation. This is done by using
the hyperlinks.
) Information Retrieval
Information retrieval simply refers to demain which is being developed coneurrently witlh the database sy stems.intornmatioN
eval typically deals with the arangement and extraction of infomation from an enonous amount of tent based documents
fetrieval systems are not concerncd with the problems of database systems like transactiO nanage't1*'ití, toHu'ttTe otaoi,
ad updalion. Similarly, database systems are not concemed with the prolblems ol nformatont retitev.al stenms iake hev and
ntormation retrieval came into existence with many newW appicati0ns duc to the large quantty ol av.ulall tevt mlo
NOwadays many information retrieval systems are available which include on-line docunent m.nagement systenas. or
digilal ibrary systems and also many developed web search portals. h e problem with ntormatuon relneval sy Stem is olp
Useful documents in the document cluster depending upon the user's request. When a user nccds to retti've t sinall portion
the available information, then the user itself starts retrieving useful information from the eluster. When user neeis to etrte
user and provides documents based on the boolcan expression. This nnetod cn operate well when the User poss
good knowledge of document collection and has the ability of lormutatimg a good query
In these methods, the query are used to asign priorities to all documents based on tacir order otreevanee in vther wods,
the more relevant docent is assigned to a first priority. less relevant doctumeat is assigned to a second pritity an w
on.In contrast to docunment choosing methods, these methods are more cflicicnt and useliul lor common users anlther
exploratory queries. Whenever a user inputs a boolean query, IR system prncesses the query based on the hey words a
returns a list of prioritized documens. Nowadays, many diflerent priority methods are available which mostly relas en
mathematical terms such as probability, statistics, algebra and logic gates. The primary purpose of all these methods ise
match the user query keywords with the keywords available in the document and also prov ide prioruy lo all doctmens
based on their order of relevance. This category defines a vector space model in which bothh docmient aud query
represented as vectors in a high dimensional space correlating to all the keywords. It also apply an adequate similas
measure in order to cvaluate the similarity between the document vector and the query vector. Ihe ptionty docued
Information Extraction
(ii)
information extraction approach is lexi mining approach wheren semantie iniornNdion is provided as p u t so s
discoyer relevant information. This approach is highly advanced and requires semantic text analysis which can be chass1iel u
and subsequent
analysis.
mining
ook for the SiA GROUPLOGO on the TITLE coVER before you buy
UNIT-5 Web and Text Mining
5.9
Document ciassiication is mostly used in various aspects including automated topic tagging, topic directory creation,
identifying document writing styles and also document's hyperlinks classification.
There are various classification methods used for ciassifying docunent databases. They include vector-space model
feature selection methods, Bayesian classification, support vector machines and association-based classifñcation.
h e documents can be organized or structured with the help of document clustering technique. However, this organizing
is done in an
unsupervised way
SInce. the Overall document space is highly dimensional. initially it is mandatory to reduce this document space to a lower
U'nstructured documents refer to open texts which can be interpreted differently by different readers like news stories. In
majority of researches. a set of specifie words are used not only to represent unstructured documents but aiso to extract various
features from such documents. This allows to convert an unstructured document to structured document.
Word Occurrences
The set of specinc words considers training corpus word by word statistic where each word acts as a feature. In a docunent.
2 feature is called boolean on the basis of whether a word occurs in the document or not.
On the other hand, a feature is called frequency on the basis of frequency of word in the document.
(11) Stop-Words
The selection of feature involves eradication ofcase sensitivity stop-words. punctuation and uncommon words. Some of
the examples of stop words are a, about, also, among. are. around, at, by, ete.
(iii) Latent Semantic Indexing
atent semantic indexing is also referred to as latent semantic analysis. It converts the vectors present in original document
to a tow It does by analyzing the document identify
ilows to
dinensional space. so to the meaning or concept of document terms. This
place the similar documents under the same topic.
fiv) Stemming
The process of reducing words to the respective morphological roots is referred to as "stemming". For instance, consider
ords such as
"deposting". "depositor", "depositton" and "deposited". These four words can be stemmed to their morphological
oot "deposit". Here. the word "deposit" is used as a feature rather than above four words.
(v)
-gTa
in addition documents also support other feature
to above feature representations, text
representations ike.
4 sage of information regarding word positions in the document.
) s a g e of n-gram representati0n.
Pair
Feature Vector
Index
A text episode is defined as a pair a (V, 5). In this structure, 'P represents set of feature vectors and 'S representsparti
order on M. The text episode a is said to occur within text sequence 'S if there exists an approach that satisfies the feature veciors
For instance, consider the text "Datamining discovers patterns". This text can be represented as,
Here, all the occurrences oftext episode are not considered. Instead, a limitation is set like, episode must be within windo*
of size 'W. For instance, consider W = 2, the subsequence (lnformation_noun _singular), (discover_noun_singular) lies with
the window. Whereas the subsequence (Informatjon_noun_singular), (databases_noun_plural) do not lie within the window.
ce
The 'a' in Sis described as the mininum number of a occurrences in S. Hence, episode discovery technique of sequenci
mining can be employed for identifying frequent episodes in a text.
Look for the Si4 GROUP LoGO o n the TITLE COVER before you buy
aINIT-5 Web and Text Mining
5.11
5.3.2 Hierarchy of
Categories, Text Clustering
15./Discuss in detail about
hierarchy of categories.
Answer
Model Paper-i, Q10(b)
ierarchy ot
categOries refers to the process of organizing
data into hierarchical groups. It is used to express the relevancy
of documents which can be accomplished in many ways. One of the methods is to categorize the data based on diiferent altmoutes
ofthe conccpt hierarehy For this reason, it is preferred to a document with lowest eoncepts. To perform automatic tagging,
a top-down approach is adopted. The possibility that aiready tagged document can be tagged to its the child nodes can be
determined using evaluation function. If such tagging is possible then tag moves downwards till it reaches its limit.
The hierarchy of documents generated carries group of doeuments at every node which are common in terms of the
coneepts assoc1ated with that node. Such a document hierarchy is useful in numerous text mining processes.
Text Clustering
fetclustering is one of the essential functionalities oftext mining. It is performed using one of clustering techniques
after the dentufication of unstructured text features. Ward's minimum variance method is one of the commonly used text
stering algor1thms. it belongs to the category of agglomerative hierarehical clustering techniques that produce eompact
ciusters.
he dissimlartes exIsting among feature vectors are mcasured un terms of Euclidean metrie or Hamming distance. Here,
the custerng process initiates with ' clusters where each ofthese eusters is assoeiated with one text. Among these twe clusters.
Min
Where
tere represents the mean value of the dissunilarity for the cluster " and "n" represents the total number of chuster
hements.
ETRUM ALLAN-ONE JOURNAL FOR ENGINEERING STuDENTS. SIA GRoUP
5.12
DATA MINING JNTO-HYDE
YDERA
Scatter/Gather documents
based on ontent sin
their content similarit,
interface used for grouping
Scatter/gather is a text clustering based
technique allows user to perform the following,
into clusters or groups.
(i) Scattering documents
(i) Gathering a subset of the groups.