0% found this document useful (0 votes)
25 views9 pages

DM Unit 5

This document discusses different types of web mining including web content mining, web structure mining, and web usage mining. It provides details on each type: - Web content mining extracts information from web page contents, both textual and non-textual information. - Web structure mining analyzes the link structure between websites to determine relationships between pages. It represents websites as graphs. - Web usage mining analyzes web server log files to discover patterns of user behavior and page visits. It aims to identify frequently visited pages and user sessions.

Uploaded by

shashank goud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views9 pages

DM Unit 5

This document discusses different types of web mining including web content mining, web structure mining, and web usage mining. It provides details on each type: - Web content mining extracts information from web page contents, both textual and non-textual information. - Web structure mining analyzes the link structure between websites to determine relationships between pages. It represents websites as graphs. - Web usage mining analyzes web server log files to discover patterns of user behavior and page visits. It aims to identify frequently visited pages and user sessions.

Uploaded by

shashank goud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

5.

4 DATA MINING IJNTU-HYDERABAD


5.2 WEB cONTENT MINING, WEB STRUCTURE MINING, WEB USAGE MINING
ae, Discuss in detail about different types of web
mining.
ArSwer Model Paper-l, Q10a

The different types of web mining are as follows,

() Web Content Mining


c o conient mining i1S a process of extracting relevant information from web contents. Basically, web content comprises

and hyperlink. The textual information in web


hotonly textual intormation but also graphical information, real-time information
Content data is a combination of unstrüctured (free text), semi-structured (HTML pages) and highly structured (database generated
HiML pages) data. However, web content data is unstructured due to which text mining techniques can be used for performing

web content mining.

Some of the examples of data contained in web includes,

Government intormation

Digital libraries

Business information of many commercial organizations.

Web applications.
Web also contains hidden and unindexed data where unindexed data refers to the data produced dynamically through

queries.

i) Web Structure Mining


Web structure mining refers to the process of analyzing the nodes and connection structure of a particular web site. This

is typically done by the graph theory.


Web structuring checks link structure ie., hyperlinks among various websites and classifies the webpages on the basis of

It discovers the relation of webpages and the pages to which the links are found. The relation is determined
hyperlinks found.
on the basis of synonyms or similar content found on the webpages.

It also determines the networks in a certain domain. This determination makes the process of querying mtóre easier and
eficient. Morebver, it helps to find out authorities and overview sites for the subjects pointing to multiple authorities. It allow
to study the inter document structure (i.e, the structure of documents within the web).

I he collection of hyperlinked pages say Vcan be represented by a directed graph G= (K, E). lIn this graph, Vis theset
of nodes (pages) and the directed edge (a, b) e Emeans that page a points to pages b. The out-deyree ofa is the count of linkad
nodes frorn 'a', whereas the in-degree of 'd is the count of nodes l1nked to it.

There are different algorithms that are employed in web structure mining. Specially, PageRank, HlTS and CLEVER are

used to find the quality rank or relevancy of web pages

(ii) Web Usage Mining


For answer reter Unst-V. 9
p9Expiain in detail about web usage mining
Modet Paper-l, 010(4/
Axaswer:

Web Usage Mining


Web usage nining simply refers to the process of searching the Weblog records in order to determine the procedures
which users can access the Web pages.

Look for the SiA GROUP LOGO o n the TITLE COVER before youbuy
UNT Westh arnt Test Min 5.

h i. i h . lit tuatative lute lnnatim aevices tythe yser

hun Jg cntry the


It
t.th ll n
avn W.hg vn yeery time tu Web paue cceased, web comtato

tiwnntin atut the n the h pape ( i e , R ) , ' udeaaol the euester, anda ineslanp

h e t fiope ntlh etovw Mobaten ul Wel iaed e o e e e nerven nwed to mainlain a large Webilky, databas
alnt i l l m a f Weblp records. This Weblog elatabane can be accesued with the help f Weblog
wh h R An fmatio
min tv hunges n n h must ie deyeloped by conmidering he ollowing iipecta
reteved
arWan f any
Wehlp mni telnije depenudi n buow mch nelevánt and robable infomnatjon can be
fi il an lay
laala collectmg thin relevant inlor mation, he technque mut he abl¢
to
pusfy, comprcss and
Ransw she datatba
une etc, Can be
1h m u t h a t t n alut h q n t veh pape q e n , cqqucntly refercd Web pages, peak usage
th nmont

Aed n perfwmimg te myl dnieumal OAP analynin on the Weblog entries Ths nfomation s very esentral

1 dentf utg the a i l e s t m e r , UNeN, maket conditiom ctc

atfeN n d Wch accenNIg rends can be ddenuficd by perloming data mining on


e
seq al
yaften, aMNN
ati
Webg e m s

et i cider an enanle i nhom hon welh usage mmg N Nelul n day to day lile. A computer science student types
h engine and etting the searched results, he selects topics that are pamed ay "Python
h 4 sarh quen ax
"Pthon m a
sea afler
Ngam lanuag fo iCwing
ivn thIN Y e s ihe neb unage m n g sy NtCm Can dcduce t}hat the pagex wilh the tule "Pylhon programming languaye
YhNY nerestef t the user than the pages ahout the python snake JJence, it increanes the page rank of thhe pages selectcd
relevant pape frOm the next scach
a M Slar th11 h t A n m n g lanpuape Thix facltate he uxCr to get
T , th seav engne abulin i seanh wll be mproved, ax the scarch has been iltered andl personalized

Q4 biscuss about the following techniques for modeling web topology,


Page Ran
Social Network
Awer Model Paper-, 011

PageNank
g rank s the mst c 1levtne and pular algorithin to mprove the web search. Ideally, the peopular search engine
g e y r a e s m aye Ranh algorthm he wkung of the algoritlhm intiatea by employing the incoming and outpoing links
the populanty of the welb-page with the users query. not
axs mlate the web nage veore It conmpacted dependng upon is

retrieval strategy proluces two documents of same rank, but the page
iTN I R he mge Raná alporthn, the fradiinal
h a n e s the smilants measure with respect o the popular dwumem, where in a document is said lo be popular if it has

by web pages but fails to support the douments which


it * Aage hnks to sthe do ument lhe algouuhn is well supported
N Y have hs hats
e n m a t a t n t ot ihe P'age Rank coTesjonedmt fo page 4, coNIstng of pages ) , . ) , . . . . ) , pointing t o A 1s specihed as,

PagoR ank t deg))

tRUN ALL IN ONE J0uRNAL FOR ENGINEERINa STUDENT -SIA GROUP


DATA MINING JNTU-H*DERABA

Where
Number of pages which A.
CD) connects to page
POg¢ NJiKil t thte
role ot the dampening factor
is to dss1g tt-t
ampening factor ranging from ) to I the identitication, the daLrnpeningtE
SCiihen he

Page which does not have links to them. At the time of


allkocating page Rank
weights assigned to the other links.
tor converging
this
the ompieke paie pur
0 , the generated affects the time require for the PageRank
outcome S:91luei
to all the pages
ani usin8 Peviwusi
PageRank is allotted
S pertornmed m an iterative manner. At first, an arbitrary
to a larger e x t e t
Scores, the calculation is until a new score can be changed
repeated,

Exampie , t h e AIIÓCT H
Rank o
With the common dampening factor of0.85
and setting the beginmng ofeach page
performed before converging the scores were 8.

P'ageRank 012 PagRauk 0.JU

PageRank ),s9

Figure: Page Rank Caleulation

Social Network
(ii)
Social networks (link analysis) are the networks
wheren the assovalion among dattoreaN eAAN a e e e d, n

form of link within a graph. These networks are basically collevtio


à of heterogeneos ad nulti Featioal a wa
in size consistng of nodes and tges The notes withua the gaah ivi
represented using a graph. These graphs are larger g
uniduectional links that specity association between two bits. o h anoctes k
objects and edges indicates
attributes, and objecis contain class labels.
Social networks generally specily the notion of "small worlés" whicth manly ocuses oa ihe div dual aiwk 1
characteristic feature of these small world networks is the presence of high dogrve ot local ctustering lo a saalt tackieNA w
Howeve, there is no much difference between lhe degree of clustering beiwoon the nodios
The different domains in the social networks include

Technological Domain
(a)
The examples of social network in this domain are elecivival power geids, ichephone cati gayas. spiad l'soi
viruses, wwW and etc.
Sociology Domain
(b)
The examples of social network in this domain are exchange of e nail messages wihia euporiks twvd
and etc.
chatroons, fiiendships

() Biology Domain
ln this domain, the exampleu Fanges from epidemiclogical neiwaxks. ccinutar aad awatade neiwn: 4atiwtWe
nenatide cacnortabiiis ciegans.
the neural rietwork of the
wom

Look for the SiA GROUP LOGO an the TiFLE COVER fpre you by
UNIT-5 Web and Text Mining 5.7
5.3 TEXT MINING
61. /Explain in detail about text mining.
AR`Wer Msdel Pagerf. (311(a)

Text Mining

Text Minng is defined as a process of extracting high quality oriented information, for text (document) databasex he

main purpose ot this mining is to process unstructured information and to extract meaningtiul numeric indices trom the datatase

so as to make the information accessible to different data mining algorithms. lext mining is an important part ef data m g

process because, such mining enables the user to make a comparison among several documents. provde priority to essenttal

documents Or 1dentity the procedure of several documents. The tasks of text mmng inchude,

(i) Text categorization

(i) Text clustering

(111) Sentiment analysis

(iv) Document summarization.

The historic text mining techniques use keywords and frequency counts so as to mine the text.On the other hand, imodet
text mining techniques use artificial neural networks for mining the text based on some semantie network analy sis. Steh mined
text is heipful for the creation of data, summarizations, semantic text based navigation, ete. The functionalittes possessed hy the
modern text mining techniques are,

0) Text mining extracts and delivers the accurate semantie network of database This network is built by consdering the basie
concepts and relations existing in the database. Thus. it gives a detailed description about the text and helps in the liurthe
analysis,
ii) It controls the size of data and provides a high quality, aceurate and summarized data.

ii) It focusses on a particular subject so as to explore the words tex.

(iv) It performs navigation of knowledge base from concepts of semantic netork to useful infornmation. This is done by using
the hyperlinks.

(V) It creates a structure that describes the senmantics of analyzed texts.

(vi) It clusters the text present in different locations of semantie


network
analyzes the queries to retrieve semantically important information fium the tef documents.

R12. piscuss about,


) Information retrieval

(i) Information extraction.


Answer: Mudel Paperl. Q11

) Information Retrieval
Information retrieval simply refers to demain which is being developed coneurrently witlh the database sy stems.intornmatioN
eval typically deals with the arangement and extraction of infomation from an enonous amount of tent based documents

PECTRUM ALL-IN-ONE JOIRNAL FOR ENGINEERING STUDENTS SiA GROUP


5.8 DATA MINING IJNTU HYDERA
nioration retrievaB and database systems are separately used to manage difterent types of iat.a NlOse tut

fetrieval systems are not concerncd with the problems of database systems like transactiO nanage't1*'ití, toHu'ttTe otaoi,
ad updalion. Similarly, database systems are not concemed with the prolblems ol nformatont retitev.al stenms iake hev and

search, unstructured documents and the analysis of tusetul intomation.

ntormation retrieval came into existence with many newW appicati0ns duc to the large quantty ol av.ulall tevt mlo

NOwadays many information retrieval systems are available which include on-line docunent m.nagement systenas. or
digilal ibrary systems and also many developed web search portals. h e problem with ntormatuon relneval sy Stem is olp

Useful documents in the document cluster depending upon the user's request. When a user nccds to retti've t sinall portion

the available information, then the user itself starts retrieving useful information from the eluster. When user neeis to etrte

may also start provndng


usetut
nfornato o the an
huge amount of available information, then information rcieval system
unolhte
referred to as infoaton filtermg and
the systems m

epending upon ils requirements. This process can be simply


process are referred to as filtering systems.

classilicd into two major categores. They are,


Generally, inlormation retrieval methods are

(a) Document choosing methods

(b) Document priority methods.

(a) Document Choosing Methods


In these methods, the query represents constraints in order to choose usetul documents fiom the cluster. "Boolca retacna
model is one such method, wherem user provides a boolean expression of keywords like "ando or dev file". In ths
a set of keywords are used to represent a document. The information retrICval system accepts a bovlc.in qery tion the

user and provides documents based on the boolcan expression. This nnetod cn operate well when the User poss

good knowledge of document collection and has the ability of lormutatimg a good query

(b) Document Priority Methods

In these methods, the query are used to asign priorities to all documents based on tacir order otreevanee in vther wods,
the more relevant docent is assigned to a first priority. less relevant doctumeat is assigned to a second pritity an w
on.In contrast to docunment choosing methods, these methods are more cflicicnt and useliul lor common users anlther
exploratory queries. Whenever a user inputs a boolean query, IR system prncesses the query based on the hey words a
returns a list of prioritized documens. Nowadays, many diflerent priority methods are available which mostly relas en

mathematical terms such as probability, statistics, algebra and logic gates. The primary purpose of all these methods ise

match the user query keywords with the keywords available in the document and also prov ide prioruy lo all doctmens
based on their order of relevance. This category defines a vector space model in which bothh docmient aud query
represented as vectors in a high dimensional space correlating to all the keywords. It also apply an adequate similas
measure in order to cvaluate the similarity between the document vector and the query vector. Ihe ptionty docued

can then be referred by using these similarity values.

Information Extraction
(ii)
information extraction approach is lexi mining approach wheren semantie iniornNdion is provided as p u t so s

discoyer relevant information. This approach is highly advanced and requires semantic text analysis which can be chass1iel u

following two typeS,

Document Classification Analysis


(a)
In this text mining approach, automated document classilication is essential due to tie existence of an cnormous nul
f web documents. These wetb doeunients are needed to be systematically arranged into classes for providing d c t e

and subsequent
analysis.
mining
ook for the SiA GROUPLOGO on the TITLE coVER before you buy
UNIT-5 Web and Text Mining
5.9
Document ciassiication is mostly used in various aspects including automated topic tagging, topic directory creation,
identifying document writing styles and also document's hyperlinks classification.

There are various classification methods used for ciassifying docunent databases. They include vector-space model

feature selection methods, Bayesian classification, support vector machines and association-based classifñcation.

(b) Document Clustering Analysis

h e documents can be organized or structured with the help of document clustering technique. However, this organizing

is done in an
unsupervised way
SInce. the Overall document space is highly dimensional. initially it is mandatory to reduce this document space to a lower

dimensional space for better understanding of the space's structure


After obtaining the low dimensional space, the conventional clustering algorithm ike spectral clustering. mixture model
clustering. latent semantic indexing and locality preserving indexing can be applied.

5.3.1 Unstructured Text, Episode Rule Discovery for Texts

a13. biscuss in detail about unstructured documents.


swer : Model Paper-, Q10(b)

U'nstructured documents refer to open texts which can be interpreted differently by different readers like news stories. In
majority of researches. a set of specifie words are used not only to represent unstructured documents but aiso to extract various
features from such documents. This allows to convert an unstructured document to structured document.

The various features of unstructured documents are as follows.

Word Occurrences

The set of specinc words considers training corpus word by word statistic where each word acts as a feature. In a docunent.
2 feature is called boolean on the basis of whether a word occurs in the document or not.

On the other hand, a feature is called frequency on the basis of frequency of word in the document.

(11) Stop-Words
The selection of feature involves eradication ofcase sensitivity stop-words. punctuation and uncommon words. Some of
the examples of stop words are a, about, also, among. are. around, at, by, ete.
(iii) Latent Semantic Indexing

atent semantic indexing is also referred to as latent semantic analysis. It converts the vectors present in original document
to a tow It does by analyzing the document identify
ilows to
dinensional space. so to the meaning or concept of document terms. This
place the similar documents under the same topic.
fiv) Stemming
The process of reducing words to the respective morphological roots is referred to as "stemming". For instance, consider
ords such as
"deposting". "depositor", "depositton" and "deposited". These four words can be stemmed to their morphological
oot "deposit". Here. the word "deposit" is used as a feature rather than above four words.

(v)
-gTa
in addition documents also support other feature
to above feature representations, text
representations ike.
4 sage of information regarding word positions in the document.

) s a g e of n-gram representati0n.

PECTRUM ALL-IN-ONE JjoURNAL FOR EMGINEERING STUDENTS SIA GROUP


DATA MINING IJNTU-HYDERAR
5.10
(vi) Part-of-Speech values. However, t r .
include twenty five possible
feature and its tags
Part-91-Speech (POS) is one of the
significant
numeric values can
be assigned to these words i
and adverb. Thus
tags are used frequently. They are, noun,
verb, adjective 5' to any other-word.
4' to adverb or
to verb, '3' to adjective,
can be asSigned to noun, 2' can
be assigned

(vii) Positional Collocations


word by one or two positions trom one of s
words which are adjacent to a particular
These feature values refer to the
two sides (lef/right).
Features
(vili) Higher Order
document concept categories,
named entities, daies, m
phrases, terms, hypernyms,
Higher order features involve
methods like information gain,
mutual information, cross ent
locations or URLs. Feature selection
addresses, organizations,
for further reduction of fcatures.
or odds ratio can be applied

structured form and techniques


iike discovering freque
unstructured text is converted to
After extracting all the features,
rules can be employed.
frequent sequences and episode
Kem _ets,
rule discovery for texts.
a 14. Discuss in detail about episode Model PaperHi Q18
Answer:
sequential data such as text may be
for structured text. Here,
Ahonen et al, intend to employ sequence mining techniques
regarded as sequence of pairs (fcature vector, index).

Feature Vector Index

Pair

Feature Vector

It contains ordered collection of features.

Index

It consists of information regarding the position of the word following sequence.

A text episode is defined as a pair a (V, 5). In this structure, 'P represents set of feature vectors and 'S representsparti
order on M. The text episode a is said to occur within text sequence 'S if there exists an approach that satisfies the feature veciors

in V with Ssuch that partial order () is considercd

For instance, consider the text "Datamining discovers patterns". This text can be represented as,

(Datamining noun _singular, 1), (discovers_verb_singular, 2), (patterns_noun_singular, 3).


Likewise, consider another text "Information discovery in databases". This text can be represented as,

(Information_noun_singular, 1), (discovery_noun_singular, 2), (in_preposition, 3), (databases_noun_plural, 4)

Here, all the occurrences oftext episode are not considered. Instead, a limitation is set like, episode must be within windo*
of size 'W. For instance, consider W = 2, the subsequence (lnformation_noun _singular), (discover_noun_singular) lies with

the window. Whereas the subsequence (Informatjon_noun_singular), (databases_noun_plural) do not lie within the window.

ce
The 'a' in Sis described as the mininum number of a occurrences in S. Hence, episode discovery technique of sequenci
mining can be employed for identifying frequent episodes in a text.

Look for the Si4 GROUP LoGO o n the TITLE COVER before you buy
aINIT-5 Web and Text Mining
5.11
5.3.2 Hierarchy of
Categories, Text Clustering
15./Discuss in detail about
hierarchy of categories.
Answer
Model Paper-i, Q10(b)

ierarchy ot
categOries refers to the process of organizing
data into hierarchical groups. It is used to express the relevancy
of documents which can be accomplished in many ways. One of the methods is to categorize the data based on diiferent altmoutes

ike date. vear. topics etc.

i1 many documents are


ass1gned to one category, they can carry different topics at the same times. For this reaso1.
volicP
hie rarchy catcgorzes the documents into a set of categories using a data structure. The concepts in this technique are represented
i n the form ot a dhrected acyeie graph. These concepts are assigned with unique names. Suppose r and y represent two concepts

inthe graph then. means


that *Is more general than y.
group o concepts can also be used for a single text document where these concepts map the document content.
Whenevera document is tagged with concept, it leads to tagging of that document with all the previously tagged elements

ofthe conccpt hierarehy For this reason, it is preferred to a document with lowest eoncepts. To perform automatic tagging,
a top-down approach is adopted. The possibility that aiready tagged document can be tagged to its the child nodes can be
determined using evaluation function. If such tagging is possible then tag moves downwards till it reaches its limit.

The hierarchy of documents generated carries group of doeuments at every node which are common in terms of the
coneepts assoc1ated with that node. Such a document hierarchy is useful in numerous text mining processes.

The concept hierarchy s Considered as a priori.

gi6. Discuss in detail about text clustering.

Aner: Model Paper-l. Q11(b)

Text Clustering

fetclustering is one of the essential functionalities oftext mining. It is performed using one of clustering techniques
after the dentufication of unstructured text features. Ward's minimum variance method is one of the commonly used text
stering algor1thms. it belongs to the category of agglomerative hierarehical clustering techniques that produce eompact
ciusters.

he dissimlartes exIsting among feature vectors are mcasured un terms of Euclidean metrie or Hamming distance. Here,
the custerng process initiates with ' clusters where each ofthese eusters is assoeiated with one text. Among these twe clusters.

dh wo say. C, and ( c a n be nerged to form a new cluster (

I he totlowing critTion S follow ed to generate

Min

Where

tere represents the mean value of the dissunilarity for the cluster " and "n" represents the total number of chuster
hements.
ETRUM ALLAN-ONE JOURNAL FOR ENGINEERING STuDENTS. SIA GRoUP
5.12
DATA MINING JNTO-HYDE
YDERA
Scatter/Gather documents
based on ontent sin
their content similarit,
interface used for grouping
Scatter/gather is a text clustering based
technique allows user to perform the following,
into clusters or groups.
(i) Scattering documents
(i) Gathering a subset of the groups.

new groups by rescattering


the subset ofthe groups the uments
type of docum existing
(111) Forming describe
terms that
named with a
list of topical exist several
docu cuments within ac
scatter/gather, the clusters are
In Iftheir
summaries.

in other ways such


as
further to
rorm smailer groups
Moreover, they can also be represented
documents
Cluster. subset of
regrouping the
nen user can perform re-clustering. In
otherwords,
as the
documents in subgroupdey
is performed
when regrouping
cluster may get change
ne type of theme
of the
documents in large group.
distinct collection of topics compared to

You might also like