0% found this document useful (0 votes)
16 views

UNIT 4 Data Mining

1) Stemming is a process that reduces words to their morphological roots by removing affixes. For example, the words "informing", "information", "informer", and "informed" would all be stemmed to the root "inform". 2) Sequence mining discovers frequent sequences in datasets where the order of events is important, such as customer purchase histories. It is used to predict future customer behavior. 3) Text mining extracts useful information from unstructured text data through techniques like information extraction, text clustering, and text classification.

Uploaded by

mahi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

UNIT 4 Data Mining

1) Stemming is a process that reduces words to their morphological roots by removing affixes. For example, the words "informing", "information", "informer", and "informed" would all be stemmed to the root "inform". 2) Sequence mining discovers frequent sequences in datasets where the order of events is important, such as customer purchase histories. It is used to predict future customer behavior. 3) Text mining extracts useful information from unstructured text data through techniques like information extraction, text clustering, and text classification.

Uploaded by

mahi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT-4

2 Marks
1. What is stemming?
Stemming is a process which reduces words to their morphological roots. For example, the
words “informing”, “information”, “informer”, and “informed” would be stemmed to their
common root “inform”, and only the latter word is used as the feature instead of the former
four.
2. What is sequence mining?
An efficient approach to mining casual events is sequence mining. Sequence mining can be
applied in DNA sequence, signal processing and speech analysis. /Given a set of data
sequences, the problem is to discover sequences that are frequent, in the sense the
percentage of data sequences containing them exceeds the user-specified minimal support.
It is widely used in retailing where data sequence corresponds to items bought by an
individual customer over time. Frequent patterns are useful for predicting future customer
behaviour.

3. What is text mining?


Text mining is the extension of the data mining approach to textual data and is concerned
with various tasks, such as extraction of information implicitly contained in collection of
documents, or similarity- based structuring.
4. What is web content mining?
Web content mining extracts information from within a page. Web content mining describes
the discovery of useful information from the web contents.

5.What is temporal data mining?


Temporal data mining is defined as the non-trivial extraction of implicit, potentially useful, previously
unrecorded information with an implicit or explicit time content, from large quantity of data

6.Define index node


an index node is a node whose out-degree is significantly larger than the average out degree
of the gap.
7. Define reference node
A reference node is a node who’s in-degree is significantly larger than the average in-degree
of the graph.

8.What is co-citation?
For a pair of nodes, p and q, the co–citation is the number of nodes that point to both p and
q.

9.What is unstructured text?


Unstructured documents are free texts, such as news stories. For an unstructured document,
features are extracted to convert it to a structured form.

10. What are the four different levels of temporality?


a) Static
b) Sequences
c) Timestamped
d) Fully Temporal

11. What is spatial mining?


Spatial mining is a branch of data mining that deals with spatial (location or geo-referenced)
data.

12. Define frequency of a sequence


The frequency of a sequence s, with respect to database D, is the total number of input
sequences in D that support it.

13. What is parallel and serial episode?


Parallel episode: If the partial order <= trivial partial order, the episode is called parallel
episode. In parallel episode there is no constraint on the relative order of the events.
Serial episode: If the partial order is the total ordering, the episode is called serial episode. A
serial episode occurs in a given sequence only id A,B,C occur in A->B->C order relatively
close. There can be other events between these three.
14.what is Occurrence of an Episode in an Event Sequence:
Occurrence of an Episode in an Event Sequence: An episode is said to be occurring in a
sequence if the events corresponding to the nodes of the episode appear in the sequencing,
preserving the partial order of the episode
15.What is Latent Semantic indexing?
Latent Semantic Indexing (LSI) transforms the original document vectors to a lower
dimensional space by analysing the co relational structure of terms in the document
collection, such that similar documents that do not share terms are placed in the same topic.

16.What is Part-Of-Speech?
One important feature is the POS. There can be 25 possible values for POS tags. Most
common tags are noun, verb, adjective and adverb. Thus, we can assign a number 1,2,3,4 and
5, depending on whether the word is a noun, verb, adjective, adverb or any other,
respectively.

17.Define stop-words
It is the text mining application to eliminate words that are so commonly used that they carry
very little useful information.
The feature selection includes removing the case, punctuation, infrequent words and stop
words.
18.What are the different types of web mining?
a) Web content mining
b) Web structure mining
c) Web usage mining
19. Define Bibliographic Coupling.
Bibliographic Coupling: For a pair of nodes, p and q, the bibliographic coupling is equal to
the number of nodes that have links from both p and q.
20. Define page rank.
A page can have a high PageRank if there are many pages that point to it, or if there are
some pages that point to it which have a high PageRank. Naturally, pages that are well cited
from many places around the web are worth looking at. Also, pages that have perhaps only
one citation from something like the Yahoo! Homepage are also generally worth looking at.
PageRank handles both these cases and everything in between, by recursively propagating
weights through the link structure of the web.

5 Mark
1. Explain GSP algorithm
GSP makes multiple passes over the database. In the first pass, all single items(1-sequnences)
are counted. From this, a set of candidate 2-sequences are formed, and another pass is made
to gather the support. The frequent 2-sequences are used to generate candidate 3-sequences,
and this process is repeated until no more frequent sequences are found. There are two main
steps in this algorithm.
1. Candidate Generation: Given a set of frequent (k-1)-frequent sequences F(k-1), the
candidates
for the next pass are generated by joining F(k-1) with itself. A pruning phase eliminates any
sequence, at least one of whose subsequences is not frequent.
2. Support Counting: Hash-tree based search is employed for efficient support counting.
Finally
non-maximal frequent sequences are removed.
F1 = set of frequent 1-sequence
k=2
do while F(k-1) ≠ Φ;
generate candidate sets Ck (set of candidate k-sequences);
for all input sequence s in the database D do
increment count of all a in Ck if s supports a
Fk ={a ∈ Ck such that its frequency exceeds the threshold}
k= k+1
set of all frequent sequences is the union of all Fks
end do
The above algorithm looks like A priori algorithm. One main difference is the generation of
candidates.
Let us assume that A B and A C are two frequent 2-sequences. The items involved in this
sequence
are (A,B) and (A,C). the candidate generation in the usual A priori algorithm would give
(A,B,C) as a 3-
itemset, but the candidate generation in this context generates A B C, A C B and
A BC. The
GSP algorithm discovers frequent sequences, allowing for time constraints such as maximum
gap,minimum gap, among the sequence elements. It supports sliding window i.e., a time
interval within
which items are observed as belonging to the same event, even if they originate from
different events.
2. Write a note on web text mining.
Due to the continuous growth of the volumes of text data, automated extraction of implicit,
previously unknown, and potentially useful information becomes more necessary to properly
utilize this vast source of knowledge.

Text mining, therefore, corresponds to the extension of the data mining approach to textual
data and is concerned with various tasks, such as extraction of information implicitly
contained in collection of documents, or similarity- based structuring.
The text expresses a vast range of information, but encodes the information in a form that is
difficult to interpret automatically. When the data is structured, it is easy to define the set of
items, and hence, it becomes easy to employ the traditional mining techniques. Identifying
individual items or terms is not so obvious in a textual database. Thus, unstructured data,
particularly free- running text, places a new demand on data mining methodology.
The inherent nature of textual data, namely unstructured characters, motivates the
development of separate text mining techniques. One way is to impose a structure on the
textual database and use any of the known datamining techniques meant for structured
databases. The other approach would be to develop a very specific technique for mining that
exploits the inherent characteristics of textual databases.

3. What are the features of unstructured text?


Word Occurrences: The bag of words or vector representation takes single words found in the
training quantity as features ignoring the sequence in which the words occur. This
representation is based on the statistic about single words in isolation. Such a feature is said
to be Boolean, if we consider whether a word either occurs or does not occur in a document.
The feature is said to be frequency based if the frequency of the word in a document is
taken into consideration.
Stop – words: The feature selection includes removing the case, punctuation, infrequent
words and stop words.
Latent Semantic Indexing (LSI): Latent Semantic Indexing (LSI) transforms the original
document vectors to a lower dimensional space by analyzing the co relational structure of
terms in the document collection, such that similar documents that do not share terms are
placed in the same topic.
Stemming: Stemming is a process which reduces words to their morphological roots. For
example, the words “informing”, “information”, “informer”, and “informed” would be
stemmed to their common root “inform”, and only the latter word is used as the feature
instead of the former four.
Part - of- Speech (POS): One important feature is the POS. There can be 25 possible values
for POS tags. Most common tags are noun, verb, adjective and adverb. Thus, we can assign a
number 1,2,3,4 and 5, depending on whether the word is a noun, verb, adjective, adverb or
any other, respectively.

4. Write a note on web usage mining


Web Usage Mining: Web usage mining finds patterns of usage of web pages. Web usage
mining deals with studying data generated by the web surfer’s sessions or behaviours. Note
that web content and structure mining utilize the real or primary data on the web. But web
usage mining mines the secondary data derived from the interactions of the users with the
web. The secondary data includes the data from the web server access logs, proxy server logs,
browser logs, user profiles, registration data, user sessions or transactions, cookies, user
queries, bookmark data, mouse clicks and scrolls, and any other data which are the results of
these interactions. This data can be accumulated by the web server. Analyses of the web
access logs of different web sites can facilitate an understanding of the user behaviour and the
web structure, thereby improving the design of this colossal collection of information.
There are two main approaches in web usage mining.
1. General Access Pattern Tracking
2. Customized Usage Tracking

5. Explain the two main approaches in web usage mining.


i)General Access Pattern Tracking: This is to learn user navigation patterns. The general
access pattern tracking analyses the web logs to understand access patterns and trends. These
analyses can shed better light on the structure and grouping of resource providers.
ii)Customized Usage Tracking: This is to learn a user profile or user modelling in adaptive
interfaces. Customized Usage Tracking analyses individual trends. Its purpose is to customize
websites to users. The information displayed, the depth of the site structure, and the format of
the resources can all be dynamically customized for each user over time, based on their
access pattern. For effective web usage mining, an important cleaning and data
transformation step may be needed before analysis. The mining techniques for web usage
mining can be classified into two commonly used approaches. The first approach maps the
usage data of the web server into relational tables before a (traditional) data mining technique
is performed. In general, typical data mining methods (such as clustering and classification)
could be used mine the usage data after the data have been pre-processed to the desired form.
However, the modification of typical data mining methods are also used, such as composite
association rule, extension of a traditional sequence discovery algorithm MIDAS, and
hypertext probabilistic grammars.
The second approach uses the log data directly by utilizing special pre-processing techniques.
The web usage data is also be represented using graph. It uses some background and domain
knowledge of navigation templates, web content, site topology, concept hierarchies, and
syntactic constraints.

6.Write a note on web structure mining


Web structure mining discovers the structure of the hyperlinks between documents. Web
structure mining is concerned with discovering the model underlying the link structures of
the web. This model can be used to categorize web pages and is useful to generate
information such as the similarity and relationship between different web sites. While web
content mining attempts to explore the structure within the document (intra – document
structure), web structure mining studies the structures of documents within the web itself.
(Inter – document structure).
There are many algorithms to calculate the quality rank or relevancy of each web page. Some
algorithms have been proposed to model web topology such as HITS, PageRank, CLEVER.
These models are mainly applied as a method to calculate the quality rank or relevancy of
each web page. We outline below some of the techniques that useful on modelling web
topology.

7. Explain the purpose of web mining


Web mining is the application of data mining techniques to discover patterns from the
World Wide Web. It uses automated methods to extract both structured and unstructured
data from web pages, server logs and link structures
The World Wide Web has become a very popular medium of publishing. Though the web is
rich with information, gathering and making sense of this data is difficult because publication
on the web is largely unorganized. Having learnt the data mining techniques, a question that
naturally comes to our minds is whether we can extract implicit, previously unknown
information from the massive collection of documents available in the web. We interact with
the web for the following purposes:

a) Finding Relevant Information: We either browse or use the search service when we want
to find specific information on the web. We usually specify a simple keyword query and the
response from a web search engine is a list of pages, ranked based on their similarity to the
query. However, today’s search tools have the following problems:
i) Low precision: This is due to the irrelevance of many of the search results. We may get
many pages of information which are not really relevant to our query.
ii) Low recall: This is due to the inability to index all the information available on the web.
Because some of the relevant pages are not properly indexed, we may not get those pages
through any of the search engines.

b) Discovering New Knowledge from the Web: We can define the above problem as a
query – triggered process (retrieval oriented). On the other hand, we can have a data –
triggered process that assumes that we already have a collection of web data and we want to
extract potentially useful knowledge out of it (data mining – oriented).

c) Personalized Web Page Synthesis: We may wish to synthesize a web page for different
individuals from the available set of web pages. Individuals have their own preferences in the
style of contents and presentations while interacting with the web. The information providers
like to create a system which responds to user queries by potentially aggregating
information from several sources.

D) Learning about individual User: This is about what customers do and what customers
want. Mass customizing the information to the intended consumers or even personalizing it
to the individual user. The web mining provides set of technologies to solve these problems.
However, web mining is not the only tool available to handle such problems, we have some
other techniques from different research areas, such as database (DB), Information Retrieval
(IR) and Natural Language Processing (NLP) can also be used.

8. Explain the relationship between information retrieval and information extraction in


text data mining
IR is concerned with finding and ranking documents that match the user’s information
needs. Actually, IR is the automatic retrieval of all relevant documents IE has the goal of
transforming a collection of documents into information that is more readily digested and
analysed with the help of an IR system.IE extracts relevant facts from the documents, while IR
selects relevant documents. Thus, in general, IE works at a finer granularity level than IR does
on the documents. Most IE systems use machine learning or data mining techniques to learn
the extraction patterns or rules for documents semi-automatically or automatically. Within
this view, text mining is part of the IE process.

9.Write a note on web content mining


Web content mining extracts information from within a page. Web content mining describes
the discovery of useful information from the web contents. The web contents could include a
very broad range of data. The web contains many kinds of data. We see that much of the
government information are gradually being placed on the web in the recent years. We also
know the existence of Digital Libraries that are accessible from the web. Many commercial
institutions are transforming their business and services electronically. There is another type
of web content – that is, the existence of web applications through web interfaces.
Some of the web content data are hidden data, and some are generated dynamically as a
result of queries and reside in the DBMS’s.
Basically, the web content consists of several types of data such as textual, image, audio,
video, metadata, as well as hyperlinks. Recent research on mining multi – types of data is
termed as multimedia data mining. The textual parts of web content data consist of
unstructured data such as free texts, semi – structured data such as HTML documents, and
more structured data such as data in the tables or database – generated HTML pages.

10. Explain episode rule discovery for texts


It is sequence mining techniques for text data. Here text is to be considered as sequential
data which consists of a sequence of pairs (feature vector, index), where the feature vector is
an ordered set of features and the index contains information about the position of the word
in the sequence. A feature can be any of the textual features listed above.

Ex: The text “Knowledge Discovery in databases” can be represented as the sequence S=
((Knowledge _ noun _ singular, 1), (discovery _ noun _ singular, 2), (in _ preposition, 3),
(databases _noun _ plural, 4)) Instead of considering all occurrences of the episode, a
restriction is set that the episode must occur within a pre specified window of size ‘w’. Thus
we examine the substrings S1 of S such that the difference of the indices in S1 is at most ‘w’.
Ex: For w=2, The subsequence (Knowledge _ noun _ singular, discovery _noun_ singular) is
an episode contained in the window, but the subsequence (Knowledge _ noun _ singular,
databases _noun _ plural) is not contained within the window

11. Write a note on temporal association rules.


Association rules identify whether a particular subset of items are supported by an adequate
number of transactions. A static association rule discovers the association between two
events. Temporal association rule suggests a number of interpretations such as
• The earlier event plays some role in causing the later event.
• There is a third set of a reasons that causes both the events.
• The confluence of events is coincidental. Temporal association rules are also called as
casual rules. They describe the relationships where the changes in one event cause changes in
the other parts of the domain. They are used in scientific investigations and medical domain.
The static property such as gender and temporal property such as medical treatment are taken
into account.

12. Write GSP algorithm


The GSP Algorithm The algorithm for solving sequence mining problem is level-wise A
priori algorithm in which all the frequent items are discovered in a level-wise fashion. It
simply means counting the occurrences of all singleton elements in the database. Then the
transactions are filtered by removing the non-frequent items. At the end, each transaction is a
modified one consisting of only the frequent elements. We can use this modified database as
an input to the GSP algorithm. This process requires one pass over the whole database.
Algorithm GSP makes multiple passes over the database. In the first pass, all single items(1-
sequnences) are counted. From this, a set of candidate 2-sequences are formed, and another
pass is made to gather the support. The frequent 2-sequences are used to generate candidate
3-sequences, and this process is repeated until no more frequent sequences are found. There
are two main steps in this algorithm.
1. Candidate Generation: Given a set of frequent (k-1)-frequent sequences F(k-1), the
candidates for the next pass are generated by joining F(k-1) with itself. A pruning phase
eliminates any sequence, at least one of whose sub sequences is not frequent.
2. Support Counting: Hash-tree based search is employed for efficient support counting.
Finally non-maximal frequent sequences are removed.
F1 = set of frequent 1-sequence
k=2
do while F(k-1) ≠ Φ;
generate candidate sets Ck (set of candidate k-sequences);
for all input sequence s in the database D do
increment count of all a in Ck if s supports a
Fk ={a ∈ Ck such that its frequency exceeds the threshold}
k= k+1
set of all frequent sequences is the union of all Fk s
end do
The above algorithm looks like A priori algorithm. One main difference is the generation
of candidates. Let us assume that A->B and A-> C are two frequent 2-sequences. The items
involved in this sequence are (A,B) and (A,C). the candidate generation in the usual A priori
algorithm would give (A,B,C) as a 3- itemset, but the candidate generation in this context
generates A->B->C, A->C->B and A->BC. The GSP algorithm discovers frequent sequences,
allowing for time constraints such as maximum gap, minimum gap, among the sequence
elements. It supports sliding window i.e., a time interval within which items are observed as
belonging to the same event, even if they originate from different events.

13. Write a note on spatial mining.


Spatial Mining Spatial mining is a branch of data mining that deals with spatial (location or
geo-referenced) data. Consider a Map of the city of Hyderabad containing various natural and
man-made geographical features and clusters of points representing houses. The houses may
be noteworthy because of their size, historical interest or current market value. Clustering
algorithms exist to assign each point to exactly one cluster, with the number of clusters
defined by the user. We can mine variety of information by identifying relationships. For
example “70% of the Banjara migrants settle in the city around the market area”. Such
information could be useful to investors or prospective home buyers and also to other
domains such as satellite images, photographs, oil and gas explorations. This problem is not
trivial— there may be large number of features. We need to detect relationships among large
number of geo referenced objects without incurring overheads.

14 Write a note on sequence mining


Sequence Mining

● An efficient approach to mining casual events is sequence mining. Sequence mining can
be applied in DNA sequence, signal processing and speech analysis.

● given a set of data sequences, the problem is to discover sequences that are frequent, in
the sense the percentage of data sequences containing them exceeds the user-specified
minimal support.
● It is widely used in retailing where data sequence corresponds to items bought by an
individual customer over time. Frequent patterns are useful for predicting future
customer behavior.
¤ Sequence Mining Problem: Let Σ={i1,i2, …..im}be a set of m distinct items. An event is a
non-empty disordered collection of items. Without any loss of generality, we write items in
an event in some predefined order. An event is denoted by {i1,i2,….,ik} where each ij is an
item in Σ.
¤Definition of Sequence: A sequence is an ordered list of events. A sequence α is denoted as
(α1 α2 ….., αq) where αi is an event. A subsequence is a sequence within a sequence. Its
items need not be adjacent in time but their ordering should not violate the time ordering of
the supporting events. A subsequence can be obtained by deleting some items and/or events.
The formal definition of subsequence : The sequence s=(α1 α2 ….., αq) is said to be a
sub sequence of s1=( β1 β2 …. --> βr) if there exist indices t1<t2<…<tq of s1, such that
α1⊆βt1, α2⊆βt2…. αq⊆βtq, A sequence s1 is said to support another sequence s if s is a
subsequence of s1.
Definition—Frequency: The frequency of a sequence s, with respect to database D, is the
total number of input sequences in D that support it.
Definition—Frequent Sequence: A frequent sequence is a sequence whose frequency exceeds
some user-specified threshold. A frequent set is maximal if it is not a subsequence of another
frequent sequence.
Example:
Customer A B C D E F
C1. 110111
C2. 001101
C1. 011011
C3. 000111
C2. 111101
C1. 010010
C3. 101110
C2. 010010
C4. 111111
C1. 001111
C4. 010001
C3. 101111
C2. 011000
C4. 101111
C2. 010000
One sequence from the table
Customer A B C D E F
C1 110111
C1 011011
C1 010010
C1 001111
Σ={A,B,C,D,E,F}
Sequence 1: (A,B,D,E,F)-> (B,C,E,F) ->(B,E) ->(C,D,E,F)
Sequence 2: (C,D,F) ->(A,B,C,D,F)-> (B,E)-> (B,C)- >(B)
Sequence 3: (D, E, F) ->(A,C,D,E) ->(A,C,D,E,F)
Sequence 4:(A, B, C, D, E, F) ->(B,F) ->(A,C,D,E,F)

Thus, the database D consists of 4 sequences. AC (or C4 1 0 1 1 1 1(A, C)) is not a


subsequence of sequence 1, but it is a subsequence of sequence 3 and sequence 4. By AC we
mean that there should be a transaction containing A and C. by A ->C, we mean that there is
a transaction containing C which appears after –not necessarily—immediately after another
transaction containing A.
Frequency of AC in D is 3. Note that we do not count multiple occurrences of AC in the same
sequence. B-> BE is not supported by sequence 4, which supports BE-> B. the sub-
sequence B B->E indicates that the transaction containing B and E follows another
transaction containing B. BE-> B represents a subsequence in which transaction containing
B and E occurs before a transaction containing B.
Normally sequence mining problem contains temporal order of events within a sequence of
transactions. We also need to focus on temporal distance between the events. That is, A-> C
should indicate the temporal gap between the transaction containing A and the transaction
containing C. It is relevant when the correlation between the two events comes to an end after
a period of time. For example, when we are trying to find out the casual relationship between
consumption of beverage and stomach upset, it will be irrelevant to correlate two events
within a gap of one year. We can specify the time distance between two events in terms of
threshold. So, B ->d BE denotes that the event containing B and E occurs in not more than d
transactions after containing B.
For example, there is no sequence which supports ABD-> 1 D, whereas ABD-> 2 D is
supported by sequence 4. When d=1 we say that the problem is continuous sequence mining
problem. A simple sequence mining problem is the one where each transaction contains a
single item.
15. Explain the types of temporal data.
Temporal data mining is defined as the non-trivial extraction of implicit, potentially useful,
previously unrecorded information with an implicit or explicit time content, from large
quantity of data. Temporal mining, aims at mining new information which takes into account
the temporal aspects of data.
Types of Temporal Data
1. Static: Static data are free of any temporal reference and the influences that can be derived
from this data are free of any temporality.
2. Sequences (Ordered sequence of events): In this category there is no explicit reference to
time, but there exists a sort of temporal relationship between data items. If a transaction
occurs before another transaction, we call it a sequence. The relations ‘before’, ‘after’,
‘during’, ‘overlap’ indicate the sequence.
3. Timestamped: in this category, the temporal information is explicit. There is an exact
temporal relationship between the items. Examples include census data, land-use data etc.
4. Fully Temporal: In this category, the validity of the data element is time-dependent.

16. Explain the relationship between information retrieval and computational


linguistics in text data mining
Text mining, therefore, corresponds to the extension of the data mining approach to textual data and
is concerned with various tasks, such as extraction of information implicitly contained in collection of
documents, or similarity- based structuring.

Information Retrieval (IR): IR is concerned with finding and ranking documents that match the
user’s information needs. The way of dealing with textual information by the IR community is a
keyword-based document representation. A body of text is analyzed by its constituent words, and
various techniques are used to build the core words for a document. Actually, IR is the automatic
retrieval of all relevant documents The goals of IR are

1. To find documents that are similar, based on some specification of the user.

2. To find the right index terms in a collection, so that querying will return the appropriate document
Computational Linguistics: Computational linguistics computes statistics over large text
collections in order to discover useful patterns. These patterns are used to inform algorithms for
various sub-problems within natural language processing such as part-of-speech tagging. The aim of
text mining is also similar. However, in computational linguistics, patterns are discovered to aid other
problems within the same domain, whereas text mining is aimed at discovering unknown information
for different applications.

You might also like