UNIT 4 Data Mining
UNIT 4 Data Mining
2 Marks
1. What is stemming?
Stemming is a process which reduces words to their morphological roots. For example, the
words “informing”, “information”, “informer”, and “informed” would be stemmed to their
common root “inform”, and only the latter word is used as the feature instead of the former
four.
2. What is sequence mining?
An efficient approach to mining casual events is sequence mining. Sequence mining can be
applied in DNA sequence, signal processing and speech analysis. /Given a set of data
sequences, the problem is to discover sequences that are frequent, in the sense the
percentage of data sequences containing them exceeds the user-specified minimal support.
It is widely used in retailing where data sequence corresponds to items bought by an
individual customer over time. Frequent patterns are useful for predicting future customer
behaviour.
8.What is co-citation?
For a pair of nodes, p and q, the co–citation is the number of nodes that point to both p and
q.
16.What is Part-Of-Speech?
One important feature is the POS. There can be 25 possible values for POS tags. Most
common tags are noun, verb, adjective and adverb. Thus, we can assign a number 1,2,3,4 and
5, depending on whether the word is a noun, verb, adjective, adverb or any other,
respectively.
17.Define stop-words
It is the text mining application to eliminate words that are so commonly used that they carry
very little useful information.
The feature selection includes removing the case, punctuation, infrequent words and stop
words.
18.What are the different types of web mining?
a) Web content mining
b) Web structure mining
c) Web usage mining
19. Define Bibliographic Coupling.
Bibliographic Coupling: For a pair of nodes, p and q, the bibliographic coupling is equal to
the number of nodes that have links from both p and q.
20. Define page rank.
A page can have a high PageRank if there are many pages that point to it, or if there are
some pages that point to it which have a high PageRank. Naturally, pages that are well cited
from many places around the web are worth looking at. Also, pages that have perhaps only
one citation from something like the Yahoo! Homepage are also generally worth looking at.
PageRank handles both these cases and everything in between, by recursively propagating
weights through the link structure of the web.
5 Mark
1. Explain GSP algorithm
GSP makes multiple passes over the database. In the first pass, all single items(1-sequnences)
are counted. From this, a set of candidate 2-sequences are formed, and another pass is made
to gather the support. The frequent 2-sequences are used to generate candidate 3-sequences,
and this process is repeated until no more frequent sequences are found. There are two main
steps in this algorithm.
1. Candidate Generation: Given a set of frequent (k-1)-frequent sequences F(k-1), the
candidates
for the next pass are generated by joining F(k-1) with itself. A pruning phase eliminates any
sequence, at least one of whose subsequences is not frequent.
2. Support Counting: Hash-tree based search is employed for efficient support counting.
Finally
non-maximal frequent sequences are removed.
F1 = set of frequent 1-sequence
k=2
do while F(k-1) ≠ Φ;
generate candidate sets Ck (set of candidate k-sequences);
for all input sequence s in the database D do
increment count of all a in Ck if s supports a
Fk ={a ∈ Ck such that its frequency exceeds the threshold}
k= k+1
set of all frequent sequences is the union of all Fks
end do
The above algorithm looks like A priori algorithm. One main difference is the generation of
candidates.
Let us assume that A B and A C are two frequent 2-sequences. The items involved in this
sequence
are (A,B) and (A,C). the candidate generation in the usual A priori algorithm would give
(A,B,C) as a 3-
itemset, but the candidate generation in this context generates A B C, A C B and
A BC. The
GSP algorithm discovers frequent sequences, allowing for time constraints such as maximum
gap,minimum gap, among the sequence elements. It supports sliding window i.e., a time
interval within
which items are observed as belonging to the same event, even if they originate from
different events.
2. Write a note on web text mining.
Due to the continuous growth of the volumes of text data, automated extraction of implicit,
previously unknown, and potentially useful information becomes more necessary to properly
utilize this vast source of knowledge.
Text mining, therefore, corresponds to the extension of the data mining approach to textual
data and is concerned with various tasks, such as extraction of information implicitly
contained in collection of documents, or similarity- based structuring.
The text expresses a vast range of information, but encodes the information in a form that is
difficult to interpret automatically. When the data is structured, it is easy to define the set of
items, and hence, it becomes easy to employ the traditional mining techniques. Identifying
individual items or terms is not so obvious in a textual database. Thus, unstructured data,
particularly free- running text, places a new demand on data mining methodology.
The inherent nature of textual data, namely unstructured characters, motivates the
development of separate text mining techniques. One way is to impose a structure on the
textual database and use any of the known datamining techniques meant for structured
databases. The other approach would be to develop a very specific technique for mining that
exploits the inherent characteristics of textual databases.
a) Finding Relevant Information: We either browse or use the search service when we want
to find specific information on the web. We usually specify a simple keyword query and the
response from a web search engine is a list of pages, ranked based on their similarity to the
query. However, today’s search tools have the following problems:
i) Low precision: This is due to the irrelevance of many of the search results. We may get
many pages of information which are not really relevant to our query.
ii) Low recall: This is due to the inability to index all the information available on the web.
Because some of the relevant pages are not properly indexed, we may not get those pages
through any of the search engines.
b) Discovering New Knowledge from the Web: We can define the above problem as a
query – triggered process (retrieval oriented). On the other hand, we can have a data –
triggered process that assumes that we already have a collection of web data and we want to
extract potentially useful knowledge out of it (data mining – oriented).
c) Personalized Web Page Synthesis: We may wish to synthesize a web page for different
individuals from the available set of web pages. Individuals have their own preferences in the
style of contents and presentations while interacting with the web. The information providers
like to create a system which responds to user queries by potentially aggregating
information from several sources.
D) Learning about individual User: This is about what customers do and what customers
want. Mass customizing the information to the intended consumers or even personalizing it
to the individual user. The web mining provides set of technologies to solve these problems.
However, web mining is not the only tool available to handle such problems, we have some
other techniques from different research areas, such as database (DB), Information Retrieval
(IR) and Natural Language Processing (NLP) can also be used.
Ex: The text “Knowledge Discovery in databases” can be represented as the sequence S=
((Knowledge _ noun _ singular, 1), (discovery _ noun _ singular, 2), (in _ preposition, 3),
(databases _noun _ plural, 4)) Instead of considering all occurrences of the episode, a
restriction is set that the episode must occur within a pre specified window of size ‘w’. Thus
we examine the substrings S1 of S such that the difference of the indices in S1 is at most ‘w’.
Ex: For w=2, The subsequence (Knowledge _ noun _ singular, discovery _noun_ singular) is
an episode contained in the window, but the subsequence (Knowledge _ noun _ singular,
databases _noun _ plural) is not contained within the window
● An efficient approach to mining casual events is sequence mining. Sequence mining can
be applied in DNA sequence, signal processing and speech analysis.
● given a set of data sequences, the problem is to discover sequences that are frequent, in
the sense the percentage of data sequences containing them exceeds the user-specified
minimal support.
● It is widely used in retailing where data sequence corresponds to items bought by an
individual customer over time. Frequent patterns are useful for predicting future
customer behavior.
¤ Sequence Mining Problem: Let Σ={i1,i2, …..im}be a set of m distinct items. An event is a
non-empty disordered collection of items. Without any loss of generality, we write items in
an event in some predefined order. An event is denoted by {i1,i2,….,ik} where each ij is an
item in Σ.
¤Definition of Sequence: A sequence is an ordered list of events. A sequence α is denoted as
(α1 α2 ….., αq) where αi is an event. A subsequence is a sequence within a sequence. Its
items need not be adjacent in time but their ordering should not violate the time ordering of
the supporting events. A subsequence can be obtained by deleting some items and/or events.
The formal definition of subsequence : The sequence s=(α1 α2 ….., αq) is said to be a
sub sequence of s1=( β1 β2 …. --> βr) if there exist indices t1<t2<…<tq of s1, such that
α1⊆βt1, α2⊆βt2…. αq⊆βtq, A sequence s1 is said to support another sequence s if s is a
subsequence of s1.
Definition—Frequency: The frequency of a sequence s, with respect to database D, is the
total number of input sequences in D that support it.
Definition—Frequent Sequence: A frequent sequence is a sequence whose frequency exceeds
some user-specified threshold. A frequent set is maximal if it is not a subsequence of another
frequent sequence.
Example:
Customer A B C D E F
C1. 110111
C2. 001101
C1. 011011
C3. 000111
C2. 111101
C1. 010010
C3. 101110
C2. 010010
C4. 111111
C1. 001111
C4. 010001
C3. 101111
C2. 011000
C4. 101111
C2. 010000
One sequence from the table
Customer A B C D E F
C1 110111
C1 011011
C1 010010
C1 001111
Σ={A,B,C,D,E,F}
Sequence 1: (A,B,D,E,F)-> (B,C,E,F) ->(B,E) ->(C,D,E,F)
Sequence 2: (C,D,F) ->(A,B,C,D,F)-> (B,E)-> (B,C)- >(B)
Sequence 3: (D, E, F) ->(A,C,D,E) ->(A,C,D,E,F)
Sequence 4:(A, B, C, D, E, F) ->(B,F) ->(A,C,D,E,F)
Information Retrieval (IR): IR is concerned with finding and ranking documents that match the
user’s information needs. The way of dealing with textual information by the IR community is a
keyword-based document representation. A body of text is analyzed by its constituent words, and
various techniques are used to build the core words for a document. Actually, IR is the automatic
retrieval of all relevant documents The goals of IR are
1. To find documents that are similar, based on some specification of the user.
2. To find the right index terms in a collection, so that querying will return the appropriate document
Computational Linguistics: Computational linguistics computes statistics over large text
collections in order to discover useful patterns. These patterns are used to inform algorithms for
various sub-problems within natural language processing such as part-of-speech tagging. The aim of
text mining is also similar. However, in computational linguistics, patterns are discovered to aid other
problems within the same domain, whereas text mining is aimed at discovering unknown information
for different applications.