0% found this document useful (0 votes)
9 views54 pages

Advanced-Applications

Data Mining IOE - Chapter 7 Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views54 pages

Advanced-Applications

Data Mining IOE - Chapter 7 Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Unit 7

Advanced Applications

1
Advanced Applications
 Multimedia data mining
 Similarity search in multimedia data
 Mining association in multimedia data
 An introduction to text mining
 Natural language processing and information extraction
 Web mining
– Web content mining
– Web structure mining
– Web usage mining

2
Mining Complex Data Types

3
Mining Sequence Data
 A sequence is an ordered list of events.
 Sequences may be categorized into three groups, based on the
characteristics of the events they describe:
(1) time-series data,
(2) symbolic sequence data, and
(3) biological sequences.

4
Mining Sequence Data
 In time-series data, sequence data consist of long sequences of
numeric data, recorded at equal time intervals (e.g., per minute,
per hour, or per day).
 Time-series data can be generated by many natural and economic
processes such as stock markets, and scientific, medical, or
natural observations.

5
Mining Sequence Data
 Symbolic sequence data consist of long sequences of event or
nominal data, which typically are not observed at equal time
intervals.
 For many such sequences, gaps (i.e., lapses between recorded
events) do not matter much.
 Examples include customer shopping sequences and web click
streams, as well as sequences of events in science and
engineering and in natural and social developments.

6
Mining Sequence Data
 Biological sequences include DNA and protein sequences. Such
sequences are typically very long, and carry important,
complicated, but hidden semantic meaning.
 Here, gaps are usually important.

7
Mining Sequence Data: Similarity Search in
Time-Series Data
 Unlike normal database queries, which find data that match a
given query exactly, a similarity search finds data sequences
that differ only slightly from the given query sequence.
 Many time-series similarity queries require subsequence
matching, that is, finding a set of sequences that contain
subsequences that are similar to a given query sequence.

8
Mining Sequence Data: Similarity Search in
Time-Series Data
 For similarity search, it is often necessary to first perform data or
dimensionality reduction and transformation of time-series data.
Typical dimensionality reduction techniques include
(1) the discrete Fourier transform (DFT),
(2) discrete wavelet transforms (DWT), and
(3) singular value decomposition (SVD) based PCA.

9
Mining Sequence Data: Regression and
Trend Analysis in Time-Series Data
 Regression analysis of time-series data has been studied
substantially in the fields of statistics and signal analysis.
 However, one may often need to go beyond pure regression analysis
and perform trend analysis for many practical applications.

10
Mining Sequence Data: Regression and
Trend Analysis in Time-Series Data
 Trend analysis builds an integrated model using the following four major
components or movements to characterize time-series data:
1. Trend or long-term movements: These indicate the general direction in
which a time-series graph is moving over time, for example, using weighted
moving average and the least squares methods to find trend curves such as the
dashed curve indicated in Figure
2. Cyclic movements: These are the long-term oscillations about a trend line
or curve.
3. Seasonal variations: These are nearly identical patterns that a time series
appears to follow during corresponding seasons of successive years such as
holiday shopping seasons. For effective trend analysis, the data often need to be
“deseasonalized” based on a seasonal index computed by autocorrelation.
4.Random movements: These characterize sporadic changes due to chanc
eevents such as labor disputes or announced personnel changes within
companies.

11
Multimedia Data Mining
“What is a multimedia database?”
 A multimedia database system stores and manages a large
collection of multimedia data, such as audio, video, image,
graphics, speech, text, document, and hypertext data, which
contain text, text markups, and linkages.
 Multimedia database systems are increasingly common owing to
the popular use of audio- video equipment, digital cameras, CD-
ROMs, and the Internet.
 Typical multimedia database systems include NASA’s EOS (Earth
Observation System), various kinds of image and audio-video
databases, and Internet databases.

12
Similarity Search in Multimedia Data

“When searching for similarities in multimedia data, can we


search on either the data description or the data content?”
 The answer is yes.
 For similarity searching in multimedia data, we consider two main
families of multimedia indexing and retrieval systems:
(1) description-based retrieval systems,
– which build indices and perform object retrieval based on
image descriptions, such as keywords, captions, size, and
time of creation;
(2) content-based retrieval systems,
– which support retrieval based on the image content, such as
color histogram, texture, pattern, image topology, and the
shape of objects and their layouts and locations within the
image. 13
Similarity Search in Multimedia Data

Several approaches have been proposed and studied for similarity-


based retrieval in image databases, based on image signature:
 Color histogram–based signature:
– In this approach, the signature of an image includes color
histograms based on the color composition of an image
regardless of its scale or orientation.
– This method does not contain any information about shape,
image topology, or texture.
– Thus, two images with similar color composition but that
contain very different shapes or textures may be identified as
similar, although they could be completely unrelated
semantically.

14
Similarity Search in Multimedia Data

Several approaches have been proposed and studied for similarity-


based retrieval in image databases, based on image signature:
 Multifeature composed signature:
– In this approach, the signature of an image includes a
composition of multiple features: color histogram, shape,
image topology, and texture.
– The extracted image features are stored as metadata, and
images are indexed based on such metadata.

15
Similarity Search in Multimedia Data

Several approaches have been proposed and studied for similarity-


based retrieval in image databases, based on image signature:
 Wavelet-based signature:
– This approach uses the dominant wavelet coefficients of an
image as its signature.
– Wavelets capture shape, texture, and image topology informa-
tion in a single unified framework.
– However, since this method computes a single signature for
an entire image, it may fail to identify images containing similar
objects where the objects differ in location or size.

16
Text Mining
 In reality, a substantial portion of the available information is
stored in text databases (or document databases), which consist
of large collections of documents from various sources, such as
news articles, research papers, books, digital libraries, e-mail
messages, and Web pages.

17
Text Mining
 Nowadays most of the information in government, industry,
business, and other institutions are stored electronically, in the
form of text databases.
 Data stored in most text databases are semistructured data in
that they are neither completely unstructured nor completely
structured.
 For example, a document may contain a few structured fields,
such as title, authors, publication date, category, and so on, but
also contain some largely unstructured text components, such as
abstract and contents.

18
Text Mining

19
Text Mining

20
Some Text Mining Application

21
Text Mining: Classification of News

22
Text Mining: Sentiment Analysis

23
Text Mining: Search Log Mining

24
Text Mining: Search Vs Discovery

25
Text Mining: Process

26
Text Mining: Text Preprocessing

27
Text Mining: Text Preprocessing
Syntactic and Linguistic Text Preprocessing

28
Text Mining: Text Preprocessing

Stopword Removal

29
Text Mining: Text Preprocessing
Stemming

30
Text Mining: Text Preprocessing
Some Basic Stemming Rules

31
Text Mining: Feature Generation

32
Text Mining: Feature Generation
Bag-of-Words: The Term-Document Matrix

33
Text Mining: Feature Generation
Bag-of-Words:Feature Generation

34
Text Mining: Feature Generation
The TF-IDF Term Weighting Scheme

35
Text Mining: Feature Generation
Word Embeddings

36
Text Mining: Feature Generation
Embedding Methods and Pretrained Models

37
Text Mining: Feature Selection

38
Text Mining: Feature Selection
Filter Tokens by POS Tags

39
Text Mining: Pattern Discovery

40
Text Mining: Pattern Discovery
Document Clustering

41
Text Mining: Pattern Discovery
Jaccard Coefficient

42
Text Mining: Pattern Discovery
Example: Jaccard Coefficient

43
Text Mining: Pattern Discovery
Cosine Similarity

44
Text Mining: Pattern Discovery
Example: Cosine Similarity and TF-IDF

45
Text Mining: Pattern Discovery
Example: Cosine Similarity and TF-IDF

46
Text Mining: Pattern Discovery
Document Classification

47
Text Mining: Pattern Discovery
Example Application: Sentiment Analysis

48
Example Application: Sentiment Analysis

49
GoEmotions: A Dataset for Fine-Grained Emotion Classification

50
Web Mining
 Web mining is the use of data mining techniques to extract knowledge
from web data.

 Web data includes:


- web documents
- hyperlinks between documents
- usage logs of web sites

 The WWW is huge, widely distributed, global information service centre


and, therefore, constitutes a rich source for data mining.

51
Web Mining

52
Web Mining: Issues

 Web data sets can be very large


- Tens to hundreds of terabyte
 Cannot mine on a single server
- Need large farms of servers
 Proper organization of hardware and software to mine multi-
terabyte data sets
 Difficulty in finding relevant information
 Extracting new knowledge from the web

53
Web Mining: Issues

54

You might also like