Advanced-Applications
Advanced-Applications
Advanced Applications
1
Advanced Applications
Multimedia data mining
Similarity search in multimedia data
Mining association in multimedia data
An introduction to text mining
Natural language processing and information extraction
Web mining
– Web content mining
– Web structure mining
– Web usage mining
2
Mining Complex Data Types
3
Mining Sequence Data
A sequence is an ordered list of events.
Sequences may be categorized into three groups, based on the
characteristics of the events they describe:
(1) time-series data,
(2) symbolic sequence data, and
(3) biological sequences.
4
Mining Sequence Data
In time-series data, sequence data consist of long sequences of
numeric data, recorded at equal time intervals (e.g., per minute,
per hour, or per day).
Time-series data can be generated by many natural and economic
processes such as stock markets, and scientific, medical, or
natural observations.
5
Mining Sequence Data
Symbolic sequence data consist of long sequences of event or
nominal data, which typically are not observed at equal time
intervals.
For many such sequences, gaps (i.e., lapses between recorded
events) do not matter much.
Examples include customer shopping sequences and web click
streams, as well as sequences of events in science and
engineering and in natural and social developments.
6
Mining Sequence Data
Biological sequences include DNA and protein sequences. Such
sequences are typically very long, and carry important,
complicated, but hidden semantic meaning.
Here, gaps are usually important.
7
Mining Sequence Data: Similarity Search in
Time-Series Data
Unlike normal database queries, which find data that match a
given query exactly, a similarity search finds data sequences
that differ only slightly from the given query sequence.
Many time-series similarity queries require subsequence
matching, that is, finding a set of sequences that contain
subsequences that are similar to a given query sequence.
8
Mining Sequence Data: Similarity Search in
Time-Series Data
For similarity search, it is often necessary to first perform data or
dimensionality reduction and transformation of time-series data.
Typical dimensionality reduction techniques include
(1) the discrete Fourier transform (DFT),
(2) discrete wavelet transforms (DWT), and
(3) singular value decomposition (SVD) based PCA.
9
Mining Sequence Data: Regression and
Trend Analysis in Time-Series Data
Regression analysis of time-series data has been studied
substantially in the fields of statistics and signal analysis.
However, one may often need to go beyond pure regression analysis
and perform trend analysis for many practical applications.
10
Mining Sequence Data: Regression and
Trend Analysis in Time-Series Data
Trend analysis builds an integrated model using the following four major
components or movements to characterize time-series data:
1. Trend or long-term movements: These indicate the general direction in
which a time-series graph is moving over time, for example, using weighted
moving average and the least squares methods to find trend curves such as the
dashed curve indicated in Figure
2. Cyclic movements: These are the long-term oscillations about a trend line
or curve.
3. Seasonal variations: These are nearly identical patterns that a time series
appears to follow during corresponding seasons of successive years such as
holiday shopping seasons. For effective trend analysis, the data often need to be
“deseasonalized” based on a seasonal index computed by autocorrelation.
4.Random movements: These characterize sporadic changes due to chanc
eevents such as labor disputes or announced personnel changes within
companies.
11
Multimedia Data Mining
“What is a multimedia database?”
A multimedia database system stores and manages a large
collection of multimedia data, such as audio, video, image,
graphics, speech, text, document, and hypertext data, which
contain text, text markups, and linkages.
Multimedia database systems are increasingly common owing to
the popular use of audio- video equipment, digital cameras, CD-
ROMs, and the Internet.
Typical multimedia database systems include NASA’s EOS (Earth
Observation System), various kinds of image and audio-video
databases, and Internet databases.
12
Similarity Search in Multimedia Data
14
Similarity Search in Multimedia Data
15
Similarity Search in Multimedia Data
16
Text Mining
In reality, a substantial portion of the available information is
stored in text databases (or document databases), which consist
of large collections of documents from various sources, such as
news articles, research papers, books, digital libraries, e-mail
messages, and Web pages.
17
Text Mining
Nowadays most of the information in government, industry,
business, and other institutions are stored electronically, in the
form of text databases.
Data stored in most text databases are semistructured data in
that they are neither completely unstructured nor completely
structured.
For example, a document may contain a few structured fields,
such as title, authors, publication date, category, and so on, but
also contain some largely unstructured text components, such as
abstract and contents.
18
Text Mining
19
Text Mining
20
Some Text Mining Application
21
Text Mining: Classification of News
22
Text Mining: Sentiment Analysis
23
Text Mining: Search Log Mining
24
Text Mining: Search Vs Discovery
25
Text Mining: Process
26
Text Mining: Text Preprocessing
27
Text Mining: Text Preprocessing
Syntactic and Linguistic Text Preprocessing
28
Text Mining: Text Preprocessing
Stopword Removal
29
Text Mining: Text Preprocessing
Stemming
30
Text Mining: Text Preprocessing
Some Basic Stemming Rules
31
Text Mining: Feature Generation
32
Text Mining: Feature Generation
Bag-of-Words: The Term-Document Matrix
33
Text Mining: Feature Generation
Bag-of-Words:Feature Generation
34
Text Mining: Feature Generation
The TF-IDF Term Weighting Scheme
35
Text Mining: Feature Generation
Word Embeddings
36
Text Mining: Feature Generation
Embedding Methods and Pretrained Models
37
Text Mining: Feature Selection
38
Text Mining: Feature Selection
Filter Tokens by POS Tags
39
Text Mining: Pattern Discovery
40
Text Mining: Pattern Discovery
Document Clustering
41
Text Mining: Pattern Discovery
Jaccard Coefficient
42
Text Mining: Pattern Discovery
Example: Jaccard Coefficient
43
Text Mining: Pattern Discovery
Cosine Similarity
44
Text Mining: Pattern Discovery
Example: Cosine Similarity and TF-IDF
45
Text Mining: Pattern Discovery
Example: Cosine Similarity and TF-IDF
46
Text Mining: Pattern Discovery
Document Classification
47
Text Mining: Pattern Discovery
Example Application: Sentiment Analysis
48
Example Application: Sentiment Analysis
49
GoEmotions: A Dataset for Fine-Grained Emotion Classification
50
Web Mining
Web mining is the use of data mining techniques to extract knowledge
from web data.
51
Web Mining
52
Web Mining: Issues
53
Web Mining: Issues
54