0% found this document useful (0 votes)
15 views31 pages

Unit-5 Data Mining AIML

Uploaded by

217r1a0597
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views31 pages

Unit-5 Data Mining AIML

Uploaded by

217r1a0597
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

UNIT - V
Advanced Concepts: Basic concepts in Mining data streams–Mining Time–series data––Mining
sequence patterns in Transactional databases– Mining Object– Spatial– Multimedia–Text and
Web data – Spatial Data mining– Multimedia Data mining–Text Mining– Mining the World
Wide Web.

Basic concepts in Mining data streams:

We generate and transmit vast amounts of digital data every second in the real world. It is not
wrong to say that massive data surround us. The continuously generating and transmitting data is
called a Data Stream. However, extracting valuable knowledge from this big data is a big task. It
takes lots of time, effort, and skills to mine insights from massive data.

Therefore, we need to implement data streams in data mining techniques to transfer valuable
insights from data to the receiver’s end. This article leads us to understand the data stream and its
mining techniques simply and helpfully.

What is Data Stream?

Data Stream is a continuous, fast-changing, and ordered chain of data transmitted at a very high
speed. It is an ordered sequence of information for a specific interval. The sender’s data is
transferred from the sender’s side and immediately shows in data streaming at the receiver’s
side. Streaming does not mean downloading the data or storing the information on storage
devices.

Page 1 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Sources of Data Stream

There are so many sources of the data stream, and a few widely used sources are listed below:

 Internet traffic
 Sensors data
 Real-time ATM transaction
 Live event data
 Call records
 Satellite data
 Audio listening
 Watching videos
 Real-time surveillance systems
 Online transactions

What are Data Streams in Data Mining?

Data Streams in Data Mining is extracting knowledge and valuable insights from a continuous
stream of data using stream processing software. Data Streams in Data Mining can be considered
a subset of general concepts of machine learning, knowledge extraction, and data mining. In
Data Streams in Data Mining, data analysis of a large amount of data needs to be done in real-
time. The structure of knowledge is extracted in data steam mining represented in the case of
models and patterns of infinite streams of information.

Page 2 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Characteristics of Data Stream in Data Mining

Data Stream in Data Mining should have the following characteristics:

 Continuous Stream of Data: The data stream is an infinite continuous stream resulting
in big data. In data streaming, multiple data streams are passed simultaneously.
 Time Sensitive: Data Streams are time-sensitive, and elements of data streams carry
timestamps with them. After a particular time, the data stream loses its significance and is
relevant for a certain period.
 Data Volatility: No data is stored in data streaming as It is volatile. Once the data mining
and analysis are done, information is summarized or discarded.
 Concept Drifting: Data Streams are very unpredictable. The data changes or evolves
with time, as in this dynamic world, nothing is constant.

Data Stream is generated through various data stream generators. Then, data mining techniques
are implemented to extract knowledge and patterns from the data streams. Therefore, these
techniques need to process multi-dimensional, multi-level, single pass, and online data streams.

Data Streams in Data Mining Techniques


Data Streams in Data Mining techniques are implemented to extract patterns and insights from a
data stream. A vast range of algorithms is available for stream mining. There are four main
algorithms used for Data Streams in Data Mining techniques.

1. Classification

Classification is a supervised learning technique. In classification, the classifier model is built


based on the training data(or past data with output labels). This classifier model is then used to
predict the label for unlabeled instances or items continuously arriving through the data stream.
Prediction is made for the unknown/new items that the model never saw, and already known
instances are used to train the model.

Generally speaking, a stream mining classifier is ready to do either one of the tasks at any
moment:

 Receive an unlabeled item and predict it based on its current model.


 Receive labels for past known items and use them for training the model

Best Known Classification Algorithms

Let’s discuss the best-known classification algorithms for predicting the labels for data streams.

Page 3 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Lazy Classifier or k-Nearest Neighbor

The k-Nearest Neighbor or k-NN classifier predicts the new items’ class labels based on the class
label of the closest instances. In particular, the lazy classifier outputs the majority class label of
the k instances closest to the one to predict.

Naive Bayes

Naive Bayes is a classifier based on Bayes’ theorem. It is a probabilistic model called ‘naive’
because it assumes conditional independence between input features. The basic idea is to
compute a probability for each one of the class labels based on the attribute values and select the
class with the highest probability as the label for the new item.

Page 4 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Decision Trees

As the name signifies, the decision tree builds a tree structure from training data, and then the
decision tree classifier is used to predict class labels of unseen data items. They are easy to
understand their predictions. In Data Streams in Data Mining Hoeffding tree is the state-of-the-
art decision tree classifier. In addition, the Hoeffding adaptive tree is advanced.

Logistic Regression

Logistic Regression is not a regression classifier, but a classification classifier used to estimate
discrete values/binary values like 0/1, yes/no, true/false, etc. It predicts the probability of
occurrence of an event by fitting data to a logit function based on known instances of the data
stream.

Ensembles

Ensembles combine different classifiers, which can predict better than individual classifiers.
Data is divided into distinct subsets, and these different subsets of data are fed to different
classifiers of ensemble model Bagging and boosting are two types of ensemble models. The
ADWIN bagging method is widely used for Data Streams in Data Mining.

2. Regression

Regression is also a supervised learning technique used to predict real values of label attributes
for the stream instances, not the discrete values like classification. However, the idea of
regression is similar to classification either to predict the real-values label for the unknown items
using the regressor model or train and adjust the model using the known data with the label.

Best Known Regression Algorithms

Regression Algorithms are also the same as classification algorithms. Below are the best-known
regression algorithms for predicting the labels for data streams.

 Lazy Classifier or k-Nearest Neighbor


 Naive Bayes
 Decision Trees
 Linear Regression
 Ensembles

3. Clustering

Clustering is an unsupervised learning technique. Clustering is functional when we have


unlabeled instances, and we want to find homogeneous clusters in them based on the similarities
of data items. Before the clustering process, the groups are not known. Clusters are formed with
continuous data streams based on data and keep on adding items to the different groups.

Page 5 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Best Known Clustering Algorithms

Let’s discuss the best-known clustering algorithms for group segmentation of data streams.

K-means Clustering

The k-means clustering method is the most used and straightforward method for clustering. It
starts by randomly selecting k centroids. After that, repeat two steps until the stopping criteria
are met: first, assign each instance to the nearest centroid, and second, recompute the cluster
centroids by taking the mean of all the items in that cluster.

Hierarchical Clustering

In hierarchical clustering, the hierarchy of clusters is created as dendrograms. For example,


PERCH is a hierarchical algorithm used for clustering online data streams.

Density-based Clustering

DBSCAN is used for density-based clustering. It is based on the natural human clustering
approach.

4. Frequent Pattern Mining

Frequent pattern mining is an essential task in unsupervised learning. It is used to describe the
data and find the association rules or discriminative features in data that will further help
classification and clustering tasks. It is based on two rules.

 Frequent Item Set- Collection of items occurring together frequently.


 Association Rules- Indicator of the strong relationship between two items.

Best Known Frequent Pattern Mining Algorithms

Below are the best-known frequent pattern mining algorithms for finding frequent itemsets in
data.

 Apriori
 Eclat
 FP-growth

Page 6 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Mining Time–series data:


Time series represents a collection of values or data obtained from the logical order of
measurement over time. Time series data mining makes our natural ability to visualize the shape
of real-time data. It is an ordered sequence of data points at uniform time intervals.

Time Series Analysis comprises methods for analyzing time-series data in order to extract
meaningful statistics, rules and patterns. These rules and patterns might be used to build
forecasting models that are able to predict future developments.

Is the database play a vital role in Time Series mining?


The database is the collection of data retrieved from a different source in which the data are
stored in a structural, nonstructural format on their respective columns.
Time Series database consists of a sequence of values or events changing with time. Data are
recorded at regular intervals.

Application of Time Series Mining:

1. Financial:
1.1 Used for stock price evaluation
1.2 For the measurement of Inflation

2. Industry:
2.1 Determine the power consumption

3. Scientific:
3.1 Used for experiment results

4. Meteorological:
4.1 Concerned with the processes and phenomena of the atmosphere, basically for forecasting
weather

Characteristic of time series components:


1. Trend
2. Cycle
3.Seasonal
4. Irregular

Category of Time-Series Movements:


1. Long-term or trend movements :
The general direction in which a time series is moving over a long interval of time. It shows the
general tendency of the data to increase or decrease a long period of time.

Page 7 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

2. Cyclic movements or cycle variations:


Long term oscillations about a trend line or curve. For example, business cycles. This oscillatory
movement has a period of oscillation of more than a year.

3. Seasonal movements or seasonal variations:


Almost identical patterns that a time series appears to follow during corresponding months of
successive years. This variation will be present in a time series if the data are recorded hourly,
daily, weekly or monthly.

4. Irregular or random movements:


These fluctuations are unforeseen, uncontrollable and unpredictable. They are not regular
variations and are purely random or irregular.

Components for Time Series Analysis

Mining sequence patterns in Transactional databases:

 Sequential pattern mining is trying to find relationships between occurrences of


sequential events, to find if there exists any specific order of the occurrences.
 We can find the sequential patterns of specific individual items also we can find the
sequential patterns across different items.

Application:

 Sequential pattern mining is widely used in analyzing of DNA sequence.


 Sequential pattern can be widely used in different areas, such as mining user access
patterns for the web sites, using the history of symptoms to predict certain kind of
disease, also by using sequential pattern mining, the retailers can make the inventory
control more efficient.

Page 8 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Challenges on Sequential pattern mining:

 A huge number of possible sequential patterns are hidden in databases.


 A mining algorithm should find the complete set of patterns, be highly efficient, scalable.

Example:

 Consider the database shown in fig 1 (This database has been sorted on customer-id and
transaction-time) fig 2 shows this database expressed as a set of customer sequences.
 With minimum support set to 25%, i.e. a minimum support of 2 customers, two
sequences :< ( 30) (90)> and < (30) (40 70)> are maximal among those satisfying the
support constraint, and are desired sequential patterns.
 The sequential pattern < (30) (90)> is supported by customers 1 and 4.Customer 4 buys
items (40 70) in between items 30 and 90, but supports the pattern < (30) (90)> since we
are looking for patterns that are not necessarily contiguous. The sequential pattern
<30((40 70)> is supported by customers 2 and 4. Customer 2 buys 60 along with 40 and
70, but supports this pattern since (40 70) is a subset of (40 60 70).
 An example of a sequence that does not have minimum support is the sequence < (10
20)(30)>, which is only supported by customers 2. The sequences
<(30)>,<(40)>.<(70)>,<(90)>,<(30)(40)>,<(30)(70)>and <(40)(70)>, though having
minimum support, are not in the answer because they are not maximal.

Customer Id Transaction Time Items Bought


1 June 25’93 30
1 June 30’93 90
2 June 10’93 10,20
2 June 15’93 30
2 June 20’93 40,60,70
3 June 25’93 30,50,70
4 June 25’93 30
4 June 30’93 40,70
4 June 25’93 90
5 June 12’93 90

Fig1: Database Sorted by Customer Id and Transaction Time

Customer Id Customer Sequence


1 < (30) (90)>
2 < (10 20)(30) (40 60 70)>
3 <(30,50,70)>
4 <(30)(40,70)(90)>
5 <(90)>

Fig2: Customer Sequence Version of the Database

Page 9 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Sequential Patterns with support >25%


< (30) (90)>
<(30)(40,70)>

Fig3: The answer set

Scalable Methods for Mining Sequential Patterns:

 Apriori-based Approaches:

1. GSP: (A sequential Pattern Mining Algorithm Based on Candidate Generate and Test) It
integrates with time constraints and relaxes the definition of transaction; also consider the
knowledge of taxonomies.
2. SPADE: (Sequential Pattern Discovery using Equivalent Class) SPADE is an algorithm
proposed to find frequent sequences using efficient lattice search techniques and simple
joins.

 Pattern –Growth –based Approaches:

1. PrefixSpan :( Mining Sequential Patterns by Prefix Projections) It mainly employs the


method of database projection to make the database projection to make the database for
next pass much smaller and consequently make the algorithm more speedy.

Mining Object:
A data mining object is only an empty container until it has been processed. Processing a data
mining model is also called training.

Processing mining structures: A mining structure gets data from an external data source, as
defined by the column bindings and usage metadata, and reads the data. This data is read in full
and then analyzed to extract various statistics. Analysis Services stores a compact representation
of the data, which is suitable for analysis by data mining algorithms, in a local cache. You can
either keep this cache or delete it after your models have been processed. By default, the cache is
stored. For more information, see Process a Mining Structure.

Processing mining models: A mining model is empty, containing definitions only, until it is
processed. To process a mining model, the mining structure that it is based on must have been
processed. The mining model gets the data from the mining structure cache, applies any filters
that may have been created on the model, and then passes the data set through the algorithm to
detect patterns. After the model is processed, the model stores only the results of processing, not
the data itself. For more information, see Process a Mining Model.

The following diagram illustrates the flow of data when a mining structure is processed, and
when a mining model is processed.

Page 10 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Viewing the Results of Processing


After a mining structure has been processed, it contains a compact representation of the data for
use in statistical analysis. If the cache has not been cleared, you can access the data in this cache
in the following ways:

 Creating a Data Mining Extensions (DMX) query on the model and drilling through to
the structure. For more information, see SELECT FROM <model>.CASES (DMX).
 Browsing a model based on the structure, and using one of the options in the user
interface to drill through to structure cases. For more information, see Data Mining
Model Viewers, or Drill Through to Case Data from a Mining Model.
 Creating a DMX query on the structure cases. For more information, see SELECT
FROM <structure>.CASES.

After a mining model has been processed, it contains only the patterns that were derived from
analysis, and mappings from the model results to the cached training data. You can browse or
query the model results, called model content, or you can query the model and structure cases, if
they have been cached.

The model content for each mining model depends on the algorithm that was used to create it.
For example, if one model is a clustering model and another is a decision trees model, the model
content is very different even though the models use exactly the same data. For more
information, see Mining Model Content (Analysis Services - Data Mining).

Page 11 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Processing Requirements
Processing requirements may differ depending on whether your mining models are based solely
on relational data, or on multidimensional data source.

For relational data source, processing requires only that you create training data and run mining
algorithms on that data. However, mining models that are based on OLAP objects, such as
dimensions and measures, require that the underlying data be in a processed state. This may
requires that the multidimensional objects be processed to populate the mining model.

Spatial Data Mining:


Spatial data mining is the application of data mining to spatial models. In spatial data mining,
analysts use geographical or spatial information to produce business intelligence or other results.
This requires specific techniques and resources to get the geographical data into relevant and
useful formats.

Challenges involved in spatial data mining include identifying patterns or finding objects that are
relevant to the questions that drive the research project. Analysts may be looking in a large database
field or other extremely large data set in order to find just the relevant data, using GIS/GPS tools or

Page 12 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

similarsystems.

One interesting thing about the term "spatial data mining" is that it is generally used to talk about
finding useful and non-trivial patterns in data. In other words, just setting up a visual map of geographic
data may not be considered spatial data mining by experts. The core goal of a spatial data mining project
is to distinguish the information in order to build real, actionable patterns to present, excluding things
like statistical coincidence, randomized spatial modeling or irrelevant results. One way analysts may do
this is by combing through data looking for "same-object" or "object-equivalent" models to provide
accurate comparisons of different geographic locations.

Mining Text and Web data:

Data Mining is a popular subject among Customer-focused companies. Many companies rely on
Data to target customers based on their personal preferences to maximize profits. Data Mining is
a broader term that means mining the Data and extracting the information, which can help while
making decisions, marketing strategies, building new customer relationships, and much more.

What is Data Mining?

Data Mining is a process of finding patterns and extracting useful information from the pool of
large data sets by transforming the data with a bunch of business rules. With the help of Data
Mining procedures, Raw datasets are converted into valuable datasets, which developers can
further use to analyze and determine the patterns.

Page 13 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Data Mining is an effective procedure for any organization as it helps improve the marketing
strategies and helps them target the customer base based on the data. With the help of structured
data, it also allows you to study different aspects of data and then get more innovative ideas to
increase productivity and sales.

The Data Mining process breaks down into the following steps –

1. Collect, Extract, Transform and Load the data into the data warehouse
2. Store and manage the data in the database or on the cloud.
3. Provide access to data to the business analyst, management teams, and Information
Technology professionals.

What is Text Mining?

Text Mining is a subset of Data Mining, and it involves the processing of data from various text
documents. It is the process of transforming unstructured text into a structured format and
interpreting these data to identify patterns. In Text Mining, various deep learning algorithms are
used to evaluate the text and generate useful information effectively.

The basic idea behind Text Mining is to find patterns in large datasets that can be used for
various purposes. Text Mining requires both Sohistcated linguistic and statistical techniques to
analyze the unstructured text format data and provide valuable insights. Text mining consists of a
wide variety of methods and technologies such as:

 Keyword-based Technologies: Keyword-based technologies depend on selecting


keywords that input data contains and are then filtered as a series of character strings.

Page 14 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

 Statistics Technologies: Statistical technology refers to the system that is completely


based on Machine learning. It uses certain text to model the data and, in turn, uses the
same model to manage and categorize text.
 Linguistic-based Technologies: Lingustinc based system uses a Natual language
processing system. The NLP models read the input text and understand the structure of
the text, grammar, logic, and context of the text.

What is Web Mining?

Web Mining is a process of extracting various useful information readily available on the
Internet (or World Wide Web). Web Mining is a subset of Data Mining. It helps to analyze user
activities on different web pages and track them over a period of time to understand customers’
behavior and surfing patterns. Web Mining is broadly categorized into three main subcategories

There are three main types of Web Data, as shown in the above image. Let’s discuss in brief
these Web Data types.

Page 15 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

1. Web Content Data: The widespread form of data in Web Content are HTML, web
pages, images, etc. All these various data types constitute Web Content data. The main
layout for the Internet/Web content is HTML, with a slight difference depending upon the
use of the browser, but the basic layout structure is the same everywhere.
2. Web Structure Data: On a typical web page, the contents are arranged within HTML
tags. The pages are hyperlinked, allowing users to navigate back and forth to find
relevant information. So basically, relationship/links describing the connection between
webpages is web structure data.
3. Web Usage Data: The main Data is generated by the Web Server and Application Server
on a typical web page. Web/Application server collects the log data, including
information about the users like their geographical location, time, the content they
interacted with, etc. The data in these log files are categorized into three types based on
the source it comes from:

 Server-side
 Client-side
 Proxy side.

Difference Between Data Mining vs Text Mining vs Web Mining

Now that you have a brief understanding of Data Mining, Text Mining, and Web Mining. In this
section, you will read more about the differences between Data Mining vs Text Mining vs Web
Mining. It will help you better understand these different Mining types. The following key
differences between Data Mining vs Text Mining vs Web Mining are listed below:

 Data Mining vs Text Mining vs Web Mining: Generic


 Data Mining vs Text Mining vs Web Mining: Process
 Data Mining vs Text Mining vs Web Mining: Use Case

Data Mining vs Text Mining vs Web Mining: Generic

The data mining process extracts, transform, and load the data into the data warehouse. The
business users use these tools to present these analyzed data in a representable form such as
tables, graphs, or charts. Data points such as Currencies, dates, and names are easy to link and do
not require understanding their context.

On the other hand, Text mining processes the texts, which are in the form of text documents,
emails, social media posts, etc. Text mining also faces significant challenges for linguistic texts
and SMS languages.

Web Data Mining is a technique that extracts data from the Web. It can be using data from Web
servers or web page scrapping. Web Mining has to deal with many log files to extract relevant
information.

Page 16 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Data Mining vs Text Mining vs Web Mining: Process

Data mining mainly focuses on data-dependent activities such as accounting, purchasing, CRM,
etc. The Data is easily accessible and homogeneous. Once the algorithm is determined, it is
easier to process the data and extract the relevant information.

On the other hand, Text mining is a complex process requiring a long time to deploy. Text
mining includes several steps like language guessing, tokenization, text segmentation, etc.

The entire Data is based on the logs collected from the Web Servers on Web Data mining.
Analyzing these logs are complex process as logs generally contain too much information, and
hence it requires several business rules to be pre-determined before extracting data from the
weblogs.

Data Mining vs Text Mining vs Web Mining: Use Case

Data mining is a robust industrial technology used for mining data for decades.

On the other hand, Text Mining was one of the complex, domain-specific, and language-specific
tools, and hence it was never valued as a ‘must-have.’

Web Mining is a relatively new process, and it came into existence after the origin of the World
Wide Web. Web Mining is considered to be a critical mining aspect in terms of understanding
user behavior over the internet.

With the advent of digitalization, the rise of social networks, and increased connectivity,
companies are now more concerned about their online reputation. They are looking for ways to
increase customer loyalty in a world of increasing choices.

Data Mining vs Text Mining vs Web Mining: Comparison Table

Base for
Data Mining Text Mining Web Mining
Comparison
Text mining is the subset Web mining is a subset of
Data mining is the of Data Mining that Data Mining that involves
statistical technique of involves processing processing the data related
Concept
processing the raw data unstructured text to the Web. It can be Web
into the structural form. documents into a Logs, Web Structure data,
structured format. or Web Contact data.
Data is mined and then
Text Data are stored in Web Data can be in the
stored in the data
Text Documents, emails, form of Structure, Content,
Data warehouse. The data stored
and logs and then and usage data and is later
Retrieval in Databases and
processed to gather high- converted into useful
spreadsheets are used to
quality information. information.
gather information and

Page 17 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

perform analysis.
Web mining mainly deals
The discovery of Text Mining involves
with three types of data,
Types of knowledge from structured data from text
i.e., Web Structure Data,
Data Data is homogeneous and documents, emails, logs,
Web Content Data, and
easy to access. PDFs, etc.
Web Usage Data.
Text Mining is used in Web Mining is used to
Data Mining is used in
the fields like customer extract information from
Application fields like medicine,
profile analysis, the web, analyze weblogs,
marketing, healthcare, etc.
bioscience, etc. etc.
In Web Mining, the data is
In Data Mining the data is In Text Mining, the data structured as well as
Data Format stored in a structured is stored in an unstructured. The data
format unstructured format format depends upon the
type of Mining method.
To retrieve the meaningful In web mining,
Text mining requires
data from Data Mining, one Application-level
pattern recognition
must be aware of Data knowledge, Data
Skills techniques and Natural
cleansing techniques, engineering, statistics, and
Required language processing to
machine learning probability are required to
enrich the meaning of
algorithms, statistics, successfully retrieve the
the text.
probability information from weblogs.
In Text Mining,
In web mining, Sequential
Statistical techniques are Computational linguistic
Techniques pattern, clustering, and
most helpful in analyzing principles are used to
Used associative mining
data. evaluate the meaning of
principles are used.
the text.

Multimedia Data mining:


Multimedia mining is a subfield of data mining that is used to find interesting information of
implicit knowledge from multimedia databases. Mining in multimedia is referred to as automatic
annotation or annotation mining. Mining multimedia data requires two or more data types, such
as text and video or text video and audio.

Multimedia data mining is an interdisciplinary field that integrates image processing and
understanding, computer vision, data mining, and pattern recognition. Multimedia data mining
discovers interesting patterns from multimedia databases that store and manage large collections
of multimedia objects, including image data, video data, audio data, sequence data and hypertext
data containing text, text markups, and linkages. Issues in multimedia data mining include
content-based retrieval and similarity search, generalization and multidimensional analysis.
Multimedia data cubes contain additional dimensions and measures for multimedia information.

Page 18 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

The framework that manages different types of multimedia data stored, delivered, and utilized in
different ways is known as a multimedia database management system. There are three classes of
multimedia databases: static, dynamic, and dimensional media. The content of the Multimedia
Database management system is as follows:

 Media data:The actual data representing an object.


 Media format data: Information such as sampling rate, resolution, encoding scheme
etc., about the format of the media data after it goes through the acquisition, processing
and encoding phase.
 Media keyword data:Keywords description relating to the generation of data. It is also
known as content descriptive data. Example: date, time and place of recording.
 Media feature data: Content dependent data such as the distribution of colours, kinds of
texture and different shapes present in data.

Types of Multimedia Applications

Types of multimedia applications based on data management characteristics are:

1. Repository applications: A Large amount of multimedia data and meta-data (Media


format date, Media keyword data, Media feature data) that is stored for retrieval
purposes, e.g., Repository of satellite images, engineering drawings, radiology scanned
pictures.
2. Presentation applications: They involve delivering multimedia data subject to temporal
constraints. Optimal viewing or listening requires DBMS to deliver data at a certain rate,
offering the quality of service above a certain threshold. Here data is processed as it is
delivered. Example: Annotating of video and audio data, real-time editing analysis.
3. Collaborative work using multimedia information involves executing a complex task
by merging drawings and changing notifications. Example: Intelligent healthcare
network.

Challenges with Multimedia Database

There are still many challenges to multimedia databases, such as:

Page 19 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

1. Modelling: Working in this area can improve database versus information retrieval
techniques; thus, documents constitute a specialized area and deserve special
consideration.
2. Design:The conceptual, logical and physical design of multimedia databases has not yet
been addressed fully as performance and tuning issues at each level are far more complex
as they consist of a variety of formats like JPEG, GIF, PNG, MPEG, which is not easy to
convert from one form to another.
3. Storage:Storage of multimedia database on any standard disk presents the problem of
representation, compression, mapping to device hierarchies, archiving and buffering
during input-output operation. In DBMS, a BLOB (Binary Large Object) facility allows
untyped bitmaps to be stored and retrieved.
4. Performance: Physical limitations dominate an application involving video playback or
audio-video synchronization. The use of parallel processing may alleviate some
problems, but such techniques are not yet fully developed. Apart from this, a multimedia
database consumes a lot of processing time and bandwidth.
5. Queries and retrieval: For multimedia data like images, video, and audio accessing data
through query open up many issues like efficient query formulation, query execution and
optimization, which need to be worked upon.

Where is Multimedia Database Applied?

Below are the following areas where a multimedia database is applied, such as:

 Documents and record management: Industries and businesses keep detailed records
and various documents. For example, insurance claim records.
 Knowledge dissemination:Multimedia database is a very effective tool for knowledge
dissemination in terms of providing several resources. For example, electronic books.

Page 20 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

 Education and training:Computer-aided learning materials can be designed using


multimedia sources which are nowadays very popular sources of learning. Example:
Digital libraries.
 Travelling: Marketing, advertising, retailing, entertainment and travel. For example, a
virtual tour of cities.
 Real-time control and monitoring: With active database technology, multimedia
presentation of information can effectively monitor and control complex tasks. For
example, manufacturing operation control.

Categories of Multimedia Data Mining

Multimedia mining refers to analyzing a large amount of multimedia information to extract


patterns based on their statistical relationships. Multimedia data mining is classified into two
broad categories: static and dynamic media. Static media contains text (digital library, creating
SMS and MMS) and images (photos and medical images). Dynamic media contains Audio
(music and MP3 sounds) and Video (movies). The below image shows the categories of
multimedia data mining.

1. Text Mining

Text is the foremost general medium for the proper exchange of information. Text Mining
evaluates a huge amount of usual language text and detects exact patterns to find useful
information. Text Mining also referred to as text data mining, is used to find meaningful
information from unstructured texts from various sources.

2. Image Mining

Image mining systems can discover meaningful information or image patterns from a huge
collection of images. Image mining determines how low-level pixel representation consists of a

Page 21 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

raw image or image sequence that can be handled to recognize high-level spatial objects and
relationships. It includes digital image processing, image understanding, database, AI, etc.

3. Video Mining

Video mining is unsubstantiated to find interesting patterns from many video data; multimedia
data is video data such as text, image, metadata, visuals and audio. It is commonly used in
security and surveillance, entertainment, medicine, sports and education programs. The
processing is indexing, automatic segmentation, content-based retrieval, classification and
detecting triggers.

4. Audio Mining

Audio mining plays an important role in multimedia applications, is a technique by which the
content of an audio signal can be automatically searched, analyzed and rotten with wavelet
transformation. It is generally used in automatic speech recognition, where the analysis efforts to
find any speech within the audio. Band energy, frequency centroid, zero-crossing rate, pitch
period and bandwidth are often used for audio processing.

Application of Multimedia Mining

There are different kinds of applications of multimedia data mining, some of which are as
follows:

 Digital Library: The collection of digital data is stored and maintained in a digital
library, which is essential to convert different digital data formats into text, images,
video, audio, etc.
 Traffic Video Sequences: To determine important but previously unidentified
knowledge from the traffic video sequences, detailed analysis and mining are to be
performed based on vehicle identification, traffic flow, and queue temporal relations of
the vehicle at an intersection. This provides an economic approach for regular traffic
monitoring processes.

Page 22 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

 Medical Analysis: Multimedia mining is primarily used in the medical field, particularly
for analyzing medical images. Various data mining techniques are used for image
classification. Examples, Automatic 3D delineation of highly aggressive brain tumours,
Automatic localization and identification of vertebrae in 3D CT scans, MRI Scans, ECG
and X-Ray.
 Customer Perception: It contains details about customers' opinions, products or
services, customers complaints, customers preferences, and the level of customer
satisfaction with products or services, which are collected together. The audio data serve
as topic detection, resource assignment and evaluation of the quality of services. Many
companies have call centres that receive telephone calls from customers.
 Media Making and Broadcasting: Radio stations and TV channels create broadcasting
companies, and multimedia mining can be applied to monitor their content to search for
more efficient approaches and improve their quality.
 Surveillance system: It consists of collecting, analyzing, summarizing audio, video or
audiovisual information about specific areas like government organizations, multi-
national companies, shopping malls, banks, forests, agricultural areas and, highways etc.
The main use of this technology in the field of security; hence it can be utilized by
military, police and private companies since they provide security services.

Process of Multimedia Data Mining

The below image shows the present architecture, which includes the types of the multimedia
mining process. Data Collection is the initial stage of the learning system; Pre-processing is to
extract significant features from raw data. It includes data cleaning, transformation,
normalization, feature extraction, etc. Learning can be direct if informative types can be
recognized at preprocessing stage. The complete process depends extremely on the nature of raw
data and the difficulty field. The product of preprocessing is the training set. A learning model
must be selected for the specified training set to learn from it and make the multimedia model
more constant.

Page 23 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Converting Un-structured data to structured data: Data resides in a fixed field within a
record or file is called structured data, and these data are stored in sequential form. Structured
data has been easily entered, stored, queried and analyzed. Unstructured data is bitstream, for
example, pixel representation for an image, audio, video and character representation for text.
These files may have an internal structure, but they are still considered "unstructured" because
their data does not fit neatly in a database. For example, images and videos of different objects
have some similarities - each represents an interpretation of a building without a clear structure.

Current data mining tools operate on structured data, which resides in a huge volume of the
relational database, while data in multimedia databases are semi-structured or unstructured.
Hence, the semi-structured or unstructured multimedia data is converted into structured one, and
then the current data mining tools are used to extract the knowledge. The sequence or time
element is different between unstructured and structured data mining. The architecture of
converting unstructured data to structured data and which is used for extracting information from
the unstructured database, is shown in the above image. Then data mining tools are applied to the
stored structured databases.

Architecture for Multimedia Data Mining

Multimedia mining architecture is given in the below image. The architecture has several
components. Important components are Input, Multimedia Content, Spatiotemporal
Segmentation, Feature Extraction, Finding similar Patterns, and Evaluation of Results.

Page 24 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

1. The input stage comprises a multimedia database used to find the patterns and perform
the data mining.
2. Multimedia Content is the data selection stage that requires the user to select the
databases, subset of fields, or data for data mining.
3. Spatio-temporal segmentation is nothing but moving objects in image sequences in the
videos, and it is useful for object segmentation.
4. Feature extraction is the preprocessing step that involves integrating data from various
sources and making choices regarding characterizing or coding certain data fields to serve
when inputs to the pattern-finding stage. Such representation of choices is required
because certain fields could include data at various levels and are not considered for
finding a similar pattern stage. In MDM, the preprocessing stage is significant since the
unstructured nature of multimedia records.
5. Finding a similar pattern stage is the heart of the whole data mining process. The hidden
patterns and trends in the data are basically uncovered in this stage. Some approaches to
finding similar pattern stages contain association, classification, clustering, regression,
time-series analysis and visualization.
6. Evaluation of Results is a data mining process used to evaluate the results, and this is
important to determine whether the prior stage must be revisited or not. This stage
consists of reporting and using the extracted knowledge to produce new actions, products,
services, or marketing strategies.

Models for Multimedia Mining

The models which are used to perform multimedia data are very important in mining. Commonly
four different multimedia mining models have been used. These are classification, association
rule, clustering and statistical modelling.

Page 25 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

1. Classification: Classification is a technique for multimedia data analysis that can learn
from every property of a specified set of multimedia. It is divided into a predefined class
label to achieve the purpose of classification. Classification is the process of constructing
data into categories for its better effective and efficient use; it creates a function that well-
planned data item into one of many predefined classes by inputting a training data set and
building a model of the class attribute based on the rest of the attributes. Decision tree
classification has a perceptive nature that the users conceptual model without loss of
exactness. Hidden Markov Model is used to classify multimedia data such as images and
videos as indoor-outdoor games.
2. Association Rule: Association Rule is one of the most important data mining techniques
that help find relations between data items in huge databases. There are two types of
associations in multimedia mining: image content and non-image content features.
Mining the frequently occurring patterns between different images becomes mining the
repeated patterns in a set of transactions. Multi-relational association rule mining displays
multiple reports for the same image. In image classification also, multiple-level
association rule techniques are used.
3. Clustering: Cluster analysis divides the data objects into multiple groups or clusters.
Cluster analysis combines all objects based on their groups. In multimedia mining, the
clustering technique can be applied to group similar images, objects, sounds, videos and
texts. Clustering algorithms can be divided into several methods: hierarchical methods,
density-based methods, grid-based methods, model-based methods, k-means algorithms,
and graph-based models.
4. Statistical Modeling: Statistical mining models regulate the statistical validity of test
parameters and have been used to test hypotheses, undertake correlation studies, and
transform and make data for further analysis. This is used to establish links between
words and partitioned image regions to form a simple co-occurrence model.

Page 26 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Issues in Multimedia Mining

Major Issues in multimedia data mining contains content-based retrieval, similarity search,
dimensional analysis, classification, prediction analysis and mining associations in multimedia
data.

1. Content-based retrieval and Similarity search

Content-based retrieval in multimedia is a stimulating problem since multimedia data is required


for detailed analysis from pixel values. We considered two main families of multimedia retrieval
systems, i.e. similarity search in multimedia data.

 Description-based retrieval system creates indices and object retrieval based on image
descriptions, such as keywords, captions, size, and creation time.
 Content-based retrieval system supports image content retrieval, for example, colour
histogram, texture, shape, objects, and wavelet transform.
 Use of content-based retrieval system: Visual features index images and promote object
retrieval based on feature similarity; it is very desirable in various applications. These
applications include diagnosis, weather prediction, TV production and internet search
engines for pictures and e-commerce.

2. Multidimensional Analysis

To perform multidimensional analysis of large multimedia databases, multimedia data cubes may
be designed and constructed similarly to traditional data cubes from relational data. A
multimedia data cube has several dimensions. For example, the size of the image or video in
bytes; the width and height of the frames, creating two dimensions, the date on which image or
video was created or last modified, the format type of the image or video, frame sequence
duration in seconds, Internet domain of pages referencing the image or video, the keywords like
a colour dimension and edge orientation dimension. A multimedia data cube can have additional
dimensions and measures for multimedia data, such as colour, texture, and shape.

The Multimedia data mining system prototype is MultiMediaMiner, the extension of the
DBMiner system that handles multimedia data. The Image Excavator component of
MultiMediaMiner uses image contextual information, like HTML tags on Web pages, to derive
keywords. By navigating online directory structures, like Yahoo! directory, it is possible to build
hierarchies of keywords mapped on the directories in which the image was found.

3. Classification and Prediction Analysis

Classification and predictive analysis has been used for mining multimedia data, particularly in
scientific analysis like astronomy, seismology, and geoscientific analysis. Decision tree
classification is an important method for reported image data mining applications. For example,
consider the sky images, which astronomers have carefully classified as the training set. It can
create models for recognizing galaxies, stars and further stellar objects based on properties like
magnitudes, areas, intensity, image moments and orientation.

Page 27 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

Image data mining classification and clustering are carefully connected to image analysis and
scientific data mining. The image data are frequently in large volumes and need substantial
processing power, such as parallel and distributed processing. Hence, many image analysis
techniques and scientific data analysis methods could be applied to image data mining.

4. Mining Associations in Multimedia

Data Association rules involving multimedia objects have been mined in image and video
databases. Three categories can be observed:

 Associations between image content and non-image content features


 Associations among image contents that are not related to spatial relationships
 Associations among image contents related to spatial relationships

First, an image contains multiple objects, each with various features such as colour, shape,
texture, keyword, and spatial locations, so that many possible associations can be made. Second,
a picture containing multiple repeated objects is essential in image analysis. The recurrence of
similar objects should not be ignored in association analysis. Third, to find the associations
between the spatial relationships and multimedia images can be used to discover object
associations and correlations. With the associations between multimedia objects, we can treat
every image as a transaction and find commonly occurring patterns among different images.

Mining the World Wide Web:

Data Mining- World Wide Web

Over the last few years, the World Wide Web has become a significant source of information
and simultaneously a popular platform for business. Web mining can define as the method of
utilizing data mining techniques and algorithms to extract useful information directly from the
web, such as Web documents and services, hyperlinks, Web content, and server logs. The World
Wide Web contains a large amount of data that provides a rich source to data mining. The
objective of Web mining is to look for patterns in Web data by collecting and examining data in
order to gain insights.

Page 28 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

What is Web Mining?


Web mining can widely be seen as the application of adapted data mining techniques to the web,
whereas data mining is defined as the application of the algorithm to discover patterns on mostly
structured data embedded into a knowledge discovery process. Web mining has a distinctive
property to provide a set of various data types. The web has multiple aspects that yield different
approaches for the mining process, such as web pages consist of text, web pages are linked via
hyperlinks, and user activity can be monitored via web server logs. These three features lead to
the differentiation between the three areas are web content mining, web structure mining, web
usage mining.

There are three types of data mining:

1. Web Content Mining:

Web content mining can be used to extract useful data, information, knowledge from the web
page content. In web content mining, each web page is considered as an individual document.
The individual can take advantage of the semi-structured nature of web pages, as HTML
provides information that concerns not only the layout but also logical structure. The primary
task of content mining is data extraction, where structured data is extracted from unstructured
websites. The objective is to facilitate data aggregation over various web sites by using the
extracted structured data. Web content mining can be utilized to distinguish topics on the web.
For Example, if any user searches for a specific task on the search engine, then the user will get a
list of suggestions.

Page 29 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

2. Web Structured Mining:

The web structure mining can be used to find the link structure of hyperlink. It is used to identify
that data either link the web pages or direct link network. In Web Structure Mining, an individual
considers the web as a directed graph, with the web pages being the vertices that are associated
with hyperlinks. The most important application in this regard is the Google search engine,
which estimates the ranking of its outcomes primarily with the PageRank algorithm. It
characterizes a page to be exceptionally relevant when frequently connected by other highly
related pages. Structure and content mining methodologies are usually combined. For example,
web structured mining can be beneficial to organizations to regulate the network between two
commercial sites.

3. Web Usage Mining:

Web usage mining is used to extract useful data, information, knowledge from the weblog
records, and assists in recognizing the user access patterns for web pages. In Mining, the usage
of web resources, the individual is thinking about records of requests of visitors of a website, that
are often collected as web server logs. While the content and structure of the collection of web
pages follow the intentions of the authors of the pages, the individual requests demonstrate how
the consumers see these pages. Web usage mining may disclose relationships that were not
proposed by the creator of the pages.

Some of the methods to identify and analyze the web usage patterns are given below:

I. Session and visitor analysis:

The analysis of preprocessed data can be accomplished in session analysis, which incorporates
the guest records, days, time, sessions, etc. This data can be utilized to analyze the visitor's
behavior.

The document is created after this analysis, which contains the details of repeatedly visited web
pages, common entry, and exit.

II. OLAP (Online Analytical Processing):

OLAP accomplishes a multidimensional analysis of advanced data.

OLAP can be accomplished on various parts of log related data in a specific period.

OLAP tools can be used to infer important business intelligence metrics

Challenges in Web Mining:


The web pretends incredible challenges for resources, and knowledge discovery based on the
following observations:

Page 30 of 31
lOMoARcPSD|8417813

Swaroopa Rani B, Asst.Professor

 The complexity of web pages:

The site pages don't have a unifying structure. They are extremely complicated as compared to
traditional text documents. There are enormous amounts of documents in the digital library of
the web. These libraries are not organized according to a specific order.

 The web is a dynamic data source:

The data on the internet is quickly updated. For example, news, climate, shopping, financial
news, sports, and so on.

 Diversity of client networks:

The client network on the web is quickly expanding. These clients have different interests,
backgrounds, and usage purposes. There are over a hundred million workstations that are
associated with the internet and still increasing tremendously.

 Relevancy of data:

It is considered that a specific person is generally concerned about a small portion of the web,
while the rest of the segment of the web contains the data that is not familiar to the user and may
lead to unwanted results.

 The web is too broad:

The size of the web is tremendous and rapidly increasing. It appears that the web is too huge for
data warehousing and data mining.

Mining the Web's Link Structures to recognize Authoritative Web Pages:

The web comprises of pages as well as hyperlinks indicating from one to another page. When a
creator of a Web page creates a hyperlink showing another Web page, this can be considered as
the creator's authorization of the other page. The unified authorization of a given page by various
creators on the web may indicate the significance of the page and may naturally prompt the
discovery of authoritative web pages. The web linkage data provide rich data about the
relevance, the quality, and structure of the web's content, and thus is a rich source of web mining.

Application of Web Mining:

Web mining has an extensive application because of various uses of the web. The list of some
applications of web mining is given below.

 Marketing and conversion tool


 Data analysis on website and application accomplishment.
 Audience behavior analysis
 Advertising and campaign accomplishment analysis, Testing and analysis of a site.

Page 31 of 31

You might also like