Unit-5 Data Mining AIML
Unit-5 Data Mining AIML
UNIT - V
Advanced Concepts: Basic concepts in Mining data streams–Mining Time–series data––Mining
sequence patterns in Transactional databases– Mining Object– Spatial– Multimedia–Text and
Web data – Spatial Data mining– Multimedia Data mining–Text Mining– Mining the World
Wide Web.
We generate and transmit vast amounts of digital data every second in the real world. It is not
wrong to say that massive data surround us. The continuously generating and transmitting data is
called a Data Stream. However, extracting valuable knowledge from this big data is a big task. It
takes lots of time, effort, and skills to mine insights from massive data.
Therefore, we need to implement data streams in data mining techniques to transfer valuable
insights from data to the receiver’s end. This article leads us to understand the data stream and its
mining techniques simply and helpfully.
Data Stream is a continuous, fast-changing, and ordered chain of data transmitted at a very high
speed. It is an ordered sequence of information for a specific interval. The sender’s data is
transferred from the sender’s side and immediately shows in data streaming at the receiver’s
side. Streaming does not mean downloading the data or storing the information on storage
devices.
Page 1 of 31
lOMoARcPSD|8417813
There are so many sources of the data stream, and a few widely used sources are listed below:
Internet traffic
Sensors data
Real-time ATM transaction
Live event data
Call records
Satellite data
Audio listening
Watching videos
Real-time surveillance systems
Online transactions
Data Streams in Data Mining is extracting knowledge and valuable insights from a continuous
stream of data using stream processing software. Data Streams in Data Mining can be considered
a subset of general concepts of machine learning, knowledge extraction, and data mining. In
Data Streams in Data Mining, data analysis of a large amount of data needs to be done in real-
time. The structure of knowledge is extracted in data steam mining represented in the case of
models and patterns of infinite streams of information.
Page 2 of 31
lOMoARcPSD|8417813
Continuous Stream of Data: The data stream is an infinite continuous stream resulting
in big data. In data streaming, multiple data streams are passed simultaneously.
Time Sensitive: Data Streams are time-sensitive, and elements of data streams carry
timestamps with them. After a particular time, the data stream loses its significance and is
relevant for a certain period.
Data Volatility: No data is stored in data streaming as It is volatile. Once the data mining
and analysis are done, information is summarized or discarded.
Concept Drifting: Data Streams are very unpredictable. The data changes or evolves
with time, as in this dynamic world, nothing is constant.
Data Stream is generated through various data stream generators. Then, data mining techniques
are implemented to extract knowledge and patterns from the data streams. Therefore, these
techniques need to process multi-dimensional, multi-level, single pass, and online data streams.
1. Classification
Generally speaking, a stream mining classifier is ready to do either one of the tasks at any
moment:
Let’s discuss the best-known classification algorithms for predicting the labels for data streams.
Page 3 of 31
lOMoARcPSD|8417813
The k-Nearest Neighbor or k-NN classifier predicts the new items’ class labels based on the class
label of the closest instances. In particular, the lazy classifier outputs the majority class label of
the k instances closest to the one to predict.
Naive Bayes
Naive Bayes is a classifier based on Bayes’ theorem. It is a probabilistic model called ‘naive’
because it assumes conditional independence between input features. The basic idea is to
compute a probability for each one of the class labels based on the attribute values and select the
class with the highest probability as the label for the new item.
Page 4 of 31
lOMoARcPSD|8417813
Decision Trees
As the name signifies, the decision tree builds a tree structure from training data, and then the
decision tree classifier is used to predict class labels of unseen data items. They are easy to
understand their predictions. In Data Streams in Data Mining Hoeffding tree is the state-of-the-
art decision tree classifier. In addition, the Hoeffding adaptive tree is advanced.
Logistic Regression
Logistic Regression is not a regression classifier, but a classification classifier used to estimate
discrete values/binary values like 0/1, yes/no, true/false, etc. It predicts the probability of
occurrence of an event by fitting data to a logit function based on known instances of the data
stream.
Ensembles
Ensembles combine different classifiers, which can predict better than individual classifiers.
Data is divided into distinct subsets, and these different subsets of data are fed to different
classifiers of ensemble model Bagging and boosting are two types of ensemble models. The
ADWIN bagging method is widely used for Data Streams in Data Mining.
2. Regression
Regression is also a supervised learning technique used to predict real values of label attributes
for the stream instances, not the discrete values like classification. However, the idea of
regression is similar to classification either to predict the real-values label for the unknown items
using the regressor model or train and adjust the model using the known data with the label.
Regression Algorithms are also the same as classification algorithms. Below are the best-known
regression algorithms for predicting the labels for data streams.
3. Clustering
Page 5 of 31
lOMoARcPSD|8417813
Let’s discuss the best-known clustering algorithms for group segmentation of data streams.
K-means Clustering
The k-means clustering method is the most used and straightforward method for clustering. It
starts by randomly selecting k centroids. After that, repeat two steps until the stopping criteria
are met: first, assign each instance to the nearest centroid, and second, recompute the cluster
centroids by taking the mean of all the items in that cluster.
Hierarchical Clustering
Density-based Clustering
DBSCAN is used for density-based clustering. It is based on the natural human clustering
approach.
Frequent pattern mining is an essential task in unsupervised learning. It is used to describe the
data and find the association rules or discriminative features in data that will further help
classification and clustering tasks. It is based on two rules.
Below are the best-known frequent pattern mining algorithms for finding frequent itemsets in
data.
Apriori
Eclat
FP-growth
Page 6 of 31
lOMoARcPSD|8417813
Time Series Analysis comprises methods for analyzing time-series data in order to extract
meaningful statistics, rules and patterns. These rules and patterns might be used to build
forecasting models that are able to predict future developments.
1. Financial:
1.1 Used for stock price evaluation
1.2 For the measurement of Inflation
2. Industry:
2.1 Determine the power consumption
3. Scientific:
3.1 Used for experiment results
4. Meteorological:
4.1 Concerned with the processes and phenomena of the atmosphere, basically for forecasting
weather
Page 7 of 31
lOMoARcPSD|8417813
Application:
Page 8 of 31
lOMoARcPSD|8417813
Example:
Consider the database shown in fig 1 (This database has been sorted on customer-id and
transaction-time) fig 2 shows this database expressed as a set of customer sequences.
With minimum support set to 25%, i.e. a minimum support of 2 customers, two
sequences :< ( 30) (90)> and < (30) (40 70)> are maximal among those satisfying the
support constraint, and are desired sequential patterns.
The sequential pattern < (30) (90)> is supported by customers 1 and 4.Customer 4 buys
items (40 70) in between items 30 and 90, but supports the pattern < (30) (90)> since we
are looking for patterns that are not necessarily contiguous. The sequential pattern
<30((40 70)> is supported by customers 2 and 4. Customer 2 buys 60 along with 40 and
70, but supports this pattern since (40 70) is a subset of (40 60 70).
An example of a sequence that does not have minimum support is the sequence < (10
20)(30)>, which is only supported by customers 2. The sequences
<(30)>,<(40)>.<(70)>,<(90)>,<(30)(40)>,<(30)(70)>and <(40)(70)>, though having
minimum support, are not in the answer because they are not maximal.
Page 9 of 31
lOMoARcPSD|8417813
Apriori-based Approaches:
1. GSP: (A sequential Pattern Mining Algorithm Based on Candidate Generate and Test) It
integrates with time constraints and relaxes the definition of transaction; also consider the
knowledge of taxonomies.
2. SPADE: (Sequential Pattern Discovery using Equivalent Class) SPADE is an algorithm
proposed to find frequent sequences using efficient lattice search techniques and simple
joins.
Mining Object:
A data mining object is only an empty container until it has been processed. Processing a data
mining model is also called training.
Processing mining structures: A mining structure gets data from an external data source, as
defined by the column bindings and usage metadata, and reads the data. This data is read in full
and then analyzed to extract various statistics. Analysis Services stores a compact representation
of the data, which is suitable for analysis by data mining algorithms, in a local cache. You can
either keep this cache or delete it after your models have been processed. By default, the cache is
stored. For more information, see Process a Mining Structure.
Processing mining models: A mining model is empty, containing definitions only, until it is
processed. To process a mining model, the mining structure that it is based on must have been
processed. The mining model gets the data from the mining structure cache, applies any filters
that may have been created on the model, and then passes the data set through the algorithm to
detect patterns. After the model is processed, the model stores only the results of processing, not
the data itself. For more information, see Process a Mining Model.
The following diagram illustrates the flow of data when a mining structure is processed, and
when a mining model is processed.
Page 10 of 31
lOMoARcPSD|8417813
Creating a Data Mining Extensions (DMX) query on the model and drilling through to
the structure. For more information, see SELECT FROM <model>.CASES (DMX).
Browsing a model based on the structure, and using one of the options in the user
interface to drill through to structure cases. For more information, see Data Mining
Model Viewers, or Drill Through to Case Data from a Mining Model.
Creating a DMX query on the structure cases. For more information, see SELECT
FROM <structure>.CASES.
After a mining model has been processed, it contains only the patterns that were derived from
analysis, and mappings from the model results to the cached training data. You can browse or
query the model results, called model content, or you can query the model and structure cases, if
they have been cached.
The model content for each mining model depends on the algorithm that was used to create it.
For example, if one model is a clustering model and another is a decision trees model, the model
content is very different even though the models use exactly the same data. For more
information, see Mining Model Content (Analysis Services - Data Mining).
Page 11 of 31
lOMoARcPSD|8417813
Processing Requirements
Processing requirements may differ depending on whether your mining models are based solely
on relational data, or on multidimensional data source.
For relational data source, processing requires only that you create training data and run mining
algorithms on that data. However, mining models that are based on OLAP objects, such as
dimensions and measures, require that the underlying data be in a processed state. This may
requires that the multidimensional objects be processed to populate the mining model.
Challenges involved in spatial data mining include identifying patterns or finding objects that are
relevant to the questions that drive the research project. Analysts may be looking in a large database
field or other extremely large data set in order to find just the relevant data, using GIS/GPS tools or
Page 12 of 31
lOMoARcPSD|8417813
similarsystems.
One interesting thing about the term "spatial data mining" is that it is generally used to talk about
finding useful and non-trivial patterns in data. In other words, just setting up a visual map of geographic
data may not be considered spatial data mining by experts. The core goal of a spatial data mining project
is to distinguish the information in order to build real, actionable patterns to present, excluding things
like statistical coincidence, randomized spatial modeling or irrelevant results. One way analysts may do
this is by combing through data looking for "same-object" or "object-equivalent" models to provide
accurate comparisons of different geographic locations.
Data Mining is a popular subject among Customer-focused companies. Many companies rely on
Data to target customers based on their personal preferences to maximize profits. Data Mining is
a broader term that means mining the Data and extracting the information, which can help while
making decisions, marketing strategies, building new customer relationships, and much more.
Data Mining is a process of finding patterns and extracting useful information from the pool of
large data sets by transforming the data with a bunch of business rules. With the help of Data
Mining procedures, Raw datasets are converted into valuable datasets, which developers can
further use to analyze and determine the patterns.
Page 13 of 31
lOMoARcPSD|8417813
Data Mining is an effective procedure for any organization as it helps improve the marketing
strategies and helps them target the customer base based on the data. With the help of structured
data, it also allows you to study different aspects of data and then get more innovative ideas to
increase productivity and sales.
The Data Mining process breaks down into the following steps –
1. Collect, Extract, Transform and Load the data into the data warehouse
2. Store and manage the data in the database or on the cloud.
3. Provide access to data to the business analyst, management teams, and Information
Technology professionals.
Text Mining is a subset of Data Mining, and it involves the processing of data from various text
documents. It is the process of transforming unstructured text into a structured format and
interpreting these data to identify patterns. In Text Mining, various deep learning algorithms are
used to evaluate the text and generate useful information effectively.
The basic idea behind Text Mining is to find patterns in large datasets that can be used for
various purposes. Text Mining requires both Sohistcated linguistic and statistical techniques to
analyze the unstructured text format data and provide valuable insights. Text mining consists of a
wide variety of methods and technologies such as:
Page 14 of 31
lOMoARcPSD|8417813
Web Mining is a process of extracting various useful information readily available on the
Internet (or World Wide Web). Web Mining is a subset of Data Mining. It helps to analyze user
activities on different web pages and track them over a period of time to understand customers’
behavior and surfing patterns. Web Mining is broadly categorized into three main subcategories
There are three main types of Web Data, as shown in the above image. Let’s discuss in brief
these Web Data types.
Page 15 of 31
lOMoARcPSD|8417813
1. Web Content Data: The widespread form of data in Web Content are HTML, web
pages, images, etc. All these various data types constitute Web Content data. The main
layout for the Internet/Web content is HTML, with a slight difference depending upon the
use of the browser, but the basic layout structure is the same everywhere.
2. Web Structure Data: On a typical web page, the contents are arranged within HTML
tags. The pages are hyperlinked, allowing users to navigate back and forth to find
relevant information. So basically, relationship/links describing the connection between
webpages is web structure data.
3. Web Usage Data: The main Data is generated by the Web Server and Application Server
on a typical web page. Web/Application server collects the log data, including
information about the users like their geographical location, time, the content they
interacted with, etc. The data in these log files are categorized into three types based on
the source it comes from:
Server-side
Client-side
Proxy side.
Now that you have a brief understanding of Data Mining, Text Mining, and Web Mining. In this
section, you will read more about the differences between Data Mining vs Text Mining vs Web
Mining. It will help you better understand these different Mining types. The following key
differences between Data Mining vs Text Mining vs Web Mining are listed below:
The data mining process extracts, transform, and load the data into the data warehouse. The
business users use these tools to present these analyzed data in a representable form such as
tables, graphs, or charts. Data points such as Currencies, dates, and names are easy to link and do
not require understanding their context.
On the other hand, Text mining processes the texts, which are in the form of text documents,
emails, social media posts, etc. Text mining also faces significant challenges for linguistic texts
and SMS languages.
Web Data Mining is a technique that extracts data from the Web. It can be using data from Web
servers or web page scrapping. Web Mining has to deal with many log files to extract relevant
information.
Page 16 of 31
lOMoARcPSD|8417813
Data mining mainly focuses on data-dependent activities such as accounting, purchasing, CRM,
etc. The Data is easily accessible and homogeneous. Once the algorithm is determined, it is
easier to process the data and extract the relevant information.
On the other hand, Text mining is a complex process requiring a long time to deploy. Text
mining includes several steps like language guessing, tokenization, text segmentation, etc.
The entire Data is based on the logs collected from the Web Servers on Web Data mining.
Analyzing these logs are complex process as logs generally contain too much information, and
hence it requires several business rules to be pre-determined before extracting data from the
weblogs.
Data mining is a robust industrial technology used for mining data for decades.
On the other hand, Text Mining was one of the complex, domain-specific, and language-specific
tools, and hence it was never valued as a ‘must-have.’
Web Mining is a relatively new process, and it came into existence after the origin of the World
Wide Web. Web Mining is considered to be a critical mining aspect in terms of understanding
user behavior over the internet.
With the advent of digitalization, the rise of social networks, and increased connectivity,
companies are now more concerned about their online reputation. They are looking for ways to
increase customer loyalty in a world of increasing choices.
Base for
Data Mining Text Mining Web Mining
Comparison
Text mining is the subset Web mining is a subset of
Data mining is the of Data Mining that Data Mining that involves
statistical technique of involves processing processing the data related
Concept
processing the raw data unstructured text to the Web. It can be Web
into the structural form. documents into a Logs, Web Structure data,
structured format. or Web Contact data.
Data is mined and then
Text Data are stored in Web Data can be in the
stored in the data
Text Documents, emails, form of Structure, Content,
Data warehouse. The data stored
and logs and then and usage data and is later
Retrieval in Databases and
processed to gather high- converted into useful
spreadsheets are used to
quality information. information.
gather information and
Page 17 of 31
lOMoARcPSD|8417813
perform analysis.
Web mining mainly deals
The discovery of Text Mining involves
with three types of data,
Types of knowledge from structured data from text
i.e., Web Structure Data,
Data Data is homogeneous and documents, emails, logs,
Web Content Data, and
easy to access. PDFs, etc.
Web Usage Data.
Text Mining is used in Web Mining is used to
Data Mining is used in
the fields like customer extract information from
Application fields like medicine,
profile analysis, the web, analyze weblogs,
marketing, healthcare, etc.
bioscience, etc. etc.
In Web Mining, the data is
In Data Mining the data is In Text Mining, the data structured as well as
Data Format stored in a structured is stored in an unstructured. The data
format unstructured format format depends upon the
type of Mining method.
To retrieve the meaningful In web mining,
Text mining requires
data from Data Mining, one Application-level
pattern recognition
must be aware of Data knowledge, Data
Skills techniques and Natural
cleansing techniques, engineering, statistics, and
Required language processing to
machine learning probability are required to
enrich the meaning of
algorithms, statistics, successfully retrieve the
the text.
probability information from weblogs.
In Text Mining,
In web mining, Sequential
Statistical techniques are Computational linguistic
Techniques pattern, clustering, and
most helpful in analyzing principles are used to
Used associative mining
data. evaluate the meaning of
principles are used.
the text.
Multimedia data mining is an interdisciplinary field that integrates image processing and
understanding, computer vision, data mining, and pattern recognition. Multimedia data mining
discovers interesting patterns from multimedia databases that store and manage large collections
of multimedia objects, including image data, video data, audio data, sequence data and hypertext
data containing text, text markups, and linkages. Issues in multimedia data mining include
content-based retrieval and similarity search, generalization and multidimensional analysis.
Multimedia data cubes contain additional dimensions and measures for multimedia information.
Page 18 of 31
lOMoARcPSD|8417813
The framework that manages different types of multimedia data stored, delivered, and utilized in
different ways is known as a multimedia database management system. There are three classes of
multimedia databases: static, dynamic, and dimensional media. The content of the Multimedia
Database management system is as follows:
Page 19 of 31
lOMoARcPSD|8417813
1. Modelling: Working in this area can improve database versus information retrieval
techniques; thus, documents constitute a specialized area and deserve special
consideration.
2. Design:The conceptual, logical and physical design of multimedia databases has not yet
been addressed fully as performance and tuning issues at each level are far more complex
as they consist of a variety of formats like JPEG, GIF, PNG, MPEG, which is not easy to
convert from one form to another.
3. Storage:Storage of multimedia database on any standard disk presents the problem of
representation, compression, mapping to device hierarchies, archiving and buffering
during input-output operation. In DBMS, a BLOB (Binary Large Object) facility allows
untyped bitmaps to be stored and retrieved.
4. Performance: Physical limitations dominate an application involving video playback or
audio-video synchronization. The use of parallel processing may alleviate some
problems, but such techniques are not yet fully developed. Apart from this, a multimedia
database consumes a lot of processing time and bandwidth.
5. Queries and retrieval: For multimedia data like images, video, and audio accessing data
through query open up many issues like efficient query formulation, query execution and
optimization, which need to be worked upon.
Below are the following areas where a multimedia database is applied, such as:
Documents and record management: Industries and businesses keep detailed records
and various documents. For example, insurance claim records.
Knowledge dissemination:Multimedia database is a very effective tool for knowledge
dissemination in terms of providing several resources. For example, electronic books.
Page 20 of 31
lOMoARcPSD|8417813
1. Text Mining
Text is the foremost general medium for the proper exchange of information. Text Mining
evaluates a huge amount of usual language text and detects exact patterns to find useful
information. Text Mining also referred to as text data mining, is used to find meaningful
information from unstructured texts from various sources.
2. Image Mining
Image mining systems can discover meaningful information or image patterns from a huge
collection of images. Image mining determines how low-level pixel representation consists of a
Page 21 of 31
lOMoARcPSD|8417813
raw image or image sequence that can be handled to recognize high-level spatial objects and
relationships. It includes digital image processing, image understanding, database, AI, etc.
3. Video Mining
Video mining is unsubstantiated to find interesting patterns from many video data; multimedia
data is video data such as text, image, metadata, visuals and audio. It is commonly used in
security and surveillance, entertainment, medicine, sports and education programs. The
processing is indexing, automatic segmentation, content-based retrieval, classification and
detecting triggers.
4. Audio Mining
Audio mining plays an important role in multimedia applications, is a technique by which the
content of an audio signal can be automatically searched, analyzed and rotten with wavelet
transformation. It is generally used in automatic speech recognition, where the analysis efforts to
find any speech within the audio. Band energy, frequency centroid, zero-crossing rate, pitch
period and bandwidth are often used for audio processing.
There are different kinds of applications of multimedia data mining, some of which are as
follows:
Digital Library: The collection of digital data is stored and maintained in a digital
library, which is essential to convert different digital data formats into text, images,
video, audio, etc.
Traffic Video Sequences: To determine important but previously unidentified
knowledge from the traffic video sequences, detailed analysis and mining are to be
performed based on vehicle identification, traffic flow, and queue temporal relations of
the vehicle at an intersection. This provides an economic approach for regular traffic
monitoring processes.
Page 22 of 31
lOMoARcPSD|8417813
Medical Analysis: Multimedia mining is primarily used in the medical field, particularly
for analyzing medical images. Various data mining techniques are used for image
classification. Examples, Automatic 3D delineation of highly aggressive brain tumours,
Automatic localization and identification of vertebrae in 3D CT scans, MRI Scans, ECG
and X-Ray.
Customer Perception: It contains details about customers' opinions, products or
services, customers complaints, customers preferences, and the level of customer
satisfaction with products or services, which are collected together. The audio data serve
as topic detection, resource assignment and evaluation of the quality of services. Many
companies have call centres that receive telephone calls from customers.
Media Making and Broadcasting: Radio stations and TV channels create broadcasting
companies, and multimedia mining can be applied to monitor their content to search for
more efficient approaches and improve their quality.
Surveillance system: It consists of collecting, analyzing, summarizing audio, video or
audiovisual information about specific areas like government organizations, multi-
national companies, shopping malls, banks, forests, agricultural areas and, highways etc.
The main use of this technology in the field of security; hence it can be utilized by
military, police and private companies since they provide security services.
The below image shows the present architecture, which includes the types of the multimedia
mining process. Data Collection is the initial stage of the learning system; Pre-processing is to
extract significant features from raw data. It includes data cleaning, transformation,
normalization, feature extraction, etc. Learning can be direct if informative types can be
recognized at preprocessing stage. The complete process depends extremely on the nature of raw
data and the difficulty field. The product of preprocessing is the training set. A learning model
must be selected for the specified training set to learn from it and make the multimedia model
more constant.
Page 23 of 31
lOMoARcPSD|8417813
Converting Un-structured data to structured data: Data resides in a fixed field within a
record or file is called structured data, and these data are stored in sequential form. Structured
data has been easily entered, stored, queried and analyzed. Unstructured data is bitstream, for
example, pixel representation for an image, audio, video and character representation for text.
These files may have an internal structure, but they are still considered "unstructured" because
their data does not fit neatly in a database. For example, images and videos of different objects
have some similarities - each represents an interpretation of a building without a clear structure.
Current data mining tools operate on structured data, which resides in a huge volume of the
relational database, while data in multimedia databases are semi-structured or unstructured.
Hence, the semi-structured or unstructured multimedia data is converted into structured one, and
then the current data mining tools are used to extract the knowledge. The sequence or time
element is different between unstructured and structured data mining. The architecture of
converting unstructured data to structured data and which is used for extracting information from
the unstructured database, is shown in the above image. Then data mining tools are applied to the
stored structured databases.
Multimedia mining architecture is given in the below image. The architecture has several
components. Important components are Input, Multimedia Content, Spatiotemporal
Segmentation, Feature Extraction, Finding similar Patterns, and Evaluation of Results.
Page 24 of 31
lOMoARcPSD|8417813
1. The input stage comprises a multimedia database used to find the patterns and perform
the data mining.
2. Multimedia Content is the data selection stage that requires the user to select the
databases, subset of fields, or data for data mining.
3. Spatio-temporal segmentation is nothing but moving objects in image sequences in the
videos, and it is useful for object segmentation.
4. Feature extraction is the preprocessing step that involves integrating data from various
sources and making choices regarding characterizing or coding certain data fields to serve
when inputs to the pattern-finding stage. Such representation of choices is required
because certain fields could include data at various levels and are not considered for
finding a similar pattern stage. In MDM, the preprocessing stage is significant since the
unstructured nature of multimedia records.
5. Finding a similar pattern stage is the heart of the whole data mining process. The hidden
patterns and trends in the data are basically uncovered in this stage. Some approaches to
finding similar pattern stages contain association, classification, clustering, regression,
time-series analysis and visualization.
6. Evaluation of Results is a data mining process used to evaluate the results, and this is
important to determine whether the prior stage must be revisited or not. This stage
consists of reporting and using the extracted knowledge to produce new actions, products,
services, or marketing strategies.
The models which are used to perform multimedia data are very important in mining. Commonly
four different multimedia mining models have been used. These are classification, association
rule, clustering and statistical modelling.
Page 25 of 31
lOMoARcPSD|8417813
1. Classification: Classification is a technique for multimedia data analysis that can learn
from every property of a specified set of multimedia. It is divided into a predefined class
label to achieve the purpose of classification. Classification is the process of constructing
data into categories for its better effective and efficient use; it creates a function that well-
planned data item into one of many predefined classes by inputting a training data set and
building a model of the class attribute based on the rest of the attributes. Decision tree
classification has a perceptive nature that the users conceptual model without loss of
exactness. Hidden Markov Model is used to classify multimedia data such as images and
videos as indoor-outdoor games.
2. Association Rule: Association Rule is one of the most important data mining techniques
that help find relations between data items in huge databases. There are two types of
associations in multimedia mining: image content and non-image content features.
Mining the frequently occurring patterns between different images becomes mining the
repeated patterns in a set of transactions. Multi-relational association rule mining displays
multiple reports for the same image. In image classification also, multiple-level
association rule techniques are used.
3. Clustering: Cluster analysis divides the data objects into multiple groups or clusters.
Cluster analysis combines all objects based on their groups. In multimedia mining, the
clustering technique can be applied to group similar images, objects, sounds, videos and
texts. Clustering algorithms can be divided into several methods: hierarchical methods,
density-based methods, grid-based methods, model-based methods, k-means algorithms,
and graph-based models.
4. Statistical Modeling: Statistical mining models regulate the statistical validity of test
parameters and have been used to test hypotheses, undertake correlation studies, and
transform and make data for further analysis. This is used to establish links between
words and partitioned image regions to form a simple co-occurrence model.
Page 26 of 31
lOMoARcPSD|8417813
Major Issues in multimedia data mining contains content-based retrieval, similarity search,
dimensional analysis, classification, prediction analysis and mining associations in multimedia
data.
Description-based retrieval system creates indices and object retrieval based on image
descriptions, such as keywords, captions, size, and creation time.
Content-based retrieval system supports image content retrieval, for example, colour
histogram, texture, shape, objects, and wavelet transform.
Use of content-based retrieval system: Visual features index images and promote object
retrieval based on feature similarity; it is very desirable in various applications. These
applications include diagnosis, weather prediction, TV production and internet search
engines for pictures and e-commerce.
2. Multidimensional Analysis
To perform multidimensional analysis of large multimedia databases, multimedia data cubes may
be designed and constructed similarly to traditional data cubes from relational data. A
multimedia data cube has several dimensions. For example, the size of the image or video in
bytes; the width and height of the frames, creating two dimensions, the date on which image or
video was created or last modified, the format type of the image or video, frame sequence
duration in seconds, Internet domain of pages referencing the image or video, the keywords like
a colour dimension and edge orientation dimension. A multimedia data cube can have additional
dimensions and measures for multimedia data, such as colour, texture, and shape.
The Multimedia data mining system prototype is MultiMediaMiner, the extension of the
DBMiner system that handles multimedia data. The Image Excavator component of
MultiMediaMiner uses image contextual information, like HTML tags on Web pages, to derive
keywords. By navigating online directory structures, like Yahoo! directory, it is possible to build
hierarchies of keywords mapped on the directories in which the image was found.
Classification and predictive analysis has been used for mining multimedia data, particularly in
scientific analysis like astronomy, seismology, and geoscientific analysis. Decision tree
classification is an important method for reported image data mining applications. For example,
consider the sky images, which astronomers have carefully classified as the training set. It can
create models for recognizing galaxies, stars and further stellar objects based on properties like
magnitudes, areas, intensity, image moments and orientation.
Page 27 of 31
lOMoARcPSD|8417813
Image data mining classification and clustering are carefully connected to image analysis and
scientific data mining. The image data are frequently in large volumes and need substantial
processing power, such as parallel and distributed processing. Hence, many image analysis
techniques and scientific data analysis methods could be applied to image data mining.
Data Association rules involving multimedia objects have been mined in image and video
databases. Three categories can be observed:
First, an image contains multiple objects, each with various features such as colour, shape,
texture, keyword, and spatial locations, so that many possible associations can be made. Second,
a picture containing multiple repeated objects is essential in image analysis. The recurrence of
similar objects should not be ignored in association analysis. Third, to find the associations
between the spatial relationships and multimedia images can be used to discover object
associations and correlations. With the associations between multimedia objects, we can treat
every image as a transaction and find commonly occurring patterns among different images.
Over the last few years, the World Wide Web has become a significant source of information
and simultaneously a popular platform for business. Web mining can define as the method of
utilizing data mining techniques and algorithms to extract useful information directly from the
web, such as Web documents and services, hyperlinks, Web content, and server logs. The World
Wide Web contains a large amount of data that provides a rich source to data mining. The
objective of Web mining is to look for patterns in Web data by collecting and examining data in
order to gain insights.
Page 28 of 31
lOMoARcPSD|8417813
Web content mining can be used to extract useful data, information, knowledge from the web
page content. In web content mining, each web page is considered as an individual document.
The individual can take advantage of the semi-structured nature of web pages, as HTML
provides information that concerns not only the layout but also logical structure. The primary
task of content mining is data extraction, where structured data is extracted from unstructured
websites. The objective is to facilitate data aggregation over various web sites by using the
extracted structured data. Web content mining can be utilized to distinguish topics on the web.
For Example, if any user searches for a specific task on the search engine, then the user will get a
list of suggestions.
Page 29 of 31
lOMoARcPSD|8417813
The web structure mining can be used to find the link structure of hyperlink. It is used to identify
that data either link the web pages or direct link network. In Web Structure Mining, an individual
considers the web as a directed graph, with the web pages being the vertices that are associated
with hyperlinks. The most important application in this regard is the Google search engine,
which estimates the ranking of its outcomes primarily with the PageRank algorithm. It
characterizes a page to be exceptionally relevant when frequently connected by other highly
related pages. Structure and content mining methodologies are usually combined. For example,
web structured mining can be beneficial to organizations to regulate the network between two
commercial sites.
Web usage mining is used to extract useful data, information, knowledge from the weblog
records, and assists in recognizing the user access patterns for web pages. In Mining, the usage
of web resources, the individual is thinking about records of requests of visitors of a website, that
are often collected as web server logs. While the content and structure of the collection of web
pages follow the intentions of the authors of the pages, the individual requests demonstrate how
the consumers see these pages. Web usage mining may disclose relationships that were not
proposed by the creator of the pages.
Some of the methods to identify and analyze the web usage patterns are given below:
The analysis of preprocessed data can be accomplished in session analysis, which incorporates
the guest records, days, time, sessions, etc. This data can be utilized to analyze the visitor's
behavior.
The document is created after this analysis, which contains the details of repeatedly visited web
pages, common entry, and exit.
OLAP can be accomplished on various parts of log related data in a specific period.
Page 30 of 31
lOMoARcPSD|8417813
The site pages don't have a unifying structure. They are extremely complicated as compared to
traditional text documents. There are enormous amounts of documents in the digital library of
the web. These libraries are not organized according to a specific order.
The data on the internet is quickly updated. For example, news, climate, shopping, financial
news, sports, and so on.
The client network on the web is quickly expanding. These clients have different interests,
backgrounds, and usage purposes. There are over a hundred million workstations that are
associated with the internet and still increasing tremendously.
Relevancy of data:
It is considered that a specific person is generally concerned about a small portion of the web,
while the rest of the segment of the web contains the data that is not familiar to the user and may
lead to unwanted results.
The size of the web is tremendous and rapidly increasing. It appears that the web is too huge for
data warehousing and data mining.
The web comprises of pages as well as hyperlinks indicating from one to another page. When a
creator of a Web page creates a hyperlink showing another Web page, this can be considered as
the creator's authorization of the other page. The unified authorization of a given page by various
creators on the web may indicate the significance of the page and may naturally prompt the
discovery of authoritative web pages. The web linkage data provide rich data about the
relevance, the quality, and structure of the web's content, and thus is a rich source of web mining.
Web mining has an extensive application because of various uses of the web. The list of some
applications of web mining is given below.
Page 31 of 31