Clustering

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Clustering:-

• Clustering is a data science technique in machine learning


that groups similar rows in a data set.
CLUSTERING:=
1. grouping the data based info found in data
2. classify the into related attribute
3. bottom up approach
4. grouped according to logical relationship

Types Of Data Used In Cluster Analysis Are:


• Interval-Scaled variables
• Binary variables
• Nominal, Ordinal, and Ratio variables
• Variables of mixed types
Interval-Scaled Variables

Interval-scaled variables are continuous measurements of a roughly linear scale.

Typical examples include weight and height, latitude and longitude coordinates (e.g., when
clustering houses), and weather temperature.

The measurement unit used can affect the clustering analysis. For example, changing
measurement units from meters to inches for height, or from kilograms to pounds for weight,
may lead to a very different clustering structure.
In general, expressing a variable in smaller units will lead to a larger range for that variable, and
thus a larger effect on the resulting clustering structure.

To help avoid dependence on the choice of measurement units, the data should be standardized.
Standardizing measurements attempts to give all variables an equal weight.

This is especially useful when given no prior knowledge of the data. However, in some
applications, users may intentionally want to give more weight to a certain set of variables than
to others.

For example, when clustering basketball player candidates, we may prefer to give more weight to
the variable height.
Binary Variables

A binary variable is a variable that can take only 2 values.


For example, generally, gender variables can take 2 variables male and female.

Contingency Table For Binary Data


Let us consider binary values 0 and 1

Let p=a+b+c+d

Simple matching coefficient (invariant, if the binary variable is symmetric):

Jaccard coefficient (noninvariant if the binary variable is asymmetric):

Nominal or Categorical Variables

A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow,
blue, green.
Method 1: Simple matching

The dissimilarity between two objects i and j can be computed based on the simple matching.

m: Let m be no of matches (i.e., the number of variables for which i and j are in the same state).

p: Let p be total no of variables.

Method 2: use a large number of binary variables

Creating a new binary variable for each of the M nominal states.

Ordinal Variables

An ordinal variable can be discrete or continuous.

In this order is important, e.g., rank.

It can be treated like interval-scaled

By replacing xif by their rank,

By mapping the range of each variable onto [0, 1] by replacing the i-th object in the f-th variable
by,

Then compute the dissimilarity using methods for interval-scaled variables.

Ratio-Scaled Intervals

Ratio-scaled variable: It is a positive measurement on a nonlinear scale, approximately at an


exponential scale, such as Ae^Bt or A^e-Bt.
Methods:
• First, treat them like interval-scaled variables — not a good choice! (why?)
• Then apply logarithmic transformation i.e.y = log(x)

• Finally, treat them as continuous ordinal data treat their rank as interval-scaled.

Variables Of Mixed Type

A database may contain all the six types of variables


symmetric binary, asymmetric binary, nominal, ordinal, interval, and ratio.

And those combinedly called as mixed-type variables.

Categorization of Major Clustering Methods


The clustering methods can be classified into following categories:
o Kmeans
o Partitioning Method
o Hierarchical Method
o Density-based Method
o Grid-Based Method
o Model-Based Method
o Constraint-based Method

1. K-means
Given k, the k-means algorithm is implemented in four steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters of the current partition (the
centroid is the center, i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed point
4. Go back to Step 2, stop when no more new assignment

2. Partitioning Method
Suppose we are given a database of n objects, the partitioning method construct k partition of
data.

Each partition will represent a cluster and k≤n. It means that it will classify the data into k
groups, which satisfy the following requirements:

o Each group contain at least one object.


o Each object must belong to exactly one group.

Typical methods:
K-means, k-medoids, CLARANS

3 Hierarchical Methods
This method creates the hierarchical decomposition of the given set of data objects.:

o Agglomerative Approach
o Divisive Approach

Agglomerative Approach

This approach is also known as bottom-up approach. In this we start with each object forming a
separate group. It keeps on merging the objects or groups that are close to one another. It keep
on doing so until all of the groups are merged into one or until the termination condition holds.

Divisive Approach

This approach is also known as top-down approach. In this we start with all of the objects in the
same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down
until each object in one cluster or the termination condition holds.
Disadvantage
This method is rigid i.e. once merge or split is done, It can never be undone.

Partitioning Method (K-Mean) in Data Mining


Partitioning Method: This clustering method classifies the information into multiple groups
based on the characteristics and similarity of the data. Its the data analysts to specify the
number of clusters that has to be generated for the clustering methods. In the partitioning
method when database(D) that contains multiple(N) objects then the partitioning method
constructs user-specified(K) partitions of the data in which each partition represents a cluster
and a particular region. There are many algorithms that come under partitioning method some
of the popular ones are K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large
Applications) etc. In this article, we will be seeing the working of K Mean algorithm in detail.

Hierarchical Clustering in Data Mining



A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical
clustering begins by treating every data point as a separate cluster. Then, it repeatedly executes
the subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all the
clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A
diagram called Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences
of merges or splits) graphically represents this hierarchy and is an inverted tree that describes the
order in which factors are merged (bottom-up view) or clusters are broken up (top-down view).
Density-based clustering in data minin
Density-based clustering refers to a method that is based on local cluster criterion, such as
density connected points. In this tutorial, we will discuss density-based clustering with examples.

What is Density-based clustering?


Density-Based Clustering refers to one of the most popular unsupervised learning methodologies
used in model building and machine learning algorithms. The data points in the region separated
by two clusters of low point density are considered as noise. The surroundings with a radius ε of
a given object are known as the ε neighborhood of the object. If the ε neighborhood of the object
comprises at least a minimum number, MinPts of objects, then it is called a core object.

Density-Based Clustering - Background


There are two different parameters to calculate the density-based clustering

EPS: It is considered as the maximum radius of the neighborhood.

MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.

NEps (i) : { k belongs to D and dist (i,k) < = Eps}

Directly density reachable:

A point i is considered as the directly density reachable from a point k with respect to Eps,
MinPts if

i belongs to NEps(k)

Core point condition:

Advertisement

NEps (k) >= MinPts


Density reachable:

A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a
sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable
from ii.

Density connected:

A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o
such that both i and j are considered as density reachable from o with respect to Eps and MinPts.
Working of Density-Based Clustering
Suppose a set of objects is denoted by D', we can say that an object I is directly density reachable
form the object j only if it is located within the ε neighborhood of j, and j is a core object.

An object i is density reachable form the object j with respect to ε and MinPts in a given set of
objects, D' only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that i i +
1 is directly density reachable from ii with respect to ε and MinPts.

Advertisement

An object i is density connected object j with respect to ε and MinPts in a given set of objects, D'
only if there is an object o belongs to D such that both point i and j are density reachable from o
with respect to ε and MinPts.

Major Features of Density-Based Clustering


The primary features of Density-based clustering are given below.

o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.

Density-Based Clustering Methods


DBSCAN

Advertisement

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends on
a density-based notion of cluster. It also identifies clusters of arbitrary size in the spatial database
with outliers.
OPTICS

Advertisement

OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant
order of database with respect to its density-based clustering structure. The order of the cluster
comprises information equivalent to the density-based clustering related to a long range of
parameter settings. OPTICS methods are beneficial for both automatic and interactive cluster
analysis, including determining an intrinsic clustering structure.

DENCLUE

Density-based clustering by Hinnebirg and Kiem. It enables a compact mathematical description


of arbitrarily shaped clusters in high dimension state of data, and it is good for data sets with a
huge amount of noise.

What is model-based clustering?

Model-based clustering is a statistical approach to data clustering. The observed


(multivariate) data is considered to have been created from a finite combination of
component models. Each component model is a probability distribution, generally a
parametric multivariate distribution.

For instance, in a multivariate Gaussian mixture model, each component is a multivariate


Gaussian distribution. The component responsible for generating a particular observation
determines the cluster to which the observation belongs.
Model-based clustering is a try to advance the fit between the given data and some
mathematical model and is based on the assumption that data are created by a combination
of a basic probability distribution.

There are the following types of model-based clustering are as follows −

Statistical approach − Expectation maximization is a popular iterative refinement algorithm. An


extension to k-means −

• It can assign each object to a cluster according to weight (probability distribution).


• New means are computed based on weight measures.

The basic idea is as follows −

• It can start with an initial estimate of the parameter vector.


• It can be used to iteratively rescore the designs against the mixture density made by the parameter vector.
• It is used to rescored patterns are used to update the parameter estimates.
• It can be used to pattern belonging to the same cluster if they are placed by their scores in a particular
component.

Algorithm
• Initially, assign k cluster centers randomly.
• It can be iteratively refined the clusters based on two steps are as follows −

Expectation step − It can assign each data point Xi to cluster Ci with the following probability

P(Xi∈Ck)=P(Ck⏐Xi)=P(Ck)P(Xi⏐Ck)P(Xi)P(Xi∈Ck)=P(Ck⏐Xi)=P(Ck)P(Xi⏐Ck)P(Xi)

Maximization step − It can be used to estimate of model parameter

mk=1N∑i=1NXiP(Xi∈Ck)XjP(Xi)∈Cjmk=1N∑i=1NXiP(Xi∈Ck)XjP(Xi)∈Cj

Machine learning approach − Machine learning is an approach that makes complex algorithms
for huge data processing and supports results to its users. It uses complex programs that
can understand through experience and create predictions.

The algorithms are improved by themselves by frequent input of training information. The
main objective of machine learning is to learn data and build models from data that can be
understood and used by humans.
It is a famous approach of incremental conceptual learning, which produces a hierarchical
clustering in the form of a classification tree. Each node defines a concept and includes a
probabilistic representation of that concept.

Limitations

• The assumption that the attributes are independent of each other is often too strong because correlation can
exist.
• It is not suitable for clustering large database data, skewed trees, and expensive probability distributions.

Neural Network Approach − The neural network approach represents each cluster as an
example, acting as a prototype of the cluster. The new objects are distributed to the cluster
whose example is the most similar according to some distance measure.

Applications of Data Mining

Scientific Analysis: Scientific simulations are generating bulks of data every day. This
includes data collected from nuclear laboratories, data about human psychology, etc. Data
mining techniques are capable of the analysis of these data. Now we can capture and store
more new data faster than we can analyze the old data already accumulated. Example of
scientific analysis:
• Sequence analysis in bioinformatics
• Classification of astronomical objects
• Medical decision support.
Intrusion Detection: A network intrusion refers to any unauthorized activity on a
digital network. Network intrusions often involve stealing valuable network resources. Data
mining technique plays a vital role in searching intrusion detection, network attacks, and
anomalies. These techniques help in selecting and refining useful and relevant information
from large data sets. Data mining technique helps in classify relevant data for Intrusion
Detection System. Intrusion Detection system generates alarms for the network traffic about
the foreign invasions in the system. For example:
• Detect security violations
• Misuse Detection
• Anomaly Detection

Business Transactions: Every business industry is memorized for perpetuity. Such


transactions are usually time-related and can be inter-business deals or intra-business
operations. The effective and in-time use of the data in a reasonable time frame for competitive
decision-making is definitely the most important problem to solve for businesses that struggle
to survive in a highly competitive world. Data mining helps to analyze these business
transactions and identify marketing approaches and decision-making. Example :
• Direct mail targeting
• Stock trading
• Customer segmentation
• Churn prediction (Churn prediction is one of the most popular Big Data use cases in
business)
Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study
of purchases done by a customer in a supermarket. This concept identifies the pattern of
frequent purchase items by customers. This analysis can help to promote deals, offers, sale by
the companies and data mining techniques helps to achieve this analysis task. Example:
• Data mining concepts are in use for Sales and marketing to provide better customer service,
to improve cross-selling opportunities, to increase direct mail response rates.
• Customer Retention in the form of pattern identification and prediction of likely defections
is possible by Data mining.
• Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Education: For analyzing the education sector, data mining uses Educational Data Mining
(EDM) method. This method generates patterns that can be used both by learners and
educators. By using data mining EDM we can perform some educational task:
• Predicting students admission in higher education
• Predicting students profiling
• Predicting student performance
• Teachers teaching performance
• Curriculum development
• Predicting student placement opportunities
Research: A data mining technique can perform predictions, classification, clustering,
associations, and grouping of data with perfection in the research area. Rules generated by data
mining are unique to find results. In most of the technical research in data mining, we create a
training model and testing model. The training/testing model is a strategy to measure the
precision of the proposed model. It is called Train/Test because we split the data set into two
sets: a training data set and a testing data set. A training data set used to design the training
model whereas testing data set is used in the testing model. Example:
• Classification of uncertain data.
• Information-based clustering.
• Decision support system
• Web Mining
• Domain-driven data mining
• IoT (Internet of Things)and Cybersecurity
• Smart farming IoT(Internet of Things)
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity
and their outcomes to improve the focusing of high-value physicians and figure out which
promoting activities will have the best effect in the following upcoming months, Whereas the
Insurance sector, data mining can help to predict which customers will buy new policies,
identify behavior patterns of risky customers and identify fraudulent behavior of customers.
• Claims analysis i.e which medical procedures are claimed together.
• Identify successful medical therapies for different illnesses.
• Characterizes patient behavior to predict office visits.
Transportation: A diversified transportation company with a large direct sales force can apply
data mining to identify the best prospects for its services. A large consumer merchandise
organization can apply information mining to improve its business cycle to retailers.
• Determine the distribution schedules among outlets.
• Analyze loading patterns.
Financial/Banking Sector: A credit card company can leverage its vast warehouse of
customer transaction data to identify customers most likely to be interested in a new credit
product.
• Credit card fraud detection.
• Identify ‘Loyal’ customers.
• Extraction of information related to customers.
• Determine credit card spending by customer groups.

Data Mining For Intrusion Detection and Prevention


The security of our computer systems and data is at continual risk. The extensive growth of the
Internet and the increasing availability of tools and tricks for intruding and attacking networks
have prompted intrusion detection and prevention to become a critical component of
networked systems.
Intrusion
Unauthorized access by an intruder involves stealing valuable resources and misuse those
resources, e.g. Worms and viruses. There are intrusion prevention techniques such as user
authentication, and sharing encrypted information that is not enough to operate because the
system is becoming more complex day by day so, we need a layer of security controls.
Intruder
It is an entity that is trying to gain unauthorized access over a system or a network. Moreover,
the data present in that system will be corrupted along with an imbalance in the environment of
that network.
Intruders are of majorly two types
• Masquerader (Outside Intruder) – No authority to use the network or system
• Misfeasor (Inside Intruder) – authorized access to limited applications

Intrusion detection system

An Intrusion Detection System is a device or an application that detects unusual indication and
monitors traffic and report its results to an administrator, but cannot take action to prevent
unusual activity. The system protects the confidentiality, integrity, and availability of data and
information systems from internet attacks. We see that the network extended dynamically, so
too are the possibilities of risks and chances of malicious intrusions are increasing.
Types of attacks detected by Intrusion detection systems majorly:
• Scanning attacks
• Denial of service (DOS) attacks
• Penetration attacks

Fig.2 Architecture of IDS

Intrusion prevention system


It is basically an extension of the Intrusion Detection System which can protect the system
from suspicious activities, viruses, and threats, and once any unwelcome activity is identified
IPS also takes action against those activities such as closing access points and prevent
firewalls.
The majority of intrusion detection and prevention systems use
either signature-based detection or anomaly-based detection.

1. Signature-Based – The signature-based system uses some library of signatures of known


attacks and if the signature matches with the pattern the system detects the intrusion take
prevention by blocking the IP address or deactivate the user account from accessing the
application. This system is basically a pattern-based system used to monitor the packets on
the network and compares the packets against a database of signature from existing attacks or
a list of attack patterns and if the signature matches with the pattern the system detect the
intrusion and alert to the admin. E.g.Antiviruses.
Advantage
• Worth detecting only Known attacks.
Disadvantage
• Failed to identify new or unknown attacks.
• Regular update of new attacks
2. Anomaly-Based – The anomaly-based system waits for any abnormal activity. If activity is
detected, the system blocks entry to the target host immediately. This system follows a
baseline pattern first we train the system with a suitable baseline and compare activity against
that baseline if someone crosses that baseline will be treated as suspicious activity and an alert
is triggered to the administrator.
Advantage
• Ability to detect unknown attacks.
Disadvantage
• Higher complexity, sometimes it is difficult to detect and chances of false alarms.
The security of our computer systems and data is at continual risk. The extensive growth of the
Internet and the increasing availability of tools and tricks for intruding and attacking networks
have prompted intrusion detection and prevention to become a critical component of
networked systems.
Intrusion
Unauthorized access by an intruder involves stealing valuable resources and misuse those
resources, e.g. Worms and viruses. There are intrusion prevention techniques such as user
authentication, and sharing encrypted information that is not enough to operate because the
system is becoming more complex day by day so, we need a layer of security controls.
Intruder
It is an entity that is trying to gain unauthorized access over a system or a network. Moreover,
the data present in that system will be corrupted along with an imbalance in the environment of
that network.
Intruders are of majorly two types
• Masquerader (Outside Intruder) – No authority to use the network or system
• Misfeasor (Inside Intruder) – authorized access to limited applications

Intrusion detection system

An Intrusion Detection System is a device or an application that detects unusual indication and
monitors traffic and report its results to an administrator, but cannot take action to prevent
unusual activity. The system protects the confidentiality, integrity, and availability of data and
information systems from internet attacks. We see that the network extended dynamically, so
too are the possibilities of risks and chances of malicious intrusions are increasing.
Types of attacks detected by Intrusion detection systems majorly:
• Scanning attacks
• Denial of service (DOS) attacks
• Penetration attacks
Fig.2 Architecture of IDS

Intrusion prevention system


It is basically an extension of the Intrusion Detection System which can protect the system
from suspicious activities, viruses, and threats, and once any unwelcome activity is identified
IPS also takes action against those activities such as closing access points and prevent
firewalls.
The majority of intrusion detection and prevention systems use either signature-
based detection or anomaly-based detection.
1. Signature-Based – The signature-based system uses some library of signatures of known
attacks and if the signature matches with the pattern the system detects the intrusion take
prevention by blocking the IP address or deactivate the user account from accessing the
application. This system is basically a pattern-based system used to monitor the packets on
the network and compares the packets against a database of signature from existing attacks or
a list of attack patterns and if the signature matches with the pattern the system detect the
intrusion and alert to the admin. E.g.Antiviruses.
Advantage
• Worth detecting only Known attacks.
Disadvantage
• Failed to identify new or unknown attacks.
• Regular update of new attacks
2. Anomaly-Based – The anomaly-based system waits for any abnormal activity. If activity is
detected, the system blocks entry to the target host immediately. This system follows a
baseline pattern first we train the system with a suitable baseline and compare activity against
that baseline if someone crosses that baseline will be treated as suspicious activity and an alert
is triggered to the administrator.
Advantage
• Ability to detect unknown attacks.
Disadvantage
• Higher complexity, sometimes it is difficult to detect and chances of false alarms.
Data Mining and Recommender Systems
Data mining makes use of various methodologies in statistics and different algorithms, like
classification models, clustering, and regression models to exploit the insights which are
present in the large set of data. It helps us to predict the outcome based on the history of events
that have taken place. For example, the amount a person spends on a monthly basis based on
his previous transactions, the frequent items which are bought by the customers, like bread,
butter, and jam, are always bought together. The trends in the market can also be analyzed, like
the demand for umbrellas during the rainy season and the demand for ice cream during the
summer. The main objective here is to analyze the pattern present in the data set and obtain
useful information based on the target required.
What could be the yield of the crops in the present year? What are the chances of a person
having a particular disease when all the symptoms are given? What is the expected sale of
groceries in a particular month? What is the expected number of customers purchasing clothes
from a particular supermarket? What is the loss/ profit percentage expected in the coming
year? All these questions can be answered provided we use an accurate model for training the
data, identify the patterns present in the datasets, and more importantly, we need to have a
sufficient amount of data to arrive at accurate and efficient results.
In particular, data processing attracts upon ideas, like sampling, estimation, and hypothesis
testing from statistics and search algorithms, modeling techniques, and learning theories from
computing, pattern recognition, and machine learning.
Recommender systems:
The recommender system mainly deals with the likes and dislikes of the users. Its major
objective is to recommend an item to a user which has a high chance of liking or is in need of a
particular user based on his previous purchases. It is like having a personalized team who can
understand our likes and dislikes and help us in making the decisions regarding a particular
item without being biased by any means by making use of a large amount of data in the
repositories which are generated day by day. The aim of recommender systems is to supply
simply accessible, high-quality recommendations for the user community. Its wish is to own a
reasonable personal authority with efficiency.
Which movie/ web series should I watch next? Which book should I read next? Which items
should I buy which would match the items bought earlier? Which are the magazines that I
should be reading? Will that match with the genre I like? Should I go to a particular place?
Will I like that? All these questions can be answered with the help of the recommender system.
Here what we do is find the similarity of the users or items from whom the recommendation
has to be made with that of all the users or items which are present in the datasets. We find the
pattern of likes and dislikes having the highest similarity. Then we make use of that pattern to
suggest whether an item or place or movie or book has to be suggested or not.
• User-based recommendation: Here we calculate Pearson’s similarity measure, which is
needed to determine the closely related users, i.e, whose likes and dislikes follow the same
pattern. The computational operations are based on the formula of Pearson similarity. The
ratings of two different users are subtracted by the mean value and multiplied in the
numerator and in the denominator, the ratings are squared and summation is calculated for
each. After getting the summation values, they are divided to get the similarity measure.
• Item-based recommendation: Initial aim is to obtain the mean adjusted matrix. The mean
adjusted matrix is used in the prediction of the rating of a new user using the item, based on
reducing the errors caused by the users, as some tend to give very high ratings most of the
time and some tend to give very low ratings most of the time. So, to reduce this
inconsistency, we subtract the mean value from each of the users. The next step is the
calculation of the similarity measure between the items. Here we can make use of the
cosine similarity matrix. The computational operations are based on the formula of cosine
similarity. The ratings of different users on two items are multiplied in the numerator and in
the denominator, the ratings are squared and a summation is calculated for each. After
getting the summation values, they are divided to get the similarity measure.
In the above two methods, we get the similarity measure based on which we make the
prediction of whether the item has to be suggested or not to a particular user or whether the
item is relevant or not.
The ways for selecting the simplest technique supported the specifics of the appliance domain,
distinguishing cogent success factors behind totally different techniques, or examination of
many techniques supported associated optimal criterion area unit all needed for effective
analysis. Recommender systems have historically been evaluated by exploitation offline
experiments that plan to estimate the prediction error of recommendations exploitation
associate existing dataset of transactions.
spatial data mining
Spatial data mining is a specialized subfield of data mining that deals with extracting knowledge
from spatial data. Spatial data refers to data that is associated with a particular location or
geography. Examples of spatial data include maps, satellite images, GPS data, and other
geospatial information

Text Data Mining


Text data mining can be described as the process of extracting essential data from standard
language text. All the data that we generate via text messages, documents, emails, files are
written in common language text. Text mining is primarily used to draw useful insights or
patterns from such data.
The text mining market has experienced exponential growth and adoption over the last few years
and also expected to gain significant growth and adoption in the coming future. One of the
primary reasons behind the adoption of text mining is higher competition in the business market,
many organizations seeking value-added solutions to compete with other organizations. With
increasing completion in business and changing customer perspectives, organizations are making
huge investments to find a solution that is capable of analyzing customer and competitor data to
improve competitiveness. The primary source of data is e-commerce websites, social media
platforms, published articles, survey, and many more. The larger part of the generated data is
unstructured, which makes it challenging and expensive for the organizations to analyze with the
help of the people. This challenge integrates with the exponential growth in data generation has
led to the growth of analytical tools. It is not only able to handle large volumes of text data but
also helps in decision-making purposes. Text mining software empowers a user to draw useful
information from a huge set of data available sources.

What is Multimedia Data Mining?


Multimedia mining is a subfield of data mining that is used to find interesting information of
implicit knowledge from multimedia databases. Mining in multimedia is referred to as automatic
annotation or annotation mining. Mining multimedia data requires two or more data types, such
as text and video or text video and audio.

Multimedia data mining is an interdisciplinary field that integrates image processing and
understanding, computer vision, data mining, and pattern recognition. Multimedia data mining
discovers interesting patterns from multimedia databases that store and manage large collections
of multimedia objects, including image data, video data, audio data, sequence data and hypertext
data containing text, text markups, and linkages. Issues in multimedia data mining
include content-based retrieval and similarity search, generalization and multidimensional
analysis. Multimedia data cubes contain additional dimensions and measures for multimedia
information.

The framework that manages different types of multimedia data stored, delivered, and utilized in
different ways is known as a multimedia database management system. There are three classes of
multimedia databases: static, dynamic, and dimensional media. The content of the Multimedia
Database management system is as follows:

o Media data:The actual data representing an object.


o Media format data: Information such as sampling rate, resolution, encoding scheme
etc., about the format of the media data after it goes through the acquisition, processing
and encoding phase.
o Media keyword data:Keywords description relating to the generation of data. It is also
known as content descriptive data. Example: date, time and place of recording.
o Media feature data: Content dependent data such as the distribution of colours, kinds of
texture and different shapes present in data.

Data Mining- World Wide Web

Over the last few years, the World Wide Web has become a significant source of information
and simultaneously a popular platform for business. Web mining can define as the method of
utilizing data mining techniques and algorithms to extract useful information directly from the
web, such as Web documents and services, hyperlinks, Web content, and server logs. The World
Wide Web contains a large amount of data that provides a rich source to data mining. The
objective of Web mining is to look for patterns in Web data by collecting and examining data in
order to gain insights.

What is Web Mining?

Web mining can widely be seen as the application of adapted data mining techniques to the web,
whereas data mining is defined as the application of the algorithm to discover patterns on mostly
structured data embedded into a knowledge discovery process. Web mining has a distinctive
property to provide a set of various data types. The web has multiple aspects that yield different
approaches for the mining process, such as web pages consist of text, web pages are linked via
hyperlinks, and user activity can be monitored via web server logs. These three features lead to
the differentiation between the three areas are web content mining, web structure mining, web
usage mining.

Privacy, security and social impacts of Data Mining


Data Mining is to intelligently discover useful information from large amounts of data to solve
real-life problems. It is a combination of two words: data and mining. Data is a collection of
instances, and mining is designed to filter useful information. Data mining, called knowledge
discovery in databases (KDD), is responsible for analyzing data from different perspectives
and classifying them. There are many data mining techniques used to transform raw data into
useful data. It has various applications such as detecting anomalous behavior, detecting fraud
and abuse, terrorist activities, and investigating crimes through lie detection. Data mining can
offer many benefits by improving customer service and satisfaction, and lifestyle, in general.
Data mining is present in many aspects of our daily lives, whether we realize it or not. It
affects how we shop, work, what we search.

Data mining with Intelligent Miner

Data mining is an innovative way of gaining new and valuable business insights by analyzing the
information held in your company database. These insights can enable you to identify market
niches, and they support and facilitate the making of well-informed business decisions.
Essentially, data mining is a ground-breaking way to leverage the information that your company
already has in order to plan a business strategy for the future.

Data mining uncovers this in-depth business intelligence by using advanced analytical and
modeling techniques. With data mining, you can ask far more sophisticated questions of your
data than you can with conventional querying methods. The information that data mining
provides can lead to an immense improvement in the quality and dependability of business
decision-making.

Conventional methods can tell a bank, for example, which of the bank account types that it
provides is the most profitable. However, only data mining enables the bank to create profiles of
the customers who already have this type of account. The bank can then use data mining to find
other customers who match that profile, so that it can accurately target a marketing campaign to
them.

Data mining can identify patterns in company data, for example, in records of supermarket
purchases. If, for example, customers buy product A and product B, which product C are they
most likely to buy as well? Accurate answers to questions like these are invaluable aids to
marketing strategies.
Data mining can identify the characteristics of a known group of customers, for example, those
who have a proven record as poor credit risks. The company can then use these characteristics to
screen new customers and to predict if they also will be poor credit risks.

Data Mining can automatically perform time series forecasts without requiring statistic skills.
You can use time series forecasts to optimize warehouse stock management.

Data mining tools ease and automate the process of discovering this kind of information from
large stores of data.

• Model training
A data mining analyst wants to obtain the rules contained in historic data. A specific mining
algorithm is selected, configured, and applied to a specified set of input data. The execution of
the mining algorithm is called training phase. The result of the training phase is a data mining
model.
• The data mining process
The data mining process comprises different steps such as building, testing, or working with the
mining models.
• Introducing database objects and data mining in SQL
The database objects that are provided by Intelligent Miner reflect the current standard for data
mining in the context of SQL.
• Using interfaces
Intelligent Miner provides a set of user-defined methods, functions, and stored procedures
to Db2. You invoke these database objects by using SQL statements.

Trends in Data Mining


Data mining is one of the most widely used methods to extract data from different sources and
organize them for better usage. Despite having different commercial systems for data mining,
many challenges come up when they are actually implemented. With the rapid evolution in the
field of data mining, companies are expected to stay abreast with all the new developments.

Complex algorithms form the basis for data mining as they allow data segmentation to identify
trends and patterns, detect variations, and predict the probabilities of various events. The raw
data may come in both analog and digital formats and is inherently based on the source of the
data. Companies need to keep track of the latest data mining trends and stay updated to do well
in the industry and overcome challenging competition.

Corporations can use data mining to discover customers' choices, make a good relationship with
customers, increase revenue, and reduce risks. Data mining is based on complex algorithms that
allow data segmentation to discover numerous trends and patterns, detect deviations, and
estimate the likelihood of certain occurrences occurring. Raw data can be in both analog and
digital formats, and it is essentially dependent on the data's source. Companies must keep up
with the latest data mining trends and stay current to succeed in the industry and beat out the
competition.

You might also like