0% found this document useful (0 votes)
37 views15 pages

DWDM 2

Uploaded by

akramshaik2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views15 pages

DWDM 2

Uploaded by

akramshaik2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT -2 DATA MINING

Introduction to Data Mining

• Data mining is the process of extracting valuable insights and knowledge from large datasets. It is an
interdisciplinary field that combines statistics, computer science, and domain expertise to discover
hidden patterns and relationships within data.
• What is the data mining definition? In simple terms, data mining is the process of extracting knowledge
from data. It involves using advanced algorithms and techniques to uncover patterns, correlations, and
trends that are not easily discernible using traditional methods.
• Data mining has become an important tool for organizations of all sizes and across various industries.
From healthcare and finance to marketing and retail, data mining is being used to improve decision-
making, optimize business processes, and gain a competitive advantage.
• One of the key benefits of data mining is that it allows organizations to gain insights into their customers,
products, and operations that would otherwise be very difficult or impossible to obtain. For example,
data mining can be used to identify which customers are most likely to purchase a particular product,
which products are most profitable, and which business processes can be optimized to reduce costs.
• Let’s explore data mining in detail in subsequent sections and try to understand answers to some
questions, such as what is data mining, why it is important, what is data mining process, etc.

Why is Data Mining Important?

Here are some of the reasons why data mining is important:

• Improved Decision Making:


By extracting insights and patterns from large datasets, data mining helps individuals and organizations
make more informed decisions. This can lead to better business outcomes, more effective public
policies, and improved scientific research.
• Cost Reduction:
Data mining can help organizations identify cost-saving opportunities and improve operational efficiency.
For example, data mining can be used to optimize supply chain management, reduce waste, and
streamline production processes.
• Competitive Advantage:
In today's data-driven economy, organizations that can mine and analyze data effectively have a
competitive advantage over those that do not. By using data mining to identify patterns in customer
behavior, market trends, and other areas, organizations can gain valuable insights that help them stay
ahead of the competition.
• Improved Customer Experience:
By analyzing customer data, organizations can gain a better understanding of customer needs and
preferences and tailor products and services accordingly. This leads to a more personalized customer
experience and can help build brand loyalty.
• Fraud Detection:
Data mining can be used to identify fraudulent activities, such as credit card fraud, insurance fraud, and
identity theft. By analyzing patterns in data, organizations can identify anomalies and flag potential fraud
cases.
• Scientific Research:
Data mining is increasingly being used in scientific research to analyze complex datasets and identify
UNIT -2 DATA MINING
new relationships between variables. This has led to important discoveries in fields such as genetics,
astronomy, and environmental science.

Data Mining Process

Below are the steps used in a typical data mining process:

• Problem Definition:
The first step in the data mining process is to define the problem that needs to be solved clearly. This
involves identifying the business or research question that needs to be answered and defining the scope
of the analysis.
• Identifying Required Data:
Once the problem has been defined, the next step is to identify the data that is needed to answer the
question. This involves identifying the data sources, the data format, and any data quality issues that
need to be addressed.
• Data Preparation and Pre-processing:
Once the required data has been identified, the next step is to prepare the data for mining. This involves
cleaning and transforming the data, handling missing data, and creating new variables or features as
needed.
• Data Modelling:
The next step is to create a model that can be used to analyze the data and answer the questions. This
involves selecting an appropriate machine learning algorithm, tuning the model parameters, and testing
the model using a validation dataset.
• Model Evaluation:
Once the model has been trained and tested, the next step is to evaluate its performance. This involves
assessing the accuracy and effectiveness of the model, identifying any areas where the model can be
improved, and deciding whether the model is suitable for deployment.
• Model Deployment:
The final step in the data mining process is to deploy the model in a production environment. This
involves integrating the model into existing business processes, creating user interfaces and reports to
help stakeholders interpret the results, and monitoring the model's performance over time.
UNIT -2 DATA MINING

Types of Data Mining Techniques

Here are a few of the most commonly used data mining techniques:

• Classification:
Classification is a technique that categorizes data into predefined classes or groups. This involves
training a machine learning model on a set of labeled data and then using the model to classify new,
unlabeled data.
• Clustering:
Clustering is a technique used to group similar data points based on their characteristics. This can be
useful for identifying patterns or segments in the data or for identifying outliers or anomalies.
• Association Rule Mining:
Association rule mining is a technique that is used to identify relationships between variables in a
dataset. This involves identifying sets of items that frequently occur together, such as items commonly
purchased in a retail setting.
• Regression Analysis:
Regression analysis is a statistical technique that is used to identify relationships between variables in a
dataset. This involves fitting a model to the data that can be used to predict the value of one variable
based on the values of other variables.
• Anomaly Detection:
Anomaly detection is a technique used to identify unusual or unexpected patterns in data. This can be
useful for identifying fraudulent activity, network intrusions, or other types of abnormal behavior.
• Text Mining:
Text mining is a technique used to extract information from unstructured text data. This involves using
UNIT -2 DATA MINING
natural language processing techniques to identify patterns and relationships in text data, such as
sentiment analysis or topic modeling.

Data Mining Applications

Here's an overview of some of the most common data mining applications across different industries:

• Marketing:
Data mining can be used to identify customer segments, predict customer behavior, and develop
targeted marketing campaigns. Businesses can gain insights into customer preferences by analyzing
customer data and developing more effective marketing strategies.
• Fraud Detection:
Data mining can be used to identify fraudulent activity, such as credit card fraud, insurance fraud, and
identity theft. By analyzing patterns in data, data mining algorithms can detect unusual behavior and alert
organizations to potential fraud.
• Healthcare:
Data mining can be used to analyze patient data, identify risk factors for diseases, and develop treatment
plans. By analyzing large amounts of patient data, healthcare organizations can gain insights into patient
health and improve the quality of care.
• Finance:
Data mining can be used to analyze financial data, identify trends and patterns, and predict future market
conditions. By analyzing financial data, businesses can gain insights into market conditions and make
better investment decisions.
• E-commerce:
Data mining can be used to analyze customer behavior, predict customer preferences, and develop
targeted marketing campaigns. By analyzing customer data, e-commerce businesses can develop more
effective marketing strategies and increase sales.
• Manufacturing:
Data mining can be used in manufacturing to improve supply chain management, optimize production
processes, and predict equipment failure.

KDD stands for Knowledge Discovery in Databases, which is the process of extracting useful knowledge
from large amounts of data. It is an area of interest to researchers and professionals in various fields,
such as artificial intelligence, machine learning, pattern recognition, databases, statistics, and data
visualization. Data mining is a key component of the KDD process.

What is KDD in Data Mining

KDD (Knowledge Discovery in Databases) is a process of discovering useful knowledge and insights from
large and complex datasets. The KDD process involves a range of techniques and methodologies,
including data preprocessing, data transformation, data mining, pattern evaluation, and knowledge
representation. KDD and data mining are closely related processes, with data mining being a key
component and subset of the KDD process.

The KDD process aims to identify hidden patterns, relationships, and trends in data that can be used to
make predictions, decisions, and recommendations. KDD is a broad and interdisciplinary field used in
various industries, such as finance, healthcare, marketing, e-commerce, etc. KDD is very important for
organizations and businesses as it enables them to derive new insights and knowledge from their data,
UNIT -2 DATA MINING
which can be further used to improve decision-making, enhance the customer experience, improve
business processes, support strategic planning, optimize operations, and drive business growth.

KDD Process in Data Mining

The KDD process in data mining is a multi-step process that involves various stages to extract useful knowledge
from large datasets. The following are the main steps involved in the KDD process -

• Data Selection - The first step in the KDD process is identifying and selecting the relevant data for
analysis. This involves choosing the relevant data sources, such as databases, data warehouses, and
data streams, and determining which data is required for the analysis.
• Data Preprocessing - After selecting the data, the next step is data preprocessing. This step involves
cleaning the data, removing outliers, and removing missing, inconsistent, or irrelevant data. This step is
critical, as the data quality can significantly impact the accuracy and effectiveness of the analysis.
• Data Transformation - Once the data is preprocessed, the next step is to transform it into a format that
data mining techniques can analyze. This step involves reducing the data dimensionality, aggregating the
data, normalizing it, and discretizing it to prepare it for further analysis.
• Data Mining - This is the heart of the KDD process and involves applying various data mining techniques
to the transformed data to discover hidden patterns, trends, relationships, and insights. A few of the most
common data mining techniques include clustering, classification, association rule mining, and anomaly
detection.
• Pattern Evaluation - After the data mining, the next step is to evaluate the discovered patterns to
determine their usefulness and relevance. This involves assessing the quality of the patterns, evaluating
their significance, and selecting the most promising patterns for further analysis.
• Knowledge Representation - This step involves representing the knowledge extracted from the data in a
way humans can easily understand and use. This can be done through visualizations, reports, or other
forms of communication that provide meaningful insights into the data.
• Deployment - The final step in the KDD process is to deploy the knowledge and insights gained from the
data mining process to practical applications. This involves integrating the knowledge into decision-
making processes or other applications to improve organizational efficiency and effectiveness.

In summary, the KDD process in data mining involves several steps to extract useful knowledge from large
datasets. It is a comprehensive and iterative process that requires careful consideration of each step to ensure
the accuracy and effectiveness of the analysis. Various steps involved in the KDD process in data mining are
shown below diagram -
UNIT -2 DATA MINING

Advantages of KDD in Data Mining

KDD in data mining is a powerful approach for extracting useful knowledge and insights from large datasets. It is
very important for organizations as it has a lot of advantages. Some of the advantages of KDD in data mining are -

• Helps in Decision Making - KDD can help make informed and data-driven decisions by discovering
hidden patterns, trends, and relationships in data that might not be immediately apparent.
• Improves Business Performance - KDD can help organizations improve their business performance by
identifying areas for improvement, optimizing processes, and reducing costs.
• Saves Time and Resources - KDD can help save time and resources by automating the data analysis
process and identifying the most relevant and significant information or knowledge.
• Increases Efficiency - KDD can help organizations streamline their processes, optimize their resources,
and increase their overall efficiency.
• Enhances Customer Experience - KDD can help organizations improve customer experience by
understanding customer behavior, preferences, and requirements and giving personalized products and
services.
• Fraud Detection - KDD can help detect fraud and identify fraudulent behavior by analyzing patterns in
data and identifying anomalies or unusual behavior.

Disadvantages of KDD in Data Mining

While KDD (Knowledge Discovery in Databases) is a powerful approach to extracting useful knowledge and
insights from large datasets, there are also some potential disadvantages to consider -

• Requires High-Quality Data - KDD relies on high-quality data to generate accurate and meaningful
insights. If the data is incomplete, inconsistent, or of poor quality, it can lead to inaccurate, misleading
results and flawed conclusions.
• Complexity - KDD is a complex and time-consuming process that requires specialized skills and
knowledge to perform effectively. The complexity can also make interpreting and communicating the
results challenging to non-experts.
• Privacy and Compliance Concerns - KDD can raise ethical concerns related to privacy, compliance,
bias, and discrimination. For example, data mining techniques can extract sensitive information about
individuals without their consent or reinforce existing biases or stereotypes.
UNIT -2 DATA MINING
• High Cost - KDD can be expensive, and require specialized software, hardware, and skilled professionals
to perform the analysis. The cost can be prohibitive for smaller organizations or those with limited
resources.

When we talk about data mining, we usually discuss knowledge discovery from data. To get to know about the
data it is necessary to discuss data objects, data attributes, and types of data attributes. Mining data includes
knowing about data, finding relations between data. And for this, we need to discuss data objects and
attributes.
Data objects are the essential part of a database. A data object represents the entity. Data Objects are like a
group of attributes of an entity. For example, a sales data object may represent customers, sales, or
purchases. When a data object is listed in a database they are called data tuples.

Attribute:
It can be seen as a data field that represents the characteristics or features of a data object. For a customer,
object attributes can be customer Id, address, etc. We can say that a set of attributes used to describe a
given object are known as attribute vector or feature vector.

Type of attributes :
This is the First step of Data-preprocessing. We differentiate between different types of attributes and then
preprocess the data. So here is the description of attribute types.
1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).
2. Quantitative (Numeric, Discrete, Continuous)

Qualitative Attributes:

1. Nominal Attributes – related to names: The values of a Nominal attribute are names of things, some kind
of symbols. Values of Nominal attributes represents some category or state and that’s why nominal attribute
also referred as categorical attributes and there is no order (rank, position) among values of the nominal
attribute.
UNIT -2 DATA MINING
Example :

2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no, affected or unaffected, true
or false.
• Symmetric: Both values are equally important (Gender).
• Asymmetric: Both values are not equally important (Result).

3. Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful sequence or
ranking(order) between them, but the magnitude between values is not actually known, the order of values
that shows what is important but don’t indicate how important it is.

Quantitative Attributes:

1. Numeric: A numeric attribute is quantitative because, it is a measurable quantity, represented in integer or


real values. Numerical attributes are of 2 types, interval, and ratio.
• An interval-scaled attribute has values, whose differences are interpretable, but the numerical
attributes do not have the correct reference point, or we can call zero points. Data can be added
and subtracted at an interval scale but can not be multiplied or divided. Consider an example of
temperature in degrees Centigrade. If a day’s temperature of one day is twice of the other day we
cannot say that one day is twice as hot as another day.
UNIT -2 DATA MINING
• A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is ratio-
scaled, we can say of a value as being a multiple (or ratio) of another value. The values are ordered,
and we can also compute the difference between values, and the mean, median, mode, Quantile-
range, and Five number summary can be given.

2. Discrete : Discrete data have finite values it can be numerical and can also be in categorical form. These
attributes has finite or countably infinite set of values.

Example:

3. Continuous: Continuous data have an infinite no of states. Continuous data is of float type. There can be
many values between 2 and 3.
Example :

Data mining refers to extracting or mining knowledge from large amounts of data. In other words, data mining
is the science, art, and technology of discovering large and complex bodies of data in order to discover useful
patterns. Theoreticians and practitioners are continually seeking improved techniques to make the process
more efficient, cost-effective, and accurate. Any situation can be analyzed in two ways in data mining:
• Statistical Analysis: In statistics, data is collected, analyzed, explored, and presented to identify
patterns and trends. Alternatively, it is referred to as quantitative analysis.
• Non-statistical Analysis: This analysis provides generalized information and includes sound, still
images, and moving images.
In statistics, there are two main categories:
•Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify the
main characteristics of that data. Graphs or numbers summarize the data. Average, Mode,
SD(Standard Deviation), and Correlation are some of the commonly used descriptive statistical
methods.
• Inferential Statistics: The process of drawing conclusions based on probability theory and
generalizing the data. By analyzing sample statistics, you can infer parameters about populations
and make models of relationships within data.
There are various statistical terms that one should be aware of while dealing with statistics. Some of these
are:
• Population
• Sample
UNIT -2 DATA MINING
•Variable
•Quantitative Variable
•Qualitative Variable
•Discrete Variable
•Continuous Variable
Now, let’s start discussing statistical methods. This is the analysis of raw data using mathematical formulas,
models, and techniques. Through the use of statistical methods, information is extracted from research data,
and different ways are available to judge the robustness of research outputs.
As a matter of fact, today’s statistical methods used in the data mining field typically are derived from the vast
statistical toolkit developed to answer problems arising in other fields. These techniques are taught in science
curriculums. It is necessary to check and test several hypotheses. The hypotheses described above help us
assess the validity of our data mining endeavor when attempting to infer any inferences from the data under
study. When using more complex and sophisticated statistical estimators and tests, these issues become
more pronounced.
For extracting knowledge from databases containing different types of observations, a variety of statistical
methods are available in Data Mining and some of these are:
•Logistic regression analysis
•Correlation analysis
•Regression analysis
•Discriminate analysis
•Linear discriminant analysis (LDA)
•Classification
•Clustering
•Outlier detection
•Classification and regression trees,
•Correspondence analysis
•Nonparametric regression,
•Statistical pattern recognition,
•Categorical data analysis,
•Time-series methods for trends and periodicity
•Artificial neural networks
Now, let’s try to understand some of the important statistical methods which are used in data mining:
Linear Regression: The linear regression method uses the best linear relationship between the independent
and dependent variables to predict the target variable. In order to achieve the best fit, make sure that all the
distances between the shape and the actual observations at each point are as small as possible. A good fit
can be determined by determining that no other position would produce fewer errors given the shape chosen.
Simple linear regression and multiple linear regression are the two major types of linear regression. By fitting a
linear relationship to the independent variable, the simple linear regression predicts the dependent variable.
Using multiple independent variables, multiple linear regression fits the best linear relationship with the
dependent variable. For more details, you can refer linear regression.
Classification: This is a method of data mining in which a collection of data is categorized so that a greater
degree of accuracy can be predicted and analyzed. An effective way to analyze very large datasets is to
classify them. Classification is one of several methods aimed at improving the efficiency of the analysis
process. A Logistic Regression and a Discriminant Analysis stand out as two major classification techniques.
Logistic Regression: It can also be applied to machine learning applications and predictive analytics. In this
approach, the dependent variable is either binary (binary regression) or multinomial (multinomial regression):
UNIT -2 DATA MINING
either one of the two or a set of one, two, three, or four options. With a logistic regression equation, one can
estimate probabilities regarding the relationship between the independent variable and the dependent
variable. For understanding logistic regression analysis in detail, you can refer to logistic regression.
Discriminant Analysis: A Discriminant Analysis is a statistical method of analyzing data based on the
measurements of categories or clusters and categorizing new observations into one or more populations that
were identified a priori. The discriminant analysis models each response class independently then uses
Bayes’s theorem to flip these projections around to estimate the likelihood of each response category given
the value of X. These models can be either linear or quadratic.
Linear Discriminant Analysis: According to Linear Discriminant Analysis, each observation is assigned a
discriminant score to classify it into a response variable class. By combining the independent variables in a
linear fashion, these scores can be obtained. Based on this model, observations are drawn from a Gaussian
distribution, and the predictor variables are correlated across all k levels of the response variable, Y. and for
further details linear discriminant analysis
Quadratic Discriminant Analysis: An alternative approach is provided by Quadratic Discriminant Analysis.
LDA and QDA both assume Gaussian distributions for the observations of the Y classes. Unlike LDA, QDA
considers each class to have its own covariance matrix. As a result, the predictor variables have different
variances across the k levels in Y.
Correlation Analysis: In statistical terms, correlation analysis captures the relationship between variables in
a pair. The value of such variables is usually stored in a column or rows of a database table and represents a
property of an object.
Regression Analysis: Based on a set of numeric data, regression is a data mining method that predicts a
range of numerical values (also known as continuous values). You could, for instance, use regression to
predict the cost of goods and services based on other variables. A regression model is used across numerous
industries for forecasting financial data, modeling environmental conditions, and analyzing trends.

Why is Data Preprocessing Important

Data Preprocessing is an important step in the Data Preparation stage of a Data Science development lifecycle
that will ensure reliable, robust, and consistent results. The main objective of this step is to ensure and check
the quality of data before applying any Machine Learning or Data Mining methods. Let’s review some of its
benefits -

• Accuracy - Data Preprocessing will ensure that input data is accurate and reliable by ensuring there are
no manual entry errors, no duplicates, etc.
• Completeness - It ensures that missing values are handled, and data is complete for further analysis.
• Consistent - Data Preprocessing ensures that input data is consistent, i.e., the same data kept in
different places should match.
• Timeliness - Whether data is updated regularly and on a timely basis or not.
• Trustable - Whether data is coming from trustworthy sources or not.
• Interpretability - Raw data is generally unusable, and Data Preprocessing converts raw data into an
interpretable format.

Key Steps in Data Preprocessing

Let’s explore a few of the key steps involved in the Data Preprocessing stage -
UNIT -2 DATA MINING

Data Cleaning

Data Cleaning uses methods to handle incorrect, incomplete, inconsistent, or missing values. Some of the
techniques for Data Cleaning include -

• Handling Missing Values


o Input data can contain missing or NULL values, which must be handled before applying any
Machine Learning or Data Mining techniques.
o Missing values can be handled by many techniques, such as removing rows/columns containing
NULL values and imputing NULL values using mean, mode, regression, etc.
• De-noising
o De-noising is a process of removing noise from the data. Noisy data is meaningless data that is
not interpretable or understandable by machines or humans. It can occur due to data entry errors,
faulty data collection, etc.
o De-noising can be performed by applying many techniques, such as binning the features, using
regression to smoothen the features to reduce noise, clustering to detect the outliers, etc.

Data Integration

Data Integration can be defined as combining data from multiple sources. A few of the issues to be considered
during Data Integration include the following -

• Entity Identification Problem - It can be defined as identifying objects/features from multiple databases
that correspond to the same entity. For example, in database A _customer_id,_ and in database
B _customer_number_ belong to the same entity.
• Schema Integration - It is used to merge two or more database schema/metadata into a single schema.
It essentially takes two or more schema as input and determines a mapping between them. For example,
entity type CUSTOMER in one schema may have CLIENT in another schema.
UNIT -2 DATA MINING
• Detecting and Resolving Data Value Concepts - The data can be stored in various ways in different
databases, and it needs to be taken care of while integrating them into a single dataset. For example,
dates can be stored in various formats such as DD/MM/YYYY, YYYY/MM/DD, or MM/DD/YYYY, etc.

Data Reduction

Data Reduction is used to reduce the volume or size of the input data. Its main objective is to reduce storage and
analysis costs and improve storage efficiency. A few of the popular techniques to perform Data Reduction
include -

• Dimensionality Reduction - It is the process of reducing the number of features in the input dataset. It
can be performed in various ways, such as selecting features with the highest importance, Principal
Component Analysis (PCA), etc.
• Numerosity Reduction - In this method, various techniques can be applied to reduce the volume of data
by choosing alternative smaller representations of the data. For example, a variable can be approximated
by a regression model, and instead of storing the entire variable, we can store the regression model to
approximate it.
• Data Compression - In this method, data is compressed. Data Compression can be lossless or lossy
depending on whether the information is lost or not during compression.

Data Transformation

Data Transformation is a process of converting data into a format that helps in building efficient ML models and
deriving better insights. A few of the most common methods for Data Transformation include -

• Smoothing - Data Smoothing is used to remove noise in the dataset, and it helps identify important
features and detect patterns. Therefore, it can help in predicting trends or future events.
• Aggregation - Data Aggregation is the process of transforming large volumes of data into an organized
and summarized format that is more understandable and comprehensive. For example, a company
may look at monthly sales data of a product instead of raw sales data to understand its performance
better and forecast future sales.
• Discretization - Data Discretization is a process of converting numerical or continuous variables into a
set of intervals/bins. This makes data easier to analyze. For example, the age features can be converted
into various intervals such as (0-10, 11-20, ..) or (child, young, …).
• Normalization - Data Normalization is a process of converting a numeric variable into a specified range
such as [-1,1], [0,1], etc. A few of the most common approaches to performing normalization are Min-
Max Normalization, Data Standardization or Data Scaling, etc.

Applications of Data Preprocessing

Data Preprocessing is important in the early stages of a Machine Learning and AI application development
lifecycle. A few of the most common usage or application include -

• Improved Accuracy of ML Models - Various techniques used to preprocess data, such as Data
Cleaning, Transformation ensure that data is complete, accurate, and understandable, resulting in
efficient and accurate ML models.
• Reduced Costs - Data Reduction techniques can help companies save storage and compute costs by
reducing the volume of the data
UNIT -2 DATA MINING
• Visualization - Preprocessed data is easily consumable and understandable that can be further used to
build dashboards to gain valuable insights.

Data similarity and dissimilarity are important measures in data mining that help in identifying patterns and
trends in datasets. Similarity measures are used to determine how similar two datasets or data points are, while
dissimilarity measures are used to determine how different they are. In this article, we will discuss some
commonly used measures of similarity and dissimilarity in data mining.

Basics of Similarity and Dissimilarity Measures

Similarity Measure

• A similarity measure is a mathematical function that quantifies the degree of similarity between two
objects or data points. It is a numerical score measuring how alike two data points are.
• It takes two data points as input and produces a similarity score as output, typically ranging from 0
(completely dissimilar) to 1 (identical or perfectly similar).
• A similarity measure can be based on various mathematical techniques such as Cosine similarity,
Jaccard similarity, and Pearson correlation coefficient.
• Similarity measures are generally used to identify duplicate records, equivalent instances, or identifying
clusters.

Dissimilarity Measure

• A dissimilarity measure is a mathematical function that quantifies the degree of dissimilarity between
two objects or data points. It is a numerical score measuring how different two data points are.
• It takes two data points as input and produces a dissimilarity score as output, ranging from 0 (identical or
perfectly similar) to 1 (completely dissimilar). A few dissimilarity measures also have infinity as their
upper limit.
• A dissimilarity measure can be obtained by using different techniques such as Euclidean distance,
Manhattan distance, and Hamming distance.
• Dissimilarity measures are often used in identifying outliers, anomalies, or clusters.

Data Types Similarity and Dissimilarity Measures

• For nominal variables, these measures are binary, indicating whether two values are equal or not.
• For ordinal variables, it is the difference between two values that are normalized by the max distance. For
the other variables, it is just a distance function.
UNIT -2 DATA MINING

You might also like