Data Mining Unit 1 (MSC Ds 3 Sem)
Data Mining Unit 1 (MSC Ds 3 Sem)
The data mining tutorial provides basic and advanced concepts of data mining. Our data mining tutorial
is designed for learners and experts.
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and individuals
to extract valuable information from huge sets of data. Data mining is also called Knowledge Discovery
in Database (KDD). The knowledge discovery process includes Data cleaning, Data integration, Data
selection, Data transformation, Data mining, Pattern evaluation, and Knowledge presentation.
Our Data mining tutorial includes all topics of Data mining such as applications, Data mining vs
Machine learning, Data mining tools, Social Media Data mining, Data mining techniques, Clustering
in data mining, Challenges in Data mining, etc.
The process of extracting information to identify patterns, trends, and useful data that would allow the
business to take the data-driven decision from huge sets of data is called Data Mining.
The process of extracting information to identify patterns, trends, and useful data that would allow the
business to take the data-driven decision from huge sets of data is called Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and assembled
in particular areas such as data warehouses, efficient analysis, data mining algorithm, helping decision
making and other data requirement to eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to find trends and
patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical
algorithms for data segments and evaluates the probability of future events. Data Mining is also called
Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases to solve
business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a particular
data set, with an objective. This process includes various types of services such as text mining, web
mining, audio and video mining, pictorial data mining, and social media mining. It is done through
software that is simple or highly specific. By outsourcing data mining, all the work can be done faster
with low operation costs. Specialized firms can also use new technologies to collect data that is
impossible to locate manually. There are tonnes of information available on various platforms, but very
little knowledge is accessible. The biggest challenge is to analyze the data to extract important
information that can be used to solve a problem or for company development. There are many powerful
instruments and techniques available to mine data and find better insight from it.
Relational Database:
Advertisement
A relational database is a collection of multiple data sets formally organized by tables, records, and
columns from which data can be accessed in various ways without having to recognize the database
tables. Tables convey and share information, which facilitates data searchability, reporting, and
organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the organization
to provide meaningful business insights. The huge amount of data comes from multiple places such as
Marketing and Finance. The extracted data is utilized for analytical purposes and helps in decision-
making for a business organization. The data warehouse is designed for the analysis of data rather than
transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT professionals
utilize the term more clearly to refer to a specific kind of setup within an IT structure. For example, a
group of databases, where an organization has kept various kinds of information.
Object-Relational Database:
A combination of an object-oriented database model and relational database model is called an object-
relational model. It supports Classes, Objects, Inheritance, etc.
Advertisement
One of the primary objectives of the Object-relational data model is to close the gap between the
Relational database and the object-oriented model practices frequently utilized in many programming
languages, for example, C++, Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the potential to undo
a database transaction if it is not performed appropriately. Even though this was a unique capability a
very long while back, today, most of the relational database systems support transactional database
activities.
Advertisement
o There is a probability that the organizations may sell useful data of customers to other
organizations for money. As per the report, American Express has sold credit card purchases
of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance training to work
on.
o Different data mining instruments operate in distinct ways due to the different algorithms
used in their design. Therefore, the selection of the right data mining tools is a very challenging
task.
o The data mining techniques are not precise, so that it may lead to severe consequences in
certain conditions.
These are the following areas where data mining is widely used:
Data mining in healthcare has excellent potential to improve the health system. It uses data and analytics
for better insights and to identify best practices that will enhance health care services and reduce costs.
Analysts use data mining approaches such as Machine learning, Multi-dimensional database, Data
visualization, Soft computing, and statistics. Data Mining can be used to forecast patients in each
category. The procedures ensure that the patients get intensive care at the right place and at the right
time. Data mining also enables healthcare insurers to recognize fraud and abuse.
Market basket analysis is a modeling method based on a hypothesis. If you buy a specific group of
products, then you are more likely to buy another group of products. This technique may enable the
retailer to understand the purchase behavior of a buyer. This data may assist the retailer in understanding
the requirements of the buyer and altering the store's layout accordingly. Using a different analytical
comparison of results between various stores, between customers in different demographic groups can
be done.
Education data mining is a newly emerging field, concerned with developing techniques that explore
knowledge from the data generated from educational Environments. EDM objectives are recognized as
affirming student's future learning behavior, studying the impact of educational support, and promoting
learning science. An organization can use data mining to make precise decisions and also to predict the
results of the student. With the results, the institution can concentrate on what to teach and how to teach.
Knowledge is the best asset possessed by a manufacturing company. Data mining tools can be beneficial
to find patterns in a complex manufacturing process. Data mining can be used in system-level designing
to obtain the relationships between product architecture, product portfolio, and data needs of the
customers. It can also be used to forecast the product development period, cost, and expectations among
the other tasks.
Customer Relationship Management (CRM) is all about obtaining and holding Customers, also
enhancing customer loyalty and implementing customer-oriented strategies. To get a decent relationship
with the customer, a business organization needs to collect data and analyze the data. With data mining
technologies, the collected data can be used for analytics.
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a little bit
time consuming and sophisticated. Data mining provides meaningful patterns and turning data into
information. An ideal fraud detection system should protect the data of all the users. Supervised methods
consist of a collection of sample records, and these records are classified as fraudulent or non-
fraudulent. A model is constructed using this data, and the technique is made to identify whether the
document is fraudulent or not.
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very challenging
task. Law enforcement may use data mining techniques to investigate offenses, monitor suspected
terrorist communications, etc. This technique includes text mining also, and it seeks meaningful patterns
in data, which is usually unstructured text. The information collected from the previous investigations
is compared, and a model for lie detection is constructed.
The Digitalization of the banking system is supposed to generate an enormous amount of data with
every new transaction. The data mining technique can help bankers by solving business-related
problems in banking and finance by identifying trends, casualties, and correlations in business
information and market costs that are not instantly evident to managers or executives because the data
volume is too large or are produced too rapidly on the screen by experts. The manager may find these
data for better targeting, acquiring, retaining, segmenting, and maintain a profitable customer.
Although data mining is very powerful, it faces many challenges during its execution. Various
challenges could be related to performance, data, methods, and techniques, etc. The process of data
mining becomes effective when the challenges or problems are correctly recognized and adequately
resolved.
Incomplete and noisy data:
The process of extracting useful data from large volumes of data is data mining. The data in the real-
world is heterogeneous, incomplete, and noisy. Data in huge quantities will usually be inaccurate or
unreliable. These problems may occur due to data measuring instrument or because of human errors.
Suppose a retail chain collects phone numbers of customers who spend more than $ 500, and the
accounting employees put the information into their system. The person may make a digit mistake when
entering the phone number, which results in incorrect data. Even some customers may not be willing to
disclose their phone numbers, which results in incomplete data. The data could get changed due to
human or system error. All these consequences (noisy and incomplete data)makes data mining
challenging.
Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing environment. It might
be in a database, individual systems, or even on the internet. Practically, It is a quite tough task to make
all the data to a centralized data repository mainly due to organizational and technical concerns. For
example, various regional offices may have their servers to store their data. It is not feasible to store, all
the data from all the offices on a central server. Therefore, data mining requires the development of
tools and algorithms that allow the mining of distributed data.
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and video, images,
complex data, spatial data, time series, and so on. Managing these various types of data and extracting
useful information is a tough task. Most of the time, new technologies, new tools, and methodologies
would have to be refined to obtain specific information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms and techniques
used. If the designed algorithm and techniques are not up to the mark, then the efficiency of the data
mining process will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and privacy. For
example, if a retailer analyzes the details of the purchased items, then it reveals data about buying habits
and preferences of the customers without their permission.
Data Visualization:
In data mining, data visualization is a very important process because it is the primary method that
shows the output to the user in a presentable way. The extracted data should convey the exact meaning
of what it intends to express. But many times, representing the information to the end-user in a precise
and easy way is difficult. The input data and the output information being complicated, very efficient,
and successful data visualization processes need to be implemented to make it successful.
There are many more challenges in data mining in addition to the problems above-mentioned. More
problems are disclosed as the actual data mining process begins, and the success of data mining relies
on getting rid of all these difficulties.
Prerequisites
Before learning the concepts of Data Mining, you should have a basic understanding of Statistics,
Database Knowledge, and Basic programming language.
Data mining includes the utilization of refined data analysis tools to find previously unknown,
valid patterns and relationships in huge data sets. These tools can incorporate statistical
models, machine learning techniques, and mathematical algorithms, such as neural networks
or decision trees. Thus, data mining incorporates analysis and prediction.
Depending on various methods and technologies from the intersection of machine learning,
database management, and statistics, professionals in data mining have devoted their careers
to better understanding how to process and make conclusions from the huge amount of data,
but what are the methods they use to make it happen?
In recent data mining projects, various major data mining techniques have been developed
and used, including association, classification, clustering, prediction, sequential patterns, and
regression.
1. Classification:
This technique is used to obtain important and relevant information about data and metadata.
This data mining technique helps to classify data in different classes.
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge
discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
iv. Classification of data mining frameworks according to data mining techniques
used:
This classification is as per the data analysis approach utilized, such as neural
networks, machine learning, genetic algorithms, visualization, statistics, data
warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in
the data mining procedure, such as query-driven systems, autonomous systems, or
interactive exploratory systems.
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the data by
a few clusters mainly loses certain confine details, but accomplishes improvement. It models
data by its clusters. Data modeling puts clustering from a historical point of view rooted in
statistics, mathematics, and numerical analysis. From a machine learning point of view,
clusters relate to hidden patterns, the search for clusters is unsupervised learning, and the
subsequent framework represents a data concept. From a practical point of view, clustering
plays an extraordinary job in data mining applications. For example, scientific data exploration,
text mining, information retrieval, spatial database applications, CRM, Web analysis,
computational biology, medical diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining technique to identify
similar data. This technique helps to recognize the differences and similarities between the
data. Clustering is very similar to the classification, but it involves grouping chunks of data
together based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship
between variables because of the presence of the other factor. It is used to define the
probability of the specific variable. Regression, primarily a form of planning and modeling. For
example, we might use it to project certain costs, depending on other factors such as
availability, consumer demand, and competition. Primarily it gives the exact relationship
between two or more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a
hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of interactions
between data items within large data sets in different types of databases. Association rule
mining has several applications and is commonly used to help sales correlations in data or
medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery items
that you have been buying for the last six months. It calculates a percentage of items being
purchased together.
o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set,
which do not match an expected pattern or expected behavior. This technique may be used
in various domains like intrusion, detection, fraud detection, etc. It is also known as Outlier
Analysis or Outilier mining. The outlier is a data point that diverges too much from the rest of
the dataset. The majority of the real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field. Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns. It comprises of finding interesting subsequences in a set
of sequences, where the stake of a sequence can be measured in terms of different criteria
like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.
The process of finding patterns, trends, associations, or significant information within sizable
datasets is known as data mining. In Data mining, we analyze the raw data in such a way that
we can extract the relevant knowledge information from structured or unstructured data using
a variety of techniques and algorithms. Finding hidden knowledge that can be applied to
prediction, classification, and other data-driven tasks is the goal of data mining.
o Data Collection: Gathering data from various sources, such as databases, websites,
sensors, or logs.
o Data Preprocessing: Cleaning and transforming data to remove noise, handle
missing values, and make it suitable for analysis.
o Exploratory Data Analysis: EDA stands for exploratory data analysis in data mining.
It is the process of first examining a dataset to learn about its characteristics, such as
the distribution of the data and any potential outliers.
o Pattern Discovery: Identifying patterns or relationships in the data, such as
associations, clusters, or predictive models, through algorithms.
o Model Evaluation: Depending on the particular task, the accuracy, precision, recall,
and other metrics of the discovered patterns or models are evaluated to determine
their quality and effectiveness.
o Knowledge Interpretation: In Data mining, knowledge interpretation is a process in
which we turn the discovered pattern into knowledge that can be used in decision-
making across industries, business, healthcare and more.
Data mining is essential in industries like marketing (customer segmentation and
recommendation systems), finance (fraud detection and risk assessment), healthcare
(disease diagnosis and treatment planning), and many others where access to large amounts
of data can be used to gain an advantage over competitors or increase the accuracy of
decisions.
While powerful and valuable for drawing insights from data, data mining has difficulties and
problems. Major problems with data mining include:
o Data Quality: The outcomes of data mining can be significantly impacted by poor data
quality, which can include missing values, outliers, inaccuracies, and inconsistencies.
Preprocessing and data cleansing are crucial steps to take to solve this problem.
o Data Security and Privacy: Mining sensitive or private data raises privacy issues. A
crucial issue is ensuring that data mining procedures adhere to privacy laws and
safeguard sensitive personal information about individuals.
o Scalability: Handling large datasets can be difficult from a computational standpoint.
Large-scale data mining tasks require effective algorithms and parallel processing
techniques.
o Complexity and Dimensionality: High-dimensional data may be subject to the "curse
of dimensionality," which makes it difficult to identify significant trends and connections.
Techniques for dimensional reduction are frequently needed.
o Overfitting: Overfitting occurs when a model performs well on training data but poorly
on new, unforeseen data. It is caused by overly complex models that fit the training
data too closely. Methods like regularization and cross-validation address this problem.
o Bias and Fairness: If the data used to train the models are biased, data mining
processes may produce biased or unfair results. Fairness in data mining is a growing
concern, particularly in applications like lending or hiring.
o Interpretability: Some sophisticated machine learning and data mining algorithms
can create complex and challenging models to understand. It can be not easy to
comprehend and justify the outcomes of these models, especially in crucial decision-
making domains.
o Algorithm Selection: Deciding which data mining algorithm is best for a given
problem can be challenging. Depending on the attributes of the data and the analysis's
goals, different algorithms may perform better or worse.
o Computational Resources: Data mining tasks may require significant computational
resources like memory and processing power. Managing and gaining access to these
resources can be difficult, particularly for smaller organizations.
o Bias in Training Data: The models may be biased if the training data used to create
them is not representative of the population in the real world. Predictions that are unfair
or inaccurate may result from this bias.
o Lack of Domain Knowledge: A thorough understanding of the studied domain is
frequently necessary for effective data mining. Making informed decisions and
correctly interpreting results can be difficult without domain expertise.
These issues must be addressed to ensure data mining is applied effectively and responsibly.
Technical know-how, ethical considerations, and regulatory compliance are needed.
we usually discuss knowledge discovery from data. To learn about the data, it is necessary to discuss
data objects, data attributes, and types of data attributes. Mining data includes knowing about data,
finding relations between data. And for this, we need to discuss data objects and attributes.
Data objects are the essential part of a A data object represents the entity. Data Objects are like a
group of attributes of an entity. For example, a sales data object may represent customers, sales, or
purchases. When a data object is listed in a database they are called data tuples.
What are Data Attributes?
Data attributes refer to the specific characteristics or properties that describe individual data
objects within a dataset.
These attributes provide meaningful information about the objects and are used to analyze,
classify, or manipulate the data.
Understanding and analyzing data attributes is fundamental in various fields such as , , and data
analysis, as they form the basis for deriving insights and making informed decisions from the
data.
Within predictive models, attributes serve as the predictors influencing an outcome. In descriptive
models, attributes constitute the pieces of information under examination for inherent patterns or
correlations.
We can say that a set of attributes used to describe a given object are known as attribute vector
or feature vector.
Examples of data attributes include numerical values (e.g., age, height), categorical labels (e.g., color,
type), textual descriptions (e.g., name, description), or any other measurable or qualitative aspect of
the data objects.
Types of attributes:
This is the initial phase of involves categorizing attributes into different types, which serves as a
foundation for subsequent data processing steps. Attributes can be broadly classified into two main
types:
1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).
2. Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:
1. Nominal Attributes :
Nominal attributes, as related to names, refer to categorical data where the values represent different
categories or labels without any inherent order or ranking. These attributes are often used to represent
names or labels associated with objects, entities, or concepts.
Example :
2. Binary Attributes: Binary attributes are a type of qualitative attribute where the data can take on
only two distinct values or states. These attributes are often used to represent yes/no,
presence/absence, or true/false conditions within a dataset. They are particularly useful for
representing categorical data where there are only two possible outcomes. For instance, in a medical
study, a binary attribute could represent whether a patient is affected or unaffected by a particular
condition.
Symmetric: In a symmetric attribute, both values or states are considered equally important or
interchangeable. For example, in the attribute “Gender” with values “Male” and “Female,” neither
value holds precedence over the other, and they are considered equally significant for analysis
purposes.
Asymmetric: An asymmetric attribute indicates that the two values or states are not equally
important or interchangeable. For instance, in the attribute “Result” with values “Pass” and “Fail,”
the states are not of equal importance; passing may hold greater significance than failing in
certain contexts, such as academic grading or certification exams
3. Ordinal Attributes : Ordinal attributes are a type of qualitative attribute where the values possess
a meaningful order or ranking, but the magnitude between values is not precisely quantified. In other
words, while the order of values indicates their relative importance or precedence, the numerical
difference between them is not standardized or known.
Example:
Quantitative Attributes:
1. Numeric: A numeric attribute is quantitative because, it is a measurable quantity, represented in
integer or real values. Numerical attributes are of 2 types: interval , and ratio-scaled.
An interval-scaled attribute has values, whose differences are interpretable, but the numerical
attributes do not have the correct reference point, or we can call zero points. Data can be added
and subtracted at an interval scale but can not be multiplied or divided. Consider an example of
temperature in degrees Centigrade. If a day’s temperature of one day is twice of the other day we
cannot say that one day is twice as hot as another day.
A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is ratio-
scaled, we can say of a value as being a multiple (or ratio) of another value. The values are
ordered, and we can also compute the difference between values, and the mean, median, mode,
Quantile-range, and Five number summary can be given.
2. Discrete : Discrete data refer to information that can take on specific, separate values rather than a
continuous range. These values are often distinct and separate from one another, and they can be
either numerical or categorical in nature.
Example:
3. Continuous : Continuous data, unlike discrete data, can take on an infinite number of possible
values within a given range. It is characterized by being able to assume any value within a specified
interval, often including fractional or decimal values.
Example :
• For data preprocessing to be successful, it is essential to have an overall picture of our data. Basic
statistical descriptions can be used to identify properties of the data and highlight which data values
should be treated as noise or outliers.
• Basic statistical descriptions can be used to identify properties of the data and highlight which data
values should be treated as noise or outliers.
• For data preprocessing tasks, we want to learn about data characteristics regarding both central
tendency and dispersion of the data.
• Measures of data dispersion include quartiles, interquartile range (IQR) and variance.
• These descriptive statistics are of great help in understanding the distribution of the data.
• We look at various ways to measure the central tendency of data, include: Mean, Weighted mean,
Trimmed mean, Median, Mode and Midrange.
1. Mean :
• The mean of a data set is the average of all the data values. The sample mean x is the point estimator
of the population mean μ.
2. Median :
• The median of a data set is the value in the middle when the data items are arranged in ascending
order. Whenever a data set has extreme values, the median is the preferred measure of central
location.
• The median is the measure of location most often reported for annual income and property value
data. A few extremely large incomes of property values can inflate the mean.
• For an off number of observations:
Median=19
8 observations = 26 18 29 12 14 27 30 19
Numbers in ascending order =12, 14, 18, 19, 26, 27, 29, 30
3. Mode:
• The mode of a data set is the value that occurs with greatest frequency. The greatest frequency can
occur at two or more different values. If the data have exactly two modes, the data have exactly two
modes, the data are bimodal. If the data have more than two modes, the data are multimodal.
• Weighted mean: Sometimes, each value in a set may be associated with a weight, the weights reflect
the significance, importance or occurrence frequency attached to their respective values.
• Trimmed mean: A major problem with the mean is its sensitivity to extreme (e.g., outlier) values.
Even a small number of extreme values can corrupt the mean. The trimmed mean is the mean
obtained after cutting off values at the high and low extremes.
• For example, we can sort the values and remove the top and bottom 2 % before computing the
mean. We should avoid trimming too large a portion (such as 20 %) at both ends as this can result in
the loss of valuable information.
• Holistic measure is a measure that must be computed on the entire data set as a whole. It cannot
be computed by partitioning the given data into subsets and merging the values obtained for the
measure in each subset.
• First quartile (Q1): The first quartile is the value, where 25% of the values are smaller than Q1 and
75% are larger.
• Third quartile (Q3): The third quartile is the value, where 75 % of the values are smaller than Q3 and
25% are larger.
• The box plot is a useful graphical display for describing the behavior of the data in the middle as well
as at the ends of the distributions. The box plot uses the median and the lower and upper quartiles. If
the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the
interquartile range or IQ.
Variance :
• The variance is a measure of variability that utilizes all the data. It is based on the difference between
the value of each observation (x;) and the mean (x) for a sample, u for a population).
• The variance is the average of the squared between each data value and the mean.
Standard Deviation :
• The standard deviation of a data set is the positive square root of the variance. It is measured in the
same in the same units as the data, making it more easily interpreted than the variance.
1. Scatter diagram
• While working with statistical data it is often observed that there are connections between sets of
data. For example the mass and height of persons are related, the taller the person the greater his/her
mass.
• To find out whether or not two sets of data are connected scatter diagrams can be used. Scatter
diagram shows the relationship between children's age and height.
• A scatter diagram is a tool for analyzing relationship between two variables. One variable is plotted
on the horizontal axis and the other is plotted on the vertical axis.
• The pattern of their intersecting points can graphically show relationship patterns. Commonly a
scatter diagram is used to prove or disprove cause-and-effect relationships.
• While scatter diagram shows relationships, it does not by itself prove that one variable causes other.
In addition to showing possible cause and effect relationships, a scatter diagram can show that two
variables are from a common cause that is unknown or that one variable can be used as a surrogate
for the other.
2. Histogram
• A histogram is used to summarize discrete or continuous data. In a histogram, the data are grouped
into ranges (e.g. 10-19, 20-29) and then plotted as connected bars. Each bar represents a range of
data.
• To construct a histogram from a continuous variable you first need to split the data into intervals,
called bins. Each bin contains the number of occurrences of scores in the data set that are contained
within that bin.
• The width of each bar is proportional to the width of each category and the height is proportional
to the frequency or percentage of that category.
3. Line graphs
• Typical examples of the types of data that can be presented using line graphs are monthly rainfall
and annual unemployment rates.
• Line graphs are particularly useful for identifying patterns and trends in the data such as seasonal
effects, large changes and turning points. Fig. 1.12.1 show line graph. (See Fig. 1.12.1 on next page)
• As well as time series data, line graphs can also be appropriate for displaying data that are measured
over other continuous variables such as distance.
• For example, a line graph could be used to show how pollution levels vary with increasing distance
from a source or how the level of a chemical varies with depth of soil.
• In a line graph the x-axis represents the continuous variable (for example year or distance from the
initial measurement) whilst the y-axis has a scale and indicated the measurement.
• Several data series can be plotted on the same line chart and this is particularly useful for analysing
and comparing the trends in different datasets.
• Line graph is often used to visualize rate of change of a quantity. It is more useful when the given
data has peaks and valleys. Line graphs are very simple to draw and quite convenient to interpret.
4. Pie charts
• A type of graph is which a circle is divided into sectors that each represents a proportion of whole.
Each sector shows the relative size of each value.
• A pie chart displays data, information and statistics in an easy to read "pie slice" format with varying
slice sizes telling how much of one data element exists.
• Pie chart is also known as circle graph. The bigger the slice, the more of that particular data was
gathered. The main use of a pie chart is to show comparisons. Fig. 1.12.2 shows pie chart. (See Fig.
1.12.2 on next page)
• Various applications of pie charts can be found in business, school and at home. For business pie
charts can be used to show the success or failure of certain products or services.
• At school, pie chart applications include showing how much time is allotted to each subject. At home
pie charts can be useful to see expenditure of monthly income in different needs.
• Reading of pie chart is as easy figuring out which slice of an actual pie is the biggest.
Legends and labels on pie graphs are hard to align and read.
• The human visual system is more efficient at perceiving and discriminating between lines and line
lengths rather than two-dimensional areas and angles.
Data visualization is the graphical representation of information. In this guide we will study what is
Data visualization and its importance with use cases.
ntroduction
Measuring similarity and dissimilarity in data mining is an important task that helps identify patterns
and relationships in large datasets. To quantify the degree of similarity or dissimilarity between two
data points or objects, mathematical functions called similarity and dissimilarity measures are used.
Similarity measures produce a score that indicates the degree of similarity between two data points,
while dissimilarity measures produce a score that indicates the degree of dissimilarity between two
data points. These measures are crucial for many data mining tasks, such as identifying duplicate
records, clustering, classification, and anomaly detection.
Let’s understand measures of similarity and dissimilarity in data mining and explore various methods
to use these measures.
Similarity Measure
For nominal variables, these measures are binary, indicating whether two values are equal
or not.
For ordinal variables, it is the difference between two values that are normalized by the max
distance. For the other variables, it is just a distance function.
Distance is a typical measure of dissimilarity between two data points or objects, whereas similarity is
a measure of how similar or alike two data points or objects are. Distance measures typically produce
a non-negative value that increases as the data points become more dissimilar. Distance measures are
fundamental principles for various algorithms, such as KNN, K-Means, etc. On the other hand,
similarity measures typically produce a non-negative value that increases as the data points become
more similar.
Similarity Measures
Similarity measures are mathematical functions used to determine the degree of similarity
between two data points or objects. These measures produce a score that indicates how
similar or alike the two data points are.
It takes two data points as input and produces a similarity score as output, typically ranging
from 0 (completely dissimilar) to 1 (identical or perfectly similar).
Similarity measures also have some well-known properties -
o sim(A,B)=1sim(A,B)=1 (or maximum similarity) only if A=BA=B
o Typical range - (0≤sim≤1)(0≤sim≤1)
o Symmetry - sim(A,B)=sim(B,A)sim(A,B)=sim(B,A) for all AA and BB
Now let’s explore a few of the most commonly used similarity measures in data mining.
Cosine Similarity
Cosine similarity is a widely used similarity measure in data mining and information retrieval. It
measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. In the
context of data mining, these vectors represent the feature vectors of two data points. The cosine
similarity score ranges from 0 to 1, with 0 indicating no similarity and 1 indicating perfect similarity.
The cosine similarity between two vectors is calculated as the dot product of the vectors divided by
the product of their magnitudes. This calculation can be represented mathematically as follows -
cos(θ)=A⋅B∥A∥∥B∥=∑i=1nAiBi∑i=1nAi2∑i=1nBi2cos(θ)=∥A∥∥B∥A⋅B=∑i=1nAi2∑i=1nBi2
∑i=1nAiBi
where A and B are the feature vectors of two data points, "." denotes the dot product, and "||" denotes
the magnitude of the vector.
Jaccard Similarity
The Jaccard similarity is another widely used similarity measure in data mining, particularly in text
analysis and clustering. It measures the similarity between two sets of data by calculating the ratio of
the intersection of the sets to their union. The Jaccard similarity score ranges from 0 to 1, with 0
indicating no similarity and 1 indicating perfect similarity.
J(A,B)=∣A∩B∣∣A∪B∣=∣A∩B∣∣A∣+∣B∣−∣A∩B∣J(A,B)=∣A∪B∣∣A∩B∣=∣A∣+∣B∣−∣A∩B∣∣A∩B∣
where ∣A∩B∣∣A∩B∣ is the size of the intersection of sets AA and BB, and ∣A∪B∣∣A∪B∣ is the size of
the union of sets AA and BB.
The Pearson correlation coefficient is a widely used similarity measure in data mining and statistical
analysis. It measures the linear correlation between two continuous variables, X and Y. The Pearson
correlation coefficient ranges from -1 to +1, with -1 indicating a perfect negative correlation, 0
indicating no correlation, and +1 indicating a perfect positive correlation. The Pearson correlation
coefficient is commonly used in data mining applications such as feature selection and regression
analysis. It can help identify variables that are highly correlated with each other, which can be useful
for reducing the dimensionality of a dataset. In regression analysis, it can also be used to predict the
value of one variable based on the value of another variable.
The Pearson correlation coefficient between two variables, X and Y, is calculated as follows -
ρX,Y=cov(X,Y)σXσY=∑i=1n(Xi−Xˉ)(Yi−Yˉ)∑i=1n(Xi−Xˉ)2∑i=1n(Yi−Yˉ)2ρX,Y=σXσYcov(X,Y)
=∑i=1n(Xi−Xˉ)2∑i=1n(Yi−Yˉ)2∑i=1n(Xi−Xˉ)(Yi−Yˉ)
where cov(X,Y)cov(X,Y) is the covariance between variables XX and YY, and σXσX and σYσY are
the standard deviations of variables XX and YY, respectively.
Sørensen-Dice Coefficient
The Sørensen-Dice coefficient, also known as the Dice similarity index or Dice coefficient, is a
similarity measure used to compare the similarity between two sets of data, typically used in the
context of text or image analysis. The coefficient ranges from 0 to 1, with 0 indicating no similarity
and 1 indicating perfect similarity. The Sørensen-Dice coefficient is commonly used in text analysis
to compare the similarity between two documents based on the set of words or terms they contain. It
is also used in image analysis to compare the similarity between two images based on the set of pixels
they contain.
S(A,B)=2∣A∩B∣∣A∣+∣B∣S(A,B)=∣A∣+∣B∣2∣A∩B∣
where ∣A∩B∣∣A∩B∣ is the size of the intersection of sets AA and BB, and ∣A∣∣A∣ and ∣B∣∣B∣ are the
sizes of sets AA and BB, respectively.
Choosing an appropriate similarity measure depends on the nature of the data and the specific task at
hand. Here are some factors to consider when choosing a similarity measure -
Different similarity measures are suitable for different data types, such as continuous or
categorical data, text or image data, etc. For example, the Pearson correlation coefficient,
which is only suitable for continuous variables.
Some similarity measures are sensitive to the scale of measurement of the data.
The choice of similarity measure also depends on the specific task at hand. For example,
cosine similarity is often used in information retrieval and text mining, while Jaccard
similarity is commonly used in clustering and recommendation systems.
Some similarity measures are more robust to noise and outliers in the data than others. For
example, the Sørensen-Dice coefficient is less sensitive to noise.
Dissimilarity Measures
Dissimilarity measures are used to quantify the degree of difference or distance between
two objects or data points.
Dissimilarity measures can be considered the inverse of similarity measures, where the
similarity measure returns a high value for similar objects and a low value for dissimilar
objects, and the dissimilarity measure returns a low value for similar objects and a high value
for dissimilar objects.
Dissimilarity measures also have some well-known properties -
o Positivity - dissim(A,B)≥0dissim(A,B)≥0 for all AA and BB,
and dissim(A,B)=0dissim(A,B)=0 only if A=BA=B.
o Symmetry - dissim(A,B)=dissim(B,A)dissim(A,B)=dissim(B,A) for all AA and BB
o Triangle Inequality
- dissim(A,C)≤dissim(A,B)+d(B,C)dissim(A,C)≤dissim(A,B)+d(B,C) for all
points AA, BB, and CC.
Let’s explore a few of the commonly used dissimilarity or distance measures in data mining.
Euclidean Distance
Euclidean distance is a commonly used dissimilarity measure that quantifies the distance between two
points in a multidimensional space. It is named after the ancient Greek mathematician Euclid, who
first studied its properties. The Euclidean distance between two points XX and YY in an n-
dimensional space is defined as the square root of the sum of the squared differences between their
corresponding coordinates, as shown below -
d(X,Y)=∑i=1n(Xi−Yi)2d(X,Y)=i=1∑n(Xi−Yi)2
Euclidean distance is commonly used in clustering, classification, and anomaly detection applications
in data mining and machine learning. It has the advantage of being easy to interpret and visualize.
However, it can be sensitive to the scale of the data and may not perform well when dealing with
high-dimensional data or data with outliers.
Manhattan Distance
Manhattan distance, also known as city block distance, is a dissimilarity measure that quantifies the
distance between two points in a multidimensional space. It is named after the geometric structure of
the streets in Manhattan, where the distance between two points is measured by the number of blocks
one has to walk horizontally and vertically to reach the other point. The Manhattan distance between
two points xx and yy in an n-dimensional space is defined as the sum of the absolute differences
between their corresponding coordinates, as shown below -
dM(x,y)=∑i=1n∣xi−yi∣dM(x,y)=i=1∑n∣xi−yi∣
In data mining and machine learning, the Manhattan distance is commonly used in clustering,
classification, and anomaly detection applications. It is particularly useful when dealing with high-
dimensional data, sparse data, or data with outliers, as it is less sensitive to extreme values than the
Euclidean distance. However, it may not be suitable for data that exhibit complex geometric structures
or nonlinear relationships between features.
Minkowski Distance
Minkowski distance is a generalization of Euclidean distance and Manhattan distance, which are
special cases of Minkowski distance. The Minkowski distance between two points xx and yy in an n-
dimensional space can be defined as -
D(x,y)=(∑i=1n∣xi−yi∣p)1pD(x,y)=(i=1∑n∣xi−yi∣p)p1
Where pp is a parameter that determines the degree of the Minkowski distance. When p=1p=1, the
Minkowski distance reduces to the Manhattan distance, and when p=2p=2, it reduces to the Euclidean
distance. When p>2p>2, it is sometimes referred to as a "higher-order" distance metric.
Hamming Distance
Hamming distance is a distance metric used to measure the dissimilarity between two strings of equal
length. It is defined as the number of positions at which the corresponding symbols in the two strings
are different.
For example, consider the strings "101010" and "111000". The Hamming distance between these two
strings is three since there are three positions at which the corresponding symbols are different: the
second, fourth, and sixth positions.
Hamming distance is often used in error-correcting codes and cryptography, where it is important to
detect and correct errors in data transmission. It is also used in data mining and machine learning
applications to compare categorical or binary data, such as DNA sequences or binary feature vectors.
Similar to similarity measures, choosing the appropriate dissimilarity measure also depends on the
nature of the data and the specific task at hand. Here are some factors to consider when selecting a
dissimilarity measure -
Different dissimilarity measures are appropriate for different types of data. For example,
Hamming distance is suitable for binary or string data, while Euclidean distance is
appropriate for continuous numerical data.
The scale of the data can also affect the choice of dissimilarity measure. For instance, if the
range of a feature is much larger than the range of another feature, Euclidean distance may
not be the best measure to use. In this case, normalization or standardization of the data
may be required, or a different measure, such as Manhattan distance, could be used.
The number of features or dimensions in the data can also impact the choice of dissimilarity
measure. For high-dimensional data, a more robust measure such as Mahalanobis distance
may be more appropriate.
Measures for similarity and dissimilarity in data mining can be learned from labeled data in a
supervised learning setting. One common approach to learning supervised similarity and
dissimilarity measures is to train a binary classifier that distinguishes between similar and
dissimilar pairs of examples. Another approach is to train a regression model that predicts
the degree of similarity or dissimilarity between pairs of examples.
Similarly, ensemble techniques are used for measures of similarity and dissimilarity in data
mining when no single measure performs well on its own or when multiple measures
capture different aspects of the data. Ensemble techniques for similarity and dissimilarity
measures involve combining multiple measures into a single measure to improve the overall
performance. For example, in a k-NN classifier, each neighbor can be classified using a
different similarity measure. The final classification decision can then be made by combining
the decisions of all neighbors using an ensemble technique.
Recently, deep learning is also being used to learn measures of similarity and dissimilarity in
data mining. It involves using neural networks to learn the similarity or dissimilarity between
pairs of examples. One common approach to deep learning for similarity and dissimilarity
measures is to use Siamese networks. Siamese networks consist of two identical neural
networks that share weights. Each network takes as input one of the two examples to be
compared and produces a feature vector that represents the example. The two feature
vectors are then compared using a distance or similarity measure, such as Euclidean distance
or cosine similarity, to produce a final output.
Conclusion
Measures of similarity and dissimilarity are essential tools in data mining for comparing and
analyzing data. These measures allow us to quantify the similarity or dissimilarity between
two data points or data sets and identify complex datasets' patterns and relationships.
Many different measures of similarity and dissimilarity are available, such as cosine
similarity, Jaccard similarity, Euclidean distance, hamming distance, etc. Choosing the
appropriate measure depends on the specific task and the characteristics of the data being
analyzed.
Ensemble techniques and deep learning approaches can also be used to combine or learn
similarity and dissimilarity measures, effectively improving performance and robustness.
Challenge Time!
Unit 2
Frequent pattern mining in data mining is the process of identifying patterns or
associations within a dataset that occur frequently. This is typically done by analyzing
large datasets to find items or sets of items that appear together frequently.
Frequent pattern extraction is an essential mission in data mining that intends to uncover
repetitive patterns or itemsets in a granted dataset. It encompasses recognizing collections
of components that occur together frequently in a transactional or relational database. This
procedure can offer valuable perceptions into the connections and affiliations among
diverse components or features within the data.
Here’s an elaborate explanation of repeating arrangement prospecting:
Transactional and Relational Databases:
Repeating arrangement prospecting can be applied to transactional databases, where
each transaction consists of a collection of objects. For instance, in a retail dataset, each
transaction may represent a customer’s purchase with objects like loaf, dairy, and ovals. It
can also be used with relational databases, where data is organized into multiple related
tables. In this case, repeating arrangements can represent connections among different
attributes or columns.
Support and Repeating Groupings:
The support of a grouping is defined as the proportion of transactions in the database that
contain that particular grouping. It represents the frequency or occurrence of the grouping
in the dataset. Repeating groupings are collections of objects whose support is above a
specified minimum support threshold. These groupings are considered interesting and are
the primary focus of repeating arrangement prospecting.
Apriori Algorithm:
The Apriori algorithm is one of the most well-known and widely used algorithms for
repeating arrangement prospecting. It uses a breadth-first search strategy to discover
repeating groupings efficiently. The algorithm works in multiple iterations. It starts by
finding repeating individual objects by scanning the database once and counting the
occurrence of each object. It then generates candidate groupings of size 2 by combining
the repeating groupings of size 1. The support of these candidate groupings is calculated
by scanning the database again. The process continues iteratively, generating candidate
groupings of size k and calculating their support until no more repeating groupings can be
found.
Support-based Pruning:
During the Apriori algorithm’s execution, aid-based pruning is used to reduce the search
space and enhance efficiency. If an itemset is found to be rare (i.e., its aid is below the
minimum aid threshold), then all its supersets are also assured to be rare. Therefore, these
supersets are trimmed from further consideration. This trimming step significantly
decreases the number of potential item sets that need to be evaluated in subsequent
iterations.
Association Rule Mining:
Frequent item sets can be further examined to discover association rules, which represent
connections between different items. An association rule consists of an antecedent and a
consequent (right-hand side), both of which are item sets. For instance, {milk, bread} =>
{eggs} is an association rule. Association rules are produced from frequent itemsets by
considering different combinations of items and calculating measures such as aid,
confidence, and lift. Aid measures the frequency of both the antecedent and the
consequent appearing together, while confidence measures the conditional probability of
the consequent given the antecedent. Lift indicates the strength of the association between
the antecedent and the consequent, considering their individual aid.
Applications:
Frequent pattern mining has various practical uses in different domains. Some examples
include market basket analysis, customer behavior analysis, web mining, bioinformatics,
and network traffic analysis. Market basket analysis involves analyzing customer purchase
patterns to identify connections between items and enhance sales strategies. In
bioinformatics, frequent pattern mining can be used to identify common patterns in DNA
sequences, protein structures, or gene expressions, leading to insights in genetics and
drug design. Web mining can employ frequent pattern mining to discover navigational
patterns, user preferences, or collaborative filtering recommendations on the web.
Regular pattern extraction is a data extraction approach employed to spot repeating forms
or itemsets in transactional or relational databases. It entails locating collections of objects
that occur collectively often and possesses numerous uses in different fields. The Apriori
algorithm is a well-liked technique utilized to effectively detect consistent itemsets, and
association rule extraction can be carried out to obtain significant connections between
objects.
There are several different algorithms used for frequent pattern mining, including:
1. Apriori algorithm: This is one of the most commonly used algorithms for frequent
pattern mining. It uses a “bottom-up” approach to identify frequent itemsets and then
generates association rules from those itemsets.
2. ECLAT algorithm: This algorithm uses a “depth-first search” approach to identify
frequent itemsets. It is particularly efficient for datasets with a large number of items.
3. FP-growth algorithm: This algorithm uses a “compression” technique to find frequent
patterns efficiently. It is particularly efficient for datasets with a large number of
transactions.
4. Frequent pattern mining has many applications, such as Market Basket Analysis,
Recommender Systems, Fraud Detection, and many more.
Advantages:
1. It can find useful information which is not visible in simple data browsing
2. It can find interesting association and correlation among data items
Disadvantages:
1. It can generate a large number of patterns
2. With high dimensionality, the number of patterns can be very large, making it difficult to
interpret the results.
The increasing power of computer technology creates a large amount of data and storage.
Databases are increasing rapidly and in this computerized world everything is shifting online
and data is increasing as a new currency. Data comes in different shapes and sizes and is
collected in different ways. By using data mining there are many benefits it helps us to
improve the particular process and in some cases, it costs saving or revenue generation.
Data mining is commonly used to search a large amount of data for patterns and trends, and
not only for searching it uses the data for further processes and develops actionable
processes.
Data mining is the process of converting raw data into suitable patterns based on trends.
Data mining has different types of patterns and frequent pattern mining is one of them.
This concept was introduced for mining transaction databases. Frequent patterns are
patterns(such as items, subsequences, or substructures) that appear frequently in the
database. It is an analytical process that finds frequent patterns, associations, or causal
structures from databases in various databases. This process aims to find the frequently
occurring item in a transaction. By frequent patterns, we can identify strongly correlated
items together and we can identify similar characteristics and associations among them. By
doing frequent data mining we can go further for clustering and association.
Frequent pattern mining is a major concern it plays a major role in associations and
correlations and disclose an intrinsic and important property of dataset.
Frequent data mining can be done by using association rules with particular algorithms eclat
and apriori algorithms. Frequent pattern mining searches for recurring relationships in a data
set. It also helps to find the inheritance regularities. to make fast processing software with a
user interface and used for a long time without any error.
Frequent Item set in Data set (Association Rule Mining)
INTRODUCTION:
1. Frequent item sets, also known as association rules, are a fundamental concept in
association rule mining, which is a technique used in data mining to discover
relationships between items in a dataset. The goal of association rule mining is to
identify relationships between items in a dataset that occur frequently together.
2. A frequent item set is a set of items that occur together frequently in a dataset. The
frequency of an item set is measured by the support count, which is the number of
transactions or records in the dataset that contain the item set. For example, if a
dataset contains 100 transactions and the item set {milk, bread} appears in 20 of those
transactions, the support count for {milk, bread} is 20.
3. Association rule mining algorithms, such as Apriori or FP-Growth, are used to find
frequent item sets and generate association rules. These algorithms work by iteratively
generating candidate item sets and pruning those that do not meet the minimum
support threshold. Once the frequent item sets are found, association rules can be
generated by using the concept of confidence, which is the ratio of the number of
transactions that contain the item set and the number of transactions that contain the
antecedent (left-hand side) of the rule.
4. Frequent item sets and association rules can be used for a variety of tasks such as
market basket analysis, cross-selling and recommendation systems. However, it
should be noted that association rule mining can generate a large number of rules,
many of which may be irrelevant or uninteresting. Therefore, it is important to use
appropriate measures such as lift and conviction to evaluate the interestingness of the
generated rules.
Association Mining searches for frequent items in the data set. In frequent mining
usually, interesting associations and correlations between item sets in transactional and
relational databases are found. In short, Frequent Mining shows which items appear
together in a transaction or relationship.
Need of Association Mining: Frequent mining is the generation of association rules from
a Transactional Dataset. If there are 2 items X and Y purchased frequently then it’s good
to put them together in stores or provide some discount offer on one item on purchase of
another item. This can really increase sales. For example, it is likely to find that if a
customer buys Milk and bread he/she also buys Butter. So the association rule
is [‘milk]^[‘bread’]=>[‘butter’]. So the seller can suggest the customer buy butter if
he/she buys Milk and Bread.
Important Definitions :
Support : It is one of the measures of interestingness. This tells about the usefulness
and certainty of rules. 5% Support means total 5% of transactions in the database
follow the rule.
Support(A -> B) = Support_count(A ∪ B)
Confidence: A confidence of 60% means that 60% of the customers who purchased a
milk and bread also bought butter.
Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)
If a rule satisfies both minimum support and minimum confidence, it is a strong rule.
Support_count(X): Number of transactions in which X appears. If X is A union B then
it is the number of transactions in which A and B both are present.
Maximal Itemset: An itemset is maximal frequent if none of its supersets are frequent.
Closed Itemset: An itemset is closed if none of its immediate supersets have same
support count same as Itemset.
K- Itemset: Itemset which contains K items is a K-itemset. So it can be said that an
itemset is frequent if the corresponding support count is greater than the minimum
support count.
Example On finding Frequent Itemsets – Consider the given dataset with given
transactions.
Lets say minimum support count is 3
Relation hold is maximal frequent => closed => frequent
1-frequent: {A} = 3; // not closed due to {A, C} and not maximal {B} = 4; // not closed due
to {B, D} and no maximal {C} = 4; // not closed due to {C, D} not maximal {D} = 5; // closed
item-set since not immediate super-set has same count. Not maximal
2-frequent: {A, B} = 2 // not frequent because support count < minimum support count so
ignore {A, C} = 3 // not closed due to {A, C, D} {A, D} = 3 // not closed due to {A, C, D} {B,
C} = 3 // not closed due to {B, C, D} {B, D} = 4 // closed but not maximal due to {B, C, D}
{C, D} = 4 // closed but not maximal due to {B, C, D}
3-frequent: {A, B, C} = 2 // ignore not frequent because support count < minimum support
count {A, B, D} = 2 // ignore not frequent because support count < minimum support count
{A, C, D} = 3 // maximal frequent {B, C, D} = 3 // maximal frequent
4-frequent: {A, B, C, D} = 2 //ignore not frequent </
ADVANTAGES OR DISADVANTAGES:
Advantages of using frequent item sets and association rule mining include:
Disadvantages of using frequent item sets and association rule mining include:
1. Large number of generated rules: Association rule mining can generate a large number
of rules, many of which may be irrelevant or uninteresting, which can make it difficult to
identify the most important patterns.
2. Limited in detecting complex relationships: Association rule mining is limited in its
ability to detect complex relationships between items, and it only considers the co-
occurrence of items in the same transaction.
3. Can be computationally expensive: As the number of items and transactions increases,
the number of candidate item sets also increases, which can make the algorithm
computationally expensive.
4. Need to define the minimum support and confidence threshold: The minimum support
and confidence threshold must be set before the association rule mining process,
which can be difficult and requires a good understanding of the data.
Introduction
In recent years, data mining has led to the growth of the challenges. It also leads to challenges.
A large data set can be discovered as valuable processes that drive the information through
decision-making. We can say that data mining is a type of art used to uncover patterns, and
we have to remember one thing: all the patterns discovered through data mining are not
equally valuable.
In the data mining process, the concept of interestingness deals with distinguishing the
patterns based on which are trivial and significant. In this article, we are learning about various
sections about the interestingness of the data. Let's discuss those sections below.
Section 1: Defining Interestingness in Data Mining
This section will define the interestingness of the data obtained in mining.
o Domain knowledge:
With the help of domain knowledge, we can consider the domain as interesting. Also, we must
understand the pattern's evolution in the domain knowledge.
o Societal Implications:
We have also to look forward to the potential societal impacts of data mining and the
importance of ethical decision-making.
Pre-requisites:
In data mining, pattern evaluation is the process of assessing the quality of discovered patterns. This
process is important in order to determine whether the patterns are useful and whether they can be
trusted. There are a number of different measures that can be used to evaluate patterns, and the choice
of measure will depend on the application.
There are several ways to evaluate pattern mining algorithms:
1. Accuracy
The accuracy of a data mining model is a measure of how correctly the model predicts the target values.
The accuracy is measured on a test dataset, which is separate from the training dataset that was used to
train the model. There are a number of ways to measure accuracy, but the most common is to calculate
the percentage of correct predictions. This is known as the accuracy rate.
Other measures of accuracy include the root mean squared error (RMSE) and the mean absolute error
(MAE). The RMSE is the square root of the mean squared error, and the MAE is the mean of the
absolute errors. The accuracy of a data mining model is important, but it is not the only thing that should
be considered. The model should also be robust and generalizable.
A model that is 100% accurate on the training data but only 50% accurate on the test data is not a good
model. The model is overfitting the training data and is not generalizable to new data. A model that is
80% accurate on the training data and 80% accurate on the test data is a good model. The model is
generalizable and can be used to make predictions on new data.
2. Classification Accuracy
This measures how accurately the patterns discovered by the algorithm can be used to classify new data.
This is typically done by taking a set of data that has been labeled with known class labels and then
using the discovered patterns to predict the class labels of the data. The accuracy can then be computed
by comparing the predicted labels to the actual labels.
Classification accuracy is one of the most popular evaluation metrics for classification models, and it is
simply the percentage of correct predictions made by the model. Although it is a straightforward and
easy-to-understand metric, classification accuracy can be misleading in certain situations. For
example, if we have a dataset with a very imbalanced class distribution, such as 100 instances of class
0 and 1,000 instances of class 1, then a model that always predicts class 1 will achieve a high
classification accuracy of 90%. However, this model is clearly not very useful, since it is not making
any correct predictions for class 0.
There are a few different ways to evaluate classification models, such as precision and recall, which are
more informative in imbalanced datasets. Precision is the percentage of correct predictions made by
the model for a particular class, and recall is the percentage of instances of a particular class that was
correctly predicted by the model. In the above example, if we looked at precision and recall for class
0, we would see that the model has a precision of 0% and a recall of 0%.
Another way to evaluate classification models is to use a confusion matrix. A confusion matrix is a
table that shows the number of correct and incorrect predictions made by the model for each class. This
can be a helpful way to visualize the performance of a model and to identify where it is making mistakes.
For example, in the above example, the confusion matrix would show that the model is making all
predictions for class 1 and no predictions for class 0.
Overall, classification accuracy is a good metric to use when evaluating classification models. However,
it is important to be aware of its limitations and to use other evaluation metrics in situations where
classification accuracy could be misleading.
3. Clustering Accuracy
This measures how accurately the patterns discovered by the algorithm can be used to cluster new data.
This is typically done by taking a set of data that has been labeled with known cluster labels and then
using the discovered patterns to predict the cluster labels of the data. The accuracy can then be computed
by comparing the predicted labels to the actual labels.
There are a few ways to evaluate the accuracy of a clustering algorithm:
External indices: these indices compare the clusters produced by the algorithm to some known
ground truth. For example, the Rand Index or the Jaccard coefficient can be used if the ground truth
is known.
Internal indices: these indices assess the goodness of clustering without reference to any external
information. The most popular internal index is the Dunn index.
Stability: this measures how robust the clustering is to small changes in the data. A clustering
algorithm is said to be stable if, when applied to different samples of the same data, it produces the
same results.
Efficiency: this measures how quickly the algorithm converges to the correct clustering.
4. Coverage
This measures how many of the possible patterns in the data are discovered by the algorithm. This can
be computed by taking the total number of possible patterns and dividing it by the number of patterns
discovered by the algorithm. A Coverage Pattern is a type of sequential pattern that is found by looking
for items that tend to appear together in sequential order. For example, a coverage pattern might be
“customers who purchase item A also tend to purchase item B within the next month.”
To evaluate a coverage pattern, analysts typically look at two things: support and confidence. Support
is the percentage of transactions that contain the pattern. Confidence is the percentage of transactions
that contain the pattern divided by the number of transactions that contain the first item in the pattern.
For example, consider the following coverage pattern: “customers who purchase item A also tend to
purchase item B within the next month.” If the support for this pattern is 0.1%, that means that 0.1% of
all transactions contain the pattern. If the confidence for this pattern is 80%, that means that 80% of the
transactions that contain item A also contain item B.
Generally, a higher support and confidence value indicates a stronger pattern. However, analysts must
be careful to avoid overfitting, which is when a pattern is found that is too specific to the data and would
not be generalizable to other data sets.
5. Visual Inspection
This is perhaps the most common method, where the data miner simply looks at the patterns to see if
they make sense. In visual inspection, the data is plotted in a graphical format and the pattern is
observed. This method is used when the data is not too large and can be easily plotted. It is also used
when the data is categorical in nature. Visual inspection is a pattern evaluation method in data mining
where the data is visually inspected for patterns. This can be done by looking at a graph or plot of the
data, or by looking at the raw data itself. This method is often used to find outliers or unusual patterns.
6. Running Time
This measures how long it takes for the algorithm to find the patterns in the data. This is typically
measured in seconds or minutes. There are a few different ways to measure the performance of a
machine learning algorithm, but one of the most common is to simply measure the amount of time it
takes to train the model and make predictions. This is known as the running time pattern evaluation.
There are a few different things to keep in mind when measuring the running time of an algorithm. First,
you need to take into account the time it takes to load the data into memory. Second, you need to account
for the time it takes to pre-process the data if any. Finally, you need to account for the time it takes to
train the model and make predictions.
In general, the running time of an algorithm will increase as the number of data increases. This is
because the algorithm has to process more data in order to learn from it. However, there are some
algorithms that are more efficient than others and can scale to large datasets better. When comparing
different algorithms, it is important to keep in mind the specific dataset that is being used. Some
algorithms may be better suited for certain types of data than others. In addition, the running time can
also be affected by the hardware that is being used.
7. Support
The support of a pattern is the percentage of the total number of records that contain the pattern. Support
Pattern evaluation is a process of finding interesting and potentially useful patterns in data. The purpose
of support pattern evaluation is to identify interesting patterns that may be useful for decision-making.
Support pattern evaluation is typically used in data mining and machine learning applications.
There are a variety of ways to evaluate support patterns. One common approach is to use a support
metric, which measures the number of times a pattern occurs in a dataset. Another common approach
is to use a lift metric, which measures the ratio of the occurrence of a pattern to the expected occurrence
of the pattern.
Support pattern evaluation can be used to find a variety of interesting patterns in data, including
association rules, sequential patterns, and co-occurrence patterns. Support pattern evaluation is an
important part of data mining and machine learning, and can be used to help make better decisions.
8. Confidence
The confidence of a pattern is the percentage of times that the pattern is found to be correct. Confidence
Pattern evaluation is a method of data mining that is used to assess the quality of patterns found in data.
This evaluation is typically performed by calculating the percentage of times a pattern is found in a data
set and comparing this percentage to the percentage of times the pattern is expected to be found based
on the overall distribution of data. If the percentage of times a pattern is found is significantly higher
than the expected percentage, then the pattern is said to be a strong confidence pattern.
9. Lift
The lift of a pattern is the ratio of the number of times that the pattern is found to be correct to the
number of times that the pattern is expected to be correct. Lift Pattern evaluation is a data mining
technique that can be used to evaluate the performance of a predictive model. The lift pattern is a
graphical representation of the model’s performance and can be used to identify potential problems with
the model.
The lift pattern is a plot of the true positive rate (TPR) against the false positive rate (FPR). The TPR is
the percentage of positive instances that are correctly classified by the model, while the FPR is the
percentage of negative instances that are incorrectly classified as positive. Ideally, the TPR would be
100% and the FPR would be 0%, but this is rarely the case in practice. The lift pattern can be used to
evaluate how close the model is to this ideal.
A good model will have a lifted pattern that is close to the diagonal line. This means that the TPR and
FPR are similar and that the model is correctly classifying a similar percentage of positive and negative
instances. A model with a lifted pattern that is far from the diagonal line is not performing as well. This
can be caused by a number of factors, including imbalanced data, poor feature selection, or overfitting.
The lift pattern can be a useful tool for identifying potential problems with a predictive model. It is
important to remember, however, that the lift pattern is only a graphical representation of the model’s
performance, and should be interpreted in conjunction with other evaluation measures.
10. Prediction
The prediction of a pattern is the percentage of times that the pattern is found to be correct. Prediction
Pattern evaluation is a data mining technique used to assess the accuracy of predictive models. It is used
to determine how well a model can predict future outcomes based on past data. Prediction Pattern
evaluation can be used to compare different models, or to evaluate the performance of a single model.
Prediction Pattern evaluation involves splitting the data set into two parts: a training set and a test set.
The training set is used to train the model, while the test set is used to assess the accuracy of the model.
To evaluate the accuracy of the model, the prediction error is calculated. Prediction Pattern evaluation
can be used to improve the accuracy of predictive models. By using a test set, predictive models can be
fine-tuned to better fit the data. This can be done by changing the model parameters or by adding new
features to the data set.
11. Precision
Precision Pattern Evaluation is a method for analyzing data that has been collected from a variety of
sources. This method can be used to identify patterns and trends in the data, and to evaluate the accuracy
of data. Precision Pattern Evaluation can be used to identify errors in the data, and to determine the
cause of the errors. This method can also be used to determine the impact of the errors on the overall
accuracy of the data.
Precision Pattern Evaluation is a valuable tool for data mining and data analysis. This method can be
used to improve the accuracy of data, and to identify patterns and trends in the data.
12. Cross-Validation
This method involves partitioning the data into two sets, training the model on one set, and then testing
it on the other. This can be done multiple times, with different partitions, to get a more reliable estimate
of the model’s performance. Cross-validation is a model validation technique for assessing how the
results of a data mining analysis will generalize to an independent data set. It is mainly used in settings
where the goal is prediction, and one wants to estimate how accurately a predictive model will perform
in practice. Cross-validation is also referred to as out-of-sample testing.
Cross-validation is a pattern evaluation method that is used to assess the accuracy of a model. It does
this by splitting the data into a training set and a test set. The model is then fit on the training set and
the accuracy is measured on the test set. This process is then repeated a number of times, with the
accuracy being averaged over all the iterations.
13. Test Set
This method involves partitioning the data into two sets, training the model on the entire data set, and
then testing it on the held-out test set. This is more reliable than cross-validation but can be more
expensive if the data set is large. There are a number of ways to evaluate the performance of a model
on a test set. The most common is to simply compare the predicted labels to the true labels and compute
the percentage of instances that are correctly classified. This is called accuracy. Another popular metric
is precision, which is the number of true positives divided by the sum of true positives and false
positives. The recall is the number of true positives divided by the sum of true positives and false
negatives. These metrics can be combined into the F1 score, which is the harmonic mean of precision
and recall.
14. Bootstrapping
This method involves randomly sampling the data with replacement, training the model on the sampled
data, and then testing it on the original data. This can be used to get a distribution of the model’s
performance, which can be useful for understanding how robust the model is. Bootstrapping is a
resampling technique used to estimate the accuracy of a model. It involves randomly selecting a sample
of data from the original dataset and then training the model on this sample. The model is then tested
on another sample of data that is not used in training. This process is repeated a number of times, and
the average accuracy of the model is calculated.
Unit 3
Data Mining: Data mining in general terms means mining or digging deep into data that is
in different forms to gain patterns, and to gain knowledge on that pattern. In the process of
data mining, large data sets are first sorted, then patterns are identified and relationships
are established to perform data analysis and solve problems.
Classification is a task in data mining that involves assigning a class label to each instance
in a dataset based on its features. The goal of classification is to build a model that
accurately predicts the class labels of new instances based on their features.
There are two main types of classification: binary classification and multi-class
classification. Binary classification involves classifying instances into two classes, such as
“spam” or “not spam”, while multi-class classification involves classifying instances into
more than two classes.
The process of building a classification model typically involves the following steps:
Data Collection:
The first step in building a classification model is data collection. In this step, the data
relevant to the problem at hand is collected. The data should be representative of the
problem and should contain all the necessary attributes and labels needed for
classification. The data can be collected from various sources, such as surveys,
questionnaires, websites, and databases.
Data Preprocessing:
The second step in building a classification model is data preprocessing. The collected
data needs to be preprocessed to ensure its quality. This involves handling missing values,
dealing with outliers, and transforming the data into a format suitable for analysis. Data
preprocessing also involves converting the data into numerical form, as most classification
algorithms require numerical input.
Handling Missing Values: Missing values in the dataset can be handled by replacing them
with the mean, median, or mode of the corresponding feature or by removing the entire
record.
Dealing with Outliers: Outliers in the dataset can be detected using various statistical
techniques such as z-score analysis, boxplots, and scatterplots. Outliers can be removed
from the dataset or replaced with the mean, median, or mode of the corresponding feature.
Data Transformation: Data transformation involves scaling or normalizing the data to bring
it into a common scale. This is done to ensure that all features have the same level of
importance in the analysis.
Feature Selection:
The third step in building a classification model is feature selection. Feature selection
involves identifying the most relevant attributes in the dataset for classification. This can be
done using various techniques, such as correlation analysis, information gain, and
principal component analysis.
Correlation Analysis: Correlation analysis involves identifying the correlation between the
features in the dataset. Features that are highly correlated with each other can be removed
as they do not provide additional information for classification.
Information Gain: Information gain is a measure of the amount of information that a feature
provides for classification. Features with high information gain are selected for
classification.
Principal Component Analysis:
Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of
the dataset. PCA identifies the most important features in the dataset and removes the
redundant ones.
Model Selection:
The fourth step in building a classification model is model selection. Model selection
involves selecting the appropriate classification algorithm for the problem at hand. There
are several algorithms available, such as decision trees, support vector machines, and
neural networks.
Decision Trees: Decision trees are a simple yet powerful classification algorithm. They
divide the dataset into smaller subsets based on the values of the features and construct a
tree-like model that can be used for classification.
Support Vector Machines: Support Vector Machines (SVMs) are a popular classification
algorithm used for both linear and nonlinear classification problems. SVMs are based on
the concept of maximum margin, which involves finding the hyperplane that maximizes the
distance between the two classes.
Neural Networks:
Neural Networks are a powerful classification algorithm that can learn complex patterns in
the data. They are inspired by the structure of the human brain and consist of multiple
layers of interconnected nodes.
Model Training:
The fifth step in building a classification model is model training. Model training involves
using the selected classification algorithm to learn the patterns in the data. The data is
divided into a training set and a validation set. The model is trained using the training set,
and its performance is evaluated on the validation set.
Model Evaluation:
The sixth step in building a classification model is model evaluation. Model evaluation
involves assessing the performance of the trained model on a test set. This is done to
ensure that the model generalizes well
Classification is a widely used technique in data mining and is applied in a variety of
domains, such as email filtering, sentiment analysis, and medical diagnosis.
Classification: It is a data analysis task, i.e. the process of finding a model that describes
and distinguishes data classes and concepts. Classification is the problem of identifying to
which of a set of categories (subpopulations), a new observation belongs to, on the basis
of a training set of data containing observations and whose categories membership is
known.
Example: Before starting any project, we need to check its feasibility. In this case, a
classifier is required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the
Project and to further approve it. It is a two-step process such as:
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate
results.
2. Classification Step: Model used to predict class labels and testing the constructed
model on test data and hence estimate the accuracy of the classification rules.
Test data are used to estimate the accuracy of the classification rule
1. Discriminative: It is a very basic classifier and determines just one class for each row
of data. It tries to model just by depending on the observed data, depends heavily on
the quality of data rather than on distributions.
Example: Logistic Regression
2. Generative: It models the distribution of individual classes and tries to learn the model
that generates the data behind the scenes by estimating assumptions and distributions
of the model. Used to predict the unseen data.
Example: Naive Bayes Classifier
Detecting Spam emails by looking at the previous data. Suppose 100 emails and that
too divided in 1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam
emails). Now if a user wants to check that if an email contains the word cheap, then
that may be termed as Spam.
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and
rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are
spam.
So, if the email contains the word cheap, what is the probability of it being spam ?? (=
80%)
Classifiers Of Machine Learning:
1. Decision Trees
2. Bayesian Classifiers
3. Neural Networks
4. K-Nearest Neighbour
5. Support Vector Machines
6. Linear Regression
7. Logistic Regression
Associated Tools and Languages: Used to mine/ extract useful information from raw
data.
Main Languages used: R, SAS, Python, SQL
Major Tools used: RapidMiner, Orange, KNIME, Spark, Weka
Libraries used: Jupyter, NumPy, Matplotlib, Pandas, ScikitLearn, NLTK, TensorFlow,
Seaborn, Basemap, etc.
Real–Life Examples :
Market Basket Analysis:
It is a modeling technique that has been associated with frequent transactions of
buying some combination of items.
Example: Amazon and many other Retailers use this technique. While viewing some
products, certain suggestions for the commodities are shown that some people have
bought in the past.
Weather Forecasting:
Changing Patterns in weather conditions needs to be observed based on parameters
such as temperature, humidity, wind direction. This keen observation also requires the
use of previous records in order to predict it accurately.
Advantages:
Mining Based Methods are cost-effective and efficient
Helps in identifying criminal suspects
Helps in predicting the risk of diseases
Helps Banks and Financial Institutions to identify defaulters so that they may approve
Cards, Loan, etc.
Disadvantages:
Privacy: When the data is either are chances that a company may give some information
about their customers to other vendors or use this information for their profit.
Accuracy Problem: Selection of Accurate model must be there in order to get the best
accuracy and result.
APPLICATIONS:
Text classification is a fundamental task in natural language processing (NLP), with applications
ranging from spam detection to sentiment analysis and document categorization.
Two popular machine learning algorithms for text classification are Naive Bayes classifier (NB)
and Support Vector Machines (SVM). Both approaches have their strengths and weaknesses, making
them suitable for different types of text classification tasks. In this article, we'll explore and compare
Naive Bayes and SVM for text classification, highlighting their key differences, advantages, and
limitations.
Naive Bayes Classifier (NB)
is a probabilistic machine learning model widely used for text classification tasks. Despite its
seemingly simplistic name, its effectiveness stems from its strong theoretical foundation and ability to
efficiently handle high-dimensional text data.It's particularly effective with high-dimensional data and
can handle large datasets efficiently. The algorithm's simplicity, speed, and ability to work well with
limited data make it a popular choice, especially when computational resources are a consideration in
real-world applications. For Probabilistic Foundation, NB leverages , calculating the probability of a
text belonging to a particulas based on the individual probabilities of its constituent words appearing
in that class.
Naivety Assumption: The "naive" aspect lies in its assumption that word occurrences are
independent of each other within a class. While this assumption rarely holds perfectly true, it
surprisingly leads to surprisingly strong performance in many real-world scenarios.
Flexibility: NB works well with both multinomial and Bernoulli word representations, adapting
to different text characteristics. Multinomial captures word frequency within a document, while
Bernoulli considers mere presence or absence.
NB requires minimal and training time, making it ideal for applications requiring fast predictions
and quick adaptation to new data.
Support Vector Machines (SVM)
are a powerful algorithm excels at distinguishing between different text categories, making it valuable
for tasks like sentiment analysis, topic labeling, and spam detection. At its heart, SVM aims to find
the optimal hyperplane a decision boundary within a high-dimensional space that cleanly separates
different text classes. Imagine plotting each text document as a point based on its extracted features
(e.g., word presence, frequency). SVM seeks the hyperplane that maximizes the margin between these
classes, ensuring clear distinction even for unseen data.
The SVM model is trained on labeled data, where each document belongs to a specific category. The
model learns the optimal hyperplane that best separates these categories in the feature space.For
validating, based on their feature vectors, the model predicts the class they belong to by placing them
on the appropriate side of the hyperplane.
image
While SVMs work with linear hyperplanes by default, the 'kernel trick' allows them to handle non-
linear relationships between features. This is crucial for text, where complex semantic relationships
exist between words.
SVMs often exhibit high accuracy on text classification tasks, for smaller datasets. They can
effectively handle sparse data inherent in text, where many features might be absent in individual
documents.
Naive Bayes and SVM for Text Classification
Criteria Naive Bayes Support Vector Machine
Effective in high-dimensional
spaces.
Simple and easy to implement.
Robust to overfitting.
Computationally efficient.
Flexibility in choosing kernel
Works well with small datasets.
functions.
Advantages Can capture complex relationships.
1
from sklearn.datasets import fetch_20newsgroups
2
from sklearn.feature_extraction.text import TfidfVectorizer
3
1
# Load the 20 Newsgroups dataset
2
tfidf_vectorizer = TfidfVectorizer()
2
X_train = tfidf_vectorizer.fit_transform(newsgroups_train.data)
3
X_test = tfidf_vectorizer.transform(newsgroups_test.data)
4
y_train = newsgroups_train.target
5
y_test = newsgroups_test.target
Step 4 Training Classifiers:
We instantiate Multinomial Naïve Bayes and SVM classifiers and train them using the training data
(X_train, y_train).
nb_classifier = MultinomialNB(): Initializing a Naïve Bayes classifier object of the
MultinomialNB class.
nb_classifier.fit(X_train, y_train): Training the Naïve Bayes classifier using the TF-IDF
features (X_train) and the corresponding target labels (y_train).
svm_classifier = SVC(kernel='linear'): Initializing an SVM classifier object of the SVC class
with a linear kernel.
svm_classifier.fit(X_train, y_train): Training the SVM classifier using the TF-IDF features
(X_train) and the corresponding target labels (y_train).
1
# Train Naïve Bayes classifier
2
nb_classifier = MultinomialNB()
3
nb_classifier.fit(X_train, y_train)
4
svm_classifier.fit(X_train, y_train)
Step 5 Model Evaluation and Prediction:
We use the trained classifiers to make predictions on the testing data. We print classification reports
containing various evaluation metrics such as precision, recall, and F1-score for both Naïve Bayes and
SVM classifiers using the classification_report function.
nb_predictions = nb_classifier.predict(X_test): Making predictions on the testing data using the
trained Naïve Bayes classifier and storing the predictions in the variable nb_predictions.
svm_predictions = svm_classifier.predict(X_test): Making predictions on the testing data using
the trained SVM classifier and storing the predictions in the variable svm_predictions.
print(classification_report(y_test, nb_predictions,
target_names=newsgroups_test.target_names)): Printing the classification report for the Naïve
Bayes classifier, which includes precision, recall, F1-score, and support for each class.
print(classification_report(y_test, svm_predictions,
target_names=newsgroups_test.target_names)): Printing the classification report for the SVM
classifier, similar to the one printed for the Naïve Bayes classifier.
1
# Evaluate classifiers
2
nb_predictions = nb_classifier.predict(X_test)
3
svm_predictions = svm_classifier.predict(X_test)
4
5
# Print classification reports
6
print("Naïve Bayes Classification Report:")
7
print(classification_report(y_test, nb_predictions, target_names=newsgroups_test.target_names))
8
Bayesian Belief Network (BBN) is a graphical model that represents the probabilistic relationships
among variables. It is used to handle uncertainty and make predictions or decisions based on
probabilities.
Graphical Representation: Variables are represented as nodes in a directed acyclic graph
(DAG), and their dependencies are shown as edges.
Conditional Probabilities: Each node’s probability depends on its parent nodes, expressed
as P(Variable | Parent)P(Variable | Parent).
Probabilistic Model: Built from probability distributions, BBNs apply probability theory for
tasks like prediction and anomaly detection.
Bayesian Belief Networks are valuable tools for understanding and solving problems involving
uncertain events. They are also known as Bayes networks, belief networks, decision networks, or
Bayesian models.
(Note: A classifier assigns data in a collection to desired categories.)
Consider this example:
In the above figure, we have an alarm ‘A’ – a node, say installed in a house of a person ‘gfg’,
which rings upon two probabilities i.e burglary ‘B’ and fire ‘F’, which are – parent nodes of the
alarm node. The alarm is the parent node of two probabilities P1 calls ‘P1’ & P2 calls ‘P2’ person
nodes.
Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘gfg’, respectively. But, there are
few drawbacks in this case, as sometimes ‘P1’ may forget to call the person ‘gfg’, even after
hearing the alarm, as he has a tendency to forget things, quick. Similarly, ‘P2’, sometimes fails to
call the person ‘gfg’, as he is only able to hear the alarm, from a certain distance.
Calculating Conditional Probability of Events in a Bayesian Network
Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’ is true (P2 has called ‘gfg’) when the
alarm ‘A’ rang, but no burglary ‘B’ and fire ‘F’ has occurred.
=> P ( P1, P2, A, ~B, ~F) [ where- P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are ‘false’ events]
[ Note: The values mentioned below are neither calculated nor computed. They have observed values
]
Burglary ‘B’ –
P (B=T) = 0.001 (‘B’ is true i.e burglary has occurred)
P (B=F) = 0.999 (‘B’ is false i.e burglary has not occurred)
Fire ‘F’ –
P (F=T) = 0.002 (‘F’ is true i.e fire has occurred)
P (F=F) = 0.998 (‘F’ is false i.e fire has not occurred)
Alarm ‘A’ –
B F P (A=T) P (A=F)
T T 0.95 0.05
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
The alarm ‘A’ node can be ‘true’ or ‘false’ ( i.e may have rung or may not have rung). It has two
parent nodes burglary ‘B’ and fire ‘F’ which can be ‘true’ or ‘false’ (i.e may have occurred or
may not have occurred) depending upon different conditions.
Person ‘P1’ –
A P (P1=T) P (P1=F)
T 0.95 0.05
F 0.05 0.95
The person ‘P1’ node can be ‘true’ or ‘false’ (i.e may have called the person ‘gfg’ or not) . It has a
parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have rung or may not have rung
,upon burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
A P (P2=T) P (P2=F)
T 0.80 0.20
F 0.01 0.99
The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the person ‘gfg’ or not). It has a
parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have rung or may not have
rung, upon burglary ‘B’ or fire ‘F’).
Solution: Considering the observed probabilistic scan –
With respect to the question — P ( P1, P2, A, ~B, ~F) , we need to get the probability of ‘P1’. We
find it with regard to its parent node – alarm ‘A’. To get the probability of ‘P2’, we find it with regard
to its parent node — alarm ‘A’.
We find the probability of alarm ‘A’ node with regard to ‘~B’ & ‘~F’ since burglary ‘B’ and fire ‘F’
are parent nodes of alarm ‘A’.
From the observed probabilistic scan, we can deduce –
P ( P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
= 0.95 * 0.80 * 0.001 * 0.999 * 0.998
= 0.00075
Get IBM Certification and a 90% fee refund on completing 90% course in 90 days! Take the Three
90 Challenge today.
Master Machine Learning, Data Science & AI with this complete program and also get a 90% refund.
What more motivation do you need? Start the challenge right away!
Comment
More info
Advertise with us
Next Article
Hidden Markov Model in Machine learning
Simila
Basic Understanding of Bayesian Belief Networks
Bayesian Belief Network (BBN) is a graphical model that represents the probabilistic relationships
among variables. It is used to handle uncertainty and make predictions or decisions based on
probabilities.
Graphical Representation: Variables are represented as nodes in a directed acyclic graph
(DAG), and their dependencies are shown as edges.
Conditional Probabilities: Each node’s probability depends on its parent nodes, expressed
as P(Variable | Parent)P(Variable | Parent).
Probabilistic Model: Built from probability distributions, BBNs apply probability theory for
tasks like prediction and anomaly detection.
Bayesian Belief Networks are valuable tools for understanding and solving problems involving
uncertain events. They are also known as Bayes networks, belief networks, decision networks, or
Bayesian models.
(Note: A classifier assigns data in a collection to desired categories.)
Consider this example:
In the above figure, we have an alarm ‘A’ – a node, say installed in a house of a person ‘gfg’,
which rings upon two probabilities i.e burglary ‘B’ and fire ‘F’, which are – parent nodes of the
alarm node. The alarm is the parent node of two probabilities P1 calls ‘P1’ & P2 calls ‘P2’ person
nodes.
Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘gfg’, respectively. But, there are
few drawbacks in this case, as sometimes ‘P1’ may forget to call the person ‘gfg’, even after
hearing the alarm, as he has a tendency to forget things, quick. Similarly, ‘P2’, sometimes fails to
call the person ‘gfg’, as he is only able to hear the alarm, from a certain distance.
Calculating Conditional Probability of Events in a Bayesian Network
Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’ is true (P2 has called ‘gfg’) when the
alarm ‘A’ rang, but no burglary ‘B’ and fire ‘F’ has occurred.
=> P ( P1, P2, A, ~B, ~F) [ where- P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are ‘false’ events]
[ Note: The values mentioned below are neither calculated nor computed. They have observed values
]
Burglary ‘B’ –
P (B=T) = 0.001 (‘B’ is true i.e burglary has occurred)
P (B=F) = 0.999 (‘B’ is false i.e burglary has not occurred)
Fire ‘F’ –
P (F=T) = 0.002 (‘F’ is true i.e fire has occurred)
P (F=F) = 0.998 (‘F’ is false i.e fire has not occurred)
Alarm ‘A’ –
B F P (A=T) P (A=F)
T T 0.95 0.05
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
The alarm ‘A’ node can be ‘true’ or ‘false’ ( i.e may have rung or may not have rung). It has two
parent nodes burglary ‘B’ and fire ‘F’ which can be ‘true’ or ‘false’ (i.e may have occurred or
may not have occurred) depending upon different conditions.
Person ‘P1’ –
A P (P1=T) P (P1=F)
T 0.95 0.05
F 0.05 0.95
The person ‘P1’ node can be ‘true’ or ‘false’ (i.e may have called the person ‘gfg’ or not) . It has a
parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have rung or may not have rung
,upon burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
A P (P2=T) P (P2=F)
T 0.80 0.20
F 0.01 0.99
The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the person ‘gfg’ or not). It has a
parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have rung or may not have
rung, upon burglary ‘B’ or fire ‘F’).
Solution: Considering the observed probabilistic scan –
With respect to the question — P ( P1, P2, A, ~B, ~F) , we need to get the probability of ‘P1’. We
find it with regard to its parent node – alarm ‘A’. To get the probability of ‘P2’, we find it with regard
to its parent node — alarm ‘A’.
We find the probability of alarm ‘A’ node with regard to ‘~B’ & ‘~F’ since burglary ‘B’ and fire ‘F’
are parent nodes of alarm ‘A’.
From the observed probabilistic scan, we can deduce –
P ( P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
= 0.95 * 0.80 * 0.001 * 0.999 * 0.998
= 0.00075
o
Association Rule Mining:
It is easy to find associations in frequent patterns:
for each frequent pattern x for each subset y c x.
calculate the support of y-> x – y.
if it is greater than the threshold, keep the rule. There are two algorithms that support this
lattice
1. Apriori algorithm
2. eclat algorithm
Apriori Eclat
working principle (it is a simple point of scale application for any supermarket which has a
good off-product scale)
the product data will be entered into the database.
the taxes and commissions are entered.
the product will be purchased and it will be sent to the bill counter.
the bill calculating operator will check the product with the bar code machine it will
check and match the product in the database and then it will show the information of
the product.
the bill will be paid by the customer and he will receive the products.
Backpropagation is an algorithm that backpropagates the errors from the output nodes to the input
nodes. Therefore, it is simply referred to as the backward propagation of errors. It uses in the vast
applications of neural networks in data mining like Character recognition, Signature verification, etc.
Neural Network:
Neural networks are an information processing paradigm inspired by the human nervous system. Just
like in the human nervous system, we have biological neurons in the same way in neural networks we
have artificial neurons, artificial neurons are mathematical functions derived from biological neurons.
The human brain is estimated to have about 10 billion neurons, each connected to an average of 10,000
other neurons. Each neuron receives a signal through a synapse, which controls the effect of the
signconcerning on the neuron.
Backpropagation:
Backpropagation is a widely used algorithm for training feedforward neural networks. It computes the
gradient of the loss function with respect to the network weights. It is very efficient, rather than naively
directly computing the gradient concerning each weight. This efficiency makes it possible to use
gradient methods to train multi-layer networks and update weights to minimize loss; variants such as
gradient descent or stochastic gradient descent are often used.
The backpropagation algorithm works by computing the gradient of the loss function with respect to
each weight via the chain rule, computing the gradient layer by layer, and iterating backward from the
last layer to avoid redundant computation of intermediate terms in the chain rule.
Features of Backpropagation:
1. it is the gradient descent method as used in the case of simple perceptron network with the
differentiable unit.
2. it is different from other networks in respect to the process by which the weights are calculated
during the learning period of the network.
3. training is done in the three stages :
the feed-forward of input training pattern
the calculation and backpropagation of the error
updation of the weight
Working of Backpropagation:
Neural networks use supervised learning to generate output vectors from input vectors that the network
operates on. It Compares generated output to the desired output and generates an error report if the
result does not match the generated output vector. Then it adjusts the weights according to the bug
report to get your desired output.
Backpropagation Algorithm:
Parameters :
x = inputs training vector x=(x1,x2,…………xn).
t = target vector t=(t1,t2……………tn).
δk = error at output unit.
δj = error at hidden layer.
α = learning rate.
V0j = bias of hidden unit j.
Training Algorithm :
Step 1: Initialize weight to small random values.
Step 2: While the stepsstopping condition is to be false do step 3 to 10.
Step 3: For each training pair do step 4 to 9 (Feed-Forward).
Step 4: Each input unit receives the signal unit and transmitsthe signal xi signal to all the units.
Step 5 : Each hidden unit Zj (z=1 to a) sums its weighted input signal to calculate its net input
zinj = v0j + Σxivij ( i=1 to n)
Applying activation function zj = f(zinj) and sends this signals to all units in the layer about i.e
output units
For each output l=unit yk = (k=1 to m) sums its weighted input signals.
yink = w0k + Σ ziwjk (j=1 to a)
and applies its activation function to calculate the output signals.
yk = f(yink)
Backpropagation Error :
Step 6: Each output unit yk (k=1 to n) receives a target pattern corresponding to an input pattern then
error is calculated as:
δk = ( tk – yk ) + yink
Step 7: Each hidden unit Zj (j=1 to a) sums its input from all units in the layer above
δinj = Σ δj wjk
The error information term is calculated as :
δj = δinj + zinj
Updation of weight and bias :
Step 8: Each output unit yk (k=1 to m) updates its bias and weight (j=1 to a). The weight correction
term is given by :
Δ wjk = α δk zj
and the bias correction term is given by Δwk = α δk.
therefore wjk(new) = wjk(old) + Δ wjk
w0k(new) = wok(old) + Δ wok
for each hidden unit zj (j=1 to a) update its bias and weights (i=0 to n) the weight connection
term
Δ vij = α δj xi
and the bias connection on term
Δ v0j = α δj
Therefore vij(new) = vij(old) + Δvij
v0j(new) = v0j(old) + Δv0j
Step 9: Test the stopping condition. The stopping condition can be the minimization of error, number
of epochs.
Backpropagation is “backpropagation of errors” and is very useful for training neural networks. It’s
fast, easy to implement, and simple. Backpropagation does not require any parameters to be set, except
the number of inputs. Backpropagation is a flexible method because no prior knowledge of the network
is required.
Types of Backpropagation
Disadvantages:
It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate results.
Performance is highly dependent on input data.
Spending too much time training.
The matrix-based approach is preferred over a mini-batch.
Backpropagation is an algorithm that backpropagates the errors from the output nodes to the input
nodes. Therefore, it is simply referred to as the backward propagation of errors. It uses in the vast
applications of neural networks in data mining like Character recognition, Signature verification, etc.
Neural Network:
Neural networks are an information processing paradigm inspired by the human nervous system. Just
like in the human nervous system, we have biological neurons in the same way in neural networks we
have artificial neurons, artificial neurons are mathematical functions derived from biological neurons.
The human brain is estimated to have about 10 billion neurons, each connected to an average of 10,000
other neurons. Each neuron receives a signal through a synapse, which controls the effect of the
signconcerning on the neuron.
Backpropagation:
Backpropagation is a widely used algorithm for training feedforward neural networks. It computes the
gradient of the loss function with respect to the network weights. It is very efficient, rather than naively
directly computing the gradient concerning each weight. This efficiency makes it possible to use
gradient methods to train multi-layer networks and update weights to minimize loss; variants such as
gradient descent or stochastic gradient descent are often used.
The backpropagation algorithm works by computing the gradient of the loss function with respect to
each weight via the chain rule, computing the gradient layer by layer, and iterating backward from the
last layer to avoid redundant computation of intermediate terms in the chain rule.
Features of Backpropagation:
1. it is the method as used in the case of simple perceptron network with the differentiable unit.
2. it is different from other networks in respect to the process by which the weights are calculated
during the learning period of the network.
3. training is done in the three stages :
the calculation and backpropagation of the error
updation of the weight
Working of Backpropagation:
Neural networks use supervised learning to generate output vectors from input vectors that the network
operates on. It Compares generated output to the desired output and generates an error report if the
result does not match the generated output vector. Then it adjusts the weights according to the bug
report to get your desired output.
Backpropagation Algorithm:
Parameters :
x = inputs training vector x=(x1,x2,…………xn).
t = target vector t=(t1,t2……………tn).
δk = error at output unit.
δj = error at hidden layer.
α = learning rate.
V0j = bias of hidden unit j.
Training Algorithm :
Step 1: Initialize weight to small random values.
Step 2: While the stepsstopping condition is to be false do step 3 to 10.
Step 3: For each training pair do step 4 to 9 (Feed-Forward).
Step 4: Each input unit receives the signal unit and transmitsthe signal xi signal to all the units.
Step 5 : Each hidden unit Zj (z=1 to a) sums its weighted input signal to calculate its net input
zinj = v0j + Σxivij ( i=1 to n)
Applying activation function zj = f(zinj) and sends this signals to all units in the layer about i.e
output units
For each output l=unit yk = (k=1 to m) sums its weighted input signals.
yink = w0k + Σ ziwjk (j=1 to a)
and applies its activation function to calculate the output signals.
yk = f(yink)
Backpropagation Error :
Step 6: Each output unit yk (k=1 to n) receives a target pattern corresponding to an input pattern then
error is calculated as:
δk = ( tk – yk ) + yink
Step 7: Each hidden unit Zj (j=1 to a) sums its input from all units in the layer above
δinj = Σ δj wjk
The error information term is calculated as :
δj = δinj + zinj
Updation of weight and bias :
Step 8: Each output unit yk (k=1 to m) updates its bias and weight (j=1 to a). The weight correction
term is given by :
Δ wjk = α δk zj
and the bias correction term is given by Δwk = α δk.
therefore wjk(new) = wjk(old) + Δ wjk
w0k(new) = wok(old) + Δ wok
for each hidden unit zj (j=1 to a) update its bias and weights (i=0 to n) the weight connection
term
Δ vij = α δj xi
and the bias connection on term
Δ v0j = α δj
Therefore vij(new) = vij(old) + Δvij
v0j(new) = v0j(old) + Δv0j
Step 9: Test the stopping condition. The stopping condition can be the minimization of error, number
of epochs.
Backpropagation is “backpropagation of errors” and is very useful for training neural networks. It’s
fast, easy to implement, and simple. Backpropagation does not require any parameters to be set, except
the number of inputs. Backpropagation is a flexible method because no prior knowledge of the network
is required.
Types of Backpropagation
Advantages:
Disadvantages:
It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate results.
Performance is highly dependent on input data.
Spending too much time training.
The matrix-based approach is preferred over a mini-batch.
Unit 4
Clustering helps to splits data into several subsets. Each of these subsets contains data similar
to each other, and these subsets are called clusters. Now that the data from our customer
base is divided into clusters, we can make an informed decision about who we think is best
suited for this product.
Let's understand this with an example, suppose we are a market manager, and we have a
new tempting product to sell. We are sure that the product would bring enormous profit, as
long as it is sold to the right people. So, how can we tell who is best suited for the product from
our company's huge customer base?
Clustering, falling under the category of unsupervised machine learning, is one of the
problems that machine learning algorithms solve.
Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its input
data.
o The intra-cluster similarities are high, It implies that the data present inside the cluster
is similar to one another.
o The inter-cluster similarity is low, and it means each cluster holds data that is not similar
to other data.
What is a Cluster?
o Clustering is the method of converting a group of abstract objects into classes of similar
objects.
o Clustering is a method of partitioning a set of data or objects into a set of significant
subclasses called clusters.
o It helps users to understand the structure or natural grouping in a data set and used
either as a stand-alone instrument to get a better insight into data distribution or as a
pre-processing step for other algorithms
Important points:
o In many applications, clustering analysis is widely used, such as data analysis, market
research, pattern recognition, and image processing.
o It assists marketers to find different groups in their client base and based on the
purchasing patterns. They can characterize their customer groups.
o It helps in allocating documents on the internet for data discovery.
o Clustering is also used in tracking applications such as detection of credit card fraud.
o As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to analyze the characteristics of each cluster.
o In terms of biology, It can be used to determine plant and animal taxonomies,
categorization of genes with the same functionalities and gain insight into structure
inherent to populations.
o It helps in the identification of areas of similar land that are used in an earth observation
database and the identification of house groups in a city according to house type,
value, and geographical location.
Clustering analysis has been an evolving problem in data mining due to its variety of
applications. The advent of various data clustering tools in the last few years and their
comprehensive use in a broad range of applications, including image processing,
computational biology, mobile communication, medicine, and economics, must contribute to
the popularity of these algorithms. The main issue with the data clustering algorithms is that it
cant be standardized. The advanced algorithm may give the best results with one type of data
set, but it may fail or perform poorly with other kinds of data set. Although many efforts have
been made to standardize the algorithms that can perform well in all situations, no significant
achievement has been achieved so far. Many clustering tools have been proposed so far.
However, each algorithm has its advantages or disadvantages and cant work on all real
situations.
1. Scalability:
Scalability in clustering implies that as we boost the amount of data objects, the time to perform
clustering should approximately scale to the complexity order of the algorithm. For example,
if we perform K- means clustering, we know it is O(n), where n is the number of objects in the
data. If we raise the number of data objects 10 folds, then the time taken to cluster them should
also approximately increase 10 times. It means there should be a linear relationship. If that is
not the case, then there is some error with our implementation process.
Data should be scalable if it is not scalable, then we can't get the appropriate result. The
figure illustrates the graphical example where it may lead to the wrong result.
2. Interpretability:
The clustering algorithm should be able to find arbitrary shape clusters. They should not be
limited to only distance measurements that tend to discover a spherical cluster of small sizes.
Algorithms should be capable of being applied to any data such as data based on intervals
(numeric), binary data, and categorical data.
Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to
such data and may result in poor quality clusters.
6. High dimensionality:
The clustering tools should not only able to handle high dimensional data space but also the
low-dimensional space.
1. Model Assessment:
Partitioning permits us to assess how well a model sums up to concealed data. Testing the
model on a different dataset allows us to check its performance and settle on informed
conclusions about its deployment.
2. Preventing Overfitting:
Partitioning helps recognize and relieve overfitting, a typical issue in machine learning. Overfit
models perform particularly well on the training data yet neglect to summarise new data.
Partitioning guarantees that models are tried on data they have not seen during training,
assisting with uncovering overfitting.
3. Tuning Hyperparameters:
Finding the correct hyperparameters for machine learning calculations is significant in data
mining. Partitioning gives the resources to tune these hyperparameters while keeping away
from spillage of data from the test set.
Through partitioning, data quality issues, for example, exceptions and missing qualities,
become obvious when models are applied to the test set. This understanding can prompt data
cleaning and preprocessing upgrades.
This article dives into the fascinating universe of partitioning strategies in data mining, giving
an exhaustive comprehension of their importance and applications. This article is a significant
asset for the two novices and prepared data mining practitioners, offering a profound jump
into partitioning techniques and their vital job in guaranteeing the viability and reliability of data-
driven bits of knowledge and predictions. Toward the finish of this article, users will be
furnished with the information and devices expected to excel at partitioning in data mining.
1. Random Sampling
Random sampling includes choosing a subset of data focused on a bigger dataset with no
particular example or predisposition. It is finished by randomly picking data focuses, which
should be possible regardless of substitution.
Use Cases:
Random sampling is usually utilized for undertakings like making a training set and a test set
for machine learning models, leading reviews, and assessing populace boundaries.
Advantages:
o Absence of Control: It may not guarantee that specific subsets of the data are equally
addressed, which can be an issue for imbalanced datasets.
o Changeability: Because of its randomness, the nature of the tested data can shift
between various runs.
Challenges:
o Determining Test Size: Concluding the fitting example size can be tested and may
require factual techniques.
o Inclination on the off chance that it is Not Done As expected: Random sampling
can bring predisposition into the dataset if it is not done accurately.
2. Stratified Sampling
Stratified sampling is a strategy where the dataset is separated into subgroups or strata in
light of explicit qualities, and afterwards, random sampling is performed inside every layer.
This guarantees that every subgroup is addressed in the example.
Use Cases:
Stratified sampling is valuable when you guarantee that particular subgroups inside your data
are sufficiently addressed in the example. It is often used in statistical, political, and clinical
examinations.
Advantages:
o Distinguishing Strata: Determining the proper strata and their attributes can be
challenging.
o Sampling inside Strata: Guaranteeing appropriate random samples inside every
layer can be tested.
3. K-Fold Cross-Validation
K-Fold Cross-Validation is a strategy used to survey the performance of a machine learning
model. The dataset is partitioned into K subsets or "folds." The model is prepared on K-1 of
these folds and tried on the leftover fold, and this cycle is repeated K times, with each fold
utilized as the test set once.
Use Cases:
K-Fold Cross-Validation is broadly utilized for model determination and hyperparameter tuning
in machine learning. It assists in assessing how well a model will sum up to new, concealed
data.
Advantages:
Use Cases:
Use cases: LOOCV is valuable when you have a restricted measure of data and need to
capitalize on it for model assessment.
Advantages:
o High Difference: LOOCV can have high fluctuation in performance gauges, as every
emphasis includes training on almost indistinguishable datasets.
o Computationally Concentrated: This can be computationally escalated for
enormous datasets.
Challenges:
Holdout validation is usually utilized for fast model assessment and testing. It is appropriate
for situations where computational assets are restricted.
Advantages:
o Inconsistency: The nature of the partition can differ depending upon the random split,
possibly prompting one-sided results.
o More Modest Test Set: The test set might be more modest, possibly influencing the
dependability of performance gauges.
Challenges:
Partitioning strategies in data mining is a pivotal move toward getting ready datasets for
different logical and prescient undertakings. In this segment, we dig into the complex parts of
executing partitioning strategies, which envelop the choice of programming tools and libraries,
the foundation of partitioning proportions, and the nuanced craft of dealing with imbalanced
datasets.
i) Explicit Libraries:
In the programming domain, Python arises as a leader with libraries like Scikit-realize, which
offers a rich set-up of capabilities, for example, train_test_split for cutting datasets into
training and testing subsets. On the R front, bundles like caret and rsample take special care
of comparable necessities, giving a strong system for partitioning data automatically.
The size of your dataset has a critical impact on your partitioning proportions. Bigger datasets
present the advantage of designating more modest segments to test, while more modest
datasets may require bigger testing sets to guarantee strong model assessment. Variation to
dataset size is key for sound model turn of events.
o Within sight of imbalanced datasets, where one class significantly dwarfs the others,
keeping up with a similar class circulation in both training is vital to test sets.
o For example, stratified sampling can assist with safeguarding class extents and
guaranteeing fair model assessment and power.
3. Taking care of Imbalanced Datasets:
i) Difficulties of Imbalanced Data:
Imbalanced datasets present special difficulties in data mining. Models will generally lean
toward the greater part of the class, frequently bringing about a lack of performance for
minority classes. This is especially risky in characterization errands where the minority classes
convey critical significance.
o Diving into true case studies can be significant for gathering functional knowledge to
handle imbalanced datasets.
o These case studies, whether found in scholastic exploration papers, data mining
rivalries (e.g., Kaggle), or industry reports, offer a rich embroidery of difficulties and
arrangements connected with imbalanced data.
1. Data Preprocessing
Cleaning and Imputation:
o Feature engineering includes making new features or adjusting existing ones to work
on model performance.
o Techniques might incorporate one-hot encoding, feature scaling, making connection
terms, and producing space-explicit features.
o The objective is to furnish the model with pertinent and educational info features.
Normalization and Scaling:
o Normalization guarantees that features are comparable, keeping a few features from
overwhelming others during model training.
o Normal techniques incorporate min-max scaling, z-score normalization, and powerful
scaling.
o Picking the right normalization strategy relies upon the conveyance of your data and
the necessities of your algorithm.
2. Picking the Right Partitioning Technique
Contemplations for Various Data Types:
o The decision of partitioning strategy ought to consider the idea of the data. For
instance:
o For time series data, worldly parting or rolling-window validation might be suitable.
o Stratified sampling in light of text classifications can be valuable for text data.
o For picture data, random sampling or k-fold cross-validation can be utilized.
o Understanding the innate attributes of the data is critical for choosing a suitable
partitioning technique.
Model Complexity and Size:
o The complexity of your machine learning model assumes a huge part in partitioning.
o If you have a great model with numerous boundaries, you might require more data for
training, which can influence the partitioning proportions.
o Less complex models require less data yet benefit from partitioning to assess their
performance.
Computational Assets:
o Consider the computational assets accessible to you, like handling power and
memory.
o Cross-validation strategies like k-fold require training the model numerous times, which
can be asset-concentrated.
o Holdout validation or more modest fold sizes might be more appropriate when asset
imperatives are available.
3. Results
Evaluation Metrics:
o After partitioning and model training, it's fundamental to examine evaluation metrics to
survey model performance.
o Normal metrics incorporate exactness, accuracy, review, F1-score for arrangement,
and MAE, MSE, and R2 for relapse.
o Comprehend the qualities and shortcomings of every measurement about your
concern to settle on informed choices.
Visualizing Model Performance:
o Representation can assist you with acquiring experiences into how your model is
performing.
o Tools like ROC bends, accuracy review bends, disarray grids and learning bends can
visually represent performance.
o Perceptions help in recognizing regions where the model succeeds or battles.
Iterative Refinement:
Case Studies
Healthcare predictive modelling involves data mining techniques to predict patient outcomes
or diseases. Partitioning is urgent here as it separates data into training and testing sets. For
instance, verifiable patient data can be parted, with one piece used to train a model and the
other to test its predictive precision.
In the monetary sector, fraud detection relies on partitioning to develop hearty models.
Partitioning ensures that the fraud detection algorithms are tested on independent datasets,
helping to identify fraudulent exchanges accurately. For example, authentic exchange data
can be partitioned for model development and validation.
Using partitioning methods to split social media data into training and testing sets for sentiment
analysis. This involves classifying user-generated content (e.g., tweets, comments, reviews)
into positive, negative, or neutral sentiments to gain insights into public opinion and customer
satisfaction. Helps businesses and organizations understand customer sentiment, improve
products or services, and make informed marketing decisions.
With the exponential development of data, conventional partitioning methods might need to
be more efficient. Challenges arise in dealing with and processing large datasets. To address
this, distributed figuring and parallel processing techniques can be employed to ensure
efficient partitioning.
Ethical concerns related to data protection and bias can affect partitioning. Data partitioning
should be done while considering fairness and transparency, especially when dealing with
sensitive data. Proper anonymization and bias relief techniques should be employed.
Advanced partitioning methods, like adaptive and dynamic partitioning, are evolving to
enhance data mining processes. These techniques aim to optimize partitioning for improved
model performance and efficiency.
A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical
clustering begins by treating every data point as a separate cluster. Then, it repeatedly executes the
subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all the clusters
are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram
called A Dendrogram is a tree-like diagram that statistics the sequences of merges or splits)
graphically represents this hierarchy and is an inverted tree that describes the order in which factors
are merged (bottom-up view) or clusters are broken up (top-down view).
What is Hierarchical Clustering?
is a method o analysis in data mining that creates a hierarchical representation of the clusters in a
dataset. The method starts by treating each data point as a separate cluster and then iteratively
combines the closest clusters until a stopping criterion is reached. The result of hierarchical clustering
is a tree-like structure, called a dendrogram, which illustrates the hierarchical relationships among the
clusters.
Hierarchical clustering has several advantages over other clustering methods
The ability to handle non-convex clusters and clusters of different sizes and densities.
The ability to handle missing data and noisy data.
The ability to reveal the hierarchical structure of the data, which can be useful for understanding
the relationships among the clusters.
Drawbacks of Hierarchical Clustering
The need for a criterion to stop the clustering process and determine the final number of clusters.
The computational cost and memory requirements of the method can be high, especially for large
datasets.
The results can be sensitive to the initial conditions, linkage criterion, and distance metric used.
In summary, Hierarchical clustering is a method of data mining that groups similar data points
into clusters by creating a hierarchical structure of the clusters.
This method can handle different types of data and reveal the relationships among the clusters.
However, it can have high computational cost and results can be sensitive to some conditions.
Types of Hierarchical Clustering
Basically, there are two types of hierarchical Clustering:
1. Agglomerative Clustering
2. Divisive clustering
1. Agglomerative Clustering
Initially consider every data point as an individual Cluster and at every step, merge the nearest pairs
of the cluster. (It is a bottom-up method). At first, every dataset is considered an individual entity or
cluster. At every iteration, the clusters merge with different clusters until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering is:
Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
Consider every data point as an individual cluster
Merge the clusters which are highly similar or close to each other.
Recalculate the proximity matrix for each cluster
Repeat Steps 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note: This is just a demonstration of how the actual algorithm works no calculation has been
performed below all the proximity among the clusters is assumed.
Let’s say we have six data points A, B, C, D, E, and F.
Step-1: Consider each alphabet as a single cluster and calculate the distance of one cluster from
all the other clusters.
Step-2: In the second step comparable clusters are merged together to form a single cluster. Let’s
say cluster (B) and cluster (C) are very similar to each other therefore we merge them in the
second step similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
Step-3: We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged
together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
Step-5: At last, the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].
2. Divisive Hierarchical clustering
We can say that Divisive Hierarchical clustering is precisely the opposite of Agglomerative
Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the data points
as a single cluster and in every iteration, we separate the data points from the clusters which aren’t
comparable. In the end, we are left with N clusters.
Density-based clustering refers to a method that is based on local cluster criterion, such as
density connected points. In this tutorial, we will discuss density-based clustering with
examples.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
A point i is considered as the directly density reachable from a point k with respect to Eps,
MinPts if
i belongs to NEps(k)
Density reachable:
A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there
is a sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable
from ii.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point
o such that both i and j are considered as density reachable from o with respect to Eps and
MinPts.
An object i is density reachable form the object j with respect to ε and MinPts in a given set of
objects, D' only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that
ii + 1 is directly density reachable from ii with respect to ε and MinPts.
An object i is density connected object j with respect to ε and MinPts in a given set of objects,
D' only if there is an object o belongs to D such that both point i and j are density reachable
from o with respect to ε and MinPts.
Major Features of Density-Based Clustering
The primary features of Density-based clustering are given below.
o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.
Density-Based Clustering Methods
DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends
on a density-based notion of cluster. It also identifies clusters of arbitrary size in the spatial
database with outliers.
OPTICS
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant
order of database with respect to its density-based clustering structure. The order of the cluster
comprises information equivalent to the density-based clustering related to a long range of
parameter settings. OPTICS methods are beneficial for both automatic and interactive cluster
analysis, including determining an intrinsic clustering structure.
DENCLUE
There is an instance of a grid-based approach that involves STING, which explores statistical
data stored in the grid cells, and WaveCluster, which clusters objects using a wavelet
transform approach. And CLIQUE, which defines a grid-and density-based approach for
Clustering in high-dimensional data space.
When we deal with the datasets available in multidimensional characteristics, we need the
help of a grid-based approach. This method includes some spatial data such as geographical
information, image data, or datasets with multiple attributes. If we divide this data space, we
can get various advantages of the grid-based method. Some of the gained advantages are as
follows.
1. Data Partitioning
This is a clustering method that classifies all the information into many groups. This
classification is based on the characteristics and similarity of the data. With the help of
data analysis, we can specify the number of clusters generated with the clustering
method's help. With the help of the portioning method, the data can be specified in
constructs user-specified(K) partitions in which each partition represents a cluster and
a particular region. So many algorithms are generated with the help of the data
partitioning method. These algorithms are K-Mean, PAM(K-Medoids), and CLARA
algorithm (Clustering Large Applications).
2. Data Reduction
We can use this technique in data mining, which is used to reduce the size of a dataset
while still preserving the most important information. Where there is a too large amount
of dataset that needs to be processed efficiently or if the dataset contains a large
amount of irrelevant or redundant information in that situation, we use the data
reduction method.
3. Local Pattern Discovery
With the help of the grid-based method, we can identify the local patterns or trends
within the data. We can analyze the data within individual cells, patterns and
relationships; these things are still hidden, and all the data in the entire dataset can be
uncovered. This is especially valuable for finding localized phenomena within data.
4. Scalability
This method is known for its scalability. We can handle large datasets, making them
particularly useful when dealing with high-dimensional data. The partitioning of space
inherently reduces dimensionality, simplifying analysis.
5. Density Estimation
Density-based Clustering refers to one of the most popular unsupervised learning
methodologies used in model building and machine learning algorithms. The data
points in the region separated by two low-point density clusters are considered noise.
The surroundings with a radius ε of a given object are known as the ε neighbourhood
of the object. If the ε neighbourhood of the object comprises at least a minimum
number of MinPts of objects, it is called a core object.
6. Clustering and Classification
The grid-based mining method can divide the space of instances into two types.
Clustering techniques are then applied using the Cells of the grid, instead of individual
data points, as the base units. The biggest advantage of this method is that it improves
processing time.
7. Grid-Based Indexing
We can use grid-based indexing, which utilizes efficient access and retrieval of data.
These structures organize the data based on the grid partitions, enhancing query
performance and retrieval.
There are so many popular methods that are based on the grid-based method. These are as
follows. All those methods have unique strengths and applications. We are going to
understand some of the methods below.
It is also a centroid-based algorithm in which each cluster is associated with a centroid. The
main purpose of this algorithm is to minimize the sum of distances between the data point and
their corresponding clusters.
This algorithm can take all the unlabeled datasets, divide those data sets into k-number of
clusters, and then repeat this process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
This algorithm mainly performs two works. These two works are as follows.
o The first work is to determine the best value for K centre points or centroids by an
iterative process.
o The second value is it Assigns each data point to its closest k-center. Those data points
which are near the K-center create a cluster.
How does the K-Means Algorithm Work?
The STING contains all the data related to the attributes in each cell, such as mean, maximum,
and minimum values, which are precomputed and stored as statistical parameters. These
statistical parameters are useful for query processing and other data analysis tasks.
Several Steps follow the working procedure of the STING. These are as follows.
Next Topic
Evaluation of Clustering is a process that determines the quality and value of clustering
outcomes in data mining and machine learning.
In data mining, to assess how we can cluster all the well data points, we need to choose an
appropriate clustering algorithm and set the parameters and various metrics or techniques
that must be used.
The main objective of clustering evaluation is to analyze the data with specific objectives to
improve performance and provide a better understanding of clustering solutions.
The following are some major reasons why Clustering is so important in data mining:
1. Pattern Discovery
In data mining, with the help of Clustering, we can discover the patterns and connections in
data. Because of this, it becomes simple to understand, and we can analyze the data by
combining similar data points that help us to reveal the unstructured data.
2. Data Summarization
With the help of Clustering, we can also summarize large data sets into a smaller cluster that
is much easier to manage. The data analysis process can be made simpler by working with
clusters rather than individual data points.
3. Anomaly Detection
Clustering helps us identify anomalies and outline the data in data mining. Data points that
are not part of any cluster or that form small, unusual clusters could indicate errors or unusual
events that need to be addressed.
4. Customer Segmentation
Clustering is a technique used in business and marketing to divide customers into different
groups according to their behaviour, preferences, or demographics. This segmentation
enables the customization of marketing plans and product offerings for particular customer
groups.
Image and document categorization: Clustering is useful for categorizing images and
documents. It assists in classifying and organizing texts, images, or documents based on
similarities, making it simpler to manage and retrieve information.
6. Recommendation Systems
In data mining, we can use Clustering for e-commerce and content recommendation systems
to put users and products in a similar group. With the help of this, we can make sure the
recommendation systems can better suggest good content so user can find it interesting
based on the preferences of their cluster.
7. Scientific Research
8. Data preprocessing
Clustering can be used to reduce the dimensionality and noise in data as a preprocessing step
in data mining. The data is streamlined and made ready for additional analysis.
9. Risk Assessment
Using Clustering, we can find the risks and spot fraud in the finance sector. It also helps in
grouping unusual patterns in financial transactions for additional investigation.
In conclusion, Clustering is a flexible and essential data mining technique for organizing,
comprehending and making sense of complex datasets. With the help of this useful tool, we
can easily find important information from the data, and with the help of its broad application
in a variety of fields like business and marketing, it also helps in scientific research and beyond.
There are several clustering algorithms, and each has a distinctive methodology. The most
typical ones are:
1. Hierarchical Clustering
Hierarchical Clustering is a well-liked and effective method in data analysis and mining for
classifying data points into hierarchical cluster structures. Clusters are created iteratively
based on the similarity between data points using a bottom-up or top-down approach. A
dendrogram, which graphically depicts the relationships between data points and clusters, is
produced by hierarchical Clustering.
2. K means Clustering
A common data mining and machine learning technique called K-Means clustering involves
dividing data points into a predetermined number of clusters, denoted by the letter "K."
o Centroid Based: In K-means clustering, we use centroids to see the average data
points with each cluster, and centroids are also used to represent the cluster.
o K-Determination: In k-means clustering, it is difficult to define the number of clusters
K in advance because many techniques can be used to find the ideal value of K.
techniques like the Silhouette score and elbow method.
o Iterative Algorithm: K-Means employs an iterative process to minimize cluster
variance. Data points are assigned to the closest centroid after cluster centroids are
first randomly initialized. The process of recalculating centroids involves taking the
cluster mean and repeating it until convergence is achieved.
3. DBSCAN
Density-based spatial Clustering of Applications with Noise, or DBSCAN for short, is a widely
used clustering algorithm in machine learning and data mining. Compared to other clustering
algorithms, DBSCAN doesn't necessitate predetermining the number of clusters and works
especially well with datasets with unevenly shaped clusters and varied cluster sizes.
Evaluating the quality of the clustering results is essential to judge how well a clustering
algorithm has performed and whether or not Clustering has successfully revealed significant
patterns in the data. The following are a few typical clustering evaluation metrics:
o Silhouette Score: The silhouette score calculates how similar each cluster's data
points are to its neighbouring clusters. It has a range of +1 (well-separated clusters) to
-1 (poor Clustering).
o Davies-Bouldin Index: The average similarity between each cluster and its most
similar cluster is measured by the Davies-Bouldin Index. A lower value indicates Better
Clustering.
o Dunn Index: The Dunn Index calculates the ratio between the maximum intra-cluster
and minimum inter-cluster distances. Higher values indicate better-defined clusters.
o Calinski-Harabasz Index: Calculates the ratio of within-cluster variance to between-
cluster variance using the Calinski-Harabasz Index (Variance Ratio Criterion). Higher
values indicate Better Clustering.
o Xie-Beni Index: The Xie-Beni Index measures how far apart and compact a cluster is
while considering intra and inter-cluster distances.
o Divergence-Based Measures: Based on divergence, metrics such as the Davies-
Bouldin Index and Dunn Index show how different clusters are. These metrics work
well for evaluating density and cluster separation.
2. Metrics for external evaluation:
o Adjusted Rand Index (ARI): The true labels and the cluster assignments are
compared using the Adjusted Rand Index, which accounts for random variation. The
range of ARI values is +1 (perfect agreement) to -1 (no agreement).
o Normalized Mutual Information (NMI): Calculates the mutual information,
normalized to take a chance into account between the cluster assignments and the
true labels.
o Fowlkes-Mallows Index (FMI): Determines the geometric mean of recall and
precision between the cluster assignments and true labels.
Limitation of Clustering
Some limitations of Clustering. The following are some typical clustering limitations:
The initial location of cluster centroids or seed points affects the performance of many
clustering algorithms, including K-Means, DBSCAN, and hierarchical Clustering. The process
becomes less robust when minor adjustments to the initialization result in different cluster
results.
For certain clustering algorithms (e.g., K-Means), it is necessary to know how many clusters
(K) will be used ahead of time. Selecting the appropriate K value can be difficult, and making
the wrong decision could result in less-than-ideal outcomes.
3. Scalability
Some clustering algorithms may not be effective for large datasets due to their computational
complexity. As the size of the dataset grows, hierarchical Clustering, for instance, may become
computationally costly.
When using unsupervised Clustering, no labelled data or ground truth may be available to
assess the Clustering's quality. The evaluation is based on many heuristics and metrics, some
of which might not always be reliable.
5. Validity of Clusters
The data and the algorithm employed determine how well a cluster is clustered. The produced
clusters might not always be significant or pertinent to the issue.
6. Subjectivity
Choosing the best clustering algorithm and parameter configurations is frequently a matter of
opinion and the analyst's assessment. For the same dataset, different algorithms may yield
different outcomes.
Next
Unit 5
In time-series data, data is measured as the long series of the numerical or textual data at equal time
intervals per minute, per hour, or per day. Time-series data mining is performed on the data obtained
from the stock markets, scientific data, and medical data. In time series mining it is not possible to
find the data that exactly matches the given query. We employ the similarity search method that finds
the data sequences that are similar to the given query string. In the similarity search method,
subsequence matching is performed to find the subsequences that are similar to a given query string.
In order to perform the similarity search, dimensionality reduction of complex data to transform the
time-series data into numerical data.
Symbolic sequences are composed of long nominal data sequences, which dynamically change their
behavior over time intervals. Examples of the Symbolic Sequences include online customer shopping
sequences as well as sequences of events of experiments. Mining of Symbolic Sequences is called
Sequential Mining. A sequential pattern is a subsequence that exists more frequently in a set of
sequences. so it finds the most frequent subsequence in a set of sequences to perform the mining.
Many scalable algorithms have been built to find out the frequent subsequence. There are also
algorithms to mine the multidimensional and multilevel sequential patterns.
Biological sequences are the long sequences of nucleotides and data mining of biological sequences is
required to find the features of the DNA of humans. Biological sequence analysis is the first step of
data mining to compare the alignment of the biological sequences. Two species are similar to each
other only if their nucleotide (DNA, RNA) and protein sequences are close and similar. During the
data mining of Biological Sequences, the degree of similarity between nucleotide sequences is
measured. The degree of similarity obtained by sequence alignment of nucleotides is essential in
determining the homology between two sequences.
There can be the situation of alignment of two or more input biological sequences by identifying similar
sequences with long subsequences. The amino acids also called proteins sequences are also compared
and aligned.
Graph Pattern Mining can be done by using Apriori-based and pattern growth-based approaches. We
can mine the subgraphs of the graph and the set of closed graphs. A closed graph g is the graph that
doesn’t have a super graph that carries the same support count as g. Graph Pattern Mining is applied
to different types of graphs such as frequent graphs, coherent graphs, and dense graphs. We can also
improve the mining efficiency by applying the user constraints on the graph patterns. Graph patterns
are two types. Homogeneous graphs where nodes or links of the graph are of the same type by having
similar features. In Heterogeneous graph patterns, the nodes and links are of different types.
5. Statistical Modeling of Networks:
A network is a collection of nodes where each node represents the data and the nodes are linked
through edges, representing relationships between data objects. If all the nodes and links connecting
the nodes are of the same type, then the network is homogeneous such as a friend network or a web
page network. If the nodes and links connecting the nodes are of different types, then the network is
heterogeneous such as health-care networks (linking the different parameters such as doctors, nurses,
patients, diseases together in the network). Graph Pattern Mining can be further applied to the
network to derive the knowledge and useful patterns from the network.
Spatial data is the geo space-related data that is stored in large data repositories. The spatial data is
represented in “vector” format and geo-referenced multimedia format. A spatial database is
constructed from large geographic data warehouses by integrating geographical data of multiple
sources of areas. we can construct spatial data cubes that contain information about the spatial
dimensions and measures. It is possible to perform the OLAP operations on the spatial data for spatial
data analysis. Spatial data mining is performed on spatial data warehouses, spatial databases, and
other geospatial data repositories. Spatial Data mining discovers knowledge about the geographic
areas. The preprocessing of spatial data involves several operations like spatial clustering, spatial
classification, spatial modeling, and outlier detection in spatial data.
Cyber-Physical System Data can be mined by constructing a graph or network of data. A cyber-
physical system (CPS) is a heterogeneous network that consists of a large number of interconnected
nodes that store patients or medical information. The links in the CPS network represent the
relationship between the nodes . cyber-physical systems store dynamic, inconsistent, and
interdependent data that contains spatiotemporal information. Mining cyber-physical data links the
situation as a query to access the data from a large information database and it involves real-time
calculations and analysis to prompt responses from the CPS system. CPS analysis requires rare-event
detection and anomaly analysis in cyber-physical data streams, in cyber-physical networks, and the
processing of Cyber-Physical Data involves the integration of stream data with real-time automated
control processes.
Multimedia data objects include image data, video data, audio data, website hyperlinks, and linkages.
Multimedia data mining tries to find out interesting patterns from multimedia databases. This includes
the processing of the digital data and performs tasks like image processing, image classification,
video, and audio data mining, and pattern recognition. Multimedia Data mining is becoming the most
interesting research area because most of the social media platforms like Twitter, Facebook data can
be analyzed through this and derive interesting trends and patterns.
Web mining is essential to discover crucial patterns and knowledge from the Web. Web content
mining analyzes data of several websites which includes the web pages and the multimedia data such
as images in the web pages. Web mining is done to understand the content of web pages, unique users
of the website, unique hypertext links, web page relevance and ranking, web page content summaries,
time that the users spent on the particular website, and understand user search patterns. Web mining
also finds out the best search engine and determines the search algorithm used by it. So it helps
improve search efficiency and finds the best search engine for the users.
Text mining is the subfield of data mining, machine learning, Natural Language processing, and
statistics. Most of the information in our daily life is stored as text such as news articles, technical
papers, books, email messages, blogs. Text Mining helps us to retrieve high-quality information from
text such as sentiment analysis, document summarization, text categorization, text clustering. We
apply machine learning models and NLP techniques to derive useful information from the text. This is
done by finding out the hidden patterns and trends by means such as statistical pattern learning and
statistical language modeling. In order to perform text mining, we need to preprocess the text by
applying the techniques of stemming and lemmatization in order to convert the textual data into data
vectors.
The data that is related to both space and time is Spatiotemporal data. Spatiotemporal data mining
retrieves interesting patterns and knowledge from spatiotemporal data. Spatiotemporal Data mining
helps us to find the value of the lands, the age of the rocks and precious stones, predict the weather
patterns. Spatiotemporal data mining has many practical applications like GPS in mobile phones,
timers, Internet-based map services, weather services, satellite, RFID, sensor.
Stream data is the data that can change dynamically and it is noisy, inconsistent which contain
multidimensional features of different data types. So this data is stored in NoSql database systems.
The volume of the stream data is very high and this is the challenge for the effective mining of stream
data. While mining the Data Streams we need to perform the tasks such as clustering, outlier analysis,
and the online detection of rare events in data streams.
Unit 5
Data Mining is the process of collecting data and then processing them to find useful patterns with the
help of statistics and machine learning processes. By finding the relationship between the database, the
peculiarities can be easily identified. Aggregation of useful datasets from a heap of data in the database
help in the growth of many industries we depend in our daily life and enhance customer service. We
can’t deny the fact that we live in a world of data. From the local grocery store to detecting network
frauds, data mining plays a significant role. Beyond the benefits, data mining has negative impacts on
society like privacy breaches and security problems. This article shows both the positive and negative
effects of data mining on society.
Positive effects of data mining on society
Data mining has influenced our lives whether we feel it or not. Its applications are widely used in many
fields to reduce strain and time. It has also supplemented the life of humans. Let’s see some of the
examples.
Customer relationship management: By using the techniques of data mining the company
provides customized and preferred services for customers which provides a pleasant experience
while using the services. By aggregating and grouping the data, the company can create
advertisements only when needed and it can reach the right people who require the service. By
targeting the customer, unwanted promotional activities can be avoided which saves a lot of money
for the company. The customer also doesn’t get annoyed when heaps of junk mails and messages
are not sent. Data mining can also help in saving time and provide satisfaction to the customers.
Personalized search engines: In the world of data and networks, our lives become intertwined
with web browsers. They had obtained an inevitable place in our lifestyle, knowledge and so on.
With the help of data mining algorithms, the suggestions and the order of websites are tailored
according to the gathered information by summarizing it. Ranking the page according to the content,
the no of visits also help the web browser to provide necessary results for the query given by the
user. By giving a personalized environment, spam and misleading advertisements can be avoided.
By data mining, frequent spam accounts can be identified and they are automatically moved into
the spam folder. For e.g. Gmail has a spam folder where unwanted and frequent junk messages are
placed instead of heaping the inbox. Web-wide tracking is a process in which the system keeps
track of every website a user visits. By incorporating the DoubleClick mechanism in these websites,
they can note the websites that have been visited. And personalized lifestyle, educational ads are
made visible in that sites relevant to that user.
Mining in the health sector: Data mining helps in maintaining the health and welfare of a human.
The layers of data mining embedded in pharmaceutical industries help to analyze data, to establish
relationships while creating and improving drugs. It also helps in analyzing the effects of drugs on
patients, the side effects and outcomes. They also help in tracking the number of chronically
diseased patients, ICU patients which help in reducing the overflow of admissions in hospitals.
Some medicines can also cause side effects or other benefits regardless of what disease it treats. In
such cases, data mining can largely influence the growth of the health sector.
E-shopping: E- retail platforms are one of the fastest-growing major industries in the world. From
books, movies, groceries, lifestyles everything is listed on online e-retail platforms. This cannot run
successfully without the help of data mining and predictive analysis. By these techniques, cross-
selling and holding onto regular customers have become possible. Data mining helps in announcing
offers and discounts to keep the customers intact and to increase sales. By using the algorithms of
data science, the e-commerce website can largely influence the customers using targeted ad
campaigns which will surge the number of users as well as it provide satisfactory results to
customers.
Crime prevention: Data mining plays a huge role in the prevention of crimes and reducing fraud
rates. In telecommunication industries, it helps in identifying subscription theft and super-imposed
frauds. It also helps in the identification of fraudulent calls. By doing this, user security can be
ensured and prevent the company from facing a huge loss. It also plays an important role in police
departments for identifying key patterns in crime and predicting them. It also helps in identifying
the unsolved crimes committed by the same criminal by establishing a relationship between
previous and present datasets in the crime database. By extracting and aggregating data, the police
department can identify future crimes and prevent them. It also helps in identifying the cause of
crime and the criminal behind that. This application largely supports the safety of people.
The exploitation of data and discrimination: By agreeing on the terms and conditions provided
by a company, the company gets access to collect data of the customers. From age groups to
economical status, the company profiles the customers. By customer profiling, they get to know the
datasets of rich, poor, elder, or younger. Some unethical or devious companies offer low credits or
inferior deals to the customer in an area where fewer sales rate is noted. For. eg. An unethical
company decreasing the credit scores in the loyalty card of people connected to a branch whose
transactions are less. Sometimes while profiling customers, wrongly accusing a customer happens.
Though he is faultless, his needs and comfort are denied. Even though the company declares the
customer faultless after investigations, still the wrongly accused customer struggles mentally and
this incident will negatively impact his life. Certain companies don’t take the responsibility of
securing the data of customers which makes the data vulnerable and causes privacy breaches.
Health-related ethical problem: Using data mining techniques, the companies can extract data
about the health problems of the employees. They can also relate the summarized dataset with the
datasets from the past history of previous employees. By discovering the pattern of diseases and
frequencies, the company chooses the specific insurance plans accordingly. But, there is a chance
that the company uses this data while hiring new employees. Hence, they avoid hiring people with
a higher frequency of sickness. Insurance companies collect this data so they can avoid policies
with companies with a high risk of health issues.
Privacy breach: Every single piece of data we enter into the database of the internet is indirectly
under the control of data miners. When used for unethical purposes, the privacy of an individual is
invaded. Certain companies use this data to filter the latent people but with the potential to become
customers of that company. In this way, the company sends targeted advertisements and increase
customer traffic. For e.g. In telecommunication industries, the call details of the customer are
created to enhance business growth and to maintain low customer churn. But, the company uses the
data selfishly for its growth and it leads to the exploitation of privacy. Thus every single piece of
data given to the network stands for greater risk under the influence of data mining.
Manipulation of data and unethical problems: There are circumstances when normal data
provided by a customer or user becomes manipulative data. That is when a customer makes a
promotion on social media, which means he has a good financial status in his growing business.
Using such information, miners can obtain unethical data to gain profits or access. Spreading of
false information through social media and erroneous opinions can mislead people because when
data miners collect this information, they become facts and which leads to a scam. Also, by using
predictive analytics and machine learning algorithms, the outcome of an event is predicted by the
government which may fail sometimes and that will create a disastrous effect on the public. When
the prediction is based on unprompted unsafe sources, those predictions lead to severe losses and
the company may fail.
Invasive marketing: The junk advertisements that heap your mobile while using social media or
other social platforms are the result of data mining. Targeted ads benefit both seller and customer
and save time but when it gets intense and unethical, wrong products are forced through
advertisements that may negatively influence the life of the user. From the browser histories to
previously purchased items, the data is extracted and used to influence the user to buy other products
sometimes which may be harmful. This aggressive technique will cause undesirable effects on the
user. Every discovery or field has its own merits and demerits. A part of that application may help
the human and a part may degrade the values and ethics of society. As a part of the society we live
in, it is our duty to use the applications of technology following the rules and maintaining ethics.
Industries, companies and marketing agents should respect the privacy of individual humans and
should provide the space they need. When every single person out there takes responsibility for the
proper handling of data, data mining would be a gift of technology that could build and ease our
life in so many ways.
Data mining is one of the most widely used methods to extract data from different sources and
organize them for better usage. Despite having different commercial systems for data mining,
many challenges come up when they are actually implemented. With the rapid evolution in the
field of data mining, companies are expected to stay abreast with all the new developments.
Complex algorithms form the basis for data mining as they allow data segmentation to identify
trends and patterns, detect variations, and predict the probabilities of various events. The raw
data may come in both analog and digital formats and is inherently based on the source of the
data. Companies need to keep track of the latest data mining trends and stay updated to do
well in the industry and overcome challenging competition.
Corporations can use data mining to discover customers' choices, make a good relationship
with customers, increase revenue, and reduce risks. Data mining is based on complex
algorithms that allow data segmentation to discover numerous trends and patterns, detect
deviations, and estimate the likelihood of certain occurrences occurring. Raw data can be in
both analog and digital formats, and it is essentially dependent on the data's source.
Companies must keep up with the latest data mining trends and stay current to succeed in the
industry and beat out the competition.
o Data Reduction
o Indexing Methods
o Similarity Search Methods
o Query Languages
2. Mining Symbolic Sequence
A symbolic sequence comprises an ordered list of elements that can be recorded with or
without a sense of time. This sequence can be used in various ways, including consumer
shopping sequences, web clickstreams, software execution sequences, biological sequences,
etc.
Advertisement
Mining sequential patterns entail identifying the subsequences that frequently appear in one
or more sequences. As a result of substantial research in this area, many scalable algorithms
have been developed. Alternatively, we can only mine the set of closed sequential patterns,
where a sequential pattern s is closed if a correct subsequence of s' and s' has the same
support as s.
Businesses that have been slow in adopting the process of data mining are now catching up
with the others. Extracting important information through the process of data mining is widely
used to make critical business decisions. We can expect data mining to become as ubiquitous
as some of the more prevalent technologies used today in the coming decade. Data mining
concepts are still evolving, and here are the following latest trends, such as:
1. Application exploration
Data mining is increasingly used to explore applications in other areas, such as financial
analysis, telecommunications, biomedicine, wireless security, and science.
This is one of the latest methods which is catching up because of the growing ability to capture
useful data accurately. It involves data extraction from different kinds of multimedia sources
such as audio, text, hypertext, video, images, etc. The data is converted into a numerical
representation in different formats. This method can be used in clustering and classifications,
performing similarity checks, and identifying associations.
This method involves mining data from mobile devices to get information about individuals.
Despite having several challenges in this type, such as complexity, privacy, cost, etc., this
method has a lot of opportunities to be enormous in various industries, especially in studying
human-computer interactions.
Data mining features are increasingly finding their way into many enterprise software use
cases, from sales forecasting in CRM SaaS platforms to cyber threat detection in intrusion
detection/prevention systems. The embedding of data mining into vertical market software
applications enables prediction capabilities for any number of industries and opens up new
realms of possibilities for unique value creation.
This new trending type of data mining includes extracting information from environmental,
astronomical, and geographical data, including images taken from outer space. This type of
data mining can reveal various aspects such as distance and topology, which are mainly used
in geographic information systems and other navigation applications.
The primary application of this type of data mining is the study of cyclical and seasonal trends.
This practice is also helpful in analyzing even random events which occur outside the normal
series of events. Retail companies mainly use this method to access customers' buying
patterns and behaviors.
Both the pharmaceutical and health care industries have long been innovators in the category
of data mining. The recent rapid development of coronavirus vaccines is directly attributed to
advances in pharmaceutical testing data mining techniques, specifically signal detection
during the clinical trial process for new drugs. In health care, specialized data mining
techniques are being used to analyze DNA sequences for creating custom therapies, make
better-informed diagnoses, and more.
Today's data mining solutions typically integrate ML and big data stores to provide advanced
data management functionality alongside sophisticated data analysis techniques. Earlier
incarnations of data mining involved manual coding by specialists with a deep background in
statistics and programming. Modern techniques are highly automated, with AI/ML replacing
most of these previously manual processes for developing pattern-discovering algorithms.
If history is any indication, significant product consolidation in the data mining space is
imminent as larger database vendors acquire data mining tooling startups to augment their
offerings with new features. The current fragmented market and a broad range of data mining
players resemble the adjacent big data vendor landscape that continues to undergo
consolidation.
Data is a set of discrete objective facts about an event or a process that have little use by
themselves unless converted into information. We have been collecting numerous data, from
simple numerical measurements and text documents to more complex information such as
spatial data, multimedia channels, and hypertext documents.
Nowadays, large quantities of data are being accumulated. The amount of data collected is
said to be almost doubled every year. An extracting data or seeking knowledge from this
massive data, data mining techniques are used. Data mining is used in almost all places
where a large amount of data is stored and processed. For example, banks typically use
‘data mining’ to find out their prospective customers who could be interested in credit cards,
personal loans, or insurance as well. Since banks have the transaction details and detailed
profiles of their customers, they analyze all this data and try to find out patterns that help
them predict that certain customers could be interested in personal loans, etc.
Basically, the motive behind mining data, whether commercial or scientific, is the same –
the need to find useful information in data to enable better decision-making or a better
understanding of the world around us.
“Extraction of interesting information or patterns from data in large databases is known as
data mining.”
According to William J.Frawley “Data mining or KDD(Knowledge Discovery in Databases)
as it is also known, is the nontrivial extraction of implicit, previously unknown, and potentially
useful information from data.”
Technically, is the computational process of analyzing data from different perspectives,
dimensions, angles and categorizing/summarizing it into meaningful information. Data
Mining can be applied to any type of data e.g. Data Warehouses, Transactional Databases,
Relational Databases, Multimedia Databases, Spatial Databases, Time-series Databases,
World Wide Web.
Data mining provides competitive advantages in the knowledge economy. It does this by
providing the maximum knowledge needed to rapidly make valuable business decisions
despite the enormous amounts of available data.
There are many measurable benefits that have been achieved in different application areas
from data mining. So, let’s discuss different applications of Data Mining:
Scientific Analysis: Scientific simulations are generating bulks of data every day. This
includes data collected from nuclear laboratories, data about human psychology, etc. Data
mining techniques are capable of the analysis of these data. Now we can capture and store
more new data faster than we can analyze the old data already accumulated. Example of
scientific analysis:
Sequence analysis in bioinformatics
Classification of astronomical objects
Medical decision support.
Intrusion Detection: A network intrusion refers to any unauthorized activity on a
digital network. Network intrusions often involve stealing valuable network resources. Data
mining technique plays a vital role in searching intrusion detection, network attacks, and
anomalies. These techniques help in selecting and refining useful and relevant information
from large data sets. Data mining technique helps in classify relevant data for Intrusion
Detection System. Intrusion Detection system generates alarms for the network traffic about
the foreign invasions in the system. For example:
Detect security violations
Misuse Detection
Anomaly Detection
Business Transactions: Every business industry is memorized for perpetuity. Such
transactions are usually time-related and can be inter-business deals or intra-business
operations. The effective and in-time use of the data in a reasonable time frame for
competitive decision-making is definitely the most important problem to solve for businesses
that struggle to survive in a highly competitive world. Data mining helps to analyze these
business transactions and identify marketing approaches and decision-making. Example :
Direct mail targeting
Stock trading
Customer segmentation
Churn prediction (Churn prediction is one of the most popular Big Data use cases in
business)
Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study
of purchases done by a customer in a supermarket. This concept identifies the pattern of
frequent purchase items by customers. This analysis can help to promote deals, offers, sale
by the companies and data mining techniques helps to achieve this analysis task. Example:
Data mining concepts are in use for Sales and marketing to provide better customer
service, to improve cross-selling opportunities, to increase direct mail response rates.
Customer Retention in the form of pattern identification and prediction of likely defections
is possible by Data mining.
Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Education: For analyzing the education sector, data mining uses Educational Data Mining
(EDM) method. This method generates patterns that can be used both by learners and
educators. By using data mining EDM we can perform some educational task:
Predicting students admission in higher education
Predicting students profiling
Predicting student performance
Teachers teaching performance
Curriculum development
Predicting student placement opportunities
Research: A data mining technique can perform predictions, , clustering, associations, and
grouping of data with perfection in the research area. Rules generated by data mining are
unique to find results. In most of the technical research in data mining, we create a training
model and testing model. The training/testing model is a strategy to measure the precision
of the proposed model. It is called Train/Test because we split the data set into two sets: a
training data set and a testing data set. A training data set used to design the training model
whereas testing data set is used in the testing model. Example:
Classification of uncertain data.
Information-based clustering.
Decision support system
Web Mining
Domain-driven data mining
IoT (Internet of Things)and Cybersecurity
Smart farming IoT(Internet of Things)
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force
activity and their outcomes to improve the focusing of high-value physicians and figure out
which promoting activities will have the best effect in the following upcoming months,
Whereas the Insurance sector, data mining can help to predict which customers will buy new
policies, identify behavior patterns of risky customers and identify fraudulent behavior of
customers.
Claims analysis i.e which medical procedures are claimed together.
Identify successful medical therapies for different illnesses.
Characterizes patient behavior to predict office visits.
Transportation: A diversified transportation company with a large direct sales force can
apply data mining to identify the best prospects for its services. A large consumer
merchandise organization can apply information mining to improve its business cycle to
retailers.
Determine the distribution schedules among outlets.
Analyze loading patterns.
Financial/Banking Sector: A credit card company can leverage its vast warehouse of
customer transaction data to identify customers most likely to be interested in a new credit
product.
Credit card fraud detection.
Identify ‘Loyal’ customers.
Extraction of information related to customers.
Determine credit card spending by customer groups.