Chapter 4 New
Chapter 4 New
Association rules are "if-then" statements, that help to show the probability of relationships
between data items, within large data sets in various types of databases. Association rule mining
has a number of applications and is widely used to help discover sales correlations in
transactional data or in medical data sets.
In data science, association rules are used to find correlations and co-occurrences between data
sets. They are ideally used to explain patterns in data from seemingly independent information
repositories, such as relational databases and transactional databases. The act of using association
rules is sometimes referred to as "association rule mining" or "mining associations."
Medicine. Doctors can use association rules to help diagnose patients. There are many variables
to consider when making a diagnosis, as many diseases share symptoms. By using association
rules and machine learning-fueled data analysis, doctors can determine the conditional
probability of a given illness by comparing symptom relationships in the data from past cases. As
new diagnoses get made, machine learning models can adapt the rules to reflect the updated
data.
Retail. Retailers can collect data about purchasing patterns, recording purchase data as item
barcodes are scanned by point-of-sale systems. Machine learning models can look for co-
occurrence in this data to determine which products are most likely to be purchased together.
The retailer can then adjust marketing and sales strategy to take advantage of this information.
User experience (UX) design. Developers can collect data on how consumers use a website they
create. They can then use associations in the data to optimize the website user interface -- by
analyzing where users tend to click and what maximizes the chance that they engage with a call
to action, for example.
Entertainment. Services like Netflix and Spotify can use association rules to fuel their content
recommendation engines. Machine learning models analyze past user behavior data for
frequent patterns, develop association rules and use those rules to recommend content that a
user is likely to engage with, or organize content in a way that is likely to put the most
interesting content for a given user first.
Association rule mining, at a basic level, involves the use of machine learning models to analyze
data for patterns, or co-occurrences, in a database. It identifies frequent if-then associations,
which themselves are the association rules.
An association rule has two parts: an antecedent (if) and a consequent (then). An antecedent is an
item found within the data. A consequent is an item found in combination with the antecedent.
Association rules are created by searching data for frequent if-then patterns and using the criteria
support and confidence to identify the most important relationships. Support is an indication of
how frequently the items appear in the data. Confidence indicates the number of times the if-then
statements are found true. A third metric, called lift, can be used to compare confidence with
expected confidence, or how many times an if-then statement is expected to be found true.
Association rules are calculated from itemsets, which are made up of two or more items. If rules
are built from analyzing all the possible itemsets, there could be so many rules that the rules hold
little meaning. With that, association rules are typically created from rules well-represented in
data.
The strength of a given association rule is measured by two main parameters: support and
confidence. Support refers to how often a given rule appears in the database being mined.
Confidence refers to the amount of times a given rule turns out to be true in practice. A rule may
show a strong correlation in a data set because it appears very often but may occur far less when
applied. This would be a case of high support, but low confidence.
Conversely, a rule might not particularly stand out in a data set, but continued analysis shows
that it occurs very frequently. This would be a case of high confidence and low support. Using
these measures helps analysts separate causation from correlation and allows them to properly
value a given rule.
A third value parameter, known as the lift value, is the ratio of confidence to support. If the lift
value is a negative value, then there is a negative correlation between datapoints. If the value is
positive, there is a positive correlation, and if the ratio equals 1, then there is no correlation.
Popular algorithms that use association rules include AIS, SETM, Apriori and variations of the
latter.
With the AIS algorithm, itemsets are generated and counted as it scans the data. In transaction
data, the AIS algorithm determines which large itemsets contained a transaction, and new
candidate itemsets are created by extending the large itemsets with other items in the transaction
data.
The SETM algorithm also generates candidate itemsets as it scans a database, but this algorithm
accounts for the itemsets at the end of its scan. New candidate itemsets are generated the same
way as with the AIS algorithm, but the transaction ID of the generating transaction is saved with
the candidate itemset in a sequential data structure. At the end of the pass, the support count of
candidate itemsets is created by aggregating the sequential structure. The downside of both the
AIS and SETM algorithms is that each one can generate and count many small candidate
itemsets, according to published materials from Dr. Saed Sayad, author of Real Time Data
Mining.
With the Apriori algorithm, candidate itemsets are generated using only the large itemsets of the
previous pass. The large itemset of the previous pass is joined with itself to generate all itemsets
with a size that's larger by one. Each generated itemset with a subset that is not large is then
deleted. The remaining itemsets are the candidates. The Apriori algorithm considers any subset
of a frequent itemset to also be a frequent itemset. With this approach, the algorithm reduces the
number of candidates being considered by only exploring the itemsets whose support count is
greater than the minimum support count, according to Sayad.
In data mining, association rules are useful for analyzing and predicting customer behavior. They
play an important part in customer analytics, market basket analysis, product clustering, catalog
design and store layout.
Programmers use association rules to build programs capable of machine learning. Machine
learning is a type of artificial intelligence (AI) that seeks to build programs with the ability to
become more efficient without being explicitly programmed.
A classic example of association rule mining refers to a relationship between diapers and beers.
The example, which seems to be fictional, claims that men who go to a store to buy diapers are
also likely to buy beer. Data that would point to that might look like this:
A supermarket has 200,000 customer transactions. About 4,000 transactions, or about 2% of the
total number of transactions, include the purchase of diapers. About 5,500 transactions (2.75%)
include the purchase of beer. Of those, about 3,500 transactions, 1.75%, include both the
purchase of diapers and beer. Based on the percentages, that large number should be much lower.
However, the fact that about 87.5% of diaper purchases include the purchase of beer indicates a
link between diapers and beer.
What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set of
data is used as training data. The set of input data and the corresponding outputs are given to the
algorithm. So, the training data set includes the input data and their associated class labels. Using
the training dataset, the algorithm derives a model or the classifier. The derived model can be a
decision tree, mathematical formula, or a neural network. In classification, when unlabeled data
is given to the model, it should find the class to which it belongs. The new data provided to the
model is the test data set.
The functioning of classification with the assistance of the bank loan application has been
mentioned above. There are two stages in the data classification system: classifier or model
creation and classification classifier.
1. Developing the Classifier or model creation: This level is the learning stage or the
learning process. The classification algorithms construct the classifier in this stage. A
classifier is constructed from a training set composed of the records of databases and their
corresponding class names. Each category that makes up the training set is referred to as
a category or class. We may also refer to these records as samples, objects, or data points.
2. Applying classifier for classification: The classifier is used for classification at this
level. The test data are used here to estimate the accuracy of the classification algorithm.
If the consistency is deemed sufficient, the classification rules can be expanded to cover
new data records. It includes:
o Sentiment Analysis: Sentiment analysis is highly helpful in social media
monitoring. We can use it to extract social media insights. We can build sentiment
analysis models to read and analyze misspelled words with advanced machine
learning algorithms. The accurate trained models provide consistently accurate
outcomes and result in a fraction of the time.
o Document Classification: We can use document classification to organize the
documents into sections according to the content. Document classification refers
to text classification; we can classify the words in the entire document. And with
the help of machine learning classification algorithms, we can execute it
automatically.
o Image Classification: Image classification is used for the trained categories of an
image. These could be the caption of the image, a statistical value, a theme. You
can tag images to train your model for relevant categories by applying supervised
learning algorithms.
o Machine Learning Classification: It uses the statistically demonstrable
algorithm rules to execute analytical tasks that would take humans hundreds of
more hours to perform.
3. Data Classification Process: The data classification process can be categorized into five
steps:
o Create the goals of data classification, strategy, workflows, and architecture of
data classification.
o Classify confidential details that we store.
o Using marks by data labelling.
o To improve protection and obedience, use effects.
o Data is complex, and a continuous method is a classification.
The data classification life cycle produces an excellent structure for controlling the flow of data
to an enterprise. Businesses need to account for data security and compliance at each level. With
the help of data classification, we can perform it at every stage, from origin to deletion. The data
life-cycle has the following stages, such as:
1. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google
documents, social media, and websites.
2. Role-based practice: Role-based security restrictions apply to all delicate data by
tagging based on in-house protection policies and agreement rules.
3. Storage: Here, we have the obtained data, including access controls and encryption.
4. Sharing: Data is continually distributed among agents, consumers, and co-workers from
various devices and platforms.
5. Archive: Here, data is eventually archived within an industry's storage systems.
6. Publication: Through the publication of data, it can reach customers. They can then view
and download in the form of dashboards.
What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same as in
classification, the training dataset contains the inputs and corresponding numerical output values.
The algorithm derives the model or a predictor according to the training dataset. The model
should find a numerical output when the new data is given. Unlike in classification, this method
does not have a class label. The model predicts a continuous-valued function or ordered value.
Regression is generally used for prediction. Predicting the value of a house depending on the
facts such as the number of rooms, the total area, etc., is an example for prediction.
For example, suppose the marketing manager needs to predict how much a particular customer
will spend at his company during a sale. We are bothered to forecast a numerical value in this
case. Therefore, an example of numeric prediction is the data processing activity. In this case, a
model or a predictor will be developed that forecasts a continuous or ordered value function.
The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities, such as:
1. Data Cleaning: Data cleaning involves removing the noise and treatment of missing values. The
noise is removed by applying smoothing techniques, and the problem of missing values is solved
by replacing a missing value with the most commonly occurring value for that attribute.
2. Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis is
used to know whether any two given attributes are related.
3. Data Transformation and reduction: The data can be transformed by any of the following
methods.
o Normalization: The data is transformed using normalization. Normalization involves
scaling all values for a given attribute to make them fall within a small specified range.
Normalization is used when the neural networks or the methods involving
measurements are used in the learning step.
o Generalization: The data can also be transformed by generalizing it to the higher
concept. For this purpose, we can use the concept hierarchies.
Here are the criteria for comparing the methods of Classification and Prediction, such as:
accuracy: The accuracy of the classifier can be referred to as the ability of the classifier to
predict the class label correctly, and the accuracy of the predictor can be referred to as how well
a given predictor can estimate the unknown value.
Speed: The speed of the method depends on the computational cost of generating and using
the classifier or predictor.
Robustness: Robustness is the ability to make correct predictions or classifications. In the
context of data mining, robustness is the ability of the classifier or predictor to make correct
predictions from incoming unknown data.
Scalability: Scalability refers to an increase or decrease in the performance of the classifier or
predictor based on the given data.
The decision tree, applied to existing data, is a classification model. We can get a class
prediction by applying it to new data for which the class is unknown. The assumption is that the
new data comes from a distribution similar to the data we used to construct our decision tree. In
many instances, this is a correct assumption, so we can use the decision tree to build a predictive
model. Classification of prediction is the process of finding a model that describes the classes or
concepts of information. The purpose is to predict the class of objects whose class label is
unknown using this model. Below are some major differences between classification and
prediction.
Classification Prediction
For example, the grouping of patients based on their For example, We can think of prediction as
medical records can be considered a classification. predicting the correct treatment for a
particular disease for a person.
Logistic Regression
Naive Bayes
K-Nearest Neighbors
Decision Tree
Support Vector Machines
Logistic Regression
Logistic regression is a calculation used to predict a binary outcome: either something happens,
or does not. This can be exhibited as Yes/No, Pass/Fail, Alive/Dead, etc.
Independent variables are analyzed to determine the binary outcome with the results falling into
one of two categories. The independent variables can be categorical or numeric, but the
dependent variable is always categorical. Written like this:
P(Y=1|X) or P(Y=0|X)
This can be used to calculate the probability of a word having a positive or negative connotation
(0, 1, or on a scale between). Or it can be used to determine the object contained in a photo (tree,
flower, grass, etc.), with each object given a probability between 0 and 1.
Naive Bayes
Naive Bayes calculates the possibility of whether a data point belongs within a certain category
or does not. In text analysis, it can be used to categorize words or phrases as belonging to a
preset “tag” (classification) or not. For example:
To decide whether or not a phrase should be tagged as “sports,” you need to calculate:
Or… the probability of A, if B is true, is equal to the probability of B, if A is true, times the
probability of A being true, divided by the probability of B being true.
K-nearest Neighbors
K-nearest neighbors (k-NN) is a pattern recognition algorithm that uses training datasets to find
the k closest relatives in future examples.
When k-NN is used in classification, you calculate to place data within the category of its nearest
neighbor. If k = 1, then it would be placed in the class nearest 1. K is classified by a plurality poll
of its neighbors.
Decision Tree
A decision tree is a supervised learning algorithm that is perfect for classification problems, as
it’s able to order classes on a precise level. It works like a flow chart, separating data points into
two similar categories at a time from the “tree trunk” to “branches,” to “leaves,” where the
categories become more finitely similar. This creates categories within categories, allowing for
organic classification with limited human supervision.
Linear Regression
Logistic Regression
Neural Network
Decision Trees
Naive Bayes
1. Linear Regression
Linear Regression falls under the category of Supervised learning in which the variable which
needs to be predicted is known as the dependent variable and the variable through which we are
predicting the dependent variable is known as the independent variable.
The data which we have collected through the data mining process will contain in a CSV file
which is, then uploaded into the Jupyter Notebook in which we will perform Predictive Analysis,
then by using ML algorithms we will perform actions on our data. The first step includes reading
the data and performing some basic Exploratory Data Analysis and then we will train the dataset
for future predictions.
2. Logistic Regression
Logistic Regression is used to predict a dependent variable by analyzing the relationship between
one or more existing independent variables. This model can take into consideration many input
criteria. Based on earlier results of the dependent variable, we will predict the future results of
the independent variable by using the probability of falling into the particular outcome category.
The main difference between Logistic and Linear Regressions is Logistic Regression is used
when the response variable is categorical such as yes/no, true/false while Linear Regression is
used when the response variable is continuous such as hours, height and weight.
3. Neural Network
Neural Network Algorithm is developed by considering the human brain that takes a set of units
as input and transfers results to a predefined output. It tries to predict the dependent variable in a
way a human brain would. A Neural Network for prediction is made by taking a web of input
nodes, an output node, and a hidden node present between the two nodes. The hidden layer
between the two nodes is what makes this prediction technique unique and efficient than other
predictive tools. Every time data passes through the web the algorithm incorporates the data that
passes through it by giving weights to the nodes in the hidden layer.
4. Decision Trees
The decision tree is an important algorithm in Predictive modeling techniques in which we can
visually represent decisions. Based on certain conditions we will conclude all possible outcomes
by using branching methodology.
Classification trees
Regression trees
The classification tree is used to separate a dataset into different classes when we expect
response variable categorical in nature.
The Regression trees are used when the response variable is numerical or continuous. A decision
algorithm builds a decision tree which is used to represent classification rules. The leaves of the
tree in the Decision Tree are the Predicted decisions.
5. Naive Bayes
This algorithm works on Baye’s probability theorem or alternatively known as Baye’s rule or
Baye’s law. It is a simple algorithm that is known for its effectiveness to quickly build Predictive
models and make predictions by using these models and algorithms.
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group
that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML
system can use this id to simplify the processing of large and complex datasets.
Example: Let's understand the clustering technique with the real-world example of Mall: When we visit
any shopping mall, we can observe that the things with similar usage are grouped together. Such as the
t-shirts are grouped in one section, and trousers are at other sections, similarly, at vegetable sections,
apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can easily find out the
things. The clustering technique also works in the same way. Other examples of clustering are grouping
documents according to the topic.
the clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
Market Segmentation
Statistical data analysis
Social network analysis
Image segmentation
Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this technique
to recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different
fruits are divided into several groups with similar properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can belong to another group also). But there are also
other various approaches of Clustering exist. Below are the main clustering methods used in
Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the
centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of
pre-defined groups. The cluster center is created in such a way that the distance between the data
points of one cluster is minimum as compared to another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset and connects the areas of high
densities into clusters. The dense areas in data space are divided from each other by sparser
areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the probability of
how a dataset belongs to a particular distribution. The grouping is done by assuming some
distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the dataset
is divided into clusters to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at the correct level.
The most common example of this method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the degree
of membership to be in a cluster. Fuzzy C-means algorithm is the example of this type of
clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above. There
are different types of clustering algorithms published, but only a few are commonly used. The
clustering algorithm is based on the kind of data that we are using. Such as, some algorithms
need to guess the number of clusters in the given dataset, whereas some are required to find the
minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It
classifies the dataset by dividing the samples into different clusters of equal variances. The
number of clusters must be specified in this algorithm. It is fast with fewer computations
required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density
of data points. It is an example of a centroid-based model, that works on updating the
candidates for centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise. It
is an example of a density-based model similar to the mean-shift, but with some remarkable
advantages. In this algorithm, the areas of high density are separated by the areas of low
density. Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative
for the k-means algorithm or for those cases where K-means can be failed. In GMM, it is
assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the
bottom-up hierarchical clustering. In this, each data point is treated as a single cluster at the
outset and then successively merged. The cluster hierarchy can be represented as a tree-
structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to
specify the number of clusters. In this, each data point sends a message between the pair of
data points until convergence. It has O(N2T) time complexity, which is the main drawback of this
algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
In Identification of Cancer Cells: The clustering algorithms are widely used for the identification
of cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
In Search Engines: Search engines also work on the clustering technique. The search result
appears based on the closest object to the search query. It does it by grouping similar data
objects in one group that is far from the other dissimilar objects. The accurate result of a query
depends on the quality of the clustering algorithm used.
Customer Segmentation: It is used in market research to segment the customers based on their
choice and preferences.
In Biology: It is used in the biology stream to classify different species of plants and animals
using the image recognition technique.
In Land Use: The clustering technique is used in identifying the area of similar lands use in the
GIS database. This can be very useful to find that for what purpose the particular land should be
used, that means for which purpose it is more suitable.