0% found this document useful (0 votes)
39 views55 pages

Data Analytics and Model Evaluation

Hierarchical clustering is an unsupervised machine learning algorithm that groups unlabeled datasets into clusters. It creates a hierarchy of clusters in the form of a tree called a dendrogram. Unlike K-means clustering, the number of clusters does not need to be predetermined. There are two approaches - agglomerative, which is a bottom-up approach that starts with each data point as a cluster and merges them; and divisive, which is a top-down approach that starts with all data points in one cluster and splits them. Agglomerative hierarchical clustering is popular and works by iteratively merging the closest clusters until all data points are in one cluster, represented by the dendrogram.

Uploaded by

toon town
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views55 pages

Data Analytics and Model Evaluation

Hierarchical clustering is an unsupervised machine learning algorithm that groups unlabeled datasets into clusters. It creates a hierarchy of clusters in the form of a tree called a dendrogram. Unlike K-means clustering, the number of clusters does not need to be predetermined. There are two approaches - agglomerative, which is a bottom-up approach that starts with each data point as a cluster and merges them; and divisive, which is a top-down approach that starts with all data points in one cluster and splits them. Agglomerative hierarchical clustering is popular and works by iteratively merging the closest clusters until all data points are in one cluster, represented by the dendrogram.

Uploaded by

toon town
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

https://www.javatpoint.

com/k-means-clustering-algorithm-in-machine-learning

Hierarchical Clustering in Machine Learning


Hierarchical clustering is another unsupervised machine learning algorithm, which is used
to group the unlabeled datasets into a cluster and also known as hierarchical cluster
analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look similar,
but they both differ depending on how they work. As there is no requirement to
predetermine the number of clusters as we did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts


with taking all data points as single clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-
down approach.

Why hierarchical clustering?


As we already have other clustering algorithms such as K-Means Clustering, then why
we need hierarchical clustering? So, as we have seen in the K-means clustering that there
are some challenges with this algorithm, which are a predetermined number of clusters,
and it always tries to create the clusters of the same size. To solve these two challenges,
we can opt for the hierarchical clustering algorithm because, in this algorithm, we don't
need to have knowledge about the predefined number of clusters.

In this topic, we will discuss the Agglomerative Hierarchical clustering algorithm.

Agglomerative Hierarchical clustering


The agglomerative hierarchical clustering algorithm is a popular example of HCA. To
group the datasets into clusters, it follows the bottom-up approach. It means, this
algorithm considers each dataset as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are merged
into a single cluster that contains all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?


The working of the AHC algorithm can be explained using the below steps:

o Step-1: Create each data point as a single cluster. Let's say there are N data points, so the
number of clusters will also be N.
o Step-2: Take two closest data points or clusters and merge them to form one cluster. So,
there will now be N-1 clusters.

o Step-3: Again, take the two closest clusters and merge them together to form one cluster.
There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram
to divide the clusters as per the problem.

Note: To better understand hierarchical clustering, it is advised to have a look on k-means


clustering

Measure for the distance between two clusters


As we have seen, the closest distance between the two clusters is crucial for the
hierarchical clustering. There are various ways to calculate the distance between two
clusters, and these ways decide the rule for clustering. These measures are called Linkage
methods. Some of the popular linkage methods are given below:
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.
Consider the below image:

2. Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.

3. Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of
the clusters is calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the type of
problem or business requirement.

Woking of Dendrogram in Hierarchical clustering


The dendrogram is a tree-like structure that is mainly used to store each step as a memory
that the HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean
distances between the data points, and the x-axis shows all the data points of the given
dataset.

The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in agglomerative
clustering, and the right part is showing the corresponding dendrogram.

o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form
a cluster, correspondingly a dendrogram is created, which connects P2 and P3 with a
rectangular shape. The hight is decided according to the Euclidean distance between the
data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created.
It is higher than of previous, as the Euclidean distance between P5 and P6 is a little bit
greater than the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram,
and P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data points together.

We can cut the dendrogram tree structure at any level as per our requirement.

https://www.javatpoint.com/hierarchical-clustering-in-machine-learning
If you receive huge amounts of unstructured data in the form of text (emails, social media
conversations, chats), you’re probably aware of the challenges that come with analyzing this
data.

Manually processing and organizing text data takes time, it’s tedious, inaccurate, and it can be
expensive if you need to hire extra staff to sort through text.

Automate text analysis with a no-code tool


TRY NOW

In this guide, learn more about what text analysis is, how to perform text analysis using AI tools,
and why it’s more important than ever to automatically analyze your text in real time.

1. Text Analysis Basics


2. Methods & Techniques
3. How Does Text Analysis Work?
4. How to Analyze Text Data
5. Use Cases and Applications
6. Tools and Resources
7. Tutorial

What Is Text Analysis?

Text analysis (TA) is a machine learning technique used to automatically extract valuable
insights from unstructured text data. Companies use text analysis tools to quickly digest online
data and documents, and transform them into actionable insights.

You can us text analysis to extract specific information, like keywords, names, or company
information from thousands of emails, or categorize survey responses by sentiment and topic.
The Text Analysis vs. Text Mining vs. Text Analytics
Firstly, let's dispel the myth that text mining and text analysis are two different processes. The
terms are often used interchangeably to explain the same process of obtaining data through
statistical pattern learning. To avoid any confusion here, let's stick to text analysis.

So, text analytics vs. text analysis: what's the difference?

Text analysis delivers qualitative results and text analytics delivers quantitative results. If a
machine performs text analysis, it identifies important information within the text itself, but if it
performs text analytics, it reveals patterns across thousands of texts, resulting in graphs, reports,
tables etc.

Let's say a customer support manager wants to know how many support tickets were solved by
individual team members. In this instance, they'd use text analytics to create a graph that
visualizes individual ticket resolution rates.

However, it's likely that the manager also wants to know which proportion of tickets resulted in a
positive or negative outcome?

By analyzing the text within each ticket, and subsequent exchanges, customer support managers
can see how each agent handled tickets, and whether customers were happy with the outcome.

Basically, the challenge in text analysis is decoding the ambiguity of human language, while in
text analytics it's detecting patterns and trends from the numerical results.

Why Is Text Analysis Important?


When you put machines to work on organizing and analyzing your text data, the insights and
benefits are huge.

Let's take a look at some of the advantages of text analysis, below:

Text Analysis Is Scalable

Text analysis tools allow businesses to structure vast quantities of information, like emails, chats,
social media, support tickets, documents, and so on, in seconds rather than days, so you can
redirect extra resources to more important business tasks.

Analyze Text in Real-time

Businesses are inundated with information and customer comments can appear anywhere on the
web these days, but it can be difficult to keep an eye on it all. Text analysis is a game-changer
when it comes to detecting urgent matters, wherever they may appear, 24/7 and in real time. By
training text analysis models to detect expressions and sentiments that imply negativity or
urgency, businesses can automatically flag tweets, reviews, videos, tickets, and the like, and take
action sooner rather than later.

AI Text Analysis Delivers Consistent Criteria

Humans make errors. Fact. And the more tedious and time-consuming a task is, the more errors
they make. By training text analysis models to your needs and criteria, algorithms are able to
analyze, understand, and sort through data much more accurately than humans ever could.

Text data derived from natural language is unstructured and noisy. Text preprocessing
involves transforming text into a clean and consistent format that can then be fed into a
model for further analysis and learning.

Text preprocessing techniques may be general so that they are applicable to many
types of applications, or they can be specialized for a specific task. For example, the
methods for processing scientific documents with equations and other mathematical
symbols can be quite different from those for dealing with user comments on social
media.

However, some steps, such as sentence segmentation, tokenization, spelling


corrections, and stemming, are common to both.

Here's what you need to know about text preprocessing to improve your natural
language processing (NLP).

The NLP Preprocessing Pipeline


A natural language processing system for textual data reads, processes, analyzes, and
interprets text. As a first step, the system preprocesses the text into a more structured
format using several different stages. The output from one stage becomes an input for
the next—hence the name “preprocessing pipeline.”

An NLP pipeline for document classification might include steps such as sentence
segmentation, word tokenization, lowercasing, stemming or lemmatization, stop word
removal, and spelling correction. Some or all of these commonly used text
preprocessing stages are used in typical NLP systems, although the order can vary
depending on the application.

Segmentation

Segmentation involves breaking up text into corresponding sentences. While this may
seem like a trivial task, it has a few challenges. For example, in the English language, a
period normally indicates the end of a sentence, but many abbreviations, including
“Inc.,” “Calif.,” “Mr.,” and “Ms.,” and all fractional numbers contain periods and introduce
uncertainty unless the end-of-sentence rules accommodate those exceptions.

Tokenization

The tokenization stage involves converting a sentence into a stream of words, also
called “tokens.” Tokens are the basic building blocks upon which analysis and other
methods are built.

Many NLP toolkits allow users to input multiple criteria based on which word boundaries
are determined. For example, you can use a whitespace or punctuation to determine if
one word has ended and the next one has started. Again, in some instances, these
rules might fail. For example, don’t, it’s, etc. are words themselves that contain
punctuation marks and have to be dealt with separately.

Change Case

Changing the case involves converting all text to lowercase or uppercase so that all
word strings follow a consistent format. Lowercasing is the more frequent choice in NLP
software.

Spell Correction

Many NLP applications include a step to correct the spelling of all words in the text.
Stop-Words Removal

"Stop words" are frequently occurring words used to construct sentences. In the English
language, stop words include is, the, are, of, in, and and. For some NLP applications,
such as document categorization, sentiment analysis, and spam filtering, these words
are redundant, and so are removed at the preprocessing stage.

Stemming

The term word stem is borrowed from linguistics and used to refer to the base or root
form of a word. For example, learn is a base word for its variants such as learn, learns,
learning, and learned.

Stemming is the process of converting all words to their base form, or stem. Normally, a
lookup table is used to find the word and its corresponding stem. Many search engines
apply stemming for retrieving documents that match user queries. Stemming is also
used at the preprocessing stage for applications such as emotion identification and text
classification.

Lemmatization

Lemmatization is a more advanced form of stemming and involves converting all words
to their corresponding root form, called “lemma.” While stemming reduces all words to
their stem via a lookup table, it does not employ any knowledge of the parts of speech
or the context of the word. This means stemming can’t distinguish which meaning of the
word right is intended in the sentences “Please turn right at the next light” and “She is
always right.”

The stemmer would stem right to right in both sentences; the lemmatizer would treat
right differently based upon its usage in the two phrases.

A lemmatizer also converts different word forms or inflections to a standard form. For
example, it would convert less to little, wrote to write, slept to sleep, etc.

A lemmatizer works with more rules of the language and contextual information than
does a stemmer. It also relies on a dictionary to look up matching words. Because of
that, it requires more processing power and time than a stemmer to generate output.
For these reasons, some NLP applications only use a stemmer and not a lemmatizer.

Text Normalization
Text normalization is the preprocessing stage that converts text to a canonical
representation. A common application is the processing of social media posts, where
input text is shortened or words are spelled in different ways. For example, hello might
be written as hellooo or something might appear as smth, and different people might
choose to write real time, real-time, or realtime. Text normalization cleans the text and
ideally replaces all words with their corresponding canonical representation. In the last
example, all three forms would be converted to realtime. Many text normalization stages
also replace emojis in text with a corresponding word. For example, :-) is replaced by
happy face.

Parts of Speech Tagging

One of the more advanced text preprocessing techniques is parts of speech (POS)
tagging. This step augments the input text with additional information about the
sentence’s grammatical structure. Each word is, therefore, inserted into one of the
predefined categories such as a noun, verb, adjective, etc. This step is also sometimes
referred to as grammatical tagging.

Is Text Preprocessing Really Necessary?

The simple answer is yes. Text preprocessing improves the performance of an NLP
system. For tasks such as sentiment analysis, document categorization, document
retrieval based upon user queries, and more, adding a text preprocessing layer provides
more accuracy.

Stages such as stemming, lemmatization, and text normalization make the vocabulary
size more manageable and transform the text into a more standard form across a
variety of documents acquired from different sources.

Once you have a clear idea of the type of application you are developing and the source
and nature of text data, you can decide on which preprocessing stages can be added to
your NLP pipeline. Most of the NLP toolkits on the market include options for all of the
preprocessing stages discussed above.

https://towardsdatascience.com/text-preprocessing-in-natural-language-processing-using-python-
6113ff5decd8
A Simple Explanation of the Bag-of-Words
Model
A quick, easy introduction to the Bag-of-Words model and
how to implement it in Python.
NOVEMBER 30, 2019

The bag-of-words (BOW) model is a representation that turns arbitrary text into fixed-
length vectors by counting how many times each word appears. This process is often
referred to as vectorization.

Let’s understand this with an example. Suppose we wanted to vectorize the following:

 the cat sat


 the cat sat in the hat
 the cat with the hat

We’ll refer to each of these as a text document.

Step 1: Determine the Vocabulary

We first define our vocabulary, which is the set of all words found in our document set.
The only words that are found in the 3 documents above are: the, cat, sat, in, the, hat,
and with.

Step 2: Count

To vectorize our documents, all we have to do is count how many times each word
appears:

Document the cat sat in hat with

the cat sat 1 1 1 0 0 0

the cat sat in the hat 2 1 1 1 1 0

the cat with the hat 2 1 0 0 1 1

Now we have length-6 vectors for each document!

 the cat sat: [1, 1, 1, 0, 0, 0]


 the cat sat in the hat: [2, 1, 1, 1, 1, 0]
 the cat with the hat: [2, 1, 0, 0, 1, 1]

Notice that we lose contextual information, e.g. where in the document the word
appeared, when we use BOW. It’s like a literal bag-of-words: it only tells you what words
occur in the document, not where they occurred.

Implementing BOW in Python


Now that you know what BOW is, I’m guessing you’ll probably need to implement it.
Here’s my preferred way of doing it, which uses Keras’s Tokenizer class:

from keras.preprocessing.text import Tokenizer

docs = [
'the cat sat',
'the cat sat in the hat',
'the cat with the hat',
]

## Step 1: Determine the Vocabulary


tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
print(f'Vocabulary: {list(tokenizer.word_index.keys())}')

## Step 2: Count
vectors = tokenizer.texts_to_matrix(docs, mode='count')
print(vectors)

Running that code gives us:

Vocabulary: ['the', 'cat', 'sat', 'hat', 'in', 'with']


[[0. 1. 1. 1. 0. 0. 0.]
[0. 2. 1. 1. 1. 1. 0.]
[0. 2. 1. 0. 1. 0. 1.]]

Notice that the vectors here have length 7 instead of 6 because of the extra 0 element at
the beginning. This is an inconsequential detail - Keras reserves index 0 and never
assigns it to any word.

How is BOW useful?


Despite being a relatively basic model, BOW is often used for Natural Language
Processing (NLP) tasks like Text Classification. Its strengths lie in its simplicity: it’s
inexpensive to compute, and sometimes simpler is better when positioning or
contextual info aren’t relevant.

I’ve written a blog post that uses BOW for profanity detection - check it out if you’re
curious to see BOW in action!

The term "bag of words" refers to a popular and simple technique used in natural language processing
(NLP) and information retrieval tasks. It represents a text document as an unordered collection or "bag"
of its individual words, disregarding grammar and word order. This technique focuses on the presence or
absence of words in a document rather than their sequence.

The bag of words model involves several steps:


1. Tokenization: The document is divided into individual words or tokens. Punctuation and other non-
word characters are often removed, and the text is split based on whitespace or other delimiters.

2. Vocabulary creation: A vocabulary or dictionary is created by listing all unique words present in the
document corpus. Each word is assigned a unique index or identifier.

3. Encoding: Each document is represented as a numerical vector, where the length of the vector is equal
to the size of the vocabulary. The value at each position in the vector indicates the frequency, presence,
or other statistics associated with the corresponding word in the vocabulary.

4. Vectorization: The textual data is converted into numerical feature vectors, typically using methods
such as one-hot encoding or term frequency-inverse document frequency (TF-IDF) representation.

The bag of words approach has some limitations. It discards valuable information about word order,
grammar, and semantics, as it treats each word independently. It also ignores the context and meaning
of the words. Nevertheless, it has been widely used for various text-based tasks, such as document
classification, sentiment analysis, and information retrieval, especially when the focus is on keyword-
based analysis rather than understanding the overall structure of the text.

https://www.engati.com/glossary/bag-of-words
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the
importance of a term (word) within a document in the context of a collection of documents or corpus. It
is commonly employed in natural language processing (NLP) and text mining tasks, including topic
modeling.

TF-IDF combines two components:

1. Term Frequency (TF): This measures the frequency of a term within a document. It indicates how often
a term appears in a document relative to the total number of terms in that document. A higher TF value
signifies that a term is more relevant to the document.

2. Inverse Document Frequency (IDF): This quantifies the rarity of a term across the entire corpus. It
measures the logarithmically scaled inverse fraction of documents that contain the term. Terms that
appear in fewer documents are given a higher IDF value, indicating their significance and distinctiveness.

The TF-IDF score for a term in a document is calculated by multiplying its TF value with its IDF value. The
formula for TF-IDF is as follows:
TF-IDF = (Term Frequency in Document) * (Inverse Document Frequency)

In the context of topic modeling, TF-IDF is often employed as a preprocessing step to identify and extract
important features from a collection of documents. These features, represented as TF-IDF vectors,
capture the relative importance of terms within each document and across the entire corpus. Topic
modeling algorithms like Latent Dirichlet Allocation (LDA) can then be applied to these TF-IDF vectors to
discover latent topics present in the corpus.

By applying TF-IDF, words that are frequent within a document but rare in the overall corpus receive
higher weights, making them more influential in determining the document's topic. This helps to identify
and prioritize key terms or features associated with specific topics. In essence, TF-IDF serves as a
weighting scheme that highlights the salient terms that contribute significantly to the characterization of
topics within a corpus.

By leveraging TF-IDF in topic modeling, researchers and analysts can effectively extract and explore the
underlying themes and topics present in a collection of documents, facilitating tasks such as document
clustering, categorization, recommendation systems, and information retrieval.
Social network analysis (SNA) is a field of study that examines the relationships and interactions between
individuals, organizations, or other entities. It focuses on understanding the structure, dynamics, and
patterns of social networks. SNA provides a framework and set of methods to analyze and visualize the
relationships within a network, uncovering valuable insights about social systems.

At its core, social network analysis recognizes that social interactions occur within a larger network of
connections. These connections can be represented graphically, where each node represents an entity
(e.g., a person, organization, or website), and the edges represent the relationships or interactions
between them (e.g., friendships, collaborations, or information flows). By studying these network
structures and properties, social network analysts aim to understand how information, influence,
resources, and behaviors flow through social systems.

Social network analysis has gained significant attention and application in various fields, including
sociology, anthropology, psychology, organizational behavior, communication studies, and computer
science. It offers a powerful lens to explore and analyze social phenomena, such as the spread of ideas,
the formation of social groups, the diffusion of innovations, the dynamics of online communities, and
the influence of individuals within a network.

The methods and tools used in social network analysis include:

1. Network visualization: Graphs and visual representations help in understanding and interpreting the
structure and patterns of a social network.

2. Centrality measures: These measures identify the most important or influential nodes within a
network, based on their connections and positions.

3. Community detection: Algorithms and techniques to identify clusters or communities within a


network, revealing groups of nodes that are densely connected.

4. Network metrics: Quantitative measures that capture properties of a network, such as density,
clustering coefficient, path length, and assortativity.

5. Diffusion and contagion modeling: Analyzing the spread of information, behaviors, or diseases through
a network, examining how they propagate and influence individuals.

6. Social network mining: Extracting patterns and insights from large-scale social network data, often
using machine learning and data mining techniques.

Social network analysis provides valuable insights into various real-world applications, including social
media analysis, organizational dynamics, epidemiology, marketing, and recommendation systems. By
uncovering hidden relationships, influential individuals, and community structures, SNA contributes to a
deeper understanding of social systems and facilitates decision-making processes in various domains.

Business analysis is a discipline that focuses on identifying, analyzing, and solving business problems and
improving organizational processes. It involves understanding the needs and objectives of a business and
using analytical techniques to drive informed decision-making and achieve desired outcomes.

The role of a business analyst is to bridge the gap between business stakeholders and technology teams,
ensuring that solutions align with business goals and requirements. Business analysts work across
various industries and sectors, including finance, healthcare, retail, and information technology, among
others.
The key activities and components of business analysis include:

1. Understanding Business Needs: Business analysts work closely with stakeholders to identify and
articulate business needs, goals, and challenges. They gather requirements by conducting interviews,
workshops, and data analysis to gain a comprehensive understanding of the organization's current state
and desired future state.

2. Requirements Elicitation and Documentation: Business analysts gather requirements by engaging with
stakeholders to identify and document their needs. This involves creating business requirements
documents (BRDs), use cases, user stories, and other artifacts that capture the functional and non-
functional requirements of a project or initiative.

3. Analysis and Problem Solving: Business analysts analyze the gathered requirements and perform gap
analysis to identify areas of improvement and potential solutions. They use various techniques such as
process modeling, data analysis, and feasibility studies to evaluate different options and recommend the
most suitable course of action.

4. Solution Design and Evaluation: Business analysts collaborate with stakeholders and subject matter
experts to design solutions that address the identified business needs. This includes creating functional
specifications, wireframes, and prototypes to communicate the proposed solution. They also participate
in solution evaluation and validation to ensure that it meets the intended objectives.

5. Facilitating Communication and Collaboration: Business analysts act as facilitators and mediators
between business stakeholders and technology teams. They bridge the communication gap, ensuring
that requirements are understood by all parties involved. They facilitate meetings, workshops, and
discussions to foster collaboration and resolve conflicts.

6. Change Management and Implementation Support: Business analysts play a crucial role in managing
organizational change and ensuring the successful implementation of solutions. They create change
management plans, conduct impact assessments, and provide support during the implementation
phase. They also assist in user training and documentation to ensure smooth adoption of new processes
or technologies.

Overall, business analysis enables organizations to make informed decisions, streamline processes, and
achieve their strategic objectives. It requires a blend of analytical skills, communication abilities, domain
knowledge, and a deep understanding of business operations. By applying business analysis techniques
and methodologies, organizations can enhance efficiency, drive innovation, and gain a competitive
advantage in the marketplace.

Classification is a type of supervised machine learning problem where


the goal is to predict, for one or more observations, the category or
class they belong to.

An important element of any machine learning workflow is the


evaluation of the performance of the model. This is the process where
we use the trained model to make predictions on previously unseen,
labelled data. In the case of classification, we then evaluate how many
of these predictions the model got right.

In real-world classification problems, it is usually impossible for a


model to be 100% correct. When evaluating a model it is, therefore,
useful to know, not only how wrong the model was, but in which way
the model was wrong.

“All models are wrong, but some are useful”,


George Box

For example, if we are trying to predict if a tumour is benign or


cancerous, we might be happier to trade off the model incorrectly
predicting that a tumour is cancerous in a small number of cases.
Rather than have the serious consequences of missing a cancer
diagnosis.

On the flip side if we were a retailer deciding which transactions were


fraudulent we might be happier for a small number of fraudulent
transactions to be missed. Rather than risk turning away good
customers.

In both of these cases, we would optimise a model to perform better for


certain outcomes and therefore we may use different metrics to select
the final model to use. As a consequence of these trade-offs when
selecting a classifier there are a variety of metrics you should use to
optimise a model for your specific use case.

In the following article, I am going to give a simple description of eight


different performance metrics and techniques you can use to evaluate a
classifier.

1. Accuracy
The overall accuracy of a model is simply the number of correct
predictions divided by the total number of predictions. An accuracy
score will give a value between 0 and 1, a value of 1 would indicate a
perfect model.

Accuracy. Image by Author


This metric should rarely be used in isolation, as on imbalanced data,
where one class is much larger than another, the accuracy can be
highly misleading.

If we go back to the cancer example. Imagine we have a dataset where


only 1% of the samples are cancerous. A classifier that simply predicts
all outcomes as benign would achieve an accuracy score of 99%.
However, this model would, in fact, be useless and dangerous as it
would never detect a cancerous observation.

2. Confusion Matrix
A confusion matrix is an extremely useful tool to observe in which
way the model is wrong (or right!). It is a matrix that compares the
number of predictions for each class that are correct and those that are
incorrect.

In a confusion matrix, there are 4 numbers to pay attention to.

True positives: The number of positive observations the model


correctly predicted as positive.

False-positive: The number of negative observations the model


incorrectly predicted as positive.

True negative: The number of negative observations the model


correctly predicted as negative.
False-negative: The number of positive observations the model
incorrectly predicted as negative.

The image below shows a confusion matrix for a classifier. Using this
we can understand the following:

 The model correctly predicted 3,383 negative samples


but incorrectly predicted 46 as positive.

 The model correctly predicted 962 positive


observations but incorrectly predicted 89 as negative.

 We can see from this confusion matrix that the data


sample is imbalanced, with the negative class having a
higher volume of observations.

Confusion matrix example (plotted using Pycaret). Image by Author


3. AUC/ROC
A classifier such as logistic regression will return the probability of an
observation belonging to a particular class as the prediction output.
For the model to be useful this is usually converted to a binary value
e.g. either the sample belongs to the class or it doesn’t. To do this a
classification threshold is used, for example, we might say that if the
probability is above 0.5 then the sample belongs to class 1.

The ROC (Receiver Operating Characteristics) curve is a plot of the


performance of the model (a plot of the true positive rate and the false
positive rate) at all classification thresholds. The AUC is the
measurement of the entire two-dimensional area under the curve and
as such is a measure of the performance of the model at all possible
classification thresholds.

ROC curves plot the accuracy of the model and therefore are best
suited to diagnose the performance of models where the data is not
imbalanced.
ROC curve example (plotted using Pycaret). Image by Author

4. Precision
Precision measures how good the model is at correctly identifying the
positive class. In other words out of all predictions for the positive class
how many were actually correct? Using alone this metric for optimising
a model we would be minimising the false positives. This might be
desirable for our fraud detection example, but would be less useful for
diagnosing cancer as we would have little understanding of positive
observations that are missed.

Precision. Image by author


5. Recall
Recall tell us how good the model is at correctly predicting all the
positive observations in the dataset. However, it does not include
information about the false positives so would be more useful in the
cancer example.

Usually, precision and recall are observed together by constructing a


precision-recall curve. This can help to visualise the trade-offs between
the two metrics at different thresholds.

6. F1 score
The F1 score is the harmonic mean of precision and recall. The F1
score will give a number between 0 and 1. If the F1 score is 1.0 this
indicates perfect precision and recall. If the F1 score is 0 this means
that either the precision or the recall is 0.

In machine Learning, Classification is the process of


categorizing a given set of data into different categories.
In Machine Learning, To measure the performance of the
classification model we use the confusion matrix.
Confusion Matrix
A confusion matrix is a matrix that summarizes the
performance of a machine learning model on a set of test
data. It is often used to measure the performance of
classification models, which aim to predict a categorical
label for each input instance. The matrix displays the
number of true positives (TP), true negatives (TN), false
positives (FP), and false negatives (FN) produced by the
model on the test data.
For binary classification, the matrix will be of a 2X2
table, For multi-class classification, the matrix shape
will be equal to the number of classes i.e for n classes it
will be nXn.
A 2X2 Confusion matrix is shown below for the image
recognization having a Dog image or Not Dog image.

Actual

Dog Not Dog

True Positive False Positive


Dog (TP) (FP)

False Negative True Negative


Predicted Not Dog (FN) (TN)

 True Positive (TP): It is the total counts having


both predicted and actual values are Dog.
 True Negative (TN): It is the total counts having
both predicted and actual values are Not Dog.
 False Positive (FP): It is the total counts having
prediction is Dog while actually Not Dog.
 False Negative (FN): It is the total counts having
prediction is Not Dog while actually, it is Dog.

Example
Index 1 2 3 4 5 6 7 8 9 10

Not Not Not Not


Dog Dog Dog Dog Dog Dog
Actual Dog Dog Dog Dog

Not Not Not Not


Dog Dog Dog Dog Dog Dog
Predicted Dog Dog Dog Dog

Result TP FN TP TN TP FP TP TP TN TN

 Actual Dog Counts = 6


 Actual Not Dog Counts = 4
 True Positive Counts = 5
 False Positive Counts = 1
 True Negative Counts = 3
 False Negative Counts = 1
Actual

Dog Not Dog

True Positive False Positive


Dog (TP =5) (FP=1)

False Negative True Negative


Predicted Not Dog (FN =1) (TN=3)

Confusion Matrix

Implementations of Confusion Matrix in Python

Steps:
 Import the necessary libraries like Numpy,
confusion_matrix from sklearn.metrics, seaborn, and
matplotlib.
 Create the NumPy array for actual and predicted
labels.
 compute the confusion matrix.
 Plot the confusion matrix with the help of the
seaborn heatmap.
 Python3
#Import the necessary libraries

import numpy as np

from sklearn.metrics import confusion_matrix

import seaborn as sns

import matplotlib.pyplot as plt

#Create the NumPy array for actual and predicted labels.

actual = np.array(

['Dog','Dog','Dog','Not Dog','Dog','Not Dog','Dog','Dog','Not Dog','Not Dog'])

predicted = np.array(

['Dog','Not Dog','Dog','Not Dog','Dog','Dog','Dog','Dog','Not Dog','Not Dog'])

#compute the confusion matrix.

cm = confusion_matrix(actual,predicted)

#Plot the confusion matrix.

sns.heatmap(cm,

annot=True,

fmt='g',

xticklabels=['Dog','Not Dog'],

yticklabels=['Dog','Not Dog'])

plt.ylabel('Prediction',fontsize=13)

plt.xlabel('Actual',fontsize=13)

plt.title('Confusion Matrix',fontsize=17)
plt.show()

Output:

Confusion Matrix

From the confusion matrix, we can find the following metrics


Accuracy: Accuracy is used to measure the performance of
the model. It is the ratio of Total correct instances to
the total instances.

For the above case:


Accuracy = (5+3)/(5+3+1+1) = 8/10 = 0.8
Precision: Precision is a measure of how accurate a model’s
positive predictions are. It is defined as the ratio of
true positive predictions to the total number of positive
predictions made by the model
For the above case:
Precision = 5/(5+1) =5/6 = 0.8333
Recall: Recall measures the effectiveness of a
classification model in identifying all relevant instances
from a dataset. It is the ratio of the number of true
positive (TP) instances to the sum of true positive and
false negative (FN) instances.

For the above case:


Recall = 5/(5+1) =5/6 = 0.8333
F1-Score: F1-score is used to evaluate the overall
performance of a classification model. It is the harmonic
mean of precision and recall,

For the above case:


F1-Score: = (2* 0.8333* 0.8333)/( 0.8333+ 0.8333) = 0.8333
Holdout Method is the simplest sort of method to evaluate a
classifier. In this method, the data set (a collection of
data items or examples) is separated into two sets, called
the Training set and Test set.
A classifier performs function of assigning data items in a
given collection to a target category or class.
Example –
E-mails in our inbox being classified into spam and non-
spam.
Classifier should be evaluated to find out, it’s accuracy,
error rate, and error estimates. It can be done using
various methods. One of most primitive methods in
evaluation of classifier is ‘Holdout Method’.
In the holdout method, data set is partitioned, such that –
maximum data belongs to training set and remaining data
belongs to test set.
Example –
If there are 20 data items present, 12 are placed in
training set and remaining 8 are placed in test set.
 After partitioning data set into two sets, training
set is used to build a model/classifier.
 After construction of classifier, we use data items
in test set, to test accuracy, error rate and error
estimate of model/classifier.
However, it is vital to remember two statements with regard
to holdout method. These are :
If maximum possible data items are placed in training set
for construction of model/classifier, classifier’s error
rates and estimates would be very low and accuracy would be
high. This is sign of a good classifier/model.
Example –
A student ‘gfg’ is coached by a teacher. Teacher teaches
her all possible topics which might appear for exam. Hence,
she tends to commit very less mistakes in exam, thus
performing well.
If more training data are used to construct a classifier,
it qualifies any data used from test set, to test it
(classifier).
If more number of data items are present in test set, such
that they are used to test classifier built using training
set. We can observe more accurate evaluation of classifier
with respect to it’s accuracy, error rate and estimation.
Example –
A student ‘gfg’ is coached by a teacher. Teacher teaches
her some topics, which might appear for the exam. If the
student ‘gfg’ is given a number of exams on basis of this
coaching, an accurate determination of student’s weak and
strong points can be found out.
If more test data are used to evaluate constructed
classifier, it’s error rate, error estimate and accuracy
can be accurately determined.
Problem :
During partitioning of whole data set into 2 parts i.e.,
training set and test set, if all data items belonging to
class – GFG1, are placed in test set entirely, such that
none of data items of class GFG1 are in training set. It is
evident, that model/classifier built, is not trained using
data items of class – GFG1.
Solution :
Stratification is a technique, using which data items
belonging to class – GFG1 are divided and placed into two
data sets i.e training set and test set, equally. Such
that, model/classifier is trained by data items belonging
to class -GFG1.
Example –
All the four data items belonging to class – GFG1, here,
are divided equally and placed, two data items each, into
two data sets – training set and test set.
Hold-out Method for Training Machine
Learning Models
May 22, 2023 by Ajitesh Kumar · Leave a comment

The hold-out method for training the machine learning models is a technique that
involves splitting the data into different sets: one set for training, and other sets
for validation and testing. The hold-out method is used to check how well a machine
learning model will perform on the new data. In this post, you will learn about the hold-
out method used during the process of training the machine learning model. Do check
out my post on what is machine learning? concepts & examples for a detailed understanding
of different aspects related to the basics of machine learning. Also, check out a related post
on what is data science?
When evaluating machine learning (ML) models, the question that arises is whether the
model is the best model available from the model’s hypothesis space in terms of
generalization error on the unseen / future data set. Whether the model is trained and
tested using the most appropriate method. Out of available models, which model to select?
These questions are taken care of using what is called as a hold-out method.
Instead of using an entire dataset for training, different sets called validation set and test set
are separated or set aside (and, thus, hold-out name) from the entire dataset and the
model is trained only on what is termed as the training dataset.
Table of Contents

 What is the Hold-out method for training ML models?


o Hold-out method for Model Evaluation
 Python Code for Training / Test Data Split
o Hold-out method for Model Selection
 Python Code for Training / Validation / Test Data Split
 Different types of Hold-out methods
o Related posts:

What is the Hold-out method for training ML


models?
The hold-out method for training a machine learning model is the process of splitting the
data into different splits and using one split for training the model and other splits for
validating and testing the models. The hold-out method is used for both model
evaluation and model selection. The following represents the data splits used in hold
out method.

When the entire data is used for training the model using different algorithms, the problem
of evaluating the models and selecting the most optimal model remains. The primary task is
to find out which model out of all models has the lowest generalization error. In other
words, which model makes a better prediction on future or unseen datasets than all other
models. This is where the need to have some mechanism arises wherein the model is trained
on one data set, and, validated and tested on another dataset. This is where the hold-out
method comes into the picture.
Hold-out method for Model Evaluation
The hold-out method for model evaluation represents the mechanism of splitting the
dataset into training and test datasets. The model is trained on the training set and then
tested on the testing set to get the most optimal model. This approach is often used when
the data set is small and there is not enough data to split into three sets (training, validation,
and testing). This approach has the advantage of being simple to implement, but it can be
sensitive to how the data is divided into two sets. If the split is not random, then the results
may be biased. Overall, the hold out method for model evaluation is a good starting point
for training machine learning models, but it should be used with caution. The following
represents the hold-out method for model evaluation.

Fig 1. Hold-out method for model evaluation


In the above diagram, you may note that the data set is split into two parts. One split is set
aside or held out for training the model. Another set is set aside or held out for testing or
evaluating the model. The split percentage is decided based on the volume of the data
available for training purposes. Generally, 70-30% split is used for splitting the dataset
where 70% of the dataset is used for training and 30% dataset is used for testing the model.

This technique is well suited if the goal is to compare the models based on the model
accuracy on the test dataset and select the best model. However, there is always a
possibility that trying to use this technique can result in the model fitting well
to the test dataset. In other words, the models are trained to improve model accuracy on
the test dataset assuming that the test dataset represents the population. The test error,
thus, becomes an optimistically biased estimation of generalization error.
However, that is not desired. The final model fails to generalize well to the unseen or future
dataset as it is trained to fit well (or overfit) concerning the test data.
The following is the process of using the hold-out method for model evaluation:

 Split the dataset into two parts (preferably based on a 70-30% split; However,
the percentage split will vary)
 Train the model on the training dataset; While training the model, some fixed
set of hyperparameters is selected.
 Test or evaluate the model on the held-out test dataset
 Train the final model on the entire dataset to get a model which can generalize
better on the unseen or future dataset.
Note that this process is used for model evaluation based on splitting the dataset into
training and test datasets and using a fixed set of hyperparameters. There is another
technique of splitting the data into three sets and using these three sets for model selection
or hyperparameters tuning. We will look at that technique in the next section.

Python Code for Training / Test Data Split


The following Python code showcases how to use the hold-out method for evaluating the
performance of a machine learning model by splitting the data into training and test
datasets. In this example, the well-known Iris dataset is employed. The code begins by
loading the Iris dataset using the load_iris() function from
the sklearn.datasets module. Subsequently, the data is divided into training and test sets
using train_test_split() from sklearn.model_selection, with a test dataset size of
30% and a fixed random state for reproducibility. A logistic regression model is then
initialized, trained on the training dataset, and utilized to make predictions on the test
dataset.
1from sklearn.datasets import load_iris
2from sklearn.model_selection import train_test_split
3from sklearn.linear_model import LogisticRegression
4from sklearn.metrics import accuracy_score
5
6# Load the Iris dataset
7iris = load_iris()
8X = iris.data
9y = iris.target
10
11# Split the data into training and test datasets
12X_train, X_test, y_train, y_test = train_test_split(X, y,
13test_size=0.3, random_state=42)
14
15# Initialize and train the model
16model = LogisticRegression()
17model.fit(X_train, y_train)
18
19# Make predictions on the test dataset
20y_pred = model.predict(X_test)
21
22# Evaluate the model performance
23accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")

Random subsampling, also known as random subsampling with replacement or bootstrapping, is a


statistical technique used to generate multiple samples from a given dataset by randomly selecting
observations (samples) with replacement. This method is commonly used in machine learning, statistical
analysis, and resampling techniques.

Here's how random subsampling works:

1. Dataset: Consider a dataset with a total of N observations (samples).

2. Sample Generation: Random subsampling involves generating multiple samples by randomly selecting
observations from the original dataset. Each sample is created by randomly selecting observations with
replacement, which means that the same observation can be chosen more than once for a particular
sample.

3. Sample Size: The sample size for each subsample is typically the same as the original dataset, but it
can also be smaller or larger, depending on the requirements of the analysis. The sample size is denoted
by n, where n ≤ N.

4. Repetition: The process of generating subsamples is typically repeated multiple times to obtain a set
of subsamples. The number of repetitions is denoted by R.

5. Analysis: Each subsample can be used independently for analysis, such as training machine learning
models, estimating statistical parameters, assessing variability, or conducting hypothesis testing.

Random subsampling offers several advantages:


1. Variability Assessment: By generating multiple subsamples, random subsampling allows for the
assessment of variability and uncertainty associated with the estimates or results obtained from the
original dataset.

2. Sample Size Flexibility: Random subsampling provides flexibility in choosing the sample size for each
subsample. Researchers can control the sample size based on computational constraints or statistical
requirements.

3. Robustness: Random subsampling helps to create robust estimates by incorporating random variations
and reducing the impact of outliers or extreme observations.

4. Model Evaluation: In machine learning, random subsampling is often used for model evaluation.
Multiple subsamples can be used to train and validate models, allowing for an estimation of model
performance on unseen data.

However, it's important to note that random subsampling does not guarantee that each observation will
be selected in each subsample. Due to the random selection process, some observations may be
excluded from certain subsamples, while others may be duplicated. The number of unique observations
in each subsample is expected to be lower than the total number of observations in the original dataset.

Overall, random subsampling is a useful technique for generating multiple samples from a dataset,
enabling robust analysis, variability assessment, and model evaluation.
Understanding AUC - ROC
Curve

Sarang Narkhede
·

Follow
Published in

Towards Data Science

5 min read

Jun 26, 2018


16.3K
76

Understanding AUC - ROC Curve [Image 1] (Image courtesy: My Photoshopped Collection)

In Machine Learning, performance measurement is an essential task.


So when it comes to a classification problem, we can count on an AUC -
ROC Curve. When we need to check or visualize the performance of the
multi-class classification problem, we use the AUC (Area Under The
Curve) ROC (Receiver Operating Characteristics) curve. It is
one of the most important evaluation metrics for checking any
classification model’s performance. It is also written as AUROC (Area
Under the Receiver Operating Characteristics)
Note: For better understanding, I suggest you read my article
about Confusion Matrix.

This blog aims to answer the following questions:

1. What is the AUC - ROC Curve?

2. Defining terms used in AUC and ROC Curve.

3. How to speculate the performance of the model?

4. Relation between Sensitivity, Specificity, FPR, and Threshold.

5. How to use AUC - ROC curve for the multiclass model?

What is the AUC - ROC Curve?


AUC - ROC curve is a performance measurement for the classification
problems at various threshold settings. ROC is a probability curve and
AUC represents the degree or measure of separability. It tells how
much the model is capable of distinguishing between classes. Higher
the AUC, the better the model is at predicting 0 classes as 0 and 1
classes as 1. By analogy, the Higher the AUC, the better the model is at
distinguishing between patients with the disease and no disease.

The ROC curve is plotted with TPR against the FPR where TPR is on
the y-axis and FPR is on the x-axis.
AUC - ROC Curve [Image 2] (Image courtesy: My Photoshopped Collection)

Defining terms used in AUC and ROC Curve.

TPR (True Positive Rate) / Recall /Sensitivity

Image 3

Specificity

Image 4

FPR
Image 5

How to speculate about the performance of the model?


An excellent model has AUC near to the 1 which means it has a good
measure of separability. A poor model has an AUC near 0 which means
it has the worst measure of separability. In fact, it means it is
reciprocating the result. It is predicting 0s as 1s and 1s as 0s. And when
AUC is 0.5, it means the model has no class separation capacity
whatsoever.

Let’s interpret the above statements.

As we know, ROC is a curve of probability. So let's plot the


distributions of those probabilities:

Note: Red distribution curve is of the positive class (patients with


disease) and the green distribution curve is of the negative
class(patients with no disease).
[Image 6 and 7] (Image courtesy: My Photoshopped Collection)

This is an ideal situation. When two curves don’t overlap at all means
model has an ideal measure of separability. It is perfectly able to
distinguish between positive class and negative class.
[Image 8 and 9] (Image courtesy: My Photoshopped Collection)

When two distributions overlap, we introduce type 1 and type 2 errors.


Depending upon the threshold, we can minimize or maximize them.
When AUC is 0.7, it means there is a 70% chance that the model will be
able to distinguish between positive class and negative class.
[Image 10 and 11] (Image courtesy: My Photoshopped Collection)

This is the worst situation. When AUC is approximately 0.5, the model
has no discrimination capacity to distinguish between positive class
and negative class.
[Image 12 and 13] (Image courtesy: My Photoshopped Collection)

When AUC is approximately 0, the model is actually reciprocating the


classes. It means the model is predicting a negative class as a positive
class and vice versa.

The relation between Sensitivity, Specificity, FPR, and


Threshold.
Sensitivity and Specificity are inversely proportional to each other. So
when we increase Sensitivity, Specificity decreases, and vice versa.

Sensitivity⬆️, Specificity⬇️ and Sensitivity⬇️,


Specificity⬆️

When we decrease the threshold, we get more positive values thus it


increases the sensitivity and decreasing the specificity.

Similarly, when we increase the threshold, we get more negative values


thus we get higher specificity and lower sensitivity.

As we know FPR is 1 - specificity. So when we increase TPR, FPR also


increases and vice versa.

TPR⬆️, FPR⬆️ and TPR⬇️, FPR⬇️


How to use the AUC ROC curve for the multi-class
model?
In a multi-class model, we can plot the N number of AUC ROC Curves
for N number classes using the One vs ALL methodology. So for
example, If you have three classes named X, Y, and Z, you will have
one ROC for X classified against Y and Z, another ROC for Y classified
against X and Z, and the third one of Z classified against Y and X.
AUC-ROC, or Area Under the Receiver Operating Characteristic Curve, is a widely used evaluation metric
in machine learning and binary classification tasks. It provides a measure of the performance and
predictive power of a classification model.

The ROC curve is created by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at
various classification thresholds. The TPR is also known as sensitivity or recall, representing the
proportion of positive instances correctly classified as positive. The FPR is the ratio of negative instances
incorrectly classified as positive. Each point on the ROC curve corresponds to a different threshold
setting for classifying positive and negative instances.

The AUC-ROC is the area under this ROC curve, ranging from 0 to 1. It provides a measure of the
classifier's ability to distinguish between positive and negative instances across all possible threshold
settings. The higher the AUC-ROC value, the better the classifier's performance.

An AUC-ROC of 0.5 indicates that the classifier performs no better than random guessing, while an AUC-
ROC of 1 indicates a perfect classifier with a clear separation between positive and negative instances.
AUC-ROC values between 0.5 and 1 represent varying degrees of classification performance.

The AUC-ROC metric is particularly useful when dealing with imbalanced datasets or when the cost of
false positives and false negatives is not balanced. It is commonly used in medical diagnostics, credit
scoring, fraud detection, and many other applications where binary classification is essential.

In summary, AUC-ROC provides a concise summary of a binary classifier's performance across various
classification thresholds, enabling effective comparison and evaluation of different models.

You might also like