0% found this document useful (0 votes)
10 views

CSC413 Lecture Note

The document outlines the machine learning workflow, including steps from data collection to model prediction, and discusses various types of machine learning such as supervised, unsupervised, and reinforcement learning. It elaborates on model evaluation metrics, the importance of addressing model bias, and techniques for regularization to prevent overfitting. Additionally, it covers natural language processing concepts like tokenization, normalization, stemming, and lemmatization, emphasizing their roles in text analysis.

Uploaded by

teasekid.zuberu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

CSC413 Lecture Note

The document outlines the machine learning workflow, including steps from data collection to model prediction, and discusses various types of machine learning such as supervised, unsupervised, and reinforcement learning. It elaborates on model evaluation metrics, the importance of addressing model bias, and techniques for regularization to prevent overfitting. Additionally, it covers natural language processing concepts like tokenization, normalization, stemming, and lemmatization, emphasizing their roles in text analysis.

Uploaded by

teasekid.zuberu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Machine Learning

Machine learning workflow:


Step 1: Data Collection
Step 2: Data Cleaning
Step 3: Model Development
Step 4: Model training
Step 5: Model Evaluation
Step 6: Hyper parameter tuning
Step 7: Model prediction
Step 8: Feedback incorporation (optional)

Data collection involves gathering data.


Data cleaning involves handling missing values, correcting data format, and standardizing and
normalizing data.
Model development is the process of creating model.
Model training is the process of training a model.
Hyper parameter tuning is the process of choosing the best parameter for a model.
Model prediction is the process of making prediction from a trained model.
Model evaluation is the process of evaluating model in order to improve it performance.

Machine learning is divided into;


 Supervised learning
 Unsupervised learning
 Reinforcement learning
Supervised learning is a type of learning in which the data is label. Supervised learning is
divided into;
 Classification
 Regression
Classification
Classification deals with categorical data. E.g. Email spam detection.
Example of classification models are;
 Logistic regression
 Naïve Bayes
 K nearest neighbour
 Support Vector Machine
Regression
Regression deals with continuous data. E.g. Stock price prediction, rainfall prediction etc.
Example of regression models are;
 Linear regression
 Ridge regression

Regression Models

 Linear Regression: Used to predict continuous numerical values.


o Example: Predicting house prices based on features like square footage,
number of bedrooms, and location.
 Ridge Regression: A regularization technique that adds a penalty term to the loss
function to prevent overfitting.
o Example: Predicting customer churn.
 Lasso Regression: Another regularization technique that can be used for feature
selection.
o Example: Predicting stock prices.

Classification Models

 Logistic Regression: Used to predict binary categorical values (e.g., 0 or 1).


o Example: Classifying emails as spam or not spam.
 Decision Trees: Creates a tree-like structure to make decisions based on features.
o Example: Predicting whether a customer will purchase a product based on
their demographics and browsing history.
 Random Forests: An ensemble of decision trees, often used for improved accuracy
and robustness.
o Example: Classifying images as cats or dogs.
 Support Vector Machines (SVMs): Finds the optimal hyperplane to separate data
points of different classes.
o Example: Identifying fraudulent credit card transactions.
 Naive Bayes: Assumes features are independent given the class label.
o Example: Spam filtering.
 K-Nearest Neighbors (KNN): Predicts the class or value of a new data point based
on the majority class or average value of its k nearest neighbors.

Example: Recommender systems.


Unsupervised Learning
Unsupervised learning is learning with data that is not labelled. Unsupervised learning is
categorized into;
 Clustering e.g. Identifying fake news, document analysis
Examples of clustering models are;
 K-means, Hierachical clustering etc.
 Dimensionality reduction e.g. Analysis of written text.
Examples of dimensionality reduction model;
 Principal component analysis.

Types of Unsupervised Learning:

1. Clustering:
o Groups similar data points together.
o Examples: customer segmentation, image segmentation.
o Algorithms: K-means clustering, hierarchical clustering, DBSCAN.
2. Dimensionality Reduction:
o Reduces the number of features while preserving essential information.
o Examples: visualizing high-dimensional data, feature engineering.

Algorithms: Principal Component Analysis (PCA), t-SNE, Autoencoders.


Metrics

Metrics: Metrics are used to measure the performance of a model. There is nothing like best
metrics. The choice of metrics usually depends on the nature of the problem. However, factors
like datasets, what need to be achieved also influence the choice of metrics. Examples of
metrics are accuracy, precision, recall, root mean squared error, ROC, and AUC etc.

Classification Metrics

Accuracy is a simple count of correct predictions

Accuracy = # correct / All = (TP + TN) / (TP +FP + FN + TN).

Accuracy isn’t a good measure because it doesn’t show class imbalance. One class may be
rare e.g. HIV-positive, or fraud.

Precision: “exactness” – is what percent of instances that the classifier labelled as positive are
actually positive?

Precision = TP / (TP + FP)

Recall: “completeness” – what percent of positive instances did the classifier label as
positive?
Recall = TP / (TP + FN)
Model Bias
Model Bias: this is a situation whereby a model is bias towards certain group. Group can be a
race, gender, religion etc. In short, we don’t want a racist algorithm. However, it doesn’t happen
overnight, it starts from data. when you don’t work on your data to address this issue then
you’re going to face this problem. The model feed on the data and as such the problem starts
from the data.

If you look at the features of the dataset you will find that we have a feature B meaning the
proportion of blacks by town. This kind of feature should be handled by deleting it. when you
allow it to be part of what is giving to the model then you will end up with a racist algorithm.
Train test Split
Train Test Split: This is a process of splitting data into training set and test set. In machine
learning it is advisable to divide data into three. a. Training set: this is used during training
for training the model. b. Validation set this is used for validation during model evaluation. c.
Test set: this is used for testing the model after training. This is usually part of the data the
model has not seen before.

Train Test Split Ratio: when dividing data into training, validation, and test set, 80% is
usually assigned to training set. 20% of the data is divided into two for validation and testing.

Overfitting and Underfitting

Overfitting: A model is said to be overfitting when the model performs well on training data
and performs badly on test data. The major cause of overfitting is lack of data and sometimes
too much or lack of training time. Underfitting: A model is said to be underfitting when it
performs badly during training and during testing. The major cause of underfitting is lack of
data.

Regularization

Regularization: This is a technique use to reduce or avoid overfitting. Regularization


Techniques: some of the regularization techniques are

a. Data Augmentation

b. Weight decay

c. Dropout

d. Under samplimg

e. Oversampling
f. SMOTE

Data Augmentation: The more data a model has for training the better the model is at predicting
unseen data. However, getting enough data most of the time is not easy. It is usually time
consuming and expensive. One simplest way is to add synthetic data to our existing data.
Synthetic data is an artificial data that is created from the existing data through some form of
transformations such as rotation, translation, cropping, and affine etc. Weight Decay: In weight
decay, an extra term is being added to the initial loss function to form a regularized loss
function. the extra term is known as the regularization term. Dropout: The basic idea behind
dropout is to make neurons independent of other neurons. For every training there is a
probability p of a node being turned off and the 1-p probability of not being turned off, this
way co-adaptation is avoided.

Reinforcement Learning
In reinforcement learning the goal is for the agent to reach goal state. The agent is expected to
get the maximum reward.
Example of reinforcement learning agent;
 Q-learning agent
In reinforcement learning we have state, action, and reward. PEAS framework is used to
measure the performance.
P- Performance measure
E-Environment
A-Actuators
S-Sensors

Agent

An agent is anything that can be viewed as perceiving its environment through sensors and
acting upon that environment through actuators (includes humans, robots, chatbots,
thermostats).

An analysis framework: a whole-agent view of intelligent agents based on rationality

Performance measure Defines what "good behaviour" is in a specific context. No one fixed
measure across all tasks, but we insist on an objective measure that an outside authority can
establish (i.e. the agent should not itself define what is good.... like humans tend to do).
Better to design performance measures according to what you (as the designer) wants in the
environment, rather than how one thinks the agent should behave.

Environment A specification of the physical (or virtual) environment the agent is expected to
operate in.
Actuators The types and physical properties of the actuators available to the agent. Limits
what the agent can do.

Sensors The types and physical properties of the sensors available to the agent. Limits what
the agent can know about the environment.

Example: Designing an automated taxi driver

 Performance measure? safety, reach destination, maximise profits, obey laws,


passenger comfort...
 Environment:? urban streets, motorways, traffic, pedestrians, weather, customers...
 Actuators? steer, accelerate, brake, horn, speak/display,...
 Sensors? video, accelerometers, GPS, engine sensors, keyboard,...
Natural Language Processing

Introduction to text: Tokenization, vector representation, cosine similarity, and lemmatization

What is text?
• At the lowest level, string or streams of characters (or bytes)

“<div class=”md”><p>Depends on who it's for I guess? Gender? Age? Nationality? Might help
people give you ideas. In general, I'd say something tartan maybe? House of Tartan sells
blankets and scarves etc (along with lots of other generic Scottish stuff). If your gifts are for
artsy people then you could maybe get some Charles Rennie Mackintosh related gifts (the shop
in the Lighthouse sells loads of stuff like that - might be worth having a wander around there
in general).</p></div>”

What is this text about?


What Scottish stuff have you guys given as gifts?
• Some of those character are markup – we can use an HTML parser to remove the tags and
focus on the main text
 Fairly easy for JSON, HTML & XML, might be more difficult for Word, PDF, etc.
Text Processing
Most tasks need to perform text normalization
1. Segmenting/tokenizing terms in running text e.g. “House of Tartan sells blankets” à
“House”, “of”, “Tartan” “sells”, “blankets”
Processing Text: Tokenisation
Tokenization: Tokenization is the process of breaking a text into individual units, which are
typically words or subwords (n-grams). These individual units are called tokens. The main
goal of tokenization is to divide the text into meaningful chunks, making it easier to process
and analyze the language data.
 Assuming (HTML) markup is removed, we need to identify the "tokens", or separate
terms
 Separate the sequence of characters into a sequence of tokens (roughly “words”)
 A token (or term) is the technical name for a meaningful sequence of characters
 In English, this may involve:
 Splitting on punctuation: _-.?!,;:"()'&£$
 Splitting on whitespace characters: " " TAB \n \r
 Other languages may be more difficult to tokenise
Example
Input Text: "Hello, how are you?"
Tokenization Output: ["Hello", ",", "how", "are", "you", "?"]
Exercise
Tokenize the string below
[He didn’t like the U.S. movie “Snakes on a train, revenge of Viper-man!”, now playing in
the U.K.]

2. Normalize or ‘canonicalize’ tokens into a normal form

Normalization: Normalization is the process of transforming text into a standard, consistent


format. The objective of normalization is to eliminate variations in the text that do not
contribute to the meaning but could affect the analysis or modeling. It includes various
operations like converting text to lowercase, removing punctuation, handling special
characters, and expanding contractions.
Example:
Original Text: "I haven't seen it!"
Normalization Output: "i have not seen it"

 May “case fold” aka lower case text, remove numbers, punctuation, etc… e.g.,
“House”, “of”, “Tartan” “sells”, “blankets”
“house”, “of”, “tartan” “sell”, “blanket”.
 Lemma: same stem, part of speech, rough word sense
 cat and cats = same lemma
 Wordform: the full inflected surface form
 cat and cats = different wordforms
Stemming
Stemming is a text normalization technique used in natural language processing to reduce
words to their base or root form. The process involves removing prefixes, suffixes, and
inflections from words to obtain the core form, known as the stem. The resulting stems may
not always be actual words, but they represent the common base that related words share.
The purpose of stemming is to group words with similar meanings together, even if they have
different forms, so that they can be treated as the same word during text analysis. This helps
in reducing data sparsity and improving the efficiency of NLP tasks like text search,
information retrieval, and sentiment analysis.
Example of stemming:
Consider the following words:
Running
Runs
Ran
Applying stemming to these words would yield the common base "run":
Stemmed words:
Running -> run
Runs -> run
Ran -> run
Here are a few more examples of stemming:
Going -> go
Jumps -> jump
Jumping -> jump
Happily -> happi (Note: The stem "happi" is not an actual English word but represents the
common base.)
 Goal: Group words that are similar
• e.g., “computer”, “computers”, “computing”, “compute”
 Definition: Process for reducing inflected words to their stem or root form
 In practice: rule-based models to remove suffixes of words , e.g. "-er", "-ed", "-s", "-
ing”
 The outcome may not be a word, e.g. comput
 Usually operates on a single word without context
 Fast, easy to implement, usually effective -- errors may not be “too bad”
 Porter Stemmer is the most widely used algorithm

Lemmatization
Lemmatization is another text normalization technique used in natural language processing to
reduce words to their base or dictionary form, called lemmas. Unlike stemming, which
simply trims off prefixes or suffixes, lemmatization takes into account the word's context and
part of speech (POS) to produce valid words.
The main goal of lemmatization is to transform different inflected forms of a word into a
single, canonical form. This allows words with the same meaning to be treated as a single
token during text analysis, improving the accuracy and interpretability of NLP tasks.
Example of lemmatization:
Consider the following words with different forms:
Running
Runs
Ran
By applying lemmatization, the words are transformed into their lemmas:
Lemmatized words:
Running -> run
Runs -> run
Ran -> run
Here are a few more examples of lemmatization:
Better -> good
Best -> good
Cats -> cat
Went -> go
In these examples, you can see that lemmatization produces valid words that represent the
base or dictionary form of the original words, taking into account the context and part of
speech. By doing so, lemmatization helps in consolidating words with the same meaning and
reducing data sparsity.

 Definition: process of grouping together the different inflected forms of a word to a


base form
 Uses context (part-of-speech patterns) to be more precise
 E.g. meeting (noun) vs. to meet (verb)
 Have to find correct dictionary headword form
 e.g. using a linguistic language dictionary like Wordnet
 More linguistically principled than stemming, but depends on accuracy of context and
completeness of dictionary
 Processing is slower than stemming

3. Segment long sequences of tokens (usually into sentences)


- Build a binary classifier (later) and/or use heuristic rules e.g. [Might, help, people,
give, you, ideas,. I’d, say, something, tartan, maybe,?
<s>Might help people give you ideas.</s><s> I'd say something tartan maybe?</s>

Text Representation
Text collections
 For a large collection of text documents we define various properties
 Type (or lexeme): an element of the vocabulary
 A normalized unique token
 N = number of all token occurrences (word count)
 V = vocabulary = set of types (unique normalized tokens) |V| is the size of the
vocabulary
 A vocabulary is stored in a data structure called a dictionary.
In natural language processing (NLP), a vocabulary refers to the collection of unique words
or tokens present in a corpus or a specific text dataset. Building a vocabulary is an essential
step in NLP tasks such as text classification, language modeling, and machine translation.
The vocabulary helps in representing text data in a numerical format that machine learning
algorithms can process. Here's an explanation of vocabulary in NLP with examples:
Corpus and Tokens: A corpus is a collection of text documents or sentences. Tokens are the
individual units or elements of a text, typically words or characters, obtained by splitting the
text. Consider the following corpus:
Corpus: "I love natural language processing. It is fascinating!"
Tokens: ['I', 'love', 'natural', 'language', 'processing', '.', 'It', 'is', 'fascinating', '!']
Vocabulary: The vocabulary is a set of unique tokens present in the corpus. It represents all
the distinct words that occur in the text dataset. In the above corpus, the vocabulary would be:
Vocabulary: {'I', 'love', 'natural', 'language', 'processing', '.', 'It', 'is', 'fascinating', '!'}
Vocabulary Size: The vocabulary size refers to the total number of unique words in the
vocabulary. In the above example, the vocabulary size is 10.
Word Frequency: Word frequency indicates how often each word occurs in the corpus. It
provides insights into the importance or prevalence of specific words. For example, the word
frequency of the corpus could be:
Word Frequency: {'I': 1, 'love': 1, 'natural': 1, 'language': 1, 'processing': 1, '.': 1, 'It': 1, 'is': 1,
'fascinating': 1, '!': 1}
Out-of-Vocabulary (OOV) Tokens: OOV tokens refer to words or tokens that are not present
in the vocabulary. They often occur when encountering new or unseen words in test or
production data. Proper handling of OOV tokens is crucial in NLP tasks to avoid issues with
unknown words.
Example: Suppose the corpus is "I enjoy reading books." and the vocabulary is {'I', 'enjoy',
'reading'}. If the test data contains the word "books," which is not in the vocabulary, it would
be considered an OOV token.
Building and managing an effective vocabulary is important in NLP tasks, and it involves
techniques such as tokenization, removing stop words, handling OOV tokens, and
maintaining an appropriate vocabulary size. The vocabulary enables the conversion of text
data into numerical representations that machine learning algorithms can work with for
various NLP tasks.

Representing the text


 We can represent a piece of text by the terms that occur in it. Mathematically, this would
be a vector for each document

 This is called "one-hot" encoding


 1 in a term's column if that term occurs in the document
 0 otherwise
 We may consider this as representing each document as a set of its terms
 All of our vectors have |V| dimensions, to cover all words we might have encountered
in all documents analysed
Implementing One-Hot Encoding
Text (Set) Similarity

Why set-based similarity?


 Work well for short pieces of text
 Person names, product titles, etc.…
 Tweets or sentences
 Simple (trivial) to compute with basic data structures
 Fundamental building block of more complex (learned) functions
The overlap coefficient, also known as the Szymkiewicz-Simpson coefficient, is a similarity
measure used to compare the similarity between two sets. It calculates the ratio of the
intersection of two sets to the smaller of the two sets. The formula for the overlap coefficient
is:
Overlap Coefficient (O) = |A ∩ B| / min(|A|, |B|)
where:
|A ∩ B| is the size of the intersection of sets A and B.
|A| is the size of set A.
|B| is the size of set B.
The overlap coefficient ranges from 0 to 1, where 0 indicates no overlap (no shared elements)
between the sets, and 1 indicates complete overlap (both sets are identical).
Let's illustrate the overlap coefficient with some examples:
Example 1: Set A: {1, 2, 3, 4, 5} Set B: {3, 4, 5, 6, 7}
|A ∩ B| = {3, 4, 5} (intersection) |A| = 5 |B| = 5
Overlap Coefficient (O) = |A ∩ B| / min(|A|, |B|) = 3 / 5 = 0.6
Example 2: Set A: {apple, banana, orange} Set B: {orange, pear, peach}
|A ∩ B| = {orange} (intersection) |A| = 3 |B| = 3
Overlap Coefficient (O) = |A ∩ B| / min(|A|, |B|) = 1 / 3 ≈ 0.3333
Example 3: Set A: {red, green, blue} Set B: {purple, yellow}
|A ∩ B| = {} (intersection - no common elements) |A| = 3 |B| = 2
Overlap Coefficient (O) = |A ∩ B| / min(|A|, |B|) = 0 / 2 = 0
In these examples, we calculated the overlap coefficient for different sets, demonstrating how
it quantifies the similarity or overlap between sets. Keep in mind that this coefficient is
commonly used in various fields, such as information retrieval and data mining, to assess the
similarity between sets of data.

The Dice coefficient, also known as the Sørensen–Dice coefficient or Dice similarity index,
is a similarity measure used to compare the similarity between two sets. It calculates the ratio
of twice the intersection of two sets to the sum of the sizes of the sets. The formula for the
Dice coefficient is:
Dice Coefficient (D) = 2 * |A ∩ B| / (|A| + |B|)
where:
|A ∩ B| is the size of the intersection of sets A and B.
|A| is the size of set A.
|B| is the size of set B.
The Dice coefficient ranges from 0 to 1, where 0 indicates no overlap (no shared elements)
between the sets, and 1 indicates complete overlap (both sets are identical).
Let's illustrate the Dice coefficient with some examples:
Example 1: Set A: {1, 2, 3, 4, 5} Set B: {3, 4, 5, 6, 7}
|A ∩ B| = {3, 4, 5} (intersection) |A| = 5 |B| = 5
Dice Coefficient (D) = 2 * |A ∩ B| / (|A| + |B|) = 2 * 3 / (5 + 5) = 6 / 10 = 0.6
Example 2: Set A: {apple, banana, orange} Set B: {orange, pear, peach}
|A ∩ B| = {orange} (intersection) |A| = 3 |B| = 3
Dice Coefficient (D) = 2 * |A ∩ B| / (|A| + |B|) = 2 * 1 / (3 + 3) = 2 / 6 ≈ 0.3333
Example 3: Set A: {red, green, blue} Set B: {purple, yellow}
|A ∩ B| = {} (intersection - no common elements) |A| = 3 |B| = 2
Dice Coefficient (D) = 2 * |A ∩ B| / (|A| + |B|) = 2 * 0 / (3 + 2) = 0 / 5 = 0
In these examples, we calculated the Dice coefficient for different sets, demonstrating how it
quantifies the similarity or overlap between sets. Like the Overlap coefficient, the Dice
coefficient is commonly used in various fields, such as image segmentation, natural language
processing, and data clustering, to measure the similarity between sets of data.

The Jaccard similarity, also known as the Jaccard index, is a similarity measure used to
compare the similarity between two sets. It calculates the ratio of the size of the intersection
of two sets to the size of their union. The formula for the Jaccard similarity is:
Jaccard Similarity (J) = |A ∩ B| / |A ∪ B|
where:
|A ∩ B| is the size of the intersection of sets A and B.
|A ∪ B| is the size of the union of sets A and B.
The Jaccard similarity ranges from 0 to 1, where 0 indicates no similarity (no shared
elements) between the sets, and 1 indicates complete similarity (both sets are identical).
Let's illustrate the Jaccard similarity with some examples:
Example 1: Set A: {1, 2, 3, 4, 5} Set B: {3, 4, 5, 6, 7}
|A ∩ B| = {3, 4, 5} (intersection) |A ∪ B| = {1, 2, 3, 4, 5, 6, 7} (union)
Jaccard Similarity (J) = |A ∩ B| / |A ∪ B| = 3 / 7 ≈ 0.4286
Example 2: Set A: {apple, banana, orange} Set B: {orange, pear, peach}
|A ∩ B| = {orange} (intersection) |A ∪ B| = {apple, banana, orange, pear, peach} (union)
Jaccard Similarity (J) = |A ∩ B| / |A ∪ B| = 1 / 5 = 0.2
Example 3: Set A: {red, green, blue} Set B: {purple, yellow}
|A ∩ B| = {} (intersection - no common elements) |A ∪ B| = {red, green, blue, purple,
yellow} (union)
Jaccard Similarity (J) = |A ∩ B| / |A ∪ B| = 0 / 5 = 0
In these examples, we calculated the Jaccard similarity for different sets, demonstrating how
it quantifies the similarity or overlap between sets. The Jaccard similarity is widely used in
various fields, such as data mining, recommendation systems, and text analysis, to measure
the similarity between sets of data.

Example
Consider two sentences d1 and d2 below. Calculate their Jaccard similarity.
 d1: James decided to quit smoking but it was not an easy decision.
 d2: Though it was not an easy decision, James decided to quit smoking
Solution:
To calculate the Jaccard similarity between two sentences, we first need to preprocess the
sentences to remove any punctuation, convert all words to lowercase, and split them into
individual tokens (words). Then, we calculate the Jaccard similarity as the size of the
intersection divided by the size of the union of the tokens in the two sentences.
Let's preprocess the sentences and calculate their Jaccard similarity:
Step 1: Preprocess the sentences
d1: "James decided to quit smoking but it was not an easy decision." d2: "Though it was not
an easy decision, James decided to quit smoking."
After preprocessing, the tokenized versions of the sentences are:
d1_tokens: ['james', 'decided', 'to', 'quit', 'smoking', 'but', 'it', 'was', 'not', 'an', 'easy', 'decision']
d2_tokens: ['though', 'it', 'was', 'not', 'an', 'easy', 'decision', 'james', 'decided', 'to', 'quit',
'smoking']
Step 2: Calculate the Jaccard similarity
Intersection (common tokens): ['james', 'decided', 'to', 'quit', 'smoking', 'it', 'was', 'not', 'an',
'easy', 'decision'] Union (all unique tokens): ['james', 'decided', 'to', 'quit', 'smoking', 'but', 'it',
'was', 'not', 'an', 'easy', 'decision', 'though']
Jaccard similarity = Size of Intersection / Size of Union = 11 / 13 ≈ 0.8462
So, the Jaccard similarity between d1 and d2 is approximately 0.8462 (rounded to four
decimal places).

Geometric similarity & Text Distributions


 Bag-of-words text representation
 Geometric similarity
 Text distributions

Bag-of-words
The Bag of Words (BoW) model is a common technique used in Natural Language
Processing (NLP) to represent text data in a numerical format. In this model, a document is
represented as an unordered collection or "bag" of words, where the frequency of each word
is used as a feature. The order and structure of the text are disregarded, and only the
occurrence of words is considered. Let's see some examples of the Bag of Words
representation:
Example 1: Consider the following two sentences:
Sentence 1: "I love natural language processing." Sentence 2: "Natural language processing is
fascinating."
Step 1: Create a vocabulary The vocabulary consists of all unique words present in the
sentences:
Vocabulary: ["I", "love", "natural", "language", "processing", "is", "fascinating"]
Step 2: Vectorize the sentences Convert each sentence into a numerical vector based on word
frequencies:
Vector for Sentence 1: [1, 1, 1, 1, 1, 0, 0] Vector for Sentence 2: [0, 0, 1, 1, 1, 1, 1]
Example 2: Consider another set of sentences:
Sentence 1: "The cat sits on the mat." Sentence 2: "The dog jumps over the fence."
Step 1: Create a vocabulary Vocabulary: ["The", "cat", "sits", "on", "mat", "dog", "jumps",
"over", "fence"]
Step 2: Vectorize the sentences Vectors for the sentences:
Vector for Sentence 1: [1, 1, 1, 1, 1, 0, 0, 0, 0] Vector for Sentence 2: [1, 0, 0, 0, 0, 1, 1, 1, 1]
In both examples, we have converted the sentences into numerical vectors based on word
frequencies. Each element in the vector corresponds to a word in the vocabulary, and its
value represents the number of occurrences of that word in the sentence. The Bag of Words
representation is straightforward and easy to implement, but it doesn't consider word order or
semantic meaning, which can be limiting for certain NLP tasks. Despite its limitations, the
Bag of Words model is a fundamental concept that forms the basis for more advanced text
processing techniques in NLP.
Term Frequency & Bag-of-words

 The one-hot encoding treated each document as a set of terms…


 If we count frequencies of terms, we are treating each document as a bag A bag allows
duplicate items
 Hence this representation is often called bag-of-words.
Bag-of-Words model
 Assumption: If a term occurs lots in a document it should imply something about what
that document is about. A relaxation of the binary occurrence assumption.
 Recording the term frequency information provides more information of
aboutness
 Bag-of-words is also a representation of text:
 We can use a dictionary exactly the same as a one-hot encoding to keep
counts of term occurrences
 E.g. [0,5,0,0,9,1,1,4,0,0,0] - dense representation
 E.g. 2:5 5:9 10:1 11:1 12:4 - sparse representation (uses less memory)
Bag-of-words representation
A Document-Term Matrix (DTM) is a tabular representation of a collection of text
documents, where rows correspond to documents, columns correspond to terms (words or n-
grams), and the cells contain the frequency of each term in each document. It is a common
way to represent text data in numerical format for various Natural Language Processing
(NLP) tasks. Let's see some examples of a Document-Term Matrix:
Example 1: Consider the following three short text documents:
Document 1: "I love natural language processing." Document 2: "Natural language
processing is fascinating." Document 3: "NLP is a subfield of artificial intelligence."
Step 1: Create a vocabulary The vocabulary consists of all unique words present in the
documents:
Vocabulary: ["I", "love", "natural", "language", "processing", "is", "fascinating", "a",
"subfield", "of", "artificial", "intelligence"]
Step 2: Build the Document-Term Matrix Count the occurrences of each term in each
document:
Document I love natural language processing is fascinating a subfield of

Document 1 1 1 1 1 1 0 0 0 0 0 0

Document 2 0 0 1 1 1 1 1 0 0 0 0

Document 3 0 0 0 0 1 1 0 1 1 1 1

Example 2: Consider another set of documents:


Document 1: "The cat sits on the mat." Document 2: "The dog jumps over the fence."
Document 3: "The mat is soft."
Step 1: Create a vocabulary Vocabulary: ["The", "cat", "sits", "on", "mat", "dog", "jumps",
"over", "fence", "is", "soft"]
Step 2: Build the Document-Term Matrix Count the occurrences of each term in each
document:

Document The cat sits on mat dog jumps over fence is soft

Document 1 2 1 1 1 1 0 0 0 0 0 0

Document 2 2 0 0 0 0 1 1 1 1 0 0

Document 3 1 0 0 0 1 0 0 0 0 1 1
In both examples, we created Document-Term Matrices by counting the occurrences of each
term in each document. The DTM allows us to represent textual data in a structured
numerical format that can be used as input for various machine learning algorithms in NLP
tasks.

 Entire corpus can be represented as a document-term matrix (DTM)


 Each row is a document, represented a vector of its word occurrences.
Text Geometric Similarity
Cosine similarity is a common similarity measure used in Natural Language Processing
(NLP) to compare the similarity between two text documents. It calculates the cosine of the
angle between two vectors, representing the term frequency of words in the documents. Let's
see some examples of cosine similarity in NLP:
Example 1: Consider two sentences:
Sentence 1: "I love natural language processing." Sentence 2: "Natural language processing is
fascinating."
Step 1: Create a vocabulary and vectorize the sentences Create a vocabulary containing all
unique words from the sentences, and represent each sentence as a numerical vector based on
word frequencies:
Vocabulary: ["I", "love", "natural", "language", "processing", "is", "fascinating"]
Vector for Sentence 1: [1, 1, 1, 1, 1, 0, 0] Vector for Sentence 2: [0, 0, 1, 1, 1, 1, 1]
Step 2: Calculate the Cosine Similarity Use the cosine similarity formula to calculate the
similarity between the two vectors:
Cosine Similarity = (A · B) / (||A|| * ||B||)
where (A · B) is the dot product of the vectors A and B, and ||A|| and ||B|| are the magnitudes
(Euclidean norms) of the vectors A and B, respectively.
Cosine Similarity = (10 + 10 + 11 + 11 + 11 + 01 + 0*1) / (√(1+1+1+1+1) *
√(0+0+1+1+1+1+1)) = 3 / (√5 * √4) ≈ 0.6708
Example 2: Consider two more sentences:
Sentence 1: "The cat sits on the mat." Sentence 2: "The dog jumps over the fence."
Step 1: Create a vocabulary and vectorize the sentences
Vocabulary: ["The", "cat", "sits", "on", "mat", "dog", "jumps", "over", "fence"]
Vector for Sentence 1: [2, 1, 1, 1, 1, 0, 0, 0, 0] Vector for Sentence 2: [2, 0, 0, 0, 0, 1, 1, 1, 1]
Step 2: Calculate the Cosine Similarity
Cosine Similarity = (22 + 10 + 10 + 10 + 10 + 01 + 01 + 01 + 0*1) / (√(2+1+1+1+1) *
√(2+1+1+1+1+1+1+1+1)) = 4 / (√6 * √9) ≈ 0.5345
In both examples, we calculated the cosine similarity between pairs of sentences, which gives
us a numerical value representing their similarity. Higher cosine similarity values indicate
greater similarity between the sentences, while lower values indicate less similarity. The
cosine similarity is widely used in NLP for tasks such as document clustering, information
retrieval, and recommendation systems.

Exercise: Cosine Similarity


 Consider two documents D1 and D2 represented with a bag-of-words representation
 D1 = (1, 0, 3), D2 = (4, 2, 1)
 Calculate their cosine similarity

To calculate the cosine similarity between two documents represented as bag-of-words


vectors, we use the formula:

Cosine Similarity = (A · B) / (||A|| * ||B||)

Where: A · B is the dot product between vectors A and B. ||A|| is the magnitude (Euclidean
norm) of vector A. ||B|| is the magnitude (Euclidean norm) of vector B.

Let's calculate the cosine similarity for D1 and D2:

D1 = (1, 0, 3) D2 = (4, 2, 1)
Step 1: Calculate the dot product (A · B) A · B = (1 * 4) + (0 * 2) + (3 * 1) = 4 + 0 + 3 = 7

Step 2: Calculate the magnitude (Euclidean norm) of each vector ||A|| = sqrt((1^2) + (0^2) +
(3^2)) = sqrt(1 + 0 + 9) = sqrt(10) ≈ 3.1623 ||B|| = sqrt((4^2) + (2^2) + (1^2)) = sqrt(16 + 4 +
1) = sqrt(21) ≈ 4.5826

Step 3: Calculate the cosine similarity Cosine Similarity = (A · B) / (||A|| * ||B||) = 7 / (3.1623
* 4.5826) ≈ 7 / 14.4929 ≈ 0.4822

So, the cosine similarity between D1 and D2 is approximately 0.4822 (rounded to four
decimal places).

Problem 1: Raw Term frequency (tf)


 The term frequency (tft,d) of term t in document d is defined as the number of times
that t occurs in d.
 We used raw tf in cosine similarity, but it has a problem
 A document with 10 occurrences of the term is more related than a document
with 1 occurrence of the term.
 But it is not 10 times more relevant.
 Aboutness does not increase linearly with term frequency.

Document frequency
 Rare terms are more informative than frequent terms Recall “stop” words
 Consider a term in the query that is rare in the collection (e.g., arachnocentric)
 A document containing this term is very likely to be useful for the query arachnocentric
- its certainly not the kind of word that just happens to appear in a document by chance
 We want a high weight for rare terms like arachnocentric.
 Frequent terms are less informative than rare terms
 Consider a term that is frequent in the collection (e.g., high, increase, line)
 A document containing such a term is more likely to be about that word than a document
that doesn’t
 But it’s not a sure indicator of aboutness.
 For frequent terms, we want positive weights for words like high,
increase, and line
 But higher weights than for rare terms.
 We will use document frequency (df) to capture this
 I.e. the number of documents in a corpus that contain a term
Example
 Consider a document containing 100 words wherein the word cat appears 3 times. Now,
assume we have 10 million documents and the word cat appears in one thousand of
these.
 What is the natural term frequency?
 What is the IDF value for cat (log_10)?
 What is the TF-IDF score with (with log_2 for TF) ?

You might also like