CSC413 Lecture Note
CSC413 Lecture Note
Regression Models
Classification Models
1. Clustering:
o Groups similar data points together.
o Examples: customer segmentation, image segmentation.
o Algorithms: K-means clustering, hierarchical clustering, DBSCAN.
2. Dimensionality Reduction:
o Reduces the number of features while preserving essential information.
o Examples: visualizing high-dimensional data, feature engineering.
Metrics: Metrics are used to measure the performance of a model. There is nothing like best
metrics. The choice of metrics usually depends on the nature of the problem. However, factors
like datasets, what need to be achieved also influence the choice of metrics. Examples of
metrics are accuracy, precision, recall, root mean squared error, ROC, and AUC etc.
Classification Metrics
Accuracy isn’t a good measure because it doesn’t show class imbalance. One class may be
rare e.g. HIV-positive, or fraud.
Precision: “exactness” – is what percent of instances that the classifier labelled as positive are
actually positive?
Recall: “completeness” – what percent of positive instances did the classifier label as
positive?
Recall = TP / (TP + FN)
Model Bias
Model Bias: this is a situation whereby a model is bias towards certain group. Group can be a
race, gender, religion etc. In short, we don’t want a racist algorithm. However, it doesn’t happen
overnight, it starts from data. when you don’t work on your data to address this issue then
you’re going to face this problem. The model feed on the data and as such the problem starts
from the data.
If you look at the features of the dataset you will find that we have a feature B meaning the
proportion of blacks by town. This kind of feature should be handled by deleting it. when you
allow it to be part of what is giving to the model then you will end up with a racist algorithm.
Train test Split
Train Test Split: This is a process of splitting data into training set and test set. In machine
learning it is advisable to divide data into three. a. Training set: this is used during training
for training the model. b. Validation set this is used for validation during model evaluation. c.
Test set: this is used for testing the model after training. This is usually part of the data the
model has not seen before.
Train Test Split Ratio: when dividing data into training, validation, and test set, 80% is
usually assigned to training set. 20% of the data is divided into two for validation and testing.
Overfitting: A model is said to be overfitting when the model performs well on training data
and performs badly on test data. The major cause of overfitting is lack of data and sometimes
too much or lack of training time. Underfitting: A model is said to be underfitting when it
performs badly during training and during testing. The major cause of underfitting is lack of
data.
Regularization
a. Data Augmentation
b. Weight decay
c. Dropout
d. Under samplimg
e. Oversampling
f. SMOTE
Data Augmentation: The more data a model has for training the better the model is at predicting
unseen data. However, getting enough data most of the time is not easy. It is usually time
consuming and expensive. One simplest way is to add synthetic data to our existing data.
Synthetic data is an artificial data that is created from the existing data through some form of
transformations such as rotation, translation, cropping, and affine etc. Weight Decay: In weight
decay, an extra term is being added to the initial loss function to form a regularized loss
function. the extra term is known as the regularization term. Dropout: The basic idea behind
dropout is to make neurons independent of other neurons. For every training there is a
probability p of a node being turned off and the 1-p probability of not being turned off, this
way co-adaptation is avoided.
Reinforcement Learning
In reinforcement learning the goal is for the agent to reach goal state. The agent is expected to
get the maximum reward.
Example of reinforcement learning agent;
Q-learning agent
In reinforcement learning we have state, action, and reward. PEAS framework is used to
measure the performance.
P- Performance measure
E-Environment
A-Actuators
S-Sensors
Agent
An agent is anything that can be viewed as perceiving its environment through sensors and
acting upon that environment through actuators (includes humans, robots, chatbots,
thermostats).
Performance measure Defines what "good behaviour" is in a specific context. No one fixed
measure across all tasks, but we insist on an objective measure that an outside authority can
establish (i.e. the agent should not itself define what is good.... like humans tend to do).
Better to design performance measures according to what you (as the designer) wants in the
environment, rather than how one thinks the agent should behave.
Environment A specification of the physical (or virtual) environment the agent is expected to
operate in.
Actuators The types and physical properties of the actuators available to the agent. Limits
what the agent can do.
Sensors The types and physical properties of the sensors available to the agent. Limits what
the agent can know about the environment.
What is text?
• At the lowest level, string or streams of characters (or bytes)
“<div class=”md”><p>Depends on who it's for I guess? Gender? Age? Nationality? Might help
people give you ideas. In general, I'd say something tartan maybe? House of Tartan sells
blankets and scarves etc (along with lots of other generic Scottish stuff). If your gifts are for
artsy people then you could maybe get some Charles Rennie Mackintosh related gifts (the shop
in the Lighthouse sells loads of stuff like that - might be worth having a wander around there
in general).</p></div>”
May “case fold” aka lower case text, remove numbers, punctuation, etc… e.g.,
“House”, “of”, “Tartan” “sells”, “blankets”
“house”, “of”, “tartan” “sell”, “blanket”.
Lemma: same stem, part of speech, rough word sense
cat and cats = same lemma
Wordform: the full inflected surface form
cat and cats = different wordforms
Stemming
Stemming is a text normalization technique used in natural language processing to reduce
words to their base or root form. The process involves removing prefixes, suffixes, and
inflections from words to obtain the core form, known as the stem. The resulting stems may
not always be actual words, but they represent the common base that related words share.
The purpose of stemming is to group words with similar meanings together, even if they have
different forms, so that they can be treated as the same word during text analysis. This helps
in reducing data sparsity and improving the efficiency of NLP tasks like text search,
information retrieval, and sentiment analysis.
Example of stemming:
Consider the following words:
Running
Runs
Ran
Applying stemming to these words would yield the common base "run":
Stemmed words:
Running -> run
Runs -> run
Ran -> run
Here are a few more examples of stemming:
Going -> go
Jumps -> jump
Jumping -> jump
Happily -> happi (Note: The stem "happi" is not an actual English word but represents the
common base.)
Goal: Group words that are similar
• e.g., “computer”, “computers”, “computing”, “compute”
Definition: Process for reducing inflected words to their stem or root form
In practice: rule-based models to remove suffixes of words , e.g. "-er", "-ed", "-s", "-
ing”
The outcome may not be a word, e.g. comput
Usually operates on a single word without context
Fast, easy to implement, usually effective -- errors may not be “too bad”
Porter Stemmer is the most widely used algorithm
Lemmatization
Lemmatization is another text normalization technique used in natural language processing to
reduce words to their base or dictionary form, called lemmas. Unlike stemming, which
simply trims off prefixes or suffixes, lemmatization takes into account the word's context and
part of speech (POS) to produce valid words.
The main goal of lemmatization is to transform different inflected forms of a word into a
single, canonical form. This allows words with the same meaning to be treated as a single
token during text analysis, improving the accuracy and interpretability of NLP tasks.
Example of lemmatization:
Consider the following words with different forms:
Running
Runs
Ran
By applying lemmatization, the words are transformed into their lemmas:
Lemmatized words:
Running -> run
Runs -> run
Ran -> run
Here are a few more examples of lemmatization:
Better -> good
Best -> good
Cats -> cat
Went -> go
In these examples, you can see that lemmatization produces valid words that represent the
base or dictionary form of the original words, taking into account the context and part of
speech. By doing so, lemmatization helps in consolidating words with the same meaning and
reducing data sparsity.
Text Representation
Text collections
For a large collection of text documents we define various properties
Type (or lexeme): an element of the vocabulary
A normalized unique token
N = number of all token occurrences (word count)
V = vocabulary = set of types (unique normalized tokens) |V| is the size of the
vocabulary
A vocabulary is stored in a data structure called a dictionary.
In natural language processing (NLP), a vocabulary refers to the collection of unique words
or tokens present in a corpus or a specific text dataset. Building a vocabulary is an essential
step in NLP tasks such as text classification, language modeling, and machine translation.
The vocabulary helps in representing text data in a numerical format that machine learning
algorithms can process. Here's an explanation of vocabulary in NLP with examples:
Corpus and Tokens: A corpus is a collection of text documents or sentences. Tokens are the
individual units or elements of a text, typically words or characters, obtained by splitting the
text. Consider the following corpus:
Corpus: "I love natural language processing. It is fascinating!"
Tokens: ['I', 'love', 'natural', 'language', 'processing', '.', 'It', 'is', 'fascinating', '!']
Vocabulary: The vocabulary is a set of unique tokens present in the corpus. It represents all
the distinct words that occur in the text dataset. In the above corpus, the vocabulary would be:
Vocabulary: {'I', 'love', 'natural', 'language', 'processing', '.', 'It', 'is', 'fascinating', '!'}
Vocabulary Size: The vocabulary size refers to the total number of unique words in the
vocabulary. In the above example, the vocabulary size is 10.
Word Frequency: Word frequency indicates how often each word occurs in the corpus. It
provides insights into the importance or prevalence of specific words. For example, the word
frequency of the corpus could be:
Word Frequency: {'I': 1, 'love': 1, 'natural': 1, 'language': 1, 'processing': 1, '.': 1, 'It': 1, 'is': 1,
'fascinating': 1, '!': 1}
Out-of-Vocabulary (OOV) Tokens: OOV tokens refer to words or tokens that are not present
in the vocabulary. They often occur when encountering new or unseen words in test or
production data. Proper handling of OOV tokens is crucial in NLP tasks to avoid issues with
unknown words.
Example: Suppose the corpus is "I enjoy reading books." and the vocabulary is {'I', 'enjoy',
'reading'}. If the test data contains the word "books," which is not in the vocabulary, it would
be considered an OOV token.
Building and managing an effective vocabulary is important in NLP tasks, and it involves
techniques such as tokenization, removing stop words, handling OOV tokens, and
maintaining an appropriate vocabulary size. The vocabulary enables the conversion of text
data into numerical representations that machine learning algorithms can work with for
various NLP tasks.
The Dice coefficient, also known as the Sørensen–Dice coefficient or Dice similarity index,
is a similarity measure used to compare the similarity between two sets. It calculates the ratio
of twice the intersection of two sets to the sum of the sizes of the sets. The formula for the
Dice coefficient is:
Dice Coefficient (D) = 2 * |A ∩ B| / (|A| + |B|)
where:
|A ∩ B| is the size of the intersection of sets A and B.
|A| is the size of set A.
|B| is the size of set B.
The Dice coefficient ranges from 0 to 1, where 0 indicates no overlap (no shared elements)
between the sets, and 1 indicates complete overlap (both sets are identical).
Let's illustrate the Dice coefficient with some examples:
Example 1: Set A: {1, 2, 3, 4, 5} Set B: {3, 4, 5, 6, 7}
|A ∩ B| = {3, 4, 5} (intersection) |A| = 5 |B| = 5
Dice Coefficient (D) = 2 * |A ∩ B| / (|A| + |B|) = 2 * 3 / (5 + 5) = 6 / 10 = 0.6
Example 2: Set A: {apple, banana, orange} Set B: {orange, pear, peach}
|A ∩ B| = {orange} (intersection) |A| = 3 |B| = 3
Dice Coefficient (D) = 2 * |A ∩ B| / (|A| + |B|) = 2 * 1 / (3 + 3) = 2 / 6 ≈ 0.3333
Example 3: Set A: {red, green, blue} Set B: {purple, yellow}
|A ∩ B| = {} (intersection - no common elements) |A| = 3 |B| = 2
Dice Coefficient (D) = 2 * |A ∩ B| / (|A| + |B|) = 2 * 0 / (3 + 2) = 0 / 5 = 0
In these examples, we calculated the Dice coefficient for different sets, demonstrating how it
quantifies the similarity or overlap between sets. Like the Overlap coefficient, the Dice
coefficient is commonly used in various fields, such as image segmentation, natural language
processing, and data clustering, to measure the similarity between sets of data.
The Jaccard similarity, also known as the Jaccard index, is a similarity measure used to
compare the similarity between two sets. It calculates the ratio of the size of the intersection
of two sets to the size of their union. The formula for the Jaccard similarity is:
Jaccard Similarity (J) = |A ∩ B| / |A ∪ B|
where:
|A ∩ B| is the size of the intersection of sets A and B.
|A ∪ B| is the size of the union of sets A and B.
The Jaccard similarity ranges from 0 to 1, where 0 indicates no similarity (no shared
elements) between the sets, and 1 indicates complete similarity (both sets are identical).
Let's illustrate the Jaccard similarity with some examples:
Example 1: Set A: {1, 2, 3, 4, 5} Set B: {3, 4, 5, 6, 7}
|A ∩ B| = {3, 4, 5} (intersection) |A ∪ B| = {1, 2, 3, 4, 5, 6, 7} (union)
Jaccard Similarity (J) = |A ∩ B| / |A ∪ B| = 3 / 7 ≈ 0.4286
Example 2: Set A: {apple, banana, orange} Set B: {orange, pear, peach}
|A ∩ B| = {orange} (intersection) |A ∪ B| = {apple, banana, orange, pear, peach} (union)
Jaccard Similarity (J) = |A ∩ B| / |A ∪ B| = 1 / 5 = 0.2
Example 3: Set A: {red, green, blue} Set B: {purple, yellow}
|A ∩ B| = {} (intersection - no common elements) |A ∪ B| = {red, green, blue, purple,
yellow} (union)
Jaccard Similarity (J) = |A ∩ B| / |A ∪ B| = 0 / 5 = 0
In these examples, we calculated the Jaccard similarity for different sets, demonstrating how
it quantifies the similarity or overlap between sets. The Jaccard similarity is widely used in
various fields, such as data mining, recommendation systems, and text analysis, to measure
the similarity between sets of data.
Example
Consider two sentences d1 and d2 below. Calculate their Jaccard similarity.
d1: James decided to quit smoking but it was not an easy decision.
d2: Though it was not an easy decision, James decided to quit smoking
Solution:
To calculate the Jaccard similarity between two sentences, we first need to preprocess the
sentences to remove any punctuation, convert all words to lowercase, and split them into
individual tokens (words). Then, we calculate the Jaccard similarity as the size of the
intersection divided by the size of the union of the tokens in the two sentences.
Let's preprocess the sentences and calculate their Jaccard similarity:
Step 1: Preprocess the sentences
d1: "James decided to quit smoking but it was not an easy decision." d2: "Though it was not
an easy decision, James decided to quit smoking."
After preprocessing, the tokenized versions of the sentences are:
d1_tokens: ['james', 'decided', 'to', 'quit', 'smoking', 'but', 'it', 'was', 'not', 'an', 'easy', 'decision']
d2_tokens: ['though', 'it', 'was', 'not', 'an', 'easy', 'decision', 'james', 'decided', 'to', 'quit',
'smoking']
Step 2: Calculate the Jaccard similarity
Intersection (common tokens): ['james', 'decided', 'to', 'quit', 'smoking', 'it', 'was', 'not', 'an',
'easy', 'decision'] Union (all unique tokens): ['james', 'decided', 'to', 'quit', 'smoking', 'but', 'it',
'was', 'not', 'an', 'easy', 'decision', 'though']
Jaccard similarity = Size of Intersection / Size of Union = 11 / 13 ≈ 0.8462
So, the Jaccard similarity between d1 and d2 is approximately 0.8462 (rounded to four
decimal places).
Bag-of-words
The Bag of Words (BoW) model is a common technique used in Natural Language
Processing (NLP) to represent text data in a numerical format. In this model, a document is
represented as an unordered collection or "bag" of words, where the frequency of each word
is used as a feature. The order and structure of the text are disregarded, and only the
occurrence of words is considered. Let's see some examples of the Bag of Words
representation:
Example 1: Consider the following two sentences:
Sentence 1: "I love natural language processing." Sentence 2: "Natural language processing is
fascinating."
Step 1: Create a vocabulary The vocabulary consists of all unique words present in the
sentences:
Vocabulary: ["I", "love", "natural", "language", "processing", "is", "fascinating"]
Step 2: Vectorize the sentences Convert each sentence into a numerical vector based on word
frequencies:
Vector for Sentence 1: [1, 1, 1, 1, 1, 0, 0] Vector for Sentence 2: [0, 0, 1, 1, 1, 1, 1]
Example 2: Consider another set of sentences:
Sentence 1: "The cat sits on the mat." Sentence 2: "The dog jumps over the fence."
Step 1: Create a vocabulary Vocabulary: ["The", "cat", "sits", "on", "mat", "dog", "jumps",
"over", "fence"]
Step 2: Vectorize the sentences Vectors for the sentences:
Vector for Sentence 1: [1, 1, 1, 1, 1, 0, 0, 0, 0] Vector for Sentence 2: [1, 0, 0, 0, 0, 1, 1, 1, 1]
In both examples, we have converted the sentences into numerical vectors based on word
frequencies. Each element in the vector corresponds to a word in the vocabulary, and its
value represents the number of occurrences of that word in the sentence. The Bag of Words
representation is straightforward and easy to implement, but it doesn't consider word order or
semantic meaning, which can be limiting for certain NLP tasks. Despite its limitations, the
Bag of Words model is a fundamental concept that forms the basis for more advanced text
processing techniques in NLP.
Term Frequency & Bag-of-words
Document 1 1 1 1 1 1 0 0 0 0 0 0
Document 2 0 0 1 1 1 1 1 0 0 0 0
Document 3 0 0 0 0 1 1 0 1 1 1 1
Document The cat sits on mat dog jumps over fence is soft
Document 1 2 1 1 1 1 0 0 0 0 0 0
Document 2 2 0 0 0 0 1 1 1 1 0 0
Document 3 1 0 0 0 1 0 0 0 0 1 1
In both examples, we created Document-Term Matrices by counting the occurrences of each
term in each document. The DTM allows us to represent textual data in a structured
numerical format that can be used as input for various machine learning algorithms in NLP
tasks.
Where: A · B is the dot product between vectors A and B. ||A|| is the magnitude (Euclidean
norm) of vector A. ||B|| is the magnitude (Euclidean norm) of vector B.
D1 = (1, 0, 3) D2 = (4, 2, 1)
Step 1: Calculate the dot product (A · B) A · B = (1 * 4) + (0 * 2) + (3 * 1) = 4 + 0 + 3 = 7
Step 2: Calculate the magnitude (Euclidean norm) of each vector ||A|| = sqrt((1^2) + (0^2) +
(3^2)) = sqrt(1 + 0 + 9) = sqrt(10) ≈ 3.1623 ||B|| = sqrt((4^2) + (2^2) + (1^2)) = sqrt(16 + 4 +
1) = sqrt(21) ≈ 4.5826
Step 3: Calculate the cosine similarity Cosine Similarity = (A · B) / (||A|| * ||B||) = 7 / (3.1623
* 4.5826) ≈ 7 / 14.4929 ≈ 0.4822
So, the cosine similarity between D1 and D2 is approximately 0.4822 (rounded to four
decimal places).
Document frequency
Rare terms are more informative than frequent terms Recall “stop” words
Consider a term in the query that is rare in the collection (e.g., arachnocentric)
A document containing this term is very likely to be useful for the query arachnocentric
- its certainly not the kind of word that just happens to appear in a document by chance
We want a high weight for rare terms like arachnocentric.
Frequent terms are less informative than rare terms
Consider a term that is frequent in the collection (e.g., high, increase, line)
A document containing such a term is more likely to be about that word than a document
that doesn’t
But it’s not a sure indicator of aboutness.
For frequent terms, we want positive weights for words like high,
increase, and line
But higher weights than for rare terms.
We will use document frequency (df) to capture this
I.e. the number of documents in a corpus that contain a term
Example
Consider a document containing 100 words wherein the word cat appears 3 times. Now,
assume we have 10 million documents and the word cat appears in one thousand of
these.
What is the natural term frequency?
What is the IDF value for cat (log_10)?
What is the TF-IDF score with (with log_2 for TF) ?