0% found this document useful (0 votes)

18 views32 pages

CSC413 Lecture Note

The document outlines the machine learning workflow, including steps from data collection to model prediction, and discusses various types of machine learning such as supervised, unsupervised, and reinforcement learning. It elaborates on model evaluation metrics, the importance of addressing model bias, and techniques for regularization to prevent overfitting. Additionally, it covers natural language processing concepts like tokenization, normalization, stemming, and lemmatization, emphasizing their roles in text analysis.

Uploaded by

teasekid.zuberu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views32 pages

CSC413 Lecture Note

Uploaded by

teasekid.zuberu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Machine Learning

Machine learning workflow:

Step 1: Data Collection
Step 2: Data Cleaning
Step 3: Model Development
Step 4: Model training
Step 5: Model Evaluation
Step 6: Hyper parameter tuning
Step 7: Model prediction
Step 8: Feedback incorporation (optional)

Data collection involves gathering data.

Data cleaning involves handling missing values, correcting data format, and standardizing and
normalizing data.
Model development is the process of creating model.
Model training is the process of training a model.
Hyper parameter tuning is the process of choosing the best parameter for a model.
Model prediction is the process of making prediction from a trained model.
Model evaluation is the process of evaluating model in order to improve it performance.

Machine learning is divided into;

 Supervised learning
 Unsupervised learning
 Reinforcement learning
Supervised learning is a type of learning in which the data is label. Supervised learning is
divided into;
 Classification
 Regression
Classification
Classification deals with categorical data. E.g. Email spam detection.
Example of classification models are;
 Logistic regression
 Naïve Bayes
 K nearest neighbour
 Support Vector Machine
Regression
Regression deals with continuous data. E.g. Stock price prediction, rainfall prediction etc.
Example of regression models are;
 Linear regression
 Ridge regression

Regression Models

 Linear Regression: Used to predict continuous numerical values.

o Example: Predicting house prices based on features like square footage,
number of bedrooms, and location.
 Ridge Regression: A regularization technique that adds a penalty term to the loss
function to prevent overfitting.
o Example: Predicting customer churn.
 Lasso Regression: Another regularization technique that can be used for feature
selection.
o Example: Predicting stock prices.

Classification Models

 Logistic Regression: Used to predict binary categorical values (e.g., 0 or 1).

o Example: Classifying emails as spam or not spam.
 Decision Trees: Creates a tree-like structure to make decisions based on features.
o Example: Predicting whether a customer will purchase a product based on
their demographics and browsing history.
 Random Forests: An ensemble of decision trees, often used for improved accuracy
and robustness.
o Example: Classifying images as cats or dogs.
 Support Vector Machines (SVMs): Finds the optimal hyperplane to separate data
points of different classes.
o Example: Identifying fraudulent credit card transactions.
 Naive Bayes: Assumes features are independent given the class label.
o Example: Spam filtering.
 K-Nearest Neighbors (KNN): Predicts the class or value of a new data point based
on the majority class or average value of its k nearest neighbors.

Example: Recommender systems.

Unsupervised Learning
Unsupervised learning is learning with data that is not labelled. Unsupervised learning is
categorized into;
 Clustering e.g. Identifying fake news, document analysis
Examples of clustering models are;
 K-means, Hierachical clustering etc.
 Dimensionality reduction e.g. Analysis of written text.
Examples of dimensionality reduction model;
 Principal component analysis.

Types of Unsupervised Learning:

1. Clustering:
o Groups similar data points together.
o Examples: customer segmentation, image segmentation.
o Algorithms: K-means clustering, hierarchical clustering, DBSCAN.
2. Dimensionality Reduction:
o Reduces the number of features while preserving essential information.
o Examples: visualizing high-dimensional data, feature engineering.

Algorithms: Principal Component Analysis (PCA), t-SNE, Autoencoders.

Metrics

Metrics: Metrics are used to measure the performance of a model. There is nothing like best
metrics. The choice of metrics usually depends on the nature of the problem. However, factors
like datasets, what need to be achieved also influence the choice of metrics. Examples of
metrics are accuracy, precision, recall, root mean squared error, ROC, and AUC etc.

Classification Metrics

Accuracy is a simple count of correct predictions

Accuracy = # correct / All = (TP + TN) / (TP +FP + FN + TN).

Accuracy isn’t a good measure because it doesn’t show class imbalance. One class may be
rare e.g. HIV-positive, or fraud.

Precision: “exactness” – is what percent of instances that the classifier labelled as positive are
actually positive?

Precision = TP / (TP + FP)

Recall: “completeness” – what percent of positive instances did the classifier label as
positive?
Recall = TP / (TP + FN)
Model Bias
Model Bias: this is a situation whereby a model is bias towards certain group. Group can be a
race, gender, religion etc. In short, we don’t want a racist algorithm. However, it doesn’t happen
overnight, it starts from data. when you don’t work on your data to address this issue then
you’re going to face this problem. The model feed on the data and as such the problem starts
from the data.

If you look at the features of the dataset you will find that we have a feature B meaning the
proportion of blacks by town. This kind of feature should be handled by deleting it. when you
allow it to be part of what is giving to the model then you will end up with a racist algorithm.
Train test Split
Train Test Split: This is a process of splitting data into training set and test set. In machine
learning it is advisable to divide data into three. a. Training set: this is used during training
for training the model. b. Validation set this is used for validation during model evaluation. c.
Test set: this is used for testing the model after training. This is usually part of the data the
model has not seen before.

Train Test Split Ratio: when dividing data into training, validation, and test set, 80% is
usually assigned to training set. 20% of the data is divided into two for validation and testing.

Overfitting and Underfitting

Overfitting: A model is said to be overfitting when the model performs well on training data
and performs badly on test data. The major cause of overfitting is lack of data and sometimes
too much or lack of training time. Underfitting: A model is said to be underfitting when it
performs badly during training and during testing. The major cause of underfitting is lack of
data.

Regularization

Regularization: This is a technique use to reduce or avoid overfitting. Regularization

Techniques: some of the regularization techniques are

a. Data Augmentation

b. Weight decay

c. Dropout

d. Under samplimg

e. Oversampling
f. SMOTE

Data Augmentation: The more data a model has for training the better the model is at predicting
unseen data. However, getting enough data most of the time is not easy. It is usually time
consuming and expensive. One simplest way is to add synthetic data to our existing data.
Synthetic data is an artificial data that is created from the existing data through some form of
transformations such as rotation, translation, cropping, and affine etc. Weight Decay: In weight
decay, an extra term is being added to the initial loss function to form a regularized loss
function. the extra term is known as the regularization term. Dropout: The basic idea behind
dropout is to make neurons independent of other neurons. For every training there is a
probability p of a node being turned off and the 1-p probability of not being turned off, this
way co-adaptation is avoided.

Reinforcement Learning
In reinforcement learning the goal is for the agent to reach goal state. The agent is expected to
get the maximum reward.
Example of reinforcement learning agent;
 Q-learning agent
In reinforcement learning we have state, action, and reward. PEAS framework is used to
measure the performance.
P- Performance measure
E-Environment
A-Actuators
S-Sensors

Agent

An agent is anything that can be viewed as perceiving its environment through sensors and
acting upon that environment through actuators (includes humans, robots, chatbots,
thermostats).

An analysis framework: a whole-agent view of intelligent agents based on rationality

Performance measure Defines what "good behaviour" is in a specific context. No one fixed
measure across all tasks, but we insist on an objective measure that an outside authority can
establish (i.e. the agent should not itself define what is good.... like humans tend to do).
Better to design performance measures according to what you (as the designer) wants in the
environment, rather than how one thinks the agent should behave.

Environment A specification of the physical (or virtual) environment the agent is expected to
operate in.
Actuators The types and physical properties of the actuators available to the agent. Limits
what the agent can do.

Sensors The types and physical properties of the sensors available to the agent. Limits what
the agent can know about the environment.

Example: Designing an automated taxi driver

 Performance measure? safety, reach destination, maximise profits, obey laws,

passenger comfort...
 Environment:? urban streets, motorways, traffic, pedestrians, weather, customers...
 Actuators? steer, accelerate, brake, horn, speak/display,...
 Sensors? video, accelerometers, GPS, engine sensors, keyboard,...
Natural Language Processing

Introduction to text: Tokenization, vector representation, cosine similarity, and lemmatization

What is text?
• At the lowest level, string or streams of characters (or bytes)

“<div class=”md”><p>Depends on who it's for I guess? Gender? Age? Nationality? Might help
people give you ideas. In general, I'd say something tartan maybe? House of Tartan sells
blankets and scarves etc (along with lots of other generic Scottish stuff). If your gifts are for
artsy people then you could maybe get some Charles Rennie Mackintosh related gifts (the shop
in the Lighthouse sells loads of stuff like that - might be worth having a wander around there
in general).</p></div>”

What is this text about?

What Scottish stuff have you guys given as gifts?
• Some of those character are markup – we can use an HTML parser to remove the tags and
focus on the main text
 Fairly easy for JSON, HTML & XML, might be more difficult for Word, PDF, etc.
Text Processing
Most tasks need to perform text normalization
1. Segmenting/tokenizing terms in running text e.g. “House of Tartan sells blankets” à
“House”, “of”, “Tartan” “sells”, “blankets”
Processing Text: Tokenisation
Tokenization: Tokenization is the process of breaking a text into individual units, which are
typically words or subwords (n-grams). These individual units are called tokens. The main
goal of tokenization is to divide the text into meaningful chunks, making it easier to process
and analyze the language data.
 Assuming (HTML) markup is removed, we need to identify the "tokens", or separate
terms
 Separate the sequence of characters into a sequence of tokens (roughly “words”)
 A token (or term) is the technical name for a meaningful sequence of characters
 In English, this may involve:
 Splitting on punctuation: _-.?!,;:"()'&£$
 Splitting on whitespace characters: " " TAB \n \r
 Other languages may be more difficult to tokenise
Example
Input Text: "Hello, how are you?"
Tokenization Output: ["Hello", ",", "how", "are", "you", "?"]
Exercise
Tokenize the string below
[He didn’t like the U.S. movie “Snakes on a train, revenge of Viper-man!”, now playing in
the U.K.]

2. Normalize or ‘canonicalize’ tokens into a normal form

Normalization: Normalization is the process of transforming text into a standard, consistent

format. The objective of normalization is to eliminate variations in the text that do not
contribute to the meaning but could affect the analysis or modeling. It includes various
operations like converting text to lowercase, removing punctuation, handling special
characters, and expanding contractions.
Example:
Original Text: "I haven't seen it!"
Normalization Output: "i have not seen it"

 May “case fold” aka lower case text, remove numbers, punctuation, etc… e.g.,
“House”, “of”, “Tartan” “sells”, “blankets”
“house”, “of”, “tartan” “sell”, “blanket”.
 Lemma: same stem, part of speech, rough word sense
 cat and cats = same lemma
 Wordform: the full inflected surface form
 cat and cats = different wordforms
Stemming
Stemming is a text normalization technique used in natural language processing to reduce
words to their base or root form. The process involves removing prefixes, suffixes, and
inflections from words to obtain the core form, known as the stem. The resulting stems may
not always be actual words, but they represent the common base that related words share.
The purpose of stemming is to group words with similar meanings together, even if they have
different forms, so that they can be treated as the same word during text analysis. This helps
in reducing data sparsity and improving the efficiency of NLP tasks like text search,
information retrieval, and sentiment analysis.
Example of stemming:
Consider the following words:
Running
Runs
Ran
Applying stemming to these words would yield the common base "run":
Stemmed words:
Running -> run
Runs -> run
Ran -> run
Here are a few more examples of stemming:
Going -> go
Jumps -> jump
Jumping -> jump
Happily -> happi (Note: The stem "happi" is not an actual English word but represents the
common base.)
 Goal: Group words that are similar
• e.g., “computer”, “computers”, “computing”, “compute”
 Definition: Process for reducing inflected words to their stem or root form
 In practice: rule-based models to remove suffixes of words , e.g. "-er", "-ed", "-s", "-
ing”
 The outcome may not be a word, e.g. comput
 Usually operates on a single word without context
 Fast, easy to implement, usually effective -- errors may not be “too bad”
 Porter Stemmer is the most widely used algorithm

Lemmatization
Lemmatization is another text normalization technique used in natural language processing to
reduce words to their base or dictionary form, called lemmas. Unlike stemming, which
simply trims off prefixes or suffixes, lemmatization takes into account the word's context and
part of speech (POS) to produce valid words.
The main goal of lemmatization is to transform different inflected forms of a word into a
single, canonical form. This allows words with the same meaning to be treated as a single
token during text analysis, improving the accuracy and interpretability of NLP tasks.
Example of lemmatization:
Consider the following words with different forms:
Running
Runs
Ran
By applying lemmatization, the words are transformed into their lemmas:
Lemmatized words:
Running -> run
Runs -> run
Ran -> run
Here are a few more examples of lemmatization:
Better -> good
Best -> good
Cats -> cat
Went -> go
In these examples, you can see that lemmatization produces valid words that represent the
base or dictionary form of the original words, taking into account the context and part of
speech. By doing so, lemmatization helps in consolidating words with the same meaning and
reducing data sparsity.

 Definition: process of grouping together the different inflected forms of a word to a

base form
 Uses context (part-of-speech patterns) to be more precise
 E.g. meeting (noun) vs. to meet (verb)
 Have to find correct dictionary headword form
 e.g. using a linguistic language dictionary like Wordnet
 More linguistically principled than stemming, but depends on accuracy of context and
completeness of dictionary
 Processing is slower than stemming

3. Segment long sequences of tokens (usually into sentences)

- Build a binary classifier (later) and/or use heuristic rules e.g. [Might, help, people,
give, you, ideas,. I’d, say, something, tartan, maybe,?
<s>Might help people give you ideas.</s><s> I'd say something tartan maybe?</s>

Text Representation
Text collections
 For a large collection of text documents we define various properties
 Type (or lexeme): an element of the vocabulary
 A normalized unique token
 N = number of all token occurrences (word count)
 V = vocabulary = set of types (unique normalized tokens) |V| is the size of the
vocabulary
 A vocabulary is stored in a data structure called a dictionary.
In natural language processing (NLP), a vocabulary refers to the collection of unique words
or tokens present in a corpus or a specific text dataset. Building a vocabulary is an essential
step in NLP tasks such as text classification, language modeling, and machine translation.
The vocabulary helps in representing text data in a numerical format that machine learning
algorithms can process. Here's an explanation of vocabulary in NLP with examples:
Corpus and Tokens: A corpus is a collection of text documents or sentences. Tokens are the
individual units or elements of a text, typically words or characters, obtained by splitting the
text. Consider the following corpus:
Corpus: "I love natural language processing. It is fascinating!"
Tokens: ['I', 'love', 'natural', 'language', 'processing', '.', 'It', 'is', 'fascinating', '!']
Vocabulary: The vocabulary is a set of unique tokens present in the corpus. It represents all
the distinct words that occur in the text dataset. In the above corpus, the vocabulary would be:
Vocabulary: {'I', 'love', 'natural', 'language', 'processing', '.', 'It', 'is', 'fascinating', '!'}
Vocabulary Size: The vocabulary size refers to the total number of unique words in the
vocabulary. In the above example, the vocabulary size is 10.
Word Frequency: Word frequency indicates how often each word occurs in the corpus. It
provides insights into the importance or prevalence of specific words. For example, the word
frequency of the corpus could be:
Word Frequency: {'I': 1, 'love': 1, 'natural': 1, 'language': 1, 'processing': 1, '.': 1, 'It': 1, 'is': 1,
'fascinating': 1, '!': 1}
Out-of-Vocabulary (OOV) Tokens: OOV tokens refer to words or tokens that are not present
in the vocabulary. They often occur when encountering new or unseen words in test or
production data. Proper handling of OOV tokens is crucial in NLP tasks to avoid issues with
unknown words.
Example: Suppose the corpus is "I enjoy reading books." and the vocabulary is {'I', 'enjoy',
'reading'}. If the test data contains the word "books," which is not in the vocabulary, it would
be considered an OOV token.
Building and managing an effective vocabulary is important in NLP tasks, and it involves
techniques such as tokenization, removing stop words, handling OOV tokens, and
maintaining an appropriate vocabulary size. The vocabulary enables the conversion of text
data into numerical representations that machine learning algorithms can work with for
various NLP tasks.

Representing the text

 We can represent a piece of text by the terms that occur in it. Mathematically, this would
be a vector for each document

 This is called "one-hot" encoding

 1 in a term's column if that term occurs in the document
 0 otherwise
 We may consider this as representing each document as a set of its terms
 All of our vectors have |V| dimensions, to cover all words we might have encountered
in all documents analysed
Implementing One-Hot Encoding
Text (Set) Similarity

Why set-based similarity?

 Work well for short pieces of text
 Person names, product titles, etc.…
 Tweets or sentences
 Simple (trivial) to compute with basic data structures
 Fundamental building block of more complex (learned) functions
The overlap coefficient, also known as the Szymkiewicz-Simpson coefficient, is a similarity
measure used to compare the similarity between two sets. It calculates the ratio of the
intersection of two sets to the smaller of the two sets. The formula for the overlap coefficient
is:
Overlap Coefficient (O) = |A ∩ B| / min(|A|, |B|)
where:
|A ∩ B| is the size of the intersection of sets A and B.
|A| is the size of set A.
|B| is the size of set B.
The overlap coefficient ranges from 0 to 1, where 0 indicates no overlap (no shared elements)
between the sets, and 1 indicates complete overlap (both sets are identical).
Let's illustrate the overlap coefficient with some examples:
Example 1: Set A: {1, 2, 3, 4, 5} Set B: {3, 4, 5, 6, 7}
|A ∩ B| = {3, 4, 5} (intersection) |A| = 5 |B| = 5
Overlap Coefficient (O) = |A ∩ B| / min(|A|, |B|) = 3 / 5 = 0.6
Example 2: Set A: {apple, banana, orange} Set B: {orange, pear, peach}
|A ∩ B| = {orange} (intersection) |A| = 3 |B| = 3
Overlap Coefficient (O) = |A ∩ B| / min(|A|, |B|) = 1 / 3 ≈ 0.3333
Example 3: Set A: {red, green, blue} Set B: {purple, yellow}
|A ∩ B| = {} (intersection - no common elements) |A| = 3 |B| = 2
Overlap Coefficient (O) = |A ∩ B| / min(|A|, |B|) = 0 / 2 = 0
In these examples, we calculated the overlap coefficient for different sets, demonstrating how
it quantifies the similarity or overlap between sets. Keep in mind that this coefficient is
commonly used in various fields, such as information retrieval and data mining, to assess the
similarity between sets of data.

The Dice coefficient, also known as the Sørensen–Dice coefficient or Dice similarity index,
is a similarity measure used to compare the similarity between two sets. It calculates the ratio
of twice the intersection of two sets to the sum of the sizes of the sets. The formula for the
Dice coefficient is:
Dice Coefficient (D) = 2 * |A ∩ B| / (|A| + |B|)
where:
|A ∩ B| is the size of the intersection of sets A and B.
|A| is the size of set A.
|B| is the size of set B.
The Dice coefficient ranges from 0 to 1, where 0 indicates no overlap (no shared elements)
between the sets, and 1 indicates complete overlap (both sets are identical).
Let's illustrate the Dice coefficient with some examples:
Example 1: Set A: {1, 2, 3, 4, 5} Set B: {3, 4, 5, 6, 7}
|A ∩ B| = {3, 4, 5} (intersection) |A| = 5 |B| = 5
Dice Coefficient (D) = 2 * |A ∩ B| / (|A| + |B|) = 2 * 3 / (5 + 5) = 6 / 10 = 0.6
Example 2: Set A: {apple, banana, orange} Set B: {orange, pear, peach}
|A ∩ B| = {orange} (intersection) |A| = 3 |B| = 3
Dice Coefficient (D) = 2 * |A ∩ B| / (|A| + |B|) = 2 * 1 / (3 + 3) = 2 / 6 ≈ 0.3333
Example 3: Set A: {red, green, blue} Set B: {purple, yellow}
|A ∩ B| = {} (intersection - no common elements) |A| = 3 |B| = 2
Dice Coefficient (D) = 2 * |A ∩ B| / (|A| + |B|) = 2 * 0 / (3 + 2) = 0 / 5 = 0
In these examples, we calculated the Dice coefficient for different sets, demonstrating how it
quantifies the similarity or overlap between sets. Like the Overlap coefficient, the Dice
coefficient is commonly used in various fields, such as image segmentation, natural language
processing, and data clustering, to measure the similarity between sets of data.

The Jaccard similarity, also known as the Jaccard index, is a similarity measure used to
compare the similarity between two sets. It calculates the ratio of the size of the intersection
of two sets to the size of their union. The formula for the Jaccard similarity is:
Jaccard Similarity (J) = |A ∩ B| / |A ∪ B|
where:
|A ∩ B| is the size of the intersection of sets A and B.
|A ∪ B| is the size of the union of sets A and B.
The Jaccard similarity ranges from 0 to 1, where 0 indicates no similarity (no shared
elements) between the sets, and 1 indicates complete similarity (both sets are identical).
Let's illustrate the Jaccard similarity with some examples:
Example 1: Set A: {1, 2, 3, 4, 5} Set B: {3, 4, 5, 6, 7}
|A ∩ B| = {3, 4, 5} (intersection) |A ∪ B| = {1, 2, 3, 4, 5, 6, 7} (union)
Jaccard Similarity (J) = |A ∩ B| / |A ∪ B| = 3 / 7 ≈ 0.4286
Example 2: Set A: {apple, banana, orange} Set B: {orange, pear, peach}
|A ∩ B| = {orange} (intersection) |A ∪ B| = {apple, banana, orange, pear, peach} (union)
Jaccard Similarity (J) = |A ∩ B| / |A ∪ B| = 1 / 5 = 0.2
Example 3: Set A: {red, green, blue} Set B: {purple, yellow}
|A ∩ B| = {} (intersection - no common elements) |A ∪ B| = {red, green, blue, purple,
yellow} (union)
Jaccard Similarity (J) = |A ∩ B| / |A ∪ B| = 0 / 5 = 0
In these examples, we calculated the Jaccard similarity for different sets, demonstrating how
it quantifies the similarity or overlap between sets. The Jaccard similarity is widely used in
various fields, such as data mining, recommendation systems, and text analysis, to measure
the similarity between sets of data.

Example
Consider two sentences d1 and d2 below. Calculate their Jaccard similarity.
 d1: James decided to quit smoking but it was not an easy decision.
 d2: Though it was not an easy decision, James decided to quit smoking
Solution:
To calculate the Jaccard similarity between two sentences, we first need to preprocess the
sentences to remove any punctuation, convert all words to lowercase, and split them into
individual tokens (words). Then, we calculate the Jaccard similarity as the size of the
intersection divided by the size of the union of the tokens in the two sentences.
Let's preprocess the sentences and calculate their Jaccard similarity:
Step 1: Preprocess the sentences
d1: "James decided to quit smoking but it was not an easy decision." d2: "Though it was not
an easy decision, James decided to quit smoking."
After preprocessing, the tokenized versions of the sentences are:
d1_tokens: ['james', 'decided', 'to', 'quit', 'smoking', 'but', 'it', 'was', 'not', 'an', 'easy', 'decision']
d2_tokens: ['though', 'it', 'was', 'not', 'an', 'easy', 'decision', 'james', 'decided', 'to', 'quit',
'smoking']
Step 2: Calculate the Jaccard similarity
Intersection (common tokens): ['james', 'decided', 'to', 'quit', 'smoking', 'it', 'was', 'not', 'an',
'easy', 'decision'] Union (all unique tokens): ['james', 'decided', 'to', 'quit', 'smoking', 'but', 'it',
'was', 'not', 'an', 'easy', 'decision', 'though']
Jaccard similarity = Size of Intersection / Size of Union = 11 / 13 ≈ 0.8462
So, the Jaccard similarity between d1 and d2 is approximately 0.8462 (rounded to four
decimal places).

Geometric similarity & Text Distributions

 Bag-of-words text representation
 Geometric similarity
 Text distributions

Bag-of-words
The Bag of Words (BoW) model is a common technique used in Natural Language
Processing (NLP) to represent text data in a numerical format. In this model, a document is
represented as an unordered collection or "bag" of words, where the frequency of each word
is used as a feature. The order and structure of the text are disregarded, and only the
occurrence of words is considered. Let's see some examples of the Bag of Words
representation:
Example 1: Consider the following two sentences:
Sentence 1: "I love natural language processing." Sentence 2: "Natural language processing is
fascinating."
Step 1: Create a vocabulary The vocabulary consists of all unique words present in the
sentences:
Vocabulary: ["I", "love", "natural", "language", "processing", "is", "fascinating"]
Step 2: Vectorize the sentences Convert each sentence into a numerical vector based on word
frequencies:
Vector for Sentence 1: [1, 1, 1, 1, 1, 0, 0] Vector for Sentence 2: [0, 0, 1, 1, 1, 1, 1]
Example 2: Consider another set of sentences:
Sentence 1: "The cat sits on the mat." Sentence 2: "The dog jumps over the fence."
Step 1: Create a vocabulary Vocabulary: ["The", "cat", "sits", "on", "mat", "dog", "jumps",
"over", "fence"]
Step 2: Vectorize the sentences Vectors for the sentences:
Vector for Sentence 1: [1, 1, 1, 1, 1, 0, 0, 0, 0] Vector for Sentence 2: [1, 0, 0, 0, 0, 1, 1, 1, 1]
In both examples, we have converted the sentences into numerical vectors based on word
frequencies. Each element in the vector corresponds to a word in the vocabulary, and its
value represents the number of occurrences of that word in the sentence. The Bag of Words
representation is straightforward and easy to implement, but it doesn't consider word order or
semantic meaning, which can be limiting for certain NLP tasks. Despite its limitations, the
Bag of Words model is a fundamental concept that forms the basis for more advanced text
processing techniques in NLP.
Term Frequency & Bag-of-words

 The one-hot encoding treated each document as a set of terms…

 If we count frequencies of terms, we are treating each document as a bag A bag allows
duplicate items
 Hence this representation is often called bag-of-words.
Bag-of-Words model
 Assumption: If a term occurs lots in a document it should imply something about what
that document is about. A relaxation of the binary occurrence assumption.
 Recording the term frequency information provides more information of
aboutness
 Bag-of-words is also a representation of text:
 We can use a dictionary exactly the same as a one-hot encoding to keep
counts of term occurrences
 E.g. [0,5,0,0,9,1,1,4,0,0,0] - dense representation
 E.g. 2:5 5:9 10:1 11:1 12:4 - sparse representation (uses less memory)
Bag-of-words representation
A Document-Term Matrix (DTM) is a tabular representation of a collection of text
documents, where rows correspond to documents, columns correspond to terms (words or n-
grams), and the cells contain the frequency of each term in each document. It is a common
way to represent text data in numerical format for various Natural Language Processing
(NLP) tasks. Let's see some examples of a Document-Term Matrix:
Example 1: Consider the following three short text documents:
Document 1: "I love natural language processing." Document 2: "Natural language
processing is fascinating." Document 3: "NLP is a subfield of artificial intelligence."
Step 1: Create a vocabulary The vocabulary consists of all unique words present in the
documents:
Vocabulary: ["I", "love", "natural", "language", "processing", "is", "fascinating", "a",
"subfield", "of", "artificial", "intelligence"]
Step 2: Build the Document-Term Matrix Count the occurrences of each term in each
document:
Document I love natural language processing is fascinating a subfield of

Document 1 1 1 1 1 1 0 0 0 0 0 0

Document 2 0 0 1 1 1 1 1 0 0 0 0

Document 3 0 0 0 0 1 1 0 1 1 1 1

Example 2: Consider another set of documents:

Document 1: "The cat sits on the mat." Document 2: "The dog jumps over the fence."
Document 3: "The mat is soft."
Step 1: Create a vocabulary Vocabulary: ["The", "cat", "sits", "on", "mat", "dog", "jumps",
"over", "fence", "is", "soft"]
Step 2: Build the Document-Term Matrix Count the occurrences of each term in each
document:

Document The cat sits on mat dog jumps over fence is soft

Document 1 2 1 1 1 1 0 0 0 0 0 0

Document 2 2 0 0 0 0 1 1 1 1 0 0

Document 3 1 0 0 0 1 0 0 0 0 1 1
In both examples, we created Document-Term Matrices by counting the occurrences of each
term in each document. The DTM allows us to represent textual data in a structured
numerical format that can be used as input for various machine learning algorithms in NLP
tasks.

 Entire corpus can be represented as a document-term matrix (DTM)

 Each row is a document, represented a vector of its word occurrences.
Text Geometric Similarity
Cosine similarity is a common similarity measure used in Natural Language Processing
(NLP) to compare the similarity between two text documents. It calculates the cosine of the
angle between two vectors, representing the term frequency of words in the documents. Let's
see some examples of cosine similarity in NLP:
Example 1: Consider two sentences:
Sentence 1: "I love natural language processing." Sentence 2: "Natural language processing is
fascinating."
Step 1: Create a vocabulary and vectorize the sentences Create a vocabulary containing all
unique words from the sentences, and represent each sentence as a numerical vector based on
word frequencies:
Vocabulary: ["I", "love", "natural", "language", "processing", "is", "fascinating"]
Vector for Sentence 1: [1, 1, 1, 1, 1, 0, 0] Vector for Sentence 2: [0, 0, 1, 1, 1, 1, 1]
Step 2: Calculate the Cosine Similarity Use the cosine similarity formula to calculate the
similarity between the two vectors:
Cosine Similarity = (A · B) / (||A|| * ||B||)
where (A · B) is the dot product of the vectors A and B, and ||A|| and ||B|| are the magnitudes
(Euclidean norms) of the vectors A and B, respectively.
Cosine Similarity = (10 + 10 + 11 + 11 + 11 + 01 + 0*1) / (√(1+1+1+1+1) *
√(0+0+1+1+1+1+1)) = 3 / (√5 * √4) ≈ 0.6708
Example 2: Consider two more sentences:
Sentence 1: "The cat sits on the mat." Sentence 2: "The dog jumps over the fence."
Step 1: Create a vocabulary and vectorize the sentences
Vocabulary: ["The", "cat", "sits", "on", "mat", "dog", "jumps", "over", "fence"]
Vector for Sentence 1: [2, 1, 1, 1, 1, 0, 0, 0, 0] Vector for Sentence 2: [2, 0, 0, 0, 0, 1, 1, 1, 1]
Step 2: Calculate the Cosine Similarity
Cosine Similarity = (22 + 10 + 10 + 10 + 10 + 01 + 01 + 01 + 0*1) / (√(2+1+1+1+1) *
√(2+1+1+1+1+1+1+1+1)) = 4 / (√6 * √9) ≈ 0.5345
In both examples, we calculated the cosine similarity between pairs of sentences, which gives
us a numerical value representing their similarity. Higher cosine similarity values indicate
greater similarity between the sentences, while lower values indicate less similarity. The
cosine similarity is widely used in NLP for tasks such as document clustering, information
retrieval, and recommendation systems.

Exercise: Cosine Similarity

 Consider two documents D1 and D2 represented with a bag-of-words representation
 D1 = (1, 0, 3), D2 = (4, 2, 1)
 Calculate their cosine similarity

To calculate the cosine similarity between two documents represented as bag-of-words

vectors, we use the formula:

Cosine Similarity = (A · B) / (||A|| * ||B||)

Where: A · B is the dot product between vectors A and B. ||A|| is the magnitude (Euclidean
norm) of vector A. ||B|| is the magnitude (Euclidean norm) of vector B.

Let's calculate the cosine similarity for D1 and D2:

D1 = (1, 0, 3) D2 = (4, 2, 1)
Step 1: Calculate the dot product (A · B) A · B = (1 * 4) + (0 * 2) + (3 * 1) = 4 + 0 + 3 = 7

Step 2: Calculate the magnitude (Euclidean norm) of each vector ||A|| = sqrt((1^2) + (0^2) +
(3^2)) = sqrt(1 + 0 + 9) = sqrt(10) ≈ 3.1623 ||B|| = sqrt((4^2) + (2^2) + (1^2)) = sqrt(16 + 4 +
1) = sqrt(21) ≈ 4.5826

Step 3: Calculate the cosine similarity Cosine Similarity = (A · B) / (||A|| * ||B||) = 7 / (3.1623
* 4.5826) ≈ 7 / 14.4929 ≈ 0.4822

So, the cosine similarity between D1 and D2 is approximately 0.4822 (rounded to four
decimal places).

Problem 1: Raw Term frequency (tf)

 The term frequency (tft,d) of term t in document d is defined as the number of times
that t occurs in d.
 We used raw tf in cosine similarity, but it has a problem
 A document with 10 occurrences of the term is more related than a document
with 1 occurrence of the term.
 But it is not 10 times more relevant.
 Aboutness does not increase linearly with term frequency.

Document frequency
 Rare terms are more informative than frequent terms Recall “stop” words
 Consider a term in the query that is rare in the collection (e.g., arachnocentric)
 A document containing this term is very likely to be useful for the query arachnocentric
- its certainly not the kind of word that just happens to appear in a document by chance
 We want a high weight for rare terms like arachnocentric.
 Frequent terms are less informative than rare terms
 Consider a term that is frequent in the collection (e.g., high, increase, line)
 A document containing such a term is more likely to be about that word than a document
that doesn’t
 But it’s not a sure indicator of aboutness.
 For frequent terms, we want positive weights for words like high,
increase, and line
 But higher weights than for rare terms.
 We will use document frequency (df) to capture this
 I.e. the number of documents in a corpus that contain a term
Example
 Consider a document containing 100 words wherein the word cat appears 3 times. Now,
assume we have 10 million documents and the word cat appears in one thousand of
these.
 What is the natural term frequency?
 What is the IDF value for cat (log_10)?
 What is the TF-IDF score with (with log_2 for TF) ?

Most Common Dissertation Mistakes
100% (2)
Most Common Dissertation Mistakes
4 pages
Syllabus - MUS 216 Ear Training I
No ratings yet
Syllabus - MUS 216 Ear Training I
6 pages
Homework 7-4 Modern Chemistry Answers
100% (1)
Homework 7-4 Modern Chemistry Answers
8 pages
Modals of Obligation Grammar Guides Reading Comprehension Exercises 19693
No ratings yet
Modals of Obligation Grammar Guides Reading Comprehension Exercises 19693
4 pages
PDF School Based Training Manual
No ratings yet
PDF School Based Training Manual
168 pages
ML-1-PPT-UNIT-1
No ratings yet
ML-1-PPT-UNIT-1
93 pages
Download full Test Bank for Cultural Psychology, Third Edition all chapters
No ratings yet
Download full Test Bank for Cultural Psychology, Third Edition all chapters
49 pages
Recommender Systems Notes
No ratings yet
Recommender Systems Notes
21 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Presentation on ML - Copy
No ratings yet
Presentation on ML - Copy
469 pages
ML notes
No ratings yet
ML notes
16 pages
HK1-2324-Anh GB1+-de Cuong
No ratings yet
HK1-2324-Anh GB1+-de Cuong
3 pages
Ali Personal Effectiveness
No ratings yet
Ali Personal Effectiveness
5 pages
Art Museums by The Numbers 2014
No ratings yet
Art Museums by The Numbers 2014
6 pages
proposal New Generation
No ratings yet
proposal New Generation
3 pages
DLL-ENG8-2NDQ-2nd Week Edited
No ratings yet
DLL-ENG8-2NDQ-2nd Week Edited
8 pages
The Machine Learning Landscape
No ratings yet
The Machine Learning Landscape
30 pages
Cooperative Learning proposal
No ratings yet
Cooperative Learning proposal
22 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
400 Word Personal Statement Example - BrightLink Prep
No ratings yet
400 Word Personal Statement Example - BrightLink Prep
9 pages
RESUME Shahrukh Khan
No ratings yet
RESUME Shahrukh Khan
3 pages
GH Slides Selected - Durham ISC - GH Saudi Trip October 2023
No ratings yet
GH Slides Selected - Durham ISC - GH Saudi Trip October 2023
18 pages
AIch5 (2)
No ratings yet
AIch5 (2)
50 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
48 pages
LearnJam - LearningDesignPrinciples
No ratings yet
LearnJam - LearningDesignPrinciples
27 pages
TalentPro Detail Steps
No ratings yet
TalentPro Detail Steps
9 pages
JD - SAP FICO Consultant
No ratings yet
JD - SAP FICO Consultant
2 pages
Baduanjin Info
No ratings yet
Baduanjin Info
3 pages
ML
No ratings yet
ML
9 pages
Ai Project Cycle Short Note
No ratings yet
Ai Project Cycle Short Note
9 pages
AIML105
No ratings yet
AIML105
5 pages
Unit1 ML
No ratings yet
Unit1 ML
15 pages
ml
No ratings yet
ml
9 pages
Machine learning
No ratings yet
Machine learning
12 pages
-3
No ratings yet
-3
28 pages
Decimo Ingles 2do Parcial
No ratings yet
Decimo Ingles 2do Parcial
5 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Investigating The Grammatical Features of Philippine English
57% (7)
Investigating The Grammatical Features of Philippine English
44 pages
Overfitting & Feature Engineering.pptx
No ratings yet
Overfitting & Feature Engineering.pptx
37 pages
Lecture 8
No ratings yet
Lecture 8
11 pages
Deep Learning[1]
No ratings yet
Deep Learning[1]
26 pages
MLSC Final Notes
No ratings yet
MLSC Final Notes
24 pages
ds unit 2
No ratings yet
ds unit 2
36 pages
Corrigé Classroom English
No ratings yet
Corrigé Classroom English
3 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
5 pages
Machine Learning
No ratings yet
Machine Learning
42 pages
ML Unit 2
No ratings yet
ML Unit 2
35 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
Masters in Hand Surgery Faculty Profile
No ratings yet
Masters in Hand Surgery Faculty Profile
8 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
Machine Learning.
No ratings yet
Machine Learning.
50 pages
LECTURE-2
No ratings yet
LECTURE-2
36 pages
AIYA SESSION 4
No ratings yet
AIYA SESSION 4
42 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
ML-chap-2
No ratings yet
ML-chap-2
60 pages
Gcu Safety Calendar 2
No ratings yet
Gcu Safety Calendar 2
7 pages
INTERVIEW FOR GRADE 9-1
No ratings yet
INTERVIEW FOR GRADE 9-1
2 pages
MLE
No ratings yet
MLE
15 pages
Supervised Learning Final With Diagrams Cleaned
No ratings yet
Supervised Learning Final With Diagrams Cleaned
7 pages
Kenny-230718-The Ultimate Machine Learning Cheat Sheet
No ratings yet
Kenny-230718-The Ultimate Machine Learning Cheat Sheet
20 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
01 - Introduction
No ratings yet
01 - Introduction
35 pages
Deep Learning
No ratings yet
Deep Learning
21 pages
module3_DS_ppt
No ratings yet
module3_DS_ppt
68 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
Machine Learning - Unit - 1
100% (1)
Machine Learning - Unit - 1
58 pages
Assignment No 1
No ratings yet
Assignment No 1
9 pages
Machine Learning
No ratings yet
Machine Learning
16 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Music Quarter 3
No ratings yet
Music Quarter 3
41 pages
Ai Notes
No ratings yet
Ai Notes
8 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
80 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Data Exploration
No ratings yet
Data Exploration
5 pages
Unit 5 Intro To Machine Learning
No ratings yet
Unit 5 Intro To Machine Learning
25 pages
Machine Learning INTRO
No ratings yet
Machine Learning INTRO
12 pages
Unit-2 AI Project Cycle
No ratings yet
Unit-2 AI Project Cycle
20 pages
Discover KG2 t2 E
No ratings yet
Discover KG2 t2 E
168 pages
Unit_IIAIProjectCycle
No ratings yet
Unit_IIAIProjectCycle
9 pages
ML Unit1
No ratings yet
ML Unit1
25 pages
Machine Learning - Brief
No ratings yet
Machine Learning - Brief
12 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
01 Needlecraft Module
No ratings yet
01 Needlecraft Module
16 pages
I. MULTIPLE CHOICE: Choose The Letter of The Correct Answer and Write On The Space Provided
No ratings yet
I. MULTIPLE CHOICE: Choose The Letter of The Correct Answer and Write On The Space Provided
3 pages
KTU Format Mini Project
0% (1)
KTU Format Mini Project
14 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet

CSC413 Lecture Note

Uploaded by

CSC413 Lecture Note

Uploaded by

Machine Learning

Machine learning workflow:

Data collection involves gathering data.

Machine learning is divided into;

 Linear Regression: Used to predict continuous numerical values.

 Logistic Regression: Used to predict binary categorical values (e.g., 0 or 1).

Example: Recommender systems.

Types of Unsupervised Learning:

Algorithms: Principal Component Analysis (PCA), t-SNE, Autoencoders.

Accuracy is a simple count of correct predictions

Accuracy = # correct / All = (TP + TN) / (TP +FP + FN + TN).

Precision = TP / (TP + FP)

Overfitting and Underfitting

Regularization: This is a technique use to reduce or avoid overfitting. Regularization

An analysis framework: a whole-agent view of intelligent agents based on rationality

Example: Designing an automated taxi driver

 Performance measure? safety, reach destination, maximise profits, obey laws,

Introduction to text: Tokenization, vector representation, cosine similarity, and lemmatization

What is this text about?

2. Normalize or ‘canonicalize’ tokens into a normal form

Normalization: Normalization is the process of transforming text into a standard, consistent

 Definition: process of grouping together the different inflected forms of a word to a

3. Segment long sequences of tokens (usually into sentences)

Representing the text

 This is called "one-hot" encoding

Why set-based similarity?

Geometric similarity & Text Distributions

 The one-hot encoding treated each document as a set of terms…

Example 2: Consider another set of documents:

 Entire corpus can be represented as a document-term matrix (DTM)

Exercise: Cosine Similarity

To calculate the cosine similarity between two documents represented as bag-of-words

Cosine Similarity = (A · B) / (||A|| * ||B||)

Let's calculate the cosine similarity for D1 and D2:

Problem 1: Raw Term frequency (tf)

You might also like