0% found this document useful (0 votes)
10 views

C2 +Preparing+Data+for+Statistical+Machine+Learning+Models

The document discusses various types of data used in machine learning, categorized by source, structure, and use case. It provides examples of time series, text, image, video, audio, and tabular data, along with their real-world applications. Additionally, it covers the importance of numerical representation and encoding methods for preparing data for machine learning algorithms.

Uploaded by

sirine.nahra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

C2 +Preparing+Data+for+Statistical+Machine+Learning+Models

The document discusses various types of data used in machine learning, categorized by source, structure, and use case. It provides examples of time series, text, image, video, audio, and tabular data, along with their real-world applications. Additionally, it covers the importance of numerical representation and encoding methods for preparing data for machine learning algorithms.

Uploaded by

sirine.nahra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

Fall 2024

EECE 490: Introduction to


Machine Learning
Chapter 2: Preparing Data for Statistical Machine Learning
Algorithms
Types of Data in Machine Learning

EECE 490: Introduction to ML 2


Categories of Data In Machine Learning

Data in machine learning studied from three different dimensions:


1. Data Source
2. Data Structure
3. Data Use Case

EECE 490: Introduction to ML 3


Data in ML based on sources

EECE 490: Introduction to ML 4


Time Series Data

Time series data is data collected


over time at regular or irregular
intervals.

EECE 490: Introduction to ML 5


Time Series Data Popular Real World Use Case

Netflix uses time series data to analyze:

- User Viewing Patterns


- Content Engagement Over Time
- Personalized Recommendations
- Seasonal Trends
- User Retention Strategies

EECE 490: Introduction to ML 6


Text Data
Text data can be:

- Natural Language: any language that


occurs naturally in a human community
by a process of use, repetition, and
change.
- Programming Language: Python, C++,
Java,..
- Markup Language: HTML, XML,..
- Symbolic Text: mathematical equations
- Log Files: logs reported by software or
hardware
EECE 490: Introduction to ML 7
Text Data Real World Use Case

AI personal assistants have gained extreme popularity in the last couple of years,
helping automatine or accelerate day-to-day work.

EECE 490: Introduction to ML 8


Image Data

Image data is visual data


collected via cameras or other
imaging devices.

EECE 490: Introduction to ML 9


Image Data Real World Use Case

Google Image uploads uses ML to


identify similar images available on the
internet.

EECE 490: Introduction to ML 10


Video Data

Video data refers to a


sequence of images (frames)
captured and displayed in rapid
succession over time, typically
accompanied by audio.

EECE 490: Introduction to ML 11


Video Data Real World Use Case

Tesla autopilot uses real-time


footage captured through
dashcams to identify
surrounding objects.

EECE 490: Introduction to ML 12


Audio Data

Audio data refers to sound signals


captured in digital or analog form,
typically represented as waveforms or
frequency spectrums, encompassing
speech, music, environmental sounds,
or other acoustic information for
analysis and processing.

EECE 490: Introduction to ML 13


Audio Data Real World Use Case

Music generation is among the most


popular use cases for audio data in
AI.

EECE 490: Introduction to ML 14


Tabular Data

Tabular data is structured data


organized in rows and columns,
resembling a table or
spreadsheet, where each row
represents an individual data
instance (e.g., a customer,
transaction, or observation), and
each column represents a
specific feature or attribute
associated with the instances.

EECE 490: Introduction to ML 15


Tabular Data Real World Use Case

SoFi utilizes machine learning


algorithms to assess applicants'
creditworthiness by analyzing various
data sources, including educational
attainment, utility payments, insurance
claims, and mobile phone usage.

EECE 490: Introduction to ML 16


Data in ML Based on Structure

EECE 490: Introduction to ML 17


Structured Data

EECE 490: Introduction to ML 18


Example of structured data

Data to predict the price of a house.

EECE 490: Introduction to ML 19


Unstructured Data

EECE 490: Introduction to ML 20


Example of unstructured data

Dataset for predicting


handwritten digits.

EECE 490: Introduction to ML 21


Semi-Structured Data

EECE 490: Introduction to ML 22


Semi-structured data example

Data to to analyze sale patterns

EECE 490: Introduction to ML 23


Data in ML based on purpose

EECE 490: Introduction to ML 24


Training Data

Training data is fed into a machine


learning algorithm to teach it how to
perform a task or to analyze the data
patterns. Once the algorithm is
trained, it is called a machine learning
model.

EECE 490: Introduction to ML 25


Validation Data

Validation dataset is used to check the model’s performance during the training
process. It allows us to choose which model ‘settings’ result in the highest
accuracy.

EECE 490: Introduction to ML 26


Testing Dataset

Testing dataset is used to check the performance of the model on data it has not
seen during training.

EECE 490: Introduction to ML 27


The Dataset Split

EECE 490: Introduction to ML 28


Features: Explicit and Implicit Information
in the Dataset

EECE 490: Introduction to ML 29


Stock Price Prediction Example

Let’s say we want to predict the price of a stock, what information would you look
at to make this prediction?

EECE 490: Introduction to ML 30


Stock Price Prediction Example

EECE 490: Introduction to ML 31


Stock Price Prediction Example

If we wanted to feed the machine learning algorithm training data to allow it to


learn how to predict the prices of stock, the features would be the input to the
algorithm.

EECE 490: Introduction to ML 32


Features of a Machine Learning Dataset

Features are the individual


measurable properties or
characteristics of the data used by
a machine learning model to
make predictions or decisions.

EECE 490: Introduction to ML 33


Impact of Feature Properties on ML Model

Not all features have an equal


contribution to the model’s
prediction, but all features must be
relevant.

EECE 490: Introduction to ML 34


Implicit Information within Features

The role of the machine learning model is to understand the underlying patterns
and informations that the features hold. This implicit set of information is learned
and stored within the model’s trainable parameters.

EECE 490: Introduction to ML 35


Features in non-tabular data types

- Text Data: The features could only be the input text (generation purpose) or
could include useful features like “most frequent words”, “document length”,
(prediction purpose)..
- Image Data: The features could include only the color intensities inside the
image or could include shape, depth of pixels, …
- Time Series and Audio Data: The features could include only the timestamp
and the value for each interval, but could include multiple other inferred
features like spectral and pitch features.

EECE 490: Introduction to ML 36


How ML Algorithms Ingest Data

EECE 490: Introduction to ML 37


Numerical Representations

Machine learning models can only


process data in numerical
representations because
mathematical computations
underlie their functionality.
Algorithms rely on numerical
operations like matrix
multiplications, dot products,
and gradient calculations, which
require data to be encoded as
numbers.

EECE 490: Introduction to ML 38


Numerical Representation of Tabular Data

The inputs and outputs of the machine learning model should be in numerical
format. Let’s say we want to predict if a person will get approved on a home loan.

EECE 490: Introduction to ML 39


Numerical Representation of Tabular Data

We have two types of features (including output): Numeric and Categorical

EECE 490: Introduction to ML 40


Numerical Representation of Tabular Data

The goal is to transform all the features to a numeric representation. Obviously, we


do not need to change the features that are already numeric, so let’s explore the
categorical features.

First, we need to import our data from kaggle.

EECE 490: Introduction to ML 41


Numerical Representation of Tabular Data

After importing our data we need to load it. From the dataset card on kaggle, we
see that we have two csv files: one for training, and one for testing. We will work
on the training file.

To load the training csv file, we will use the pandas library.

EECE 490: Introduction to ML 42


Numerical Representation of Tabular Data

Now that we have our dataset, let’s select the categorical features. We can know
which features are categorical by using the .info() function in pandas that would
tell us the object type of the feature.

EECE 490: Introduction to ML 43


Numerical Representation of Tabular Data

Alternatively, we can use the .select_dtypes() to select the features that have the
‘object’ type and print a list of their names.

EECE 490: Introduction to ML 44


Numerical Representation of Tabular Data

The process of transforming a


categorical feature into its
numeric representation is called
‘encoding’.

EECE 490: Introduction to ML 45


Examples of encoding methods

Feature: Color of a rose: Red, White, Yellow, Pink

1. Label Encoding: 2, 3, 1, 0 (done in alphabetical order unless specified


otherwise)
2. One-hot encoding: [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]
3. Ordinal Encoding: 0, 1, 2, 3 (according to order of occurrence or hierarchy)
4. Target Encoding: 0.8, 0.7, 0.9, 0.6 (mean target value, e.g., price or rating,
calculated for each category).

EECE 490: Introduction to ML 46


Numerical Representation of Tabular Data

Let’s explore how we can use each encoding type in python code for the ‘property
area’ feature in our dataset. The first thing we need to do is take a look at the
‘categories’ of this feature and their values.

EECE 490: Introduction to ML 47


Numerical Representation of Tabular Data

Using label encoding:

EECE 490: Introduction to ML 48


Numerical Representation of Tabular Data

Using One Hot Encoding:

EECE 490: Introduction to ML 49


Numerical Representation of Tabular Data

Using ordinal encoding:

EECE 490: Introduction to ML 50


Numerical Representation of Tabular Data

Using target encoding with ‘Loan Amount Term’ as the target variable.

EECE 490: Introduction to ML 51


Numerical Representation of Image Data

To understand how to represent an image in a numerical format, we need to


understand what makes an image. The smallest building block of an image is
called a pixel.

EECE 490: Introduction to ML 52


Numerical Representation of Image Data

The size of an image is described by the number of pixels in the height and the
width.

Example: 1920x1080. This means that the screen will have a width of 1,920
pixels while the height of the screen will be 1,080 pixels. This results in a grand
total of 2,073,600 pixels on-screen.

Each pixel has an intensity value that describes the intensity of color at that pixel.

EECE 490: Introduction to ML 53


Numerical Representation of Image Data

The matrix of pixel intensities in


an image is used as a numeric
representation of that image.
The range of an image’s pixel
intensity depends on the image
type.
We will go through the main image
types: Black and white, Grayscale,
RGB, RGBA, Multi-Spectral, and
Depth Maps

EECE 490: Introduction to ML 54


Manipulating Image Data: OpenCV Library

To visualize the pixel values, we will use the opencv library to load the image and
print its values.

For readability, we will resize the images to a smaller size.

EECE 490: Introduction to ML 55


Numerical Representation of Image Data

Black and white images: Pixels are either 0 (black) or 1 (white), representing a
binary image.

EECE 490: Introduction to ML 56


Numerical Representation of Image Data

Grayscale Images: Pixel values


range from 0 (black) to 255 (white) for
8-bit images. Higher bit-depth
grayscale images (e.g., 16-bit) extend
this range.

EECE 490: Introduction to ML 57


Numerical Representation of Image Data

RGB images: each pixel in an RGB image


can be represented as three values: Red,
Green, and Blue.

Thus, RGB images can be represented as


three channels. Each channel represents
one red, green, and blue intensities
respectively.

EECE 490: Introduction to ML 58


Numerical Representation of Image Data

RGBA images: Similar to RGB but with an


additional alpha channel for transparency,
where the alpha value typically ranges
from 0 (fully transparent) to 255 (fully
opaque).

EECE 490: Introduction to ML 59


Numerical Representation of Image Data

Multi-Spectral images: Images captured


across multiple wavelengths of the
electromagnetic spectrum, typically beyond
the standard visible light range (red, green,
and blue).
Commonly include 4–12 spectral bands,
compared to the 3 bands (RGB) in standard
images. Examples: Visible (RGB),
near-infrared (NIR), short-wave infrared
(SWIR), and ultraviolet (UV).

EECE 490: Introduction to ML 60


Numerical Representation of Image Data

Depth Maps: A representation of the


distances between the camera (or
sensor) and objects in a scene, where
each pixel's value corresponds to the
depth (or distance) of that point. These
maps are typically grayscale images
where brighter pixels represent objects
closer to the camera, and darker pixels
represent objects farther away.

EECE 490: Introduction to ML 61


Numerical Representation: Audio & Time Series Data

Depending on your application, time


series and audio data can be
transformed to either an image
representation or tabular
representation.

EECE 490: Introduction to ML 62


Numerical Representation: Audio & Time Series Data
Sliding Window: An approach used to
capture discrete instances of continuous
data that can be used for image or
tabular transformation.
After segmenting the data into windows,
a ‘window function’ is applied to each
segment to reduce the edge effects
(discontinuities) that would otherwise be
introduced by dividing the signal into
chunks.

EECE 490: Introduction to ML 63


Numerical Representation: Audio & Time Series Data

A window function is a mathematical


function that is applied to each segment
of the audio signal before performing any
analysis (like Fourier Transform). The
purpose of the window function is to
smooth the signal at the edges of each
segment, reducing the abrupt
discontinuities.
This minimizes the spectral leakage
(energy spreading into other frequency
bins) that results from discontinuities at the
edges of the windows.

EECE 490: Introduction to ML 64


Numerical Representation: Audio & Time Series Data

Popular window functions:

1. Hamming Window:

2. Blackman-Harris Window:

EECE 490: Introduction to ML 65


Numerical Representation: Audio & Time Series Data
Tabular Representation
Tabular representation of Audio and Time series data can be achieved by
documenting the values of each window as a row in a structured representation.

EECE 490: Introduction to ML 66


Numerical Representation: Audio & Time Series Data
Image Representation
Spectrograms are created by:
1. Applying Short Time Fourier Transform
on the window after the window function
has been applied.
2. Then, STFT is converted to a log scale to
better match human hearing sensitivity (we
are more sensitive to changes in lower
frequencies than higher ones).
3. The spectrogram is then created by
stacking the frequency content of each
windowed segment along the time axis.

EECE 490: Introduction to ML 67


Reminder of Fourier Transform

The Fourier Transform (FT) converts a signal from the time domain (how a
signal changes over time) to the frequency domain (what frequencies are
present in the signal and their amplitudes).

EECE 490: Introduction to ML 68


Reminder of Fourier Transform

Fourier transform variations:

EECE 490: Introduction to ML 69


Numerical Representation: Audio & Time Series Data
Image Representation
A more advanced version of the spectrogram is
the Mel-frequency image.
1. Compute the spectrogram of the signal.
2. Apply the Mel Filter Bank that divides the
frequency range (e.g., 0–8000 Hz for a 16
kHz signal) into N Mel bands (e.g., 40
bands).
3. For each Mel band, compute the power
(square of the magnitude) of the
frequencies within that band.
4. Plot the log power values of the Mel bands
over time to create a 2D image

EECE 490: Introduction to ML 70


Numerical Representation of Text Data

Before we transform text into its numeric format, we need to pre-process the text:
Mandatory pre-processing:
1. Tokenization
2. Vocabulary Building
Optional pre-processing:
1. Remove links, tags, emojis, stop words, extra spaces and punctuations.
2. Transform all text into lower case.
3. Replace accented characters, numbers, and special characters with a different
representation.

EECE 490: Introduction to ML 71


Numerical Representation of Text: Tokenization

Tokenization: The process of segmenting text into smaller, discrete units


called tokens, such as words, subwords, or characters. These tokens serve as
the basic building blocks for representing text in a machine learning model,
where each token is mapped to a unique index or representation.

EECE 490: Introduction to ML 72


Numerical Representation of Text: Tokenization

5 main types of tokenization are used to to process text for natural language
models:

1. Character level tokenization.


2. Byte level tokenization.
3. Word level tokenization.
4. Sub-word level tokenization.
5. Sentence level tokenization.

EECE 490: Introduction to ML 73


Numerical Representation of Text: Tokenization

Character Tokenization Advantages Disadvantages Use-Cases


Definition

Breaks text into individual - Handles rare or unknown - Produces longer - Languages with no
characters. words effectively. sequences, increasing spaces (e.g., Chinese,
- Compact vocabulary computational cost. Japanese).
size. - Loss of semantic - Handling rare or OOV
- Robust to spelling information at the words.
variations and errors. character level. - Autocompletion and text
- Requires deeper models. generation tasks.

EECE 490: Introduction to ML 74


Numerical Representation of Text: Tokenization

Python implementation of character level tokenization.

EECE 490: Introduction to ML 75


Numerical Representation of Text: Tokenization
Byte level tokenization Advantages Disadvantages Use-Cases
Definition

Breaks text into individual - Handles all characters, - Produces very - Language-agnostic tasks.
bytes or characters, including special and fine-grained tokenization - Low-resource languages.
typically at the byte level, unseen ones, with no OOV (often single character - Models requiring robust
to handle any character, issues. level), which can lead to handling of all characters,
including rare or unseen - Compact vocabulary long sequences. such as multilingual or
ones. size. - Loss of higher-level code-switching tasks.
- Works well with any text, semantic meaning.
regardless of language or
structure.

EECE 490: Introduction to ML 76


Numerical Representation of Text: Tokenization

Python implementation of byte level tokenization.

Some characters are represented by multiple bytes, like emojis.

EECE 490: Introduction to ML 77


Numerical Representation of Text: Tokenization
Word Tokenization Advantages Disadvantages Use-Cases
Definition

Breaks text into individual - Easy to understand and - Struggles with - Traditional NLP tasks
words, usually based on implement. out-of-vocabulary (OOV) (e.g., sentiment analysis,
spaces or punctuation. - Retains semantic words. text classification).
meaning at the word level. - Large vocabulary size
- Works well for languages increases computational
with clear word boundaries cost.
(e.g., English). - Sensitive to spelling
variations.

EECE 490: Introduction to ML 78


Numerical Representation of Text: Tokenization

Python implementation of word


level tokenization.

Different methods and libraries


might tokenize the same string
differently based on the
pre-processing techniques used.

EECE 490: Introduction to ML 79


Numerical Representation of Text: Tokenization
Sub-Word Tokenization Advantages Disadvantages Use-Cases
Definition

Breaks words into smaller - Handles - Requires a pre-defined - Pre-trained language


units, such as subwords, out-of-vocabulary (OOV) vocabulary, which may not models (e.g., BERT, GPT,
often based on frequency words by splitting them adapt well to new RoBERTa).
or statistical patterns. into known subwords. domains. - Multilingual NLP tasks.
- Balances vocabulary size - Complex tokenization - Domains with frequent
and sequence length. process compared to compound words or
- Retains semantic word-level approaches. technical jargon.
meaning at a finer
granularity.

EECE 490: Introduction to ML 80


Numerical Representation of Text: Tokenization

Sub-word tokenization implementation methods: Byte Pair Encodings:

The basic idea of BPE is to iteratively merge the most frequent pair of
consecutive bytes or characters in a text corpus until a predefined vocabulary
size is reached. The resulting subword units can be used to represent the original
text in a more compact and efficient way.

BPE is one of the most popular tokenization methods and is widely used in the
training of LLMs.

EECE 490: Introduction to ML 81


Numerical Representation of Text: Tokenization
How does BPE work?
1. Separate the text into characters.
These are your initial set of tokens
2. Select the two characters that
occur most frequently next to each
other.
3. Merge these two characters tleft
and tright into a new token.
4. Repeat until the number of tokens
in your vocabulary meets a
threshold.

EECE 490: Introduction to ML 82


Numerical Representation of Text: Tokenization

Sub-word tokenization implementation methods: Word Piece:

Similar to byte pair encoding, but instead of using frequency to determine the
existence of a sub-word, a probability function (P) is used to determine the
likelihood of sub-words existing in the sentence (S).

EECE 490: Introduction to ML 83


Numerical Representation of Text: Tokenization

BERT, a very famous language model, uses WordPiece tokenization method. So,
we can just load its tokenizer.

EECE 490: Introduction to ML 84


Numerical Representation of Text: Tokenization

Sub-word tokenization implementation methods: Unigram Language Model:

The Unigram Language Model is a subword tokenization method used in


SentencePiece that selects the best subwords for a given corpus using a
probabilistic framework. It is particularly effective for creating a compact
vocabulary while maintaining good coverage of linguistic phenomena.

Unlike Word Piece and BPE, which iteratively merges subwords, this method
starts with a large set of subwords and prunes them to reach the desired
vocabulary size.

EECE 490: Introduction to ML 85


Numerical Representation of Text: Tokenization

How does ULM work (in SentencePiece Tokenizer)?


1. Create a set of all possible sub-word candidates, called unigrams. (Corpus
segmentation or including all characters in the text)
2. For a sentence S, each subword ti is assigned a probability, where the goal is
to maximize P(S) for the training dataset.

3. Calculate the probability of each subword appearing in the tokenized text,


adjust it to maximize the likelihood of the training data, and filter out
low-probability unigrams.
4. Repeat until the number of unique tokens reaches a threshold.

EECE 490: Introduction to ML 86


Numerical Representation of Text: Tokenization

A unigram language model is trained on a corpus of data and can be loaded and
used for tokenization.

EECE 490: Introduction to ML 87


Numerical Representation of Text: Tokenization
Sentence Advantages Disadvantages Use-Cases
Tokenization
Definition

Breaks text - Captures broader semantic meaning - May lose finer details from - Sentence-based
into complete by preserving sentence context. within sentences (e.g., individual tasks such as
sentences, - Reduces sequence length compared word meaning). sentiment analysis,
treating each to word or character-level tokenization. - Cannot handle complex machine translation,
sentence as a - Facilitates tasks that require sentence structures (e.g., subword-level question answering,
single unit or understanding (e.g., sentiment analysis, nuances, context beyond the document
token. machine translation). sentence). classification.

EECE 490: Introduction to ML 88


Numerical Representation of Text: Tokenization

Sentence Tokenizers are straightforward, where they are either rule based like
NLK, Spacy, Regex, and CoreNLP. Otherwise, a machine learning model can be
trained to predict the boundaries of sentences since rule-based approaches might
become too nuanced.

EECE 490: Introduction to ML 89


Numerical Representation of Text:
Vocabulary Building

The vocabulary is a set of unique tokens available in the corpus. Based on the
tokenization strategy, they could be characters, bytes, words, sub-words, or
sentences.

EECE 490: Introduction to ML 90


Numerical Representation of Text:
Vocabulary Building
Special tokens are added to the vocabulary to account for different cases:

1. <UNK>: for unknown words (words that did not occur in the training corpus)
2. <STRT>: for tokens at the beginning of sentences
3. <END>: for the last token in a sentence or paragraph.
4. </W>: in case of byte, character, or sub-word encoding, this token is used to
indicate the end of a word.

There are many more special sequences that we will come across in the
upcoming chapters.

EECE 490: Introduction to ML 91


Numerical Representation of Text Data

Now that we have a defined set of


vocabulary words for our corpus, we
can use one of two methods:

- Vectorization: Using the


statistical properties of the text to
represent it.
- Embedding: Using a deep
learning model to output a vector
representation of the token.

EECE 490: Introduction to ML 92


Numerical Representation of Text Data

EECE 490: Introduction to ML 93


Numerical Representation of Text Data: Vectorization
Indexing Vocab Tokens
Each token in the vocabulary is indexed with an integer. The choice of token
indices can be based on the order of occurrence, frequency (as in statistical
methods like Bag of Words), alphabetical order, or specific rules defined by the
tokenizer (in BPE, the indexes are created during the training process).

EECE 490: Introduction to ML 94


Numerical Representation of Text Data: Vectorization
Bag of Words
Bag of Words: This method uses word-level tokenization. Each unique word in
the corpus is assigned an index, and the text is represented as a vector indicating
the frequency of each word.

EECE 490: Introduction to ML 95


Numerical Representation of Text Data: Vectorization
Bag of Words

EECE 490: Introduction to ML 96


Numerical Representation of Text Data: Vectorization
TF-IDF
TF-IDF can be applied on any tokenization level and combines two components:

1. Term Frequency (TF): Measures how often a word appears in a document.


2. Inverse Document Frequency (IDF): Measures how unique or rare a word is
across the entire corpus.

EECE 490: Introduction to ML 97


Numerical Representation of Text Data: Vectorization
TF-IDF

EECE 490: Introduction to ML 98


Numerical Representation of Text Data: Vectorization
TF-IDF

EECE 490: Introduction to ML 99


Numerical Representation of Text Data: Embeddings

An embedding is a dense, multidimensional representation of a token that


encodes semantic, contextual, and statistical information, capturing
relationships and patterns within the data.

EECE 490: Introduction to ML 100


Numerical Representation of Text Data: Embeddings
Embeddings are generated through machine learning models that are trained
specifically to output this numeric representation of text.

Unlike traditional vectorization techniques, the values within an embedding vector


are not directly interpretable. Instead, embeddings representing similar
meanings are positioned closer together in the embedding space.

EECE 490: Introduction to ML 101


Numerical Representation of Text Data: Embeddings

We previously mentioned that indexing our


vocabulary could be sufficient to represent
text in a numeric format. However, this
approach disregards the semantic and
structural properties of the text.

A solution to this limitation is to input the


vocabulary into an embedding model,
which learns dense vector representations
for each token. These embeddings capture
the semantic relationships and contextual
information of the text.

EECE 490: Introduction to ML 102


Numerical Representation of Text Data: Embeddings
Word2Vec Model

EECE 490: Introduction to ML 103


Numerical Representation of Video Data

EECE 490: Introduction to ML 104


Thank You

EECE 490: Introduction to ML 105

You might also like