C2 +Preparing+Data+for+Statistical+Machine+Learning+Models
C2 +Preparing+Data+for+Statistical+Machine+Learning+Models
AI personal assistants have gained extreme popularity in the last couple of years,
helping automatine or accelerate day-to-day work.
Validation dataset is used to check the model’s performance during the training
process. It allows us to choose which model ‘settings’ result in the highest
accuracy.
Testing dataset is used to check the performance of the model on data it has not
seen during training.
Let’s say we want to predict the price of a stock, what information would you look
at to make this prediction?
The role of the machine learning model is to understand the underlying patterns
and informations that the features hold. This implicit set of information is learned
and stored within the model’s trainable parameters.
- Text Data: The features could only be the input text (generation purpose) or
could include useful features like “most frequent words”, “document length”,
(prediction purpose)..
- Image Data: The features could include only the color intensities inside the
image or could include shape, depth of pixels, …
- Time Series and Audio Data: The features could include only the timestamp
and the value for each interval, but could include multiple other inferred
features like spectral and pitch features.
The inputs and outputs of the machine learning model should be in numerical
format. Let’s say we want to predict if a person will get approved on a home loan.
After importing our data we need to load it. From the dataset card on kaggle, we
see that we have two csv files: one for training, and one for testing. We will work
on the training file.
To load the training csv file, we will use the pandas library.
Now that we have our dataset, let’s select the categorical features. We can know
which features are categorical by using the .info() function in pandas that would
tell us the object type of the feature.
Alternatively, we can use the .select_dtypes() to select the features that have the
‘object’ type and print a list of their names.
Let’s explore how we can use each encoding type in python code for the ‘property
area’ feature in our dataset. The first thing we need to do is take a look at the
‘categories’ of this feature and their values.
Using target encoding with ‘Loan Amount Term’ as the target variable.
The size of an image is described by the number of pixels in the height and the
width.
Example: 1920x1080. This means that the screen will have a width of 1,920
pixels while the height of the screen will be 1,080 pixels. This results in a grand
total of 2,073,600 pixels on-screen.
Each pixel has an intensity value that describes the intensity of color at that pixel.
To visualize the pixel values, we will use the opencv library to load the image and
print its values.
Black and white images: Pixels are either 0 (black) or 1 (white), representing a
binary image.
1. Hamming Window:
2. Blackman-Harris Window:
The Fourier Transform (FT) converts a signal from the time domain (how a
signal changes over time) to the frequency domain (what frequencies are
present in the signal and their amplitudes).
Before we transform text into its numeric format, we need to pre-process the text:
Mandatory pre-processing:
1. Tokenization
2. Vocabulary Building
Optional pre-processing:
1. Remove links, tags, emojis, stop words, extra spaces and punctuations.
2. Transform all text into lower case.
3. Replace accented characters, numbers, and special characters with a different
representation.
5 main types of tokenization are used to to process text for natural language
models:
Breaks text into individual - Handles rare or unknown - Produces longer - Languages with no
characters. words effectively. sequences, increasing spaces (e.g., Chinese,
- Compact vocabulary computational cost. Japanese).
size. - Loss of semantic - Handling rare or OOV
- Robust to spelling information at the words.
variations and errors. character level. - Autocompletion and text
- Requires deeper models. generation tasks.
Breaks text into individual - Handles all characters, - Produces very - Language-agnostic tasks.
bytes or characters, including special and fine-grained tokenization - Low-resource languages.
typically at the byte level, unseen ones, with no OOV (often single character - Models requiring robust
to handle any character, issues. level), which can lead to handling of all characters,
including rare or unseen - Compact vocabulary long sequences. such as multilingual or
ones. size. - Loss of higher-level code-switching tasks.
- Works well with any text, semantic meaning.
regardless of language or
structure.
Breaks text into individual - Easy to understand and - Struggles with - Traditional NLP tasks
words, usually based on implement. out-of-vocabulary (OOV) (e.g., sentiment analysis,
spaces or punctuation. - Retains semantic words. text classification).
meaning at the word level. - Large vocabulary size
- Works well for languages increases computational
with clear word boundaries cost.
(e.g., English). - Sensitive to spelling
variations.
The basic idea of BPE is to iteratively merge the most frequent pair of
consecutive bytes or characters in a text corpus until a predefined vocabulary
size is reached. The resulting subword units can be used to represent the original
text in a more compact and efficient way.
BPE is one of the most popular tokenization methods and is widely used in the
training of LLMs.
Similar to byte pair encoding, but instead of using frequency to determine the
existence of a sub-word, a probability function (P) is used to determine the
likelihood of sub-words existing in the sentence (S).
BERT, a very famous language model, uses WordPiece tokenization method. So,
we can just load its tokenizer.
Unlike Word Piece and BPE, which iteratively merges subwords, this method
starts with a large set of subwords and prunes them to reach the desired
vocabulary size.
A unigram language model is trained on a corpus of data and can be loaded and
used for tokenization.
Breaks text - Captures broader semantic meaning - May lose finer details from - Sentence-based
into complete by preserving sentence context. within sentences (e.g., individual tasks such as
sentences, - Reduces sequence length compared word meaning). sentiment analysis,
treating each to word or character-level tokenization. - Cannot handle complex machine translation,
sentence as a - Facilitates tasks that require sentence structures (e.g., subword-level question answering,
single unit or understanding (e.g., sentiment analysis, nuances, context beyond the document
token. machine translation). sentence). classification.
Sentence Tokenizers are straightforward, where they are either rule based like
NLK, Spacy, Regex, and CoreNLP. Otherwise, a machine learning model can be
trained to predict the boundaries of sentences since rule-based approaches might
become too nuanced.
The vocabulary is a set of unique tokens available in the corpus. Based on the
tokenization strategy, they could be characters, bytes, words, sub-words, or
sentences.
1. <UNK>: for unknown words (words that did not occur in the training corpus)
2. <STRT>: for tokens at the beginning of sentences
3. <END>: for the last token in a sentence or paragraph.
4. </W>: in case of byte, character, or sub-word encoding, this token is used to
indicate the end of a word.
There are many more special sequences that we will come across in the
upcoming chapters.