Module 1 Lab 1
Module 1 Lab 1
This lab notebook introduces the concept of feature extraction from data, focusing on two
types: text data and image data. The goal is to help you understand how to transform raw data
into features that machine learning algorithms can use. Below, I will explain every major part,
step by step, as if you are new to the field.
Key Observations
Bigram frequencies (pairs of characters) are similar across topics in the same language but
differ significantly between languages.
Therefore, bigram frequency is a good feature for distinguishing languages, not topics.
Dimensionality Reduction
By using unigrams, you reduce the text to 26 features (one for each letter).
Using bigrams, you get 26×26 = 676 features (all possible pairs of letters).
Further Exploration
Try using different languages, topics, or text sources.
Visualize trigrams (three-character sequences) or higher-order n-grams for more complex
patterns.
Text Trigram Count Frequency of letter triplets Captures more complex patterns
Image Hull Pixel Count Number of pixels in filled digit Measures overall digit area
If you have any specific part of the code or concept you want to explore further, let me know!
⁂