Detailed Explanation of the PDF: "AIML Module 01 Lab 01 Features"
This lab notebook introduces the concept of feature extraction from data, focusing on two
types: text data and image data. The goal is to help you understand how to transform raw data
into features that machine learning algorithms can use. Below, I will explain every major part,
step by step, as if you are new to the field.
Part 1: Features of Text
Why Extract Features from Text?
Machine learning algorithms work with numbers, not raw text.
To use text as input, we must convert it into a set of features (numerical values that capture
important information about the text).
Downloading Text Data
The lab downloads Wikipedia articles on topics like "Giraffe," "Elephant," "Machine," and
"Artificial Intelligence" in English, French, and Hindi.
This is done using the wikipedia Python package.
Cleaning the Text
The text is cleaned to remove all special characters, numbers, and spaces, leaving only
lowercase alphabetic characters.
This simplifies the data and ensures consistency across languages.
What are N-grams?
N-grams are continuous sequences of 'n' items (characters or words) from a given text.
Unigram: Single character (n=1)
Bigram: Pair of characters (n=2)
Trigram: Sequence of three characters (n=3)
N-grams help capture the structure and patterns in text.
Counting N-grams
The frequency of each n-gram is counted using Python’s Counter from the collections
module.
These counts are used as features representing the text.
Visualizing N-gram Frequencies
The frequencies are plotted as histograms (bar charts).
By comparing these plots for different languages, you can see that the patterns of n-gram
frequencies are unique to each language.
Key Observations
Bigram frequencies (pairs of characters) are similar across topics in the same language but
differ significantly between languages.
Therefore, bigram frequency is a good feature for distinguishing languages, not topics.
Dimensionality Reduction
By using unigrams, you reduce the text to 26 features (one for each letter).
Using bigrams, you get 26×26 = 676 features (all possible pairs of letters).
Further Exploration
Try using different languages, topics, or text sources.
Visualize trigrams (three-character sequences) or higher-order n-grams for more complex
patterns.
Part 2: Features from Images (Written Numbers)
Dataset Used: MNIST
The MNIST dataset contains images of handwritten digits (0-9), each as a 28×28 pixel
grayscale image.
Visualizing the Data
Images of the digits "1" and "0" are displayed to understand what the data looks like.
Simple Feature: Sum of Pixels
For each image, count the number of non-black (active) pixels.
This feature alone can distinguish between some digits (e.g., "1" has fewer active pixels
than "0").
Advanced Feature: Counting "Hole" Pixels
A "hole pixel" is a black pixel completely surrounded by non-black pixels.
The algorithm fills in the holes and counts how many such pixels exist.
This feature is especially useful for distinguishing "0" (which has a hole) from "1" (which
does not).
Visualizing Hole Pixels
The lab displays side-by-side images of the original digit and the image showing only the
hole pixels.
Feature: Hull Pixels
The "hull" of an image is the digit with all holes filled in.
Counting the number of hull pixels provides another feature for classification.
Feature: Boundary Pixels
Boundary pixels are those on the edge of the digit (where the digit meets the background).
The algorithm finds these by comparing each pixel to its neighbors.
Visualizing Boundary Pixels
The lab shows images with boundary pixels highlighted, helping you see the outline of the
digit.
Summary Table: Features Extracted
Data
Feature Name What It Measures Why It’s Useful
Type
Text Unigram Count Frequency of each letter Captures language/character patterns
Text Bigram Count Frequency of letter pairs Captures language structure
Text Trigram Count Frequency of letter triplets Captures more complex patterns
Number of active (non-black)
Image Pixel Sum Simple shape/size feature
pixels
Hole Pixel Distinguishes digits with/without
Image Number of enclosed "hole" pixels
Count holes
Image Hull Pixel Count Number of pixels in filled digit Measures overall digit area
Image Boundary Pixels Number of edge pixels Captures digit outline
Key Takeaways for Beginners
Feature extraction is the process of turning raw data (text or images) into numbers that a
computer can understand.
For text, features are often based on the frequency of letters or groups of letters (n-grams).
For images, features can be as simple as counting pixels or as complex as measuring holes
and boundaries.
Visualizing features helps you understand how well they separate different classes (e.g.,
different digits or languages).
The choice of features is crucial for the success of any machine learning model.
If you have any specific part of the code or concept you want to explore further, let me know!
⁂