0% found this document useful (0 votes)
2 views

Module 1 Lab 1

The document introduces feature extraction from text and image data for machine learning. It covers methods for transforming raw text into numerical features using n-grams and discusses image features from the MNIST dataset, including pixel counts and boundary detection. Key takeaways emphasize the importance of feature selection in machine learning and the visualization of features for understanding data separation.

Uploaded by

katrao39798
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 1 Lab 1

The document introduces feature extraction from text and image data for machine learning. It covers methods for transforming raw text into numerical features using n-grams and discusses image features from the MNIST dataset, including pixel counts and boundary detection. Key takeaways emphasize the importance of feature selection in machine learning and the visualization of features for understanding data separation.

Uploaded by

katrao39798
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Detailed Explanation of the PDF: "AIML Module 01 Lab 01 Features"

This lab notebook introduces the concept of feature extraction from data, focusing on two
types: text data and image data. The goal is to help you understand how to transform raw data
into features that machine learning algorithms can use. Below, I will explain every major part,
step by step, as if you are new to the field.

Part 1: Features of Text

Why Extract Features from Text?


Machine learning algorithms work with numbers, not raw text.
To use text as input, we must convert it into a set of features (numerical values that capture
important information about the text).

Downloading Text Data


The lab downloads Wikipedia articles on topics like "Giraffe," "Elephant," "Machine," and
"Artificial Intelligence" in English, French, and Hindi.
This is done using the wikipedia Python package.

Cleaning the Text


The text is cleaned to remove all special characters, numbers, and spaces, leaving only
lowercase alphabetic characters.
This simplifies the data and ensures consistency across languages.

What are N-grams?


N-grams are continuous sequences of 'n' items (characters or words) from a given text.
Unigram: Single character (n=1)
Bigram: Pair of characters (n=2)
Trigram: Sequence of three characters (n=3)
N-grams help capture the structure and patterns in text.
Counting N-grams
The frequency of each n-gram is counted using Python’s Counter from the collections
module.
These counts are used as features representing the text.

Visualizing N-gram Frequencies


The frequencies are plotted as histograms (bar charts).
By comparing these plots for different languages, you can see that the patterns of n-gram
frequencies are unique to each language.

Key Observations
Bigram frequencies (pairs of characters) are similar across topics in the same language but
differ significantly between languages.
Therefore, bigram frequency is a good feature for distinguishing languages, not topics.

Dimensionality Reduction
By using unigrams, you reduce the text to 26 features (one for each letter).
Using bigrams, you get 26×26 = 676 features (all possible pairs of letters).

Further Exploration
Try using different languages, topics, or text sources.
Visualize trigrams (three-character sequences) or higher-order n-grams for more complex
patterns.

Part 2: Features from Images (Written Numbers)

Dataset Used: MNIST


The MNIST dataset contains images of handwritten digits (0-9), each as a 28×28 pixel
grayscale image.

Visualizing the Data


Images of the digits "1" and "0" are displayed to understand what the data looks like.

Simple Feature: Sum of Pixels


For each image, count the number of non-black (active) pixels.
This feature alone can distinguish between some digits (e.g., "1" has fewer active pixels
than "0").
Advanced Feature: Counting "Hole" Pixels
A "hole pixel" is a black pixel completely surrounded by non-black pixels.
The algorithm fills in the holes and counts how many such pixels exist.
This feature is especially useful for distinguishing "0" (which has a hole) from "1" (which
does not).

Visualizing Hole Pixels


The lab displays side-by-side images of the original digit and the image showing only the
hole pixels.

Feature: Hull Pixels


The "hull" of an image is the digit with all holes filled in.
Counting the number of hull pixels provides another feature for classification.

Feature: Boundary Pixels


Boundary pixels are those on the edge of the digit (where the digit meets the background).
The algorithm finds these by comparing each pixel to its neighbors.

Visualizing Boundary Pixels


The lab shows images with boundary pixels highlighted, helping you see the outline of the
digit.

Summary Table: Features Extracted


Data
Feature Name What It Measures Why It’s Useful
Type

Text Unigram Count Frequency of each letter Captures language/character patterns

Text Bigram Count Frequency of letter pairs Captures language structure

Text Trigram Count Frequency of letter triplets Captures more complex patterns

Number of active (non-black)


Image Pixel Sum Simple shape/size feature
pixels

Hole Pixel Distinguishes digits with/without


Image Number of enclosed "hole" pixels
Count holes

Image Hull Pixel Count Number of pixels in filled digit Measures overall digit area

Image Boundary Pixels Number of edge pixels Captures digit outline


Key Takeaways for Beginners
Feature extraction is the process of turning raw data (text or images) into numbers that a
computer can understand.
For text, features are often based on the frequency of letters or groups of letters (n-grams).
For images, features can be as simple as counting pixels or as complex as measuring holes
and boundaries.
Visualizing features helps you understand how well they separate different classes (e.g.,
different digits or languages).
The choice of features is crucial for the success of any machine learning model.

If you have any specific part of the code or concept you want to explore further, let me know!

You might also like