0% found this document useful (0 votes)

5 views4 pages

Module 1 Lab 1

The document introduces feature extraction from text and image data for machine learning. It covers methods for transforming raw text into numerical features using n-grams and discusses image features from the MNIST dataset, including pixel counts and boundary detection. Key takeaways emphasize the importance of feature selection in machine learning and the visualization of features for understanding data separation.

Uploaded by

katrao39798

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views4 pages

Module 1 Lab 1

Uploaded by

katrao39798

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Detailed Explanation of the PDF: "AIML Module 01 Lab 01 Features"

This lab notebook introduces the concept of feature extraction from data, focusing on two
types: text data and image data. The goal is to help you understand how to transform raw data
into features that machine learning algorithms can use. Below, I will explain every major part,
step by step, as if you are new to the field.

Part 1: Features of Text

Why Extract Features from Text?

Machine learning algorithms work with numbers, not raw text.
To use text as input, we must convert it into a set of features (numerical values that capture
important information about the text).

Downloading Text Data

The lab downloads Wikipedia articles on topics like "Giraffe," "Elephant," "Machine," and
"Artificial Intelligence" in English, French, and Hindi.
This is done using the wikipedia Python package.

Cleaning the Text

The text is cleaned to remove all special characters, numbers, and spaces, leaving only
lowercase alphabetic characters.
This simplifies the data and ensures consistency across languages.

What are N-grams?

N-grams are continuous sequences of 'n' items (characters or words) from a given text.
Unigram: Single character (n=1)
Bigram: Pair of characters (n=2)
Trigram: Sequence of three characters (n=3)
N-grams help capture the structure and patterns in text.
Counting N-grams
The frequency of each n-gram is counted using Python’s Counter from the collections
module.
These counts are used as features representing the text.

Visualizing N-gram Frequencies

The frequencies are plotted as histograms (bar charts).
By comparing these plots for different languages, you can see that the patterns of n-gram
frequencies are unique to each language.

Key Observations
Bigram frequencies (pairs of characters) are similar across topics in the same language but
differ significantly between languages.
Therefore, bigram frequency is a good feature for distinguishing languages, not topics.

Dimensionality Reduction
By using unigrams, you reduce the text to 26 features (one for each letter).
Using bigrams, you get 26×26 = 676 features (all possible pairs of letters).

Further Exploration
Try using different languages, topics, or text sources.
Visualize trigrams (three-character sequences) or higher-order n-grams for more complex
patterns.

Part 2: Features from Images (Written Numbers)

Dataset Used: MNIST

The MNIST dataset contains images of handwritten digits (0-9), each as a 28×28 pixel
grayscale image.

Visualizing the Data

Images of the digits "1" and "0" are displayed to understand what the data looks like.

Simple Feature: Sum of Pixels

For each image, count the number of non-black (active) pixels.
This feature alone can distinguish between some digits (e.g., "1" has fewer active pixels
than "0").
Advanced Feature: Counting "Hole" Pixels
A "hole pixel" is a black pixel completely surrounded by non-black pixels.
The algorithm fills in the holes and counts how many such pixels exist.
This feature is especially useful for distinguishing "0" (which has a hole) from "1" (which
does not).

Visualizing Hole Pixels

The lab displays side-by-side images of the original digit and the image showing only the
hole pixels.

Feature: Hull Pixels

The "hull" of an image is the digit with all holes filled in.
Counting the number of hull pixels provides another feature for classification.

Feature: Boundary Pixels

Boundary pixels are those on the edge of the digit (where the digit meets the background).
The algorithm finds these by comparing each pixel to its neighbors.

Visualizing Boundary Pixels

The lab shows images with boundary pixels highlighted, helping you see the outline of the
digit.

Summary Table: Features Extracted

Data
Feature Name What It Measures Why It’s Useful
Type

Text Unigram Count Frequency of each letter Captures language/character patterns

Text Bigram Count Frequency of letter pairs Captures language structure

Text Trigram Count Frequency of letter triplets Captures more complex patterns

Number of active (non-black)

Image Pixel Sum Simple shape/size feature
pixels

Hole Pixel Distinguishes digits with/without

Image Number of enclosed "hole" pixels
Count holes

Image Hull Pixel Count Number of pixels in filled digit Measures overall digit area

Image Boundary Pixels Number of edge pixels Captures digit outline

Key Takeaways for Beginners
Feature extraction is the process of turning raw data (text or images) into numbers that a
computer can understand.
For text, features are often based on the frequency of letters or groups of letters (n-grams).
For images, features can be as simple as counting pixels or as complex as measuring holes
and boundaries.
Visualizing features helps you understand how well they separate different classes (e.g.,
different digits or languages).
The choice of features is crucial for the success of any machine learning model.

If you have any specific part of the code or concept you want to explore further, let me know!
⁂

DANGOTE RECRUITMENT PAST QUESTIONS - Guideempire
100% (3)
DANGOTE RECRUITMENT PAST QUESTIONS - Guideempire
133 pages
Vehicle Detection and Tracking
No ratings yet
Vehicle Detection and Tracking
11 pages
Makalah Bilingualism and Multilingualism
100% (1)
Makalah Bilingualism and Multilingualism
13 pages
Module 1-3
No ratings yet
Module 1-3
63 pages
A Review On Machine Learning Text Feature Extraction Techniques
No ratings yet
A Review On Machine Learning Text Feature Extraction Techniques
6 pages
Image Feature Extraction
No ratings yet
Image Feature Extraction
11 pages
Feature Extraction
No ratings yet
Feature Extraction
14 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Recognition and Detection of Language On Inscriptions: Dr. C Parthasarathy, R.Sarvanan, M Sathish, U.Sai Sri Teja
No ratings yet
Recognition and Detection of Language On Inscriptions: Dr. C Parthasarathy, R.Sarvanan, M Sathish, U.Sai Sri Teja
3 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Bag-Of-Words Models: Noah Snavely
No ratings yet
Bag-Of-Words Models: Noah Snavely
47 pages
Feature Extraction: Dr. Mallikarjun Hangarge
No ratings yet
Feature Extraction: Dr. Mallikarjun Hangarge
17 pages
Machine Learning: Aigerim Bogyrbayeva
No ratings yet
Machine Learning: Aigerim Bogyrbayeva
85 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
Chapter 11and 12 - MIP and PR
No ratings yet
Chapter 11and 12 - MIP and PR
100 pages
Feature Engineering and Selection: CS 294: Practical Machine Learning October 1, 2009 Alexandre Bouchard-Côté
No ratings yet
Feature Engineering and Selection: CS 294: Practical Machine Learning October 1, 2009 Alexandre Bouchard-Côté
94 pages
Using Bigrams in Text Categorization: R B J A
No ratings yet
Using Bigrams in Text Categorization: R B J A
10 pages
Sarkar
No ratings yet
Sarkar
26 pages
18: Application Example OCR: Problem Description and Pipeline
No ratings yet
18: Application Example OCR: Problem Description and Pipeline
6 pages
Optical Character Recognition Using Neural Networks (ECE 539 Project Report)
No ratings yet
Optical Character Recognition Using Neural Networks (ECE 539 Project Report)
15 pages
Chapter 8 Text Analytics
No ratings yet
Chapter 8 Text Analytics
42 pages
1 Introduction
No ratings yet
1 Introduction
81 pages
Text Feature Extraction Based On Deep Learning A Review (PRINTED)
No ratings yet
Text Feature Extraction Based On Deep Learning A Review (PRINTED)
12 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
An Introduction To Feature Extraction
No ratings yet
An Introduction To Feature Extraction
2 pages
Lecture6 2
No ratings yet
Lecture6 2
37 pages
Docs
No ratings yet
Docs
25 pages
Bai09 Descriptors
No ratings yet
Bai09 Descriptors
81 pages
Authorship Analysis and Identification Techniques: A Review
No ratings yet
Authorship Analysis and Identification Techniques: A Review
6 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
Ciml v0 - 99 ch05 PDF
No ratings yet
Ciml v0 - 99 ch05 PDF
18 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
Word Embeddings in NLP
No ratings yet
Word Embeddings in NLP
42 pages
Local Features and Bag of Words Models
No ratings yet
Local Features and Bag of Words Models
60 pages
05 - Feature Engineering (Text)
No ratings yet
05 - Feature Engineering (Text)
28 pages
Basics of Bag of Words Model
No ratings yet
Basics of Bag of Words Model
32 pages
03-3 Feature Descriptors
No ratings yet
03-3 Feature Descriptors
58 pages
A Robust and Fast Text Extraction in Images and Video Frames
No ratings yet
A Robust and Fast Text Extraction in Images and Video Frames
7 pages
A Study Using N-Gram Features For Text Categorization
No ratings yet
A Study Using N-Gram Features For Text Categorization
10 pages
CV 2025 Spring 12 Short
No ratings yet
CV 2025 Spring 12 Short
120 pages
ARTIN1 Week 10 NLP, Part 2
No ratings yet
ARTIN1 Week 10 NLP, Part 2
8 pages
Title: Spatial Cohesion Refers To The Fact That Text
No ratings yet
Title: Spatial Cohesion Refers To The Fact That Text
6 pages
Ai - W6L11
No ratings yet
Ai - W6L11
15 pages
Ng Ram Viewer
No ratings yet
Ng Ram Viewer
42 pages
Ngram 2x3
No ratings yet
Ngram 2x3
5 pages
Chapter 4 Text Classification
No ratings yet
Chapter 4 Text Classification
28 pages
Machine Learning With Python - Unit-5
No ratings yet
Machine Learning With Python - Unit-5
26 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Pattern Recognition: Lecturer
No ratings yet
Pattern Recognition: Lecturer
43 pages
Handwritten Character Recognition Based On Structural Characteristics
No ratings yet
Handwritten Character Recognition Based On Structural Characteristics
4 pages
02 Sliding Windows 15 Min
No ratings yet
02 Sliding Windows 15 Min
8 pages
Practical Consequences of How Dict Works: Keys Must Be Hashable Objects
No ratings yet
Practical Consequences of How Dict Works: Keys Must Be Hashable Objects
2 pages
Lab4 103169894
No ratings yet
Lab4 103169894
34 pages
مقاله4 2019
No ratings yet
مقاله4 2019
14 pages
OCR For Printed Telugu Documents
No ratings yet
OCR For Printed Telugu Documents
32 pages
Machine Learning-Lecture 17 (Student)
No ratings yet
Machine Learning-Lecture 17 (Student)
7 pages
Bofinal
No ratings yet
Bofinal
10 pages
Python Code Examples
100% (1)
Python Code Examples
30 pages
Classification Techniques
No ratings yet
Classification Techniques
99 pages
Levels and Curves with GIMP
From Everand
Levels and Curves with GIMP
Alberto García Briz
4/5 (1)
Computer Graphics in Python
From Everand
Computer Graphics in Python
Martin McBride
No ratings yet
Module 3 Lab 3
No ratings yet
Module 3 Lab 3
4 pages
Module 1 Lab 2
No ratings yet
Module 1 Lab 2
7 pages
Temp 2 Lab 1
No ratings yet
Temp 2 Lab 1
5 pages
Module 2 Lab 3
No ratings yet
Module 2 Lab 3
5 pages
Tugas Bhs Inggris Bisnis
No ratings yet
Tugas Bhs Inggris Bisnis
7 pages
CCT Links
No ratings yet
CCT Links
3 pages
(Lesson 4) Hypernet C - Speculations and Predictions
No ratings yet
(Lesson 4) Hypernet C - Speculations and Predictions
24 pages
Electronic Journal of Africana Bibliography
No ratings yet
Electronic Journal of Africana Bibliography
53 pages
TOEFL 6 Month Study Plan
No ratings yet
TOEFL 6 Month Study Plan
21 pages
Reported Speech 2021 WH Questions by English With Simo
No ratings yet
Reported Speech 2021 WH Questions by English With Simo
1 page
Board Game 2nd
No ratings yet
Board Game 2nd
3 pages
Lesson 3 Southeast Asia
No ratings yet
Lesson 3 Southeast Asia
105 pages
ĐÁP ÁN CHÍNH TH C HSG Anh L P 8 - (2023-2024) in
No ratings yet
ĐÁP ÁN CHÍNH TH C HSG Anh L P 8 - (2023-2024) in
4 pages
Fm07-Character Sets
No ratings yet
Fm07-Character Sets
45 pages
Head of Department Speech
No ratings yet
Head of Department Speech
8 pages
Lesson-Plan-in-English 4-MATATAG
No ratings yet
Lesson-Plan-in-English 4-MATATAG
4 pages
ôn tập kt gk 1 lần 2
No ratings yet
ôn tập kt gk 1 lần 2
4 pages
NLP Unit-1
No ratings yet
NLP Unit-1
12 pages
Final Eng526
No ratings yet
Final Eng526
21 pages
The Ielts: Bridge #07302390901
No ratings yet
The Ielts: Bridge #07302390901
2 pages
PSD Term Paper Provat Final
No ratings yet
PSD Term Paper Provat Final
13 pages
Diagnostic Test For English 10 (SAMPLE ONLY)
No ratings yet
Diagnostic Test For English 10 (SAMPLE ONLY)
3 pages
Class Timetables - 18.01.2023
No ratings yet
Class Timetables - 18.01.2023
14 pages
Modul English Section A
No ratings yet
Modul English Section A
128 pages
Irregular Adjectives List
No ratings yet
Irregular Adjectives List
4 pages
Teens Upper Intermediate Review 3 British English Student
No ratings yet
Teens Upper Intermediate Review 3 British English Student
9 pages
Cards For Biography
No ratings yet
Cards For Biography
1 page
Programming Language Pragmatics: Michael L. Scott
No ratings yet
Programming Language Pragmatics: Michael L. Scott
19 pages
Anh 8 DT 23-24
No ratings yet
Anh 8 DT 23-24
3 pages
Present Perfect Tense 4
No ratings yet
Present Perfect Tense 4
4 pages
A Journal of Forty-Eight Hours of The Year 1945
100% (1)
A Journal of Forty-Eight Hours of The Year 1945
5 pages
Teaching English Without Teaching English Roberto Guzman TEDxUPRM
No ratings yet
Teaching English Without Teaching English Roberto Guzman TEDxUPRM
8 pages

Module 1 Lab 1

Uploaded by

Module 1 Lab 1

Uploaded by

Detailed Explanation of the PDF: "AIML Module 01 Lab 01 Features"

Part 1: Features of Text

Why Extract Features from Text?

Downloading Text Data

Cleaning the Text

What are N-grams?

Visualizing N-gram Frequencies

Part 2: Features from Images (Written Numbers)

Dataset Used: MNIST

Visualizing the Data

Simple Feature: Sum of Pixels

Visualizing Hole Pixels

Feature: Hull Pixels

Feature: Boundary Pixels

Visualizing Boundary Pixels

Summary Table: Features Extracted

Text Unigram Count Frequency of each letter Captures language/character patterns

Text Bigram Count Frequency of letter pairs Captures language structure

Number of active (non-black)

Hole Pixel Distinguishes digits with/without

Image Boundary Pixels Number of edge pixels Captures digit outline

You might also like