Ngram Experiment 3

Uploaded by

21106053.rohit.negi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views3 pages

Ngram Experiment 3

Uploaded by

21106053.rohit.negi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Department of Computer Science & Engineering (AI&ML)

BE SEM :VII AY: 2024-25

Subject: Natural Language Processing Lab

Aim: Implementation of: (i) Display BoW of an input text (ii) Display N-Gram of an input text

Theory:

Machine learning algorithms cannot work with raw text directly, we need to convert the text into
vectors of numbers. This is called feature extraction.

The bag-of-words model is a popular and simple feature extraction technique used when we
work with text. It describes the occurrence of each word within a document.

To use this model, we need to:

• Design a vocabulary of known words (also called tokens)

• Choose a measure of the presence of known words

Any information about the order or structure of words is discarded. That’s why it’s called a
bag of words. This model is trying to understand whether a known word occurs in a document,
but don’t know where is that word in the document.
The intuition is that similar documents have similar contents. Also, from a content, we can
learn something about the meaning of the document.

As such, there is pressure to decrease the size of the vocabulary when using a bag-of-words
model. There are simple text cleaning techniques that can be used as a first step, such as:

• Ignoring case

• Ignoring punctuation

Department of Computer Science & Engineering-(AI&ML) | APSIT

• Ignoring frequent words that don’t contain much information, called stop words, like “a,”
“of,” etc.
• Fixing misspelled words.

• Reducing words to their stem (e.g. “play” from “playing”) using stemming algorithms.

A more sophisticated approach is to create a vocabulary of grouped words. This both changes
the scope of the vocabulary and allows the bag-of-words to capture a little bit more meaning
from the document.

In this approach, each word or token is called a “gram”. Creating a vocabulary of two-word
pairs is, in turn, called a bigram model. Again, only the bigrams that appear in the corpus are
modeled, not all possible bigrams.

An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram) is a

two-word sequence of words like “please turn”, “turn your”, or “your homework”, and a 3-gram
(more commonly called a trigram) is a three-word sequence of words like “please turn your”, or
“turn your homework”.

Once a vocabulary has been chosen, the occurrence of words in example documents needs to
be scored.

Some additional simple scoring methods include:

• Counts. Count the number of times each word appears in a document.

Frequencies. Calculate the frequency that each word appears in a document out of all the
words in the document.

Limitations of Bag-of-Words
The bag-of-words model is very simple to understand and implement and offers a lot of
flexibility for customization on your specific text data.

It has been used with great success on prediction problems like language modeling and
documentation classification.

Nevertheless, it suffers from some shortcomings, such as:

Department of Computer Science & Engineering-(AI&ML) | APSIT

Vocabulary: The vocabulary requires careful design, most specifically in order to manage the
size, which impacts the sparsity of the document representations.

Sparsity: Sparse representations are harder to model both for computational reasons (space and
time complexity) and also for information reasons, where the challenge is for the models to
harness so little information in such a large representational space.

Examples of Use Cases

1. Autocomplete and Spell Checkers: By predicting the next word in a sequence or

suggesting corrections.
2. Speech Recognition: To predict the most likely sequence of words from a sequence of
sounds.
3. Machine Translation: To find the most probable sequence of words in the target
language given a sequence of words in the source language.
4. Text Generation: In chatbots or content generation systems to produce coherent text.

Overall, N-gram models provide a balance between simplicity and effectiveness, making them a
valuable tool in the NLP toolkit, especially for tasks involving local context and manageable data
sizes.

Conclusion:
The bag-of-words model is a popular and simple feature extraction technique used when we
work with text. It describes the occurrence of each word within a document. Statistical language
models, in its essence, are the type of models that assign probabilities to the sequences of
words. In this practical, we understood the simplest model that assigns probabilities to
sentences and sequences of words, the n-gram. Often a simple bigram approach is better than a
1-gram bag-of-words model for tasks like documentation classification.

Department of Computer Science & Engineering-(AI&ML) | APSIT

Unit Ii
No ratings yet
Unit Ii
20 pages
Limit (Regular)
No ratings yet
Limit (Regular)
5 pages
Devc Lecture Notes (20A54201) : I - Btech
No ratings yet
Devc Lecture Notes (20A54201) : I - Btech
218 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
NLP 2-5 Unit Notes
No ratings yet
NLP 2-5 Unit Notes
83 pages
EasyPicing PDF
No ratings yet
EasyPicing PDF
169 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Basics of Bag of Words Model
No ratings yet
Basics of Bag of Words Model
32 pages
Module-1 ch-2
No ratings yet
Module-1 ch-2
31 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
Machine Learning With Python - Unit-5
No ratings yet
Machine Learning With Python - Unit-5
26 pages
Grade 7 Physics Electricity Workbook
No ratings yet
Grade 7 Physics Electricity Workbook
45 pages
F15 CS194 Lec 05 Natural Language
No ratings yet
F15 CS194 Lec 05 Natural Language
69 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
CS-875-Lecture 4
No ratings yet
CS-875-Lecture 4
47 pages
Bag of Words
No ratings yet
Bag of Words
8 pages
Communication Skills in English DI01000031
No ratings yet
Communication Skills in English DI01000031
7 pages
Pali A Grammar of The Language of The Theravada Tipitaka TH Oberlies Berlin 2001 600dpi Lossy PDF
No ratings yet
Pali A Grammar of The Language of The Theravada Tipitaka TH Oberlies Berlin 2001 600dpi Lossy PDF
408 pages
Unit2 01
No ratings yet
Unit2 01
9 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
Real Time Scenarioes
No ratings yet
Real Time Scenarioes
82 pages
Word Embedding
No ratings yet
Word Embedding
60 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Bag of Words
No ratings yet
Bag of Words
3 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
Unit IV
No ratings yet
Unit IV
58 pages
CH 6. Applications of AI-NLP
No ratings yet
CH 6. Applications of AI-NLP
65 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Lect 04
No ratings yet
Lect 04
44 pages
10 Write A Program To Find The Sum of All The Digits of A Number
No ratings yet
10 Write A Program To Find The Sum of All The Digits of A Number
2 pages
Lecture 4 N Grams
No ratings yet
Lecture 4 N Grams
29 pages
NLP Key Points
No ratings yet
NLP Key Points
3 pages
Distributed Representations of Sentences and Documents: Quoc Le Tomas Mikolov
No ratings yet
Distributed Representations of Sentences and Documents: Quoc Le Tomas Mikolov
9 pages
The Ultimate Guide To Citing Anything in Chicago Style
100% (1)
The Ultimate Guide To Citing Anything in Chicago Style
14 pages
Fire in My Future
No ratings yet
Fire in My Future
34 pages
Bag - of - Words NLP
No ratings yet
Bag - of - Words NLP
23 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Lab 5
No ratings yet
Lab 5
27 pages
Document
No ratings yet
Document
6 pages
NLTK - N-Gram LM
No ratings yet
NLTK - N-Gram LM
13 pages
Natural Language Processing - Compressed
No ratings yet
Natural Language Processing - Compressed
17 pages
TSP Unit1 Own
No ratings yet
TSP Unit1 Own
20 pages
Unit IV
No ratings yet
Unit IV
57 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Pipeline
No ratings yet
Pipeline
9 pages
NLP Notes
No ratings yet
NLP Notes
12 pages
EE102 Engg Drawing Course Outline Spring 18
No ratings yet
EE102 Engg Drawing Course Outline Spring 18
4 pages
NLP Sentiment Analysis
No ratings yet
NLP Sentiment Analysis
7 pages
Im01b08k02 02en
No ratings yet
Im01b08k02 02en
478 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
13 Ai Cse551 NLP 1 PDF
No ratings yet
13 Ai Cse551 NLP 1 PDF
50 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
TSP Unit1 Own
No ratings yet
TSP Unit1 Own
13 pages
Chapter 2
No ratings yet
Chapter 2
44 pages
Unit 3-Notes AI
No ratings yet
Unit 3-Notes AI
36 pages
C10 - Ai - Unit 3 - NLP - Half Yearly
No ratings yet
C10 - Ai - Unit 3 - NLP - Half Yearly
37 pages
Chapter 5
No ratings yet
Chapter 5
22 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Mil Verbos
No ratings yet
Mil Verbos
22 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
Algebra 1
No ratings yet
Algebra 1
69 pages
Cambridge IGCSE™: Chinese As A Second Language 0523/03
No ratings yet
Cambridge IGCSE™: Chinese As A Second Language 0523/03
6 pages
Exp 5
No ratings yet
Exp 5
2 pages
Báo - cáo - Nhóm - Tâm - Tài - Trong - (TH tín hiệu và hệ thống)
No ratings yet
Báo - cáo - Nhóm - Tâm - Tài - Trong - (TH tín hiệu và hệ thống)
43 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
10 Passage 3 - E-Training Q28-40
No ratings yet
10 Passage 3 - E-Training Q28-40
6 pages
Smp08alg Na Te2 C13 L07 13
No ratings yet
Smp08alg Na Te2 C13 L07 13
7 pages
Meinberg m320 Datasheet
No ratings yet
Meinberg m320 Datasheet
5 pages
HP Z4 G4 Workstation
No ratings yet
HP Z4 G4 Workstation
73 pages
Postgraduate English: Re-Evaluating Woolf's Androgynous Mind
No ratings yet
Postgraduate English: Re-Evaluating Woolf's Androgynous Mind
25 pages
Concurrency in OOPS
No ratings yet
Concurrency in OOPS
62 pages
Anna Karenina by Leo Tolstoy
No ratings yet
Anna Karenina by Leo Tolstoy
2 pages
How To Teach Reading
No ratings yet
How To Teach Reading
22 pages
F5923 3G Soho Mobile Router User Manual
No ratings yet
F5923 3G Soho Mobile Router User Manual
19 pages
Dasht e Yahudi
No ratings yet
Dasht e Yahudi
3 pages
Chapter 8
No ratings yet
Chapter 8
5 pages
LRDI Live Class
No ratings yet
LRDI Live Class
21 pages
Compiler 2
No ratings yet
Compiler 2
8 pages
Game Theory Lab Assignment 1 - Colab
No ratings yet
Game Theory Lab Assignment 1 - Colab
5 pages
Soal B.inggris 7-9
No ratings yet
Soal B.inggris 7-9
5 pages
Experiment 8
No ratings yet
Experiment 8
2 pages
Simple Present - Hojas de Ejercicios
No ratings yet
Simple Present - Hojas de Ejercicios
4 pages
Repeat Patterns Lesson Plan - August
No ratings yet
Repeat Patterns Lesson Plan - August
3 pages
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
From Everand
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
Marcin Jamro
No ratings yet
ChatGPT Simplified: A Comprehensive Guide to Understanding and Utilizing AI Language Models, ChatGPT-4, ChatGPT Prompts, Fiction Writing, Blogging, Content Writing, Make Money Online
From Everand
ChatGPT Simplified: A Comprehensive Guide to Understanding and Utilizing AI Language Models, ChatGPT-4, ChatGPT Prompts, Fiction Writing, Blogging, Content Writing, Make Money Online
Silas Quantum
5/5 (1)
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet