0% found this document useful (0 votes)

2 views4 pages

Week 7 - Show in Class - Text Processing

The document discusses the importance of text pre-processing in AI and data analytics, outlining steps such as tokenization, cleaning, stop word removal, stemming, and lemmatization to convert unstructured text into a structured format for analysis. It also introduces TF-IDF (Term Frequency-Inverse Document Frequency), a statistical measure that evaluates the importance of words in a document relative to a collection of documents. TF-IDF is useful for identifying significant words, filtering out common terms, and has applications in information retrieval, text classification, and keyword extraction.

Uploaded by

anle1001.super

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views4 pages

Week 7 - Show in Class - Text Processing

Uploaded by

anle1001.super

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Text Pre-processing and TF-IDF: Foundations of Text Analysis

Text Pre-processing: Preparing Text for Analysis

In the field of AI and data analytics, we often encounter data in the form of
unstructured text. To effectively analyze this text using computational
methods, we need to transform it into a structured format that machines
can understand. This process is called text pre-processing.

Why is Text Pre-processing Necessary?

 Unstructured Data: Raw text is often messy and lacks a defined

structure. It may contain various inconsistencies, irrelevant
information, and formatting that can hinder analysis.

 Numerical Input for AI: Most AI and machine learning models

require numerical input. Text data, being symbolic, needs to be
converted into a numerical representation.

Common Text Pre-processing Steps

1. Tokenization: Breaking Down Text

o Tokenization is the process of splitting text into smaller units

called tokens.

o Tokens can be words, subwords, or characters.

o This step converts a continuous string of text into discrete

elements.

o For example, the sentence "Welcome to the world of AI!" can

be tokenized into the following list of tokens: ["Welcome", "to",
"the", "world", "of", "AI", "!"]

o Python libraries like NLTK provide tools for tokenization.

2. Cleaning: Making Text Consistent

o Cleaning involves removing or standardizing irrelevant

information to reduce noise and improve data consistency.

o Common cleaning operations include:

 Removing punctuation (!, ?, ., etc.)

 Removing special characters (#, @, *, etc.)

 Converting text to lowercase (to treat "The" and "the"

the same)
 Removing numbers (if not relevant to the analysis)

 Handling abbreviations (e.g., "Dr." to "Doctor", "it's" to

"it is")

 Removing extra whitespace

o For example, the input "Welcome to the world of AI!!! It's

amazing, isn't it?" can be cleaned to: "welcome to the world of
ai it is amazing isnt it".

3. Stop Word Removal: Filtering Out Commonplace Words

o Stop words are common words that appear frequently in a

language but carry little meaningful information for many text
analysis tasks.

o Examples of stop words in English include "the", "is", "a",

"and", "in", "to", "I", and "you".

o Removing stop words can help focus on the more important

terms in a text.

o For example, the sentence "The quick brown fox jumps over
the lazy dog" becomes "quick brown fox jumps lazy dog" after
stop word removal.

o NLTK provides lists of stop words for various languages.

4. Stemming: Reducing Words to Their Roots

o Stemming reduces words to their root or base form by

removing suffixes.

o It is a simpler and faster approach than lemmatization.

o For example:

 "running", "runs", "ran" -> "run"

 "easily", "easy", "easier" -> "easi"

o Note that stemming does not always produce a valid word. For
example, both "university" and "universe" might be stemmed
to "univers".

5. Lemmatization: Finding the Dictionary Form

o Lemmatization reduces words to their base or dictionary form,

called the lemma.
o It is more sophisticated than stemming because it considers
the word's meaning and context.

o Lemmatization ensures that the resulting word is a valid word.

o For example:

 "better", "best" -> "good"

 "went" -> "go"

 "are", "is", "was" -> "be"

o Lemmatization is generally more computationally expensive

than stemming.

Text Analysis: Weighing Word Importance with TF-IDF

Once the text has been pre-processed, we can begin to analyze its
content. A common technique for this is TF-IDF, which helps us
understand the importance of words within a document relative to a
collection of documents.

What is TF-IDF?

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is

a statistical measure that assigns a score to each word in a document
based on its importance.

 Term Frequency (TF): Measures how often a word appears in a

specific document. The more times a word appears in a document,
the more relevant it is to the document's content.

 Inverse Document Frequency (IDF): Measures how rare a word

is across a collection of documents (corpus). Words that appear in
many documents are less informative than words that appear in
only a few.

The TF-IDF score is calculated by multiplying the TF and IDF scores:

TF-IDF = TF * IDF

A high TF-IDF score indicates that a word is frequent in a given document

but rare across the corpus, suggesting that it is an important word for
understanding the document's content.

Why is TF-IDF Useful?

 Identifies Important Words: TF-IDF helps to highlight the words

that are most characteristic of a document.
 Filters Out Common Words: It downweights the importance of
common words (like "the", "is", "and") that appear frequently in all
documents and thus provide little discriminatory power.

 Applications: TF-IDF is widely used in various applications,

including:

o Information Retrieval: Ranking search results based on

their relevance to a query.

o Text Classification: Categorizing documents into different

groups or topics.

o Keyword Extraction: Identifying the most important words

or phrases in a document.

Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
NLP Question Bank Answers (Jagmeet)
No ratings yet
NLP Question Bank Answers (Jagmeet)
31 pages
Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Week 1-4 Text An
No ratings yet
Week 1-4 Text An
74 pages
NLP Question Bank Answers (Raghav) - This Is Better
No ratings yet
NLP Question Bank Answers (Raghav) - This Is Better
25 pages
Q1-Ppt 3 Peripheral Devices
No ratings yet
Q1-Ppt 3 Peripheral Devices
30 pages
BDA3
No ratings yet
BDA3
61 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Lecture 5 - Language Representation Tf-Idf
No ratings yet
Lecture 5 - Language Representation Tf-Idf
51 pages
Module III
No ratings yet
Module III
42 pages
TF Idf
No ratings yet
TF Idf
18 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
Lect 5
No ratings yet
Lect 5
40 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
ITD253 L2 TextPreprocessing
No ratings yet
ITD253 L2 TextPreprocessing
33 pages
AIML Unit5
No ratings yet
AIML Unit5
36 pages
Lecture 10 - Term Frequency
No ratings yet
Lecture 10 - Term Frequency
17 pages
TF Idf
No ratings yet
TF Idf
15 pages
Text Analysis
No ratings yet
Text Analysis
13 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
Bag of Words
No ratings yet
Bag of Words
19 pages
Text Prediction Analysis
No ratings yet
Text Prediction Analysis
12 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
01 - Introduction To Text Analytics - Part2
No ratings yet
01 - Introduction To Text Analytics - Part2
48 pages
1 Information Retrieval System
No ratings yet
1 Information Retrieval System
10 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
Lesson 2.1 - V4 - Term Frequency-Inverse Document Frequency (TF-IDF)
No ratings yet
Lesson 2.1 - V4 - Term Frequency-Inverse Document Frequency (TF-IDF)
14 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
Week 12
No ratings yet
Week 12
19 pages
Module 3
No ratings yet
Module 3
40 pages
Lecture 6 - From Unstructured Texts To Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts To Structure Data I
17 pages
Getting Started With Natural Language Processing
No ratings yet
Getting Started With Natural Language Processing
10 pages
Exp 7
No ratings yet
Exp 7
9 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
TF-IDF - From - Scratch - Towards - Data - Science
No ratings yet
TF-IDF - From - Scratch - Towards - Data - Science
20 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
1-S2.0-S1877050916311589-Main - Part-5
No ratings yet
1-S2.0-S1877050916311589-Main - Part-5
7 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Unit 5
No ratings yet
Unit 5
8 pages
Exam 2
No ratings yet
Exam 2
5 pages
Group A Assignment No: 7
No ratings yet
Group A Assignment No: 7
10 pages
NLP Revision Notes
No ratings yet
NLP Revision Notes
6 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
Preprocessing Techniquesfor Text Mining
No ratings yet
Preprocessing Techniquesfor Text Mining
7 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
TF Idf
No ratings yet
TF Idf
4 pages
TF IDF Vectorizer
No ratings yet
TF IDF Vectorizer
2 pages
Jurnal Information Retrieval
No ratings yet
Jurnal Information Retrieval
4 pages
Scenario Based Servicenow Developer Interview Questions
No ratings yet
Scenario Based Servicenow Developer Interview Questions
22 pages
A New Approach To Represent Textual Documents Using CVSM
No ratings yet
A New Approach To Represent Textual Documents Using CVSM
6 pages
Chapter Two
No ratings yet
Chapter Two
3 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
TF-IDF Model
No ratings yet
TF-IDF Model
3 pages
Coast Saring MS
100% (1)
Coast Saring MS
71 pages
IJDKP
No ratings yet
IJDKP
7 pages
Preprocessing Stemin JI
No ratings yet
Preprocessing Stemin JI
3 pages
Quectel Products Portfolio Overview V2.6
No ratings yet
Quectel Products Portfolio Overview V2.6
146 pages
Netcool PDF
No ratings yet
Netcool PDF
170 pages
One-Step Procedure: (M. Jayababu (SAP MM/S4 Sourcing and Procurement Consultant) )
100% (1)
One-Step Procedure: (M. Jayababu (SAP MM/S4 Sourcing and Procurement Consultant) )
9 pages
Tiếng Anh 9 Friends Plus - Starter - vocab - one - star
No ratings yet
Tiếng Anh 9 Friends Plus - Starter - vocab - one - star
1 page
Information Systems Journal - 2018 - Mueller - The Roles of Social Identity and Dynamic Salient Group Formations For ERP
No ratings yet
Information Systems Journal - 2018 - Mueller - The Roles of Social Identity and Dynamic Salient Group Formations For ERP
32 pages
Old-Week 5 - During Class - Enterprise AI and Business Analytics (2510)
No ratings yet
Old-Week 5 - During Class - Enterprise AI and Business Analytics (2510)
7 pages
651C-M (1.0)
No ratings yet
651C-M (1.0)
68 pages
N 201 Introducing Programming
No ratings yet
N 201 Introducing Programming
47 pages
EDIS User Guide PDF
No ratings yet
EDIS User Guide PDF
27 pages
Angular Best Practices 20180412
No ratings yet
Angular Best Practices 20180412
33 pages
OPT9603-Shelf & Common Cards-Bharti1
No ratings yet
OPT9603-Shelf & Common Cards-Bharti1
16 pages
SN 750 SE Reviewer Guide July 2021
No ratings yet
SN 750 SE Reviewer Guide July 2021
19 pages
Canteen Ordering System PDF
No ratings yet
Canteen Ordering System PDF
16 pages
Improve Your Deployment Pipeline
No ratings yet
Improve Your Deployment Pipeline
4 pages
System Diagrm
No ratings yet
System Diagrm
11 pages
PalletTool Setup-FANUC
No ratings yet
PalletTool Setup-FANUC
2 pages
The Impact of Computer Technology On Accounting System and Its Effect On Employment Generation
No ratings yet
The Impact of Computer Technology On Accounting System and Its Effect On Employment Generation
7 pages
Quiz 1
No ratings yet
Quiz 1
2 pages
Institute of Management Studies Ghaziabad Synopsis On Project of MCA VI Sem-2011
No ratings yet
Institute of Management Studies Ghaziabad Synopsis On Project of MCA VI Sem-2011
8 pages
Hack The Public With Fake Access Point
No ratings yet
Hack The Public With Fake Access Point
10 pages
Software Requirement Specification (SRS) On "Online Electronic Shop"
No ratings yet
Software Requirement Specification (SRS) On "Online Electronic Shop"
5 pages
Troubleshooting OSPF Neighbor Mismatch Effectively On Junos
No ratings yet
Troubleshooting OSPF Neighbor Mismatch Effectively On Junos
8 pages
Display Method in D365 F
No ratings yet
Display Method in D365 F
6 pages
Information Systems Journal - 2015 - Xu - Internet Aggression in Online Communities A Contemporary Deterrence Perspective
No ratings yet
Information Systems Journal - 2015 - Xu - Internet Aggression in Online Communities A Contemporary Deterrence Perspective
27 pages
Procedure 1610 PR.01 Systems and Network Security: Revision Date: 6/10/11
No ratings yet
Procedure 1610 PR.01 Systems and Network Security: Revision Date: 6/10/11
5 pages
Cable Reference Guide
No ratings yet
Cable Reference Guide
1 page
00 Colab
No ratings yet
00 Colab
2 pages
Shortcut Keys
No ratings yet
Shortcut Keys
3 pages
Comments
No ratings yet
Comments
2 pages
Grade 6 Sample Paper Weekly Test 4
No ratings yet
Grade 6 Sample Paper Weekly Test 4
2 pages
BSCS Curriculum 2004
No ratings yet
BSCS Curriculum 2004
1 page
Shubham Kumar Dutta Resume
No ratings yet
Shubham Kumar Dutta Resume
1 page
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet

Week 7 - Show in Class - Text Processing

Uploaded by

Week 7 - Show in Class - Text Processing

Uploaded by

Text Pre-processing and TF-IDF: Foundations of Text Analysis

Text Pre-processing: Preparing Text for Analysis

Why is Text Pre-processing Necessary?

 Unstructured Data: Raw text is often messy and lacks a defined

 Numerical Input for AI: Most AI and machine learning models

Common Text Pre-processing Steps

1. Tokenization: Breaking Down Text

o Tokenization is the process of splitting text into smaller units

o Tokens can be words, subwords, or characters.

o This step converts a continuous string of text into discrete

o For example, the sentence "Welcome to the world of AI!" can

o Python libraries like NLTK provide tools for tokenization.

2. Cleaning: Making Text Consistent

o Cleaning involves removing or standardizing irrelevant

o Common cleaning operations include:

 Removing punctuation (!, ?, ., etc.)

 Removing special characters (#, @, *, etc.)

 Converting text to lowercase (to treat "The" and "the"

 Handling abbreviations (e.g., "Dr." to "Doctor", "it's" to

 Removing extra whitespace

o For example, the input "Welcome to the world of AI!!! It's

3. Stop Word Removal: Filtering Out Commonplace Words

o Stop words are common words that appear frequently in a

o Examples of stop words in English include "the", "is", "a",

o Removing stop words can help focus on the more important

o NLTK provides lists of stop words for various languages.

4. Stemming: Reducing Words to Their Roots

o Stemming reduces words to their root or base form by

o It is a simpler and faster approach than lemmatization.

 "running", "runs", "ran" -> "run"

 "easily", "easy", "easier" -> "easi"

5. Lemmatization: Finding the Dictionary Form

o Lemmatization reduces words to their base or dictionary form,

o Lemmatization ensures that the resulting word is a valid word.

 "better", "best" -> "good"

 "went" -> "go"

 "are", "is", "was" -> "be"

o Lemmatization is generally more computationally expensive

Text Analysis: Weighing Word Importance with TF-IDF

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is

 Term Frequency (TF): Measures how often a word appears in a

 Inverse Document Frequency (IDF): Measures how rare a word

The TF-IDF score is calculated by multiplying the TF and IDF scores:

A high TF-IDF score indicates that a word is frequent in a given document

Why is TF-IDF Useful?

 Identifies Important Words: TF-IDF helps to highlight the words

 Applications: TF-IDF is widely used in various applications,

o Information Retrieval: Ranking search results based on

o Text Classification: Categorizing documents into different

o Keyword Extraction: Identifying the most important words

You might also like