0% found this document useful (0 votes)

40 views20 pages

Language Engineering - Section

The document describes creating a Python program that acts as a basic search engine. It explains how the program will extract text from documents, tokenize the words, lemmatize the words to reduce them to their base forms, create an index of which words occur in which documents, and allow a user to search for a term and see which documents contain it. It also discusses potential additions like showing related words if a search term is not found or listing words with the same meaning. Code for the project is provided in a Colab notebook.

Uploaded by

asmaa soliman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views20 pages

Language Engineering - Section

Uploaded by

asmaa soliman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Language Engineering

Prepared by: Abdelrahman M. Safwat

Section (5) – Mini Project

Idea

 We want to create a Python program that acts a small search engine.

 It should accept a term and return which document contains that
term.

2
Extracting all text from text files

 First, we need to extract all text from the text files in the directory
we’re in.
import os

dir_list = os.listdir(".")
dir_list = sorted(dir_list)
text_list = []

for file in dir_list:
if file.endswith(".txt"):
text_list.append(open(file).read())
3
Extracting all text from text files

 Once we’ve extracted all the text, we’ll need to tokenize it and make
a list of all words we have, that way we can start storing in which
document does each word occur.

4
Extracting all text from text files

 Since we’re making a search engine, we can’t just extract the text and
make a list of words. A lot of those words will be different forms
of the same word. (i.e. go, going, went, etc.)
 That’s why we need to lemmatize all words.

5
Lemmatizing words

 Lemmatizing words requires that we get the POS tags of each word,
and then use the word and its tag in the lemmatizer.

6
Lemmatizing words

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
7
        return None
Lemmatizing words

def lemmatize_sentence(sentence):
lemmatizer = WordNetLemmatizer()
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            lemmatized_sentence.append(word)
        else:
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

8
Lemmatizing words

new_text_list = []

for text in text_list:
new_text_list.append(lemmatize_sentence(text))

9
Tokenizing all text

 Next, we need to create a list of all words.

 As word_tokenize() returns a list, this will create a list of lists, and that’s
why we’ll use itertool’s chain() function to unpack the list of lists into one list.
 As the list will contain duplicate of words, we’ll turn the list into a set, which
from itertools import chain
removes all duplicates.
all_words = []

for text in new_text_list:
all_words.append(nltk.word_tokenize(text))

all_words = list(chain(*all_words)) 10
all_words_set = set(all_words)
Creating word index

 Next, we’ll loop through all texts and check if the word exists in that
text, and if so, store the index of that text.
words_index = {}

counter = 1

for word in all_words_set:
  words_index[word] = []
  for text in new_text_list:
    if(word in text):
      words_index[word].append(counter)
    counter += 1 11
  counter = 1
Testing our search engine

search = "from"

words_index[search]

12
Using graph of how many times a word occurs
in all the text.

 we stored indexes of word in the documentation in list_Frequency

 we used matplotlib.pyplot to show the list_Frequency in graph
 scatter have two parameters (x-axis , y-axis)
 scatter show graph in points
 bar() show graph as bars
 we used to show size of list_Frequency

13
Using graph of how many times a word occurs
in all the text.

14
Try with me ,,,,,,

 Use the mini project to show a graph of how many times a word
occurs in all the text.

15
Task #1

 Use the mini project to show the closest word in all the text if the
word not found.

16
Task #2

 Use the mini project to show list of words have The same meaning.

17
Try it out yourself

 Code:
https://colab.research.google.com/drive/17g3Co7YGZuhufZPZ99rCR
SH_84zpUNLo?usp=sharing

18
Thank you for your attention!

19
References

 https://medium.com/@gaurav5430/using-nltk-for-lemmatizing-sentences-c1bfff963258

Majhool Sentecnes
80% (5)
Majhool Sentecnes
21 pages
Python Code Examples
100% (1)
Python Code Examples
30 pages
NLP Exercises
No ratings yet
NLP Exercises
2 pages
Part 4: Implementing The Solution in Python
No ratings yet
Part 4: Implementing The Solution in Python
5 pages
Lab1 IR
No ratings yet
Lab1 IR
14 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
29 pages
NLP Record
No ratings yet
NLP Record
16 pages
Ccs339 Text and Speech Analysis Lab Manual
No ratings yet
Ccs339 Text and Speech Analysis Lab Manual
51 pages
CCS369 - Text and Speech Analysis
No ratings yet
CCS369 - Text and Speech Analysis
31 pages
Thirty Minutes A Day
No ratings yet
Thirty Minutes A Day
53 pages
TSA Student
No ratings yet
TSA Student
20 pages
Accent Training Powerpoint
No ratings yet
Accent Training Powerpoint
37 pages
Tsa Ex-2
No ratings yet
Tsa Ex-2
4 pages
PBL - 3 Proposed Methodology: The Workflow For Creating The Summary Generator
No ratings yet
PBL - 3 Proposed Methodology: The Workflow For Creating The Summary Generator
6 pages
Assignment 2 IR
No ratings yet
Assignment 2 IR
6 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Batch 2
No ratings yet
Batch 2
13 pages
NLP Record
No ratings yet
NLP Record
15 pages
Midterm 1
No ratings yet
Midterm 1
5 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
Lab3 IR BIM
No ratings yet
Lab3 IR BIM
14 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
20BCE1779 - Web Mining - Lab-1
No ratings yet
20BCE1779 - Web Mining - Lab-1
9 pages
IR Practical Code
No ratings yet
IR Practical Code
13 pages
Assessment - 2: - K Mary Nikitha
No ratings yet
Assessment - 2: - K Mary Nikitha
27 pages
All Practicals
No ratings yet
All Practicals
33 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
L3 Legal Office Correspondence
100% (1)
L3 Legal Office Correspondence
5 pages
Simple Search Engine - Project Repo
No ratings yet
Simple Search Engine - Project Repo
2 pages
NLP MTE Syllabus and Practice Problems
No ratings yet
NLP MTE Syllabus and Practice Problems
2 pages
Ccs369-Lab Ex 3,4,5
No ratings yet
Ccs369-Lab Ex 3,4,5
8 pages
6 - Text Vectorization-CSC688-SP22
No ratings yet
6 - Text Vectorization-CSC688-SP22
5 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
NLP TP1 Report Lahouel Ibrahim
No ratings yet
NLP TP1 Report Lahouel Ibrahim
6 pages
ENGLISH - FIRST YEAR IPE QUESTION BANK - 2021-22 (Final)
No ratings yet
ENGLISH - FIRST YEAR IPE QUESTION BANK - 2021-22 (Final)
156 pages
Ngram 2x3
No ratings yet
Ngram 2x3
5 pages
Final NLP Lab File
No ratings yet
Final NLP Lab File
28 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Detail NLP
No ratings yet
Detail NLP
5 pages
IR Assignment4
No ratings yet
IR Assignment4
5 pages
Spanish 2 - M1L1 - Subject Pronouns
No ratings yet
Spanish 2 - M1L1 - Subject Pronouns
7 pages
Unit 5
No ratings yet
Unit 5
4 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
Word Formation
No ratings yet
Word Formation
5 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Direct and Indirect Speech
No ratings yet
Direct and Indirect Speech
11 pages
Adverbs of Frequency
No ratings yet
Adverbs of Frequency
2 pages
GR 10 Unit 4 English
No ratings yet
GR 10 Unit 4 English
43 pages
Microskills by Richards
No ratings yet
Microskills by Richards
3 pages
NLP Previous Sem-1-3
No ratings yet
NLP Previous Sem-1-3
3 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
G4 ENGLISH NOTES # 4 Adverbs of Frequency
No ratings yet
G4 ENGLISH NOTES # 4 Adverbs of Frequency
2 pages
LK Profesional Modul 1-6 Unipa Surabaya
No ratings yet
LK Profesional Modul 1-6 Unipa Surabaya
27 pages
Teaching Upsr Note Expansion
No ratings yet
Teaching Upsr Note Expansion
19 pages
Akkusativ Präpositionen
0% (1)
Akkusativ Präpositionen
4 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
List of Irregular Verbs
No ratings yet
List of Irregular Verbs
4 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
Descriptive Writing (Art of Writing)
No ratings yet
Descriptive Writing (Art of Writing)
3 pages
Grammar Test
No ratings yet
Grammar Test
6 pages
6989 Song Survivor Destinys Child
No ratings yet
6989 Song Survivor Destinys Child
1 page
Back To School Simple Past.
No ratings yet
Back To School Simple Past.
3 pages
Chinese Morphology
No ratings yet
Chinese Morphology
19 pages
Tijana V. AŠIĆ, Tatjana S. GRUJIĆ, Marina V. KEBARA, Specific Telic Constructions With The Preposition Do in Serbian
No ratings yet
Tijana V. AŠIĆ, Tatjana S. GRUJIĆ, Marina V. KEBARA, Specific Telic Constructions With The Preposition Do in Serbian
22 pages
Review For First-Term Test
No ratings yet
Review For First-Term Test
2 pages
Tp2 e 188 Planit Y5 Spelling Term 1a Overview
No ratings yet
Tp2 e 188 Planit Y5 Spelling Term 1a Overview
1 page
OIS Grade 9 MYP L and L SA-3 2021-22
No ratings yet
OIS Grade 9 MYP L and L SA-3 2021-22
6 pages
CFG To CNF Conversion
No ratings yet
CFG To CNF Conversion
10 pages
Tienganh 11 Chinhthuc
No ratings yet
Tienganh 11 Chinhthuc
15 pages
1920 Form 2 FE Listening (Answer Key)
No ratings yet
1920 Form 2 FE Listening (Answer Key)
4 pages
Objectives Reading
No ratings yet
Objectives Reading
4 pages
English Grammar
No ratings yet
English Grammar
5 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Python Reference: An Alphabetical Guide
From Everand
Python Reference: An Alphabetical Guide
Jo Foster
No ratings yet
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
From Everand
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
Eric Elliott
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet

Language Engineering - Section

Uploaded by

Language Engineering - Section

Uploaded by

Language Engineering

Prepared by: Abdelrahman M. Safwat

Section (5) – Mini Project

 We want to create a Python program that acts a small search engine.

 Next, we need to create a list of all words.

 we stored indexes of word in the documentation in list_Frequency

You might also like