0% found this document useful (0 votes)
40 views20 pages

Language Engineering - Section

The document describes creating a Python program that acts as a basic search engine. It explains how the program will extract text from documents, tokenize the words, lemmatize the words to reduce them to their base forms, create an index of which words occur in which documents, and allow a user to search for a term and see which documents contain it. It also discusses potential additions like showing related words if a search term is not found or listing words with the same meaning. Code for the project is provided in a Colab notebook.

Uploaded by

asmaa soliman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views20 pages

Language Engineering - Section

The document describes creating a Python program that acts as a basic search engine. It explains how the program will extract text from documents, tokenize the words, lemmatize the words to reduce them to their base forms, create an index of which words occur in which documents, and allow a user to search for a term and see which documents contain it. It also discusses potential additions like showing related words if a search term is not found or listing words with the same meaning. Code for the project is provided in a Colab notebook.

Uploaded by

asmaa soliman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Language Engineering

Prepared by: Abdelrahman M. Safwat

Section (5) – Mini Project


Idea

 We want to create a Python program that acts a small search engine.


 It should accept a term and return which document contains that
term.

2
Extracting all text from text files

 First, we need to extract all text from the text files in the directory
we’re in.
import os

dir_list = os.listdir(".")
dir_list = sorted(dir_list)
text_list = []

for file in dir_list:
  if file.endswith(".txt"):
    text_list.append(open(file).read())
3
Extracting all text from text files

 Once we’ve extracted all the text, we’ll need to tokenize it and make
a list of all words we have, that way we can start storing in which
document does each word occur.

4
Extracting all text from text files

 Since we’re making a search engine, we can’t just extract the text and
make a list of words. A lot of those words will be different forms
of the same word. (i.e. go, going, went, etc.)
 That’s why we need to lemmatize all words.

5
Lemmatizing words

 Lemmatizing words requires that we get the POS tags of each word,
and then use the word and its tag in the lemmatizer.

6
Lemmatizing words

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
7
        return None
Lemmatizing words

def lemmatize_sentence(sentence):
lemmatizer = WordNetLemmatizer()
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            lemmatized_sentence.append(word)
        else:        
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

8
Lemmatizing words

new_text_list = []

for text in text_list:
  new_text_list.append(lemmatize_sentence(text))

9
Tokenizing all text

 Next, we need to create a list of all words.


 As word_tokenize() returns a list, this will create a list of lists, and that’s
why we’ll use itertool’s chain() function to unpack the list of lists into one list.
 As the list will contain duplicate of words, we’ll turn the list into a set, which
from itertools import chain
removes all duplicates.
all_words = []

for text in new_text_list:
  all_words.append(nltk.word_tokenize(text))

all_words = list(chain(*all_words)) 10
all_words_set = set(all_words)
Creating word index

 Next, we’ll loop through all texts and check if the word exists in that
text, and if so, store the index of that text.
words_index = {}

counter = 1

for word in all_words_set:
  words_index[word] = []
  for text in new_text_list:
    if(word in text):
      words_index[word].append(counter)
    counter += 1 11
  counter = 1
Testing our search engine

search = "from"

words_index[search]

12
Using graph of how many times a word occurs
in all the text.

 we stored indexes of word in the documentation in list_Frequency


 we used matplotlib.pyplot to show the list_Frequency in graph
 scatter have two parameters (x-axis , y-axis)
 scatter show graph in points
 bar() show graph as bars
 we used to show size of list_Frequency

13
Using graph of how many times a word occurs
in all the text.

14
Try with me ,,,,,,

 Use the mini project to show a graph of how many times a word
occurs in all the text.

15
Task #1

 Use the mini project to show the closest word in all the text if the
word not found.

16
Task #2

 Use the mini project to show list of words have The same meaning.

17
Try it out yourself

 Code:
https://colab.research.google.com/drive/17g3Co7YGZuhufZPZ99rCR
SH_84zpUNLo?usp=sharing

18
Thank you for your attention!

19
References

 https://medium.com/@gaurav5430/using-nltk-for-lemmatizing-sentences-c1bfff963258

20

You might also like