0% found this document useful (0 votes)
242 views29 pages

NLP Lab Manual

Lab manual for svce students

Uploaded by

charantejakedyam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
242 views29 pages

NLP Lab Manual

Lab manual for svce students

Uploaded by

charantejakedyam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 29
s it Vidya Vikas Education Trust's —- Universal College of Engineering, in Road, Vasai-401208 Department of Computer Engineering LAB Manual CSL803: Computational Lab-IT Natural Language Processing Sem: VII AY: 2021-2022 Mrs. Vishakha Shelke Name: Jinay Likhinesh Shah Roll No.: 122 Div: B Class: BE Computer ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC CSL803: Computational Lab-II-Natural Language Processing Sr. No. List of Experiments Page No. Preprocessing of text (Tokenization, Filtration, Script Validation, Stop Word Removal, Stemming) 2 | Morphological Analysis 3 | N-gram model 4 | POS tagging 5 | Chunking 6 | Named Entity Recognition 7 Virtual Lab on Word Generator 8 | Mini Project based on NLP Application ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Name: Jinay Roll No.: 122 Div: B Class: BE. Year: 2021 — 2022 Experiment No. 1 Aim: To study Preprocessing of text (Tokenization, Filtration, Script Validation, Stop Word Removal, Stemming) Theor To preprocess your text simply means to bring your text into a form that is predictable and analyzable for your task. A task here is a combination of approach and domain Machine Leaming needs data in the numeric form. We basically used encoding technique (BagOfWord, Bi-gram.u-gram, TF-IDF, Word2Vec) to encode text into numeric vector. But before encoding we first need to clean the text data and this process to prepare (or clean) text databefore encoding is called text preprocessing, this is the very first step to solve the NLP problems. Tokenization: Tokenization is about splitting strings of text into smaller pieces, or “tokens”. Paragraphs can be tokenized into sentences and sentences can be tokenized into words. Filtration: Similarly, if we are doing simple word counts, or trying to visualize our text with a word cloud, stopwords are some of the most frequently occurring words but don’t really tell us anything. We're offen better off tossing the stopwords out of the text, By checking the Filter Stopwords option in the Text Pre-processing tool, you can automatically filter these words out. Script Validation: The script must be validated properly. Stemming Stemming is the process of reducing inflection in words (e.g. troubled, troubles) to their root form (e.g. trouble). The “root” in this case may not be a real root word, but just a canonical form of the original word. Stemming uses a crude heuristic process that chops off the ends of words in the hope of correctly transforming words into its root form. So, the words “trouble”, “troubled” and “troubles” might 3 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC actually be converted to troublinstead of trouble because the ends were just chopped off (ugh, how erude!). There are different algorithms for stemming. The most common algorithm, which is also known to be empirically effective for English, is Porters Algorithm. Here is an example of stemming in action with Porter Stemmer: original_word stemmed_words ° connect connect 1 connected connect 2 connection connect 3 connections connect 4 connects connect Stopword Removal Stop words are a set of commonly used words in a language. Examples of stop words in English are “a, “the”, “is”, “are” and etc. The intuition behind using stop words is that, by removing low information words from text, we can focus on the important words instead. For example, in the context of a search system, if your search query is “what is text preprocessing?”. you want the search systam to focns on surfacing documents that talk about text preprocessing over documents that talk about what is. This can be done by preventing all words from your stop word list from being analyzed. Stop words are commonly applied in search systems, text classification applications, topic modeling, topic extraction and others. In my experience, stop word removal, while effective in search and topic extraction systems. showed to be non-critical in classification systems. However, it does help reduce the number of features in consideration which helps keep your models decently sized. Here is an example of stop word removal in action. All stop words are replaced with a dummy character, W: original sentence = this is a text full of content and we need to clean it up sentence with stop words removed= WWW text full W content WWW W clean WW Code: String handling: print(len("what it is what it isnt")) "what, "t"s" "what," "snt") ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC rint(len(s)) x-sorted|s) print(s) print) dexes print(a) print() Output: 23 6 ('what’, ‘it’, 'is', ‘what’, ‘it’, isnt'] Fis, isnt’, it, ‘what’, ‘what"] Tis, isnt, “it, "it, ‘what’, what’, ‘what’, it, is’ ‘what’, "it, isnt')s File handling (tokenization and filtering): for line in open("file.txt"): for word in line.splitQ: if word endswith(‘ing’) print(word) print(len(word)) Output: eating 6 dancing 7 jumping 7 File.txt ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC 1 like eating in restraunt. I like dancing too. My daughter like bungee jumping. Conclu In the above experiment we have studied regarding preprocessing of text in detail like filtration, stop word removal, tokenization, stemming, script validation and have tried to implement the code for it and have successfully executed it ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Name — Jinay Roll No: 122 Div: B Class: BE. Year: 2021 = 2022 Experiment No. 2 Aim: To Study Morphological Analysis, Theory: Morphological Analysis: While performing the morphological analysis, each particular word is analyzed. Non-word tokens such as punctuation are removed from the words. Hence the remaining words are assigned categories. For instance, Ram’s iPhone cannot convert the video from mkv to mp4. In Morphological analysis, word by word the sentence is analyzed.So here, Ram is a proper noun, Ram's is assigned as possessive suffix and mkv and .mp4 is assigned as a file extension As shown above, the sentence is analyzed word by word. Each word is assigned a syntactic category. The file extensions are also identified present in the sentence which is behaving as an adjective in the above example. In the above example, the possessive suffix is also identified. This is a very important step as the judgment of prefixes and suffixes will depend on a syntactic category for the word. For example, swims and swims are different. One makes it plural, while the other makes it a third-person singular verb. If the prefix or suffix is incomectly interpreted then the meaning and understanding of the sentence are completely changed. The interpretation assigns a category to the word. Hence, discard the uncertainty from the word. Regular Expression: Regular expressions also called regex. It is a very powerful programming tool that is used for a variety of purposes such as feature extraction from text, string replacement and other string manipulations. A regular expression is a set of characters, or a pattern, which is used to find sub strings in a given string. for ex. extracting all hashtags fiom a tweet, getting email id or phone numbers ete. from a large unstructured text content In short, if there’s a pattern in any string, you can easily extract, substitute and do variety of other string manipulation operations using regular expressions. Regular expressions are a language in itself since they have their own compilers and almost all popular programming languages support working with regexes. Stop Word Removal: ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC The words which are generally filtered out before processing a natural language are called stop words, These are actually the most common words in any language (like articles, prepositions, pronous, conjunctions, ete) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”. Stop words are available in abundance in any human language, By removing these words, we remove the low-level information from our text in order to give more focus to the important information. In order words, we can say that the removal of such words does not show any negative consequences on the model we train for our task. Removal of stop words definitely reduces the dataset size and thus reduces the training time due to the fewer number of tokens involved in the training. Sample text with Stop Without Stop Words Words GeeksforGeeks—A Computer | GeeksforGeeks , Computer Scienc Science Portal for Geeks Portal Geeks Can listening be exhausting? Listening, Exhausting Tike reading, sol read Like, Reading, read Synonym: The word synonym defines the relationship between different words that have a similar meaning. A simple way to decide whether two words are synonymous is to check for substitutability. Two Words are synonyms in a context if they can be substituted for each for each other without changing the meaning of the sentence. ‘Stemming: Stemming is the process of reducing a word fo its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma, Stemming is important in natural language understanding (NLU) and natural language processing (NLP), Code: ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Regular Expressio import re input="The 5 biggest animals are 1. Elephant,2 Rhino and 3 dinasaur" input=input.lower() print(input) result re.sub(r\d¥,,",input) print(result) Output: the 5 biggest animals are 1. elephant,2 rhino and 3 dinasaur the biggest animals are . elephant, rhino and dinasaur Stop word removal: def punctuations(raw_review): text = raw_review text = text.replace("n't", ' not’) text = textreplace(""s", "is!) text = textreplace(""re", are’) text = text.replace(""ve", " have") text = text.replace("m", 'am') text = text.replace(""d", ' would’) text = textreplace("l", will!) text = text.replace("in", ing’) import re letters_only = re.sub"[*a-zA-Z]"," "text) return(""join(letters_only)) t="Hows's my team doin, you're supposed to be not loosin" p=punctuations(t) print(p) ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Output Hows is my team doing you are supposed to be not loosing Synonym: import ultk: altk.download( wordnet’) fiom ultk.corpus import wordnet synonyms = [] for syn in wordnet synsets(Machine’): for lemma in syn.lemmas(): synouyms.append(lemma.name()) print(synonyms) Ontput: ['machine’, ‘machine’, ‘machine’, ‘machine’, 'simple_machine’, ‘machine’, ‘political_machine’, ‘car’, ‘auto’, automobile’, ‘machine’, ‘motorcar’, ‘machine’, 'machine'] Stemming from nitk.stem import PorterStemmer stemmer = PorterStemmer() print(stemmer.stem(eating’)) print(stemmer stem(ate’)) Output: eat ate Conch Thus, in the above experiment we have studied regarding morphological analysis in detail with stemming, synonym, stop word removal, regular expression and tried to implement the code and got proper output. 10 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Name: Jinay Roll No: 122 Div: B Class: BE. Year: 2021 — 2022 Experiment No. 3 Aim: To study N-gram mode! Given a sequence of N-1 words, an N-gram model predicts the most probable word that might follow this sequence. I's a probabilistic model that's trained on a corpus of text. Such a model is usefil in many NLP applications including speech recognition, machine translation and predictive text input An N-gram model is built by counting how often word sequences oceur in corpus text and then estimating the probabilities. Since a simple N-gram model has limitations, improvements are often made via smoothing, interpolation and backoff. An N-gram model is one type of a Language Model (LM), which is about finding the probability distribution over word sequences. Consider two sentences: "There was heavy rain" vs, "There was heavy flood’, From experience, wwe know that the former sentence sounds better. An N-gram model will tell us that "heavy rain" ocems much more often than "heavy flood” in the training corpus. Thus, the first sentence is more probable and will be selected by the model. A model that simply relies on how often a word occurs without looking at previous words is called unigram. If a model considers only the previous word to predict the current word, then it’s called bigram. If two previous words are considered, then it's a trigram model An n-gram model for the above example would calculate the following probability. P(There was heavy rain’) «= ~—~P("There’, ‘was’, ‘heavy’, rain’) = P((There)P(was'!There')P(‘heavy'!There was')P(‘rain'There was heavy’) Since it's impractical to calculate these conditional probabilities, using Markov assumption, we approximate this to a bigram model: (There was heavy rain’) ~ P(There')P(‘was'|There’)P(‘heavy'|was)P(‘rain heavy’) In speech recognition, input may be noisy and this can lead to wrong speech-to-text conversions. N-gram models can correct this based ou their knowledge of the probabilities, Likewise, N-gram models are used in machine translation to produce more natural sentences in the target language ul ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC When correcting for spelling errors, sometimes dictionary lookups will not help. For example, in the phrase "in about fifteen mineuts" the word ‘minuets’ is a valid dictionary word but it's incomect in this context. N-gram models can correct such errors. N-gram models are usually at word level. It's also been used at character level to do stemming, separate the root word from the suffix. By looking at N-gram statistics. we could also uages or differentiate between US and UK spellings. For example, 'sz' is common in ‘b' and ‘kp’ are common in Tgbo. In general, many NLP applications benefit from N-gram models including part-of-speech tagging. natural language generation, word similarity, sentiment extraction and predictive text input Code: import re from nltk.util import ngrams = "Machine learning is an important part of Al rong" ‘and Al is going to become inmporant for daily functio tokens = [token for token in s.split(" ")] output = list(ngrams(tokens, 2)) print(output) Output: [( Machine’, learning’), (learning, "s'), (is! ‘an’, ('an, important), (important, ‘part’, (‘part ‘of ), (of, ’Ar'), (Al, ‘and’), (‘anc!, 'Ar), (‘Al is’) ("is’, 'going’), ('going’,'to’), to’, become’), ‘become’, mporant’), (‘inmporant’, for’), (for, ‘daily’), (‘daily functionong’), (unctionong’,")] Conclusion: Thus, in the above experiment we have studied regarding N-Gram Model in detail with the help of theory and then tried to implement the code and successfully executed it. 12 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Name :Jinay Roll No: 122 Div: B-Class: BE. Year: 2021 — 2022 Experiment No.4 Aim: To study POS tagging It is a process of converting a sentence to forms — list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on. Part of Speech Tag Noun Verb Adjective Adverb ns w<3 Default tagging is a basic step for the part-of-speech tagging. It is performed using the DefaultTagger class. The DefaultTagger class takes ‘tag’ as a single argument. NN is the tag for a singular noun, DefanltTagger is most usefill when it gets to work with most common part-of- speech tag. that’s why a noun tag is recommended 3 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC DefaultTagger | SequentialBackoffTagger choose_tag() | Tagger tag() evaluate() Tagging is a kind of classification that may be defined as the automatic assignment of description to the tokens. Here the descriptor is called tag, which may represent one of the part- of-speech, semantic information and so on. Now, if we talk about Part-of-Speech (PoS) tagging, then it may be defined as the process of assigning one of the parts of speech to the given word. It is generally called POS tagging. In simple words, we can say that POS tagging is a task of labelling each word in a sentence with its appropriate part of speech. We already know that parts of speech include nouns, verb, adverbs, adjectives, pronouns, conjunetion and their sub-categories. Most of the POS tagging falls under Rule Base POS tagging, Stochastic POS tagging and Transformation based tagging Rule-based POS Tagging One of the oldest techniques of tagging is rule-based POS tagging. Rule-based taggers use dictionary or lexicon for getting possible tags for tagging each word. If the word has more than one possible tag, then rule-based taggers use hand-written rules to identify the correct tag. Disambiguation can also be performed in rule-based tagging by analyzing the linguistic features of a word along with its preceding as well as following words. For example. suppose if the preceding word of a word is article, then word must be a noun. Stochastic POS Tagging 4 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Another fechnique of tagging is Stochastic POS Tagging. Now, the question that arises here is which model can be stochastic. The model that includes frequency or probability (statistics) ean be called stochastic, Any number of different approaches to the problem of part-of-speech tagging can be referred to as stochastic tagger The simplest stochastic tagger applies the following approaches for POS tagging ~ Word Frequency Approach In this approach, the stochastic taggers disambiguate the words based on the probability that a word occurs with a particular tag. We can also say that the tag encountered most frequently with the word in the training set is the one assigned to an ambiguous instance of that word. The main issue with this approach is that it may yield inadmissible sequence of tags. Tag Sequence Probabilities Itis another approach of stochastic tagging, where the tagger caleulates the probability of a given sequence of tags occurring. It is also called n-gram approach. It is called so because the best tag for a given word is determined by the probability at which it occurs with the n previous tags, Transformation-based Tagging Transformation based tagging is also called Brill tagging. It is an instance of the transformation- based learning (TBL), which is a rule-based algorithm for automatic tagging of POS to the given text. TBL, allows us to have linguistic knowledge in a readable fom, transforms one state to another state by using transformation rules. It draws the inspiration from both the previous explained taggers ~ rule-based and stochastic. If we see similarity between rule-based and transformation tagger, then like rule-based, it is also based on the rules that specify what tags need to be assigned to what words. On the other hand, if we see similarity between stochastic and transformation tagger then like stochastic, it is machine learning technique in which rules are automatically induced from data HMM for POS Tagging The POS tagging process is the process of finding the sequence of tags which is most likely to have generated a given word sequence. We can model this POS process by using a Hidden 15 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Markov Model (HMM), where tags are the hidden states that produced the observable output, ie., the words. Code: import nltk nltk.download(‘averaged_perceptron_tagger’) nitk.download(punkt’) text = nltk.word_tokenize("And now for Everything completely Same") nitk pos_tag(text) Output: [CAnd', "CC, (‘now', "RB’), (for', "IN", (Everything’, 'VBG'), (‘completely', 'RB'), (Same, ‘IS')] Conclusion: Thus, we have studied POS Tagging in the above experiment also leamed regarding different types of POS Tagging and tried to implement the code for POS Tagging and successfully executed it 16 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Name :Jinay 122 Div: B — Class:BE. Year: 2021 ~ 2022 Experiment No. 5 Aim: To study Chunking Theor, Chunk extraction or partial parsing is a process of meaningful extracting short phrases from the sentence (tagged with Part-of-Speech).Chunks are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be a part of chuck and such words are known as chinks. A ChunkRule class specifies what words or patterns to include and exclude in a chunk. Defining Chunk patterns: Chuck pattems are normal regular expressions which are modified and designed to mateh the part-of-speech tag designed to match sequences of part-of-speech tags. Angle brackets are used to specify an indiviual tag for example — 10 match a noun tag. One can define multiple tags in the same way. Chunking is a process of extracting phrases from unstructured text. Instead of just simple tokens which may not represent the actual meaning of the text, its advisable to use phrases such as “South Africa” as a single word instead of ‘South’ and “Africa” separate words. Chunking in NLP is Changing a perception by moving a “chunk”, or a group of bits of information, in the direction of a Deductive or Inductive conclusion through the use of language. Ciunking up or down allows the speaker to use certain language pattems, to utilize the natural internal process through language, to reach for higher meanings or search for more specific bits/portions of missing information. When we “Chunk Up” the language gets more abstract and there are more chances for agreement, and when we “Chunk Down” we tend to be looking for the specific details that may have been missing in the chunk up. As an example, if you ask the question “for what purpose cars?” you may get the answer “transport”, which is a higher chunk and more toward abstract. 7 Vidya Vikas Education Trust's Universal College of Engineering, Kaman Road, Vasai-401208 ‘Accredited B+ Grade by NAAC If you asked “what specifically about a car”? you will start to get smaller pieces of information about a car Lateral thinking will be the process of chunking up and then looking for other examples: For example, “for what intentions cars?”, “wansportation”, “what are other examples of transportation?” “Buses!” Code: Noun Phrase chunking: Import alti: sentence = [("the", "DT"), ("litle", "JJ"), ("yellow", "II", ("dog", "NN"), ("barked”, "VBD"), (Yat”, "IN"), ("the", "DI"), ("cat "NN")] grammar = "NP: {2*}" ep = nltk RegexpParser(gramma:) result = ep.parse(sentence) print(result) >>> result.draw() Output: (s (NP the/DT little/J yellow/JI dog/NN) barked/VBD a/IN (NP the/DT cat’NN)) 18, Vidya Vikas Education Trust's Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Conclusion: Thus, in the above experiment we have studies regarding chunking and tried to implement the code for same and successfilly executed it. 19) ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Name Jinay Roll No: 122 Div: Bo Class:BE. Year: 2022 Experiment No. 6 : To study Named Entity Recognition ‘Named Entity Recognition (NER) is a standard NLP problem which involves spotting named entities (people, places, organizations ete.) from a chunk of text, and classifying them into a predefined set of categories. Some of the practical applications of NER include: * Scanning news articles for the people, organizations and locations reported. * Providing concise features for search optimization: instead of searching the entire content, one may simply search for the major entities involved. * Quickly retrieving geographical locations talked about in Twitter posts. In any text document, there are particular terms that represent specific entities that are more informative and have a unique context. These entities are known as named entities, which more specifically refer to terms that represent real-world objects like people, places, organizations, and so on, which are often denoted by proper names. A naive approach could be to find these by looking at the noun phrases in text documents. Named entity recognition (NER), also known as entity chunking/extraction, is a popular technique used in information extraction to identify and segment the named entities and classify or eategorize them under various predefined classes. 20 Vidya Vikas Education Trust's Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Preprocessing Module Input Morphosyntactic Text — ‘Segmentation _ Tokenization ‘Analysis Output | Post-process | ie ioe +—— Named Entity Recognition Module Machine irre Leaming poe yet Pets How NER works At the heart of any NER model is a two step process: Detect a named entity Categorize the entity Beneath this lie a couple of things. Step one involves detecting a word or string of words that form an entity. Each word represents a token: “The Great Lakes” is a string of three tokens that represents one entity. Inside-outside- beginning tagging is a common way of indicating where entities begin and end. We'll explore this fimther in a future blog post. The second step requires the creation of entity categories, How is NER used? NER is suited to any situation in which a high-level overview of a large quantity of text is helpful With NER. you can, at a glance, understand the subject or theme of a body of text and quickly szoup texts based on their relevancy or similatity. Some notable NER use cases include: Human resources 2 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Speed up the hiting process by summarizing applicants’ CVs; improve internal workflows by categorizing employee complaints and questions Customer support Improve response times by categorizing user requests, complaints and questions and filtering by priority keywords Code: Named Entity Recognition Jocs = [Omnicom ‘IN’, ‘New York’) (DDB Needham’, IN’, "New York’), (Kaplan Thaler Group’. IN’, ‘New York') (‘BBDO South’, ‘IN’, 'Atlanta’), ( Georgia-Pacific’, IN’, 'Atlanta’)] query = [el for (el, rel, €2) in locs if e2: print(query) Output: [BBDO South’, Georgia-Pacific’) Conclus Thus, in the above experiment we have studied regarding named entity recognition, working of named entity recognition, how named entity recognition can be used and then implemented the code for the same and successfully executed it, 22 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Bo Class:B.E, Year: 2021 Name Jinay ~ 2022 Experiment No. 7 Aim: Virtual Lab on Word Generation Theory: Given the root and suffix information, a word can be generated. For example, Language input:analysis Hindi t-S1G7liladakaa), cat=n, gen=m, num=sg, case-ob| Hindi ‘1t-GIlladakaa), cat=n, gen=m, num=pl, English rt=boy, cat=n, num=pl English ‘t=play, cat=v, num=sg, per=3, tense=pr Morphological analysis and generation: Inverse processes, Analysis may involve non-determinism, since more than one analysis is possible. Generation is a deterministic process. In case a language allows spelling variation, then till that extent, generation wauld also involve nan-determinism, Input: = Output: 23 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC NONE (Check Right answer!!! Conclusion: Thus, in the above experiment we have studied regarding Word Generation . 24 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC Name — Jinay Roll No: 122 Div: BB Class:BE, Year: 2021 — 2022 Experiment No. 8 Aim: Miniproject based on NLP applications Name of Group Members: Jinay Shah(122) Pashwa Shah(123) Theory:these tools can be very helpful for kids who strugzte with writing To use word prediction, your child needs to use a keyboard to write. This can be an onscreen keyboard on a smartphone or digital tablet. Or it can be a physical keyboard connected to a device or computer. Those suggestions are shown on the screen, like at the top of an onscreen keyboard. The child clicks or taps on a suggested word, and it’s inserted into the writing. There are also advanced word prediction tools available, They include: Tools that read word choices aloud with text-to-speech. This is important for kids with reading issues who can’t read what the suggestions are. Word prediction tools that make suggestions tailored to specific topics. For instance, the words used in a history paper will differ a lot from those in a science report. To make suggestions more accurate, kids can pick special dictionaries for what they're writing about, Tools that display word suggestions in example sentences. This can help kids decide between words that are confusing, like to, too and two. Codes import bs4 as bs import urllib.request import re import nitk. In 8]: scrapped_data = urllib.request.urlopen(’https://en.wikipedia.org/wiki/Artficial_intelligence’) article = scrapped_data .read{) 25 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC parsed_article = bs.BeautifulSouplarticle,'Ixml') paragraphs = parsed_article.find_all’p') article_text =" for p in paragraphs: article_text += p.text Reomve Stop Words try: import string from nltk.corpus import stopwords Import nitk except Exception as e: print(e) class PreProcessText(object) def _init_(seif pass def __remove_punctuation(self, text): Takes a String return : Return a String 26 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC message = [] for xin text: if xin string punctuation: pass else: message.append(x) message = "join|message) return message def _remove_stopwords(self, text): Takes a String return List word for x in text.split(): if xlower() in stopwords.words('english’): pass else: words.append(x) return words def token_words(selftex 27 Vidya Vikas Education Trust's Universal College of Engineering, Kaman Road, Vasai-401208 ‘Accredited B+ Grade by NAAC Takes String Return Token also called list of words that is used to Train the Model message = self,__remove_punctuation(text) words = self,_remove_stopwords(message) return words In 23}: import nitk: flag = nltk download{"stopwords") if (fla “False” or flag == False} print("Failed to Download Stop Words") else: Print(" Downloaded Stop words... ") helper = PreProcessText() ‘words = helper.token_words(text=txt) ‘words = helper.token_words(text=article_text) from gensim.models import Word2Vec In Bo}: #model = Word2Vec({words], min_count=1) model = Word2Vec| [words], size=100, window-5, min_count=1, workers=4) In (34: vocabulary = model.wv.vocab In 39} sim_words = model.awv.most_similar‘machine’) In (40): 28 ‘Vidya Vikas Education Trust’s Universal College of Engineering, Kaman Road, Vasai-401208 Accredited Bt Grade by NAAC sim_words Output: {('supervised!, 0.28355103731155396), (favor18', 0.30931341648101807), (‘provide', 0.3048184812068939), ('would’, 0.296578615903812237), {'species’, 0.2742482125759125), ( collectively’, 0.27363914251327515), ('transformative', 0.272122971861839294), (‘advanced 0.27198636531829834), ('n, 0.3024228811264038),] n: Thus we completed mini-project on similar word prediction successfully. 29

You might also like