Laboratory Practice VI Natural Language Processing
Laboratory Practice VI Natural Language Processing
Mini Project
Sentence Autocompletion
Group Members
Omkar Jagtap(19CO033)
Shreya Jagtap(19CO035)
Abhishek Mulik(19CO051)
Tanvi Paigude(19CO066)
Class: BE Computer A
Faculty:
Aim:
To develop a sentence autocompletion model.
Abstract:
Imagine that you were a representative replying to customer online and you are
asking the same questions over and over to your customer. Would you like to get
automatic suggestions instead of typing the same thing again and again ?
Introduction:
Autocomplete is a user interface function in which an application predicts a
word or phrase that the user needs to type without the user having to type it
entirely.
In modern applications, word completion or autocomplete or autosuggest is a
popular user interface feature. Its aim is to predict what the user wants to type
and add sections of the text automatically.
By providing available options, the aim is to speed up typing, assist those with
typing problems, correct/prevent spelling errors, and promote information
retrieval. Witten and Daraghs’ work on the Reactive Keyboard from 1983 may
be the earliest example of the concept. Several other methods have been
identified since then, but the basic concept has remained the same.
Word processors (MS Word, OpenOffice.org), programming editors (Emacs,
Eclipse), desktop applications (web browsers, e-mail clients), HTML form
elements on websites, web applications (Google Suggest, web-based e-mail
clients), mobile phone interfaces, Unix terminals, and so on all have the
feature.
Whenever you search something on Google, after typing 2-3 letters, it indicates
the feasible search terms. Or in case you look for something with typos, it
corrects them and still finds relevant results for you. Isn’t it amazing?
It is something that makes us every day however by no means will pay lots of
interest to it. It is an important application of natural language processing and
a splendid occurrence of what it is far meaning for a great many all
throughout the planet, including you and me. Search autocomplete and
autocorrect each help usi discovering right results much productively.
Presently, various gatherings have additionally begun utilizing this element on
their sites, as Facebook and Quora.
Dataset
The file contains 22K conversations between a customer and a representative.
For the purpose of this project, we are only interested in completing the
threads of the representative.
Data Selection and Cleaning:
The information is reaching to partition the strings from the client and the
representative, separate the sentences based on the accentuation (we are
going to keep the accentuation), the ultimate content will be cleaned up with a
few light regex and as it were the sentence bigger than 1 word will be kept.
Finally, since the agent has the inclination to inquire the same question over
and over once more, the autocomplete is amazingly valuable by proposing a
complete sentence. In our case, we are going to check the number of events of
the same sentence so ready to utilize it as a highlight afterward and erase the
duplicates.
Implementation:
Generate TFI-DF vectorizer:
In information retrieval, tf–idf or TFIDF, full forming as term frequency– inverse
document frequency, is nothing but a numerical statistic that is intended to
reflect how important a word is to a document in a collection or corpus. The
most common use of the tf-idf. The tf–idf value increases in direct proportion
to the number of times a term appears in the document and is compensated
by the number of documents in the corpus that contain the word, which tends
to justify the fact that certain words appear more frequently in general.TF-IDF
weight speaks to the relative significance of a term within the document and
whole corpus. TF stands for Term
Frequency:
It calculates how as often as possible a term appears in a document. Since,
each document size varies, a term may show up more in a long-sized document
than
a brief one. In this way, the length of the document frequently separates Term
frequency.