Unit 1a
Unit 1a
Unit-1
Introduction
Components of NLP
Natural Language Understanding (NLU)
NLU is the process of enabling machines to comprehend and interpret human language. It involves the analysis
of input text or speech to extract meaning, context, and intent.
Text Planning: Deciding what information to include and how to structure it.
Sentence Generation: Creating grammatically correct and contextually appropriate
sentences.
Lexical Choice: Selecting appropriate words and vocabulary for the generated text.
Referring Expression Generation: Deciding how to refer to entities mentioned in
the text.
Coherence and Cohesion: Ensuring that the generated text flows logically and is
cohesive.
NLG is used in various applications such as automatic summarization, report generation, chatbots, and content
creation.
Approaches and Models for Applying Natural
Language Processing
Classical approach to NLP
• Rule-Based Systems:
• Syntax and Grammar Rules:
• Semantic Analysis:
• Named Entity Recognition (NER):
• Shallow Natural Language Processing:
• Information Retrieval Techniques:
• Machine Translation with Rules:
• Expert Systems:
Rule Based approaches NLP
• Syntax and Grammatical Rules:
• Named Entity Recognition (NER):
• Semantic Rules:
• Sentiment Analysis Rules:
• Question Answering Rules:
• Dialogue Management Rules:
• Template-based NLG:
• Hybrid Approaches:
https://www.uni-due.de/SHE/REV_Levels_Chart.htm
Source: https://towardsdatascience.com/linguistic-knowledge-in-natural-language-processing-
332630f43ce1
Morphology
At this stage we care about the words that make up the sentence, how they are formed,
and how do they change depending on their context. Some examples of these include:
• Prefixes/suffixes
• Singularization/pluralization
• Gender detection
• Word inflection (modification of word to express different grammatical categories
such tenses, case, voice etc..). Other forms of inflection includes conjugation
(inflection of verbs) and declension (inflection of nouns, adjectives, adverbs etc…).
• Lemmatization (the base form of the word, or the reverse of inflection)
• Spell checking
Source: https://towardsdatascience.com/linguistic-knowledge-in-natural-language-processing-
332630f43ce1
Syntax (Parsing)
In this stage, we focus more on the relationship of the words within a sentence ie
how a sentence is constructed.
Source: https://towardsdatascience.com/linguistic-knowledge-in-natural-language-processing-
Semantics
Once we’ve understood the syntactic structures, we are more prepared to get into the “meaning” of
the sentence (for a fun read on what meaning can actually mean in NLP — head over here to dive
into a Twitter discussion on the subject ).
Some example of tasks performed at this stage include:
• Named Entity Recognition (NER)
• Relationship Extraction
Source: https://towardsdatascience.com/linguistic-knowledge-in-natural-language-processing-
332630f43ce1
Pragmatics
At this level, we try to understand the text as a whole. Popular problems that we’re
trying to solve at this stage are:
• Topic modelling
• Co-reference
• Summarization
• Question & Answering
Source: https://towardsdatascience.com/linguistic-knowledge-in-natural-language-processing-
332630f43ce1
Text processing
Theory and practice of automating the creation or manipulation of electronic text.
Representation of data:
• Text.
• Images.
• Audio.
• Videos.
• Text extraction.
• Text classification.
Extracting individual and small bits of information from large text data is called as text extraction.
Assigning values to the text data depending upon the content is called as text classification.
Text analysis vs. Text mining vs. Text analytics
• Used to obtain data by statistical pattern learning.
• Both text analysis and text mining are qualitative processes.
• Text Analytics is quantitative process.
• Example:
– Banking service: Customer satisfaction.
– Text analysis: Individual performance of the customer support
executive. Text used in the feedback like "good", "bad“.
– Text analytics:
• Overall performance of all the support executives.
• Graph for visualizing the performance of the entire support team.
– Text analytics for overall count of issues resolved.
Text processing tools
• Statistical methods
• Text classification methods
• Text extraction methods
Tools and methodologies: Statistical methods
• Statistical methods:
– Word frequency: Identify the most regularly used expressions or words that is present in a specific
text.
– Collocation: Method for identifying the common words that appear together.
– Concordance: Methodology to provide context to the natural language.
– TF-IDF: Identifies the importance of words in a document.
• Keyword extraction: Identifying and detecting the most relevant of the words inside a text.
• Entity extraction: Useful for gathering information on specific relevant elements and to
discard all the other irrelevant elements.
Scope of text analysis/processing
• Large documents:
– Refer for a context.
– Cross examine multiple documents.
• Individual sentences:
– Gathering specific information.
– Identify the emotional or intentional activities.
• Data preparation.
• Data analysis.
Data gathering
• Text analysis: Gathering the required data that need to be analyzed.
• Internal data:
– Email.
– Chat messages.
– CRM tools.
– Databases.
– Surveys.
– Spreadsheets.
– Product analysis report.
• Tokenization:
– Identify and recognize the unit of text.
– Process of breaking up text characters into meaningful elements.
– Analyze the meaningful parts of the text and discarding the meaningless sections.
– Removes all the frequent words that can be found in a sentence.
• Stemming:
– Used to reduces a word to its root to convey meaning.
– Unnecessary character removal like prefix, suffix etc.
• Lemmatization:
– Identify parts of the speech not needed and removes the inflection.
Dependency parsing
Constituency parsing
– Uses syntactic structures: Abstract notes associated to words and abstract categories.
• Education:
– Data driven learning.
– Concordance usage.
– Generalization extraction from data.
• General usages:
– Native speaker intuition.
– Frequency of occurrence.
– Relationship as per usage.
Traits of a good text corpus
• Depth:
– Example: Wordlist: Top 50000 words and not just top 5000 words.
• Recent:
– Example: outdated texts: Courting vs current age texts: Dating.
• Metadata:
– Example: Source, Genre, Type.
• Genre:
– Example: Text from newspapers, journals, etc.
• Size:
– Example: half million English words.
• Clean:
– Example: Noun – Flower, Fruit Verb – Eat, Smell.
Annotation and Storage of Corpus
• modern corpora are digitally stored.
• a popular approach is to use the Extensible Markup Language (XML) for encoding and storing
corpus files containing two types of elements: content (for keeping linguistic data collected for
the corpus) and markup (for keeping annotations).
• annotation enriches a corpus and facilitates discovery of generalizations and
knowledge which is difficult, if not impossible, to obtain from a raw corpus.
Example.. POS Tagging
• All currently available balanced corpora are POS-tagged
• a syntactically annotated corpus, however, is conventionally called a treebank
• Standard way is manual annotation by annotators
• Modern Automated annotation is used using prediction using either statistic or heuristic rules
• Another way is crowdsourcing.
Annotations in text corpus (1 of 2)