0% found this document useful (0 votes)
66 views53 pages

Unit 1a

Uploaded by

Samriddhi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views53 pages

Unit 1a

Uploaded by

Samriddhi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

NLP

Unit-1

Introduction
Components of NLP
Natural Language Understanding (NLU)
NLU is the process of enabling machines to comprehend and interpret human language. It involves the analysis
of input text or speech to extract meaning, context, and intent.

Tokenization: Breaking down the input into individual words or tokens.


Part-of-Speech (POS) Tagging: Assigning grammatical categories (nouns, verbs, etc.) to each token.
Named Entity Recognition (NER): Identifying and classifying entities such as names of people,
organizations, locations, etc.
Syntax and Semantics Analysis: Understanding the grammatical structure and meaning of
sentences.
Sentiment Analysis: Determining the emotional tone expressed in the text.
Intent Recognition: Identifying the purpose or goal behind a given user input.
Natural Language Generation (NLG)
NLG is the process of generating human-like language or text based on underlying data or information. It
involves transforming structured data into coherent and contextually relevant natural language output.

Text Planning: Deciding what information to include and how to structure it.
Sentence Generation: Creating grammatically correct and contextually appropriate
sentences.
Lexical Choice: Selecting appropriate words and vocabulary for the generated text.
Referring Expression Generation: Deciding how to refer to entities mentioned in
the text.
Coherence and Cohesion: Ensuring that the generated text flows logically and is
cohesive.

NLG is used in various applications such as automatic summarization, report generation, chatbots, and content
creation.
Approaches and Models for Applying Natural
Language Processing
Classical approach to NLP
• Rule-Based Systems:
• Syntax and Grammar Rules:
• Semantic Analysis:
• Named Entity Recognition (NER):
• Shallow Natural Language Processing:
• Information Retrieval Techniques:
• Machine Translation with Rules:
• Expert Systems:
Rule Based approaches NLP
• Syntax and Grammatical Rules:
• Named Entity Recognition (NER):
• Semantic Rules:
• Sentiment Analysis Rules:
• Question Answering Rules:
• Dialogue Management Rules:
• Template-based NLG:
• Hybrid Approaches:

They are particularly suitable for well-defined and rule-bound


tasks, but their effectiveness may be limited in more complex
and dynamic language understanding scenarios.
Traditional approachesStatistical approaches
Tokenization
involves breaking text into a sequence of tokens, roughly corresponding to words.
Part-of-speech tagging
identifying words as a part of speech they belong to
Chunking
based on grouping bits of information in order to come to a deductive / inductive
conclusion.
Named-entity recognition
locating and classifying named entities
Co-reference resolution
identifying all the expressions that refer to the very same entity in a text.
Semantic role labeling
assigning roles to the constituents or phrases in sentences.
Statistical approaches in NLP
Statistical approaches in Natural Language Processing (NLP) involve the use of statistical models and machine
learning algorithms to automatically learn patterns, relationships, and structures from large amounts of
linguistic data.
• Corpus-based Learning
• Probabilistic Models
• N-gram Models
• Hidden Markov Models (HMMs)
• Maximum Likelihood Estimation (MLE)
• Conditional Random Fields (CRFs)
• Machine Learning Algorithms
• Support Vector Machines (SVMs),
• decision trees, and
• neural networks
• Word Embedding
Adaptive models
Employ a bunch of adaptive or self-learning models (like neural Networks etc.) that help to improve predictions
on ever-changing data
Examples
long short-term memory networks (LSTMs)
Used to classify, process, and predict time-series data based on time lags of unknown size and duration
between important events.
Generative adversarial networks (GANs)
belongs to unsupervised machine learning and comprise two neural networks, one of which generates
candidates and the other evaluates them.
Understanding linguistics

https://www.uni-due.de/SHE/REV_Levels_Chart.htm

Source: https://towardsdatascience.com/linguistic-knowledge-in-natural-language-processing-
332630f43ce1
Morphology
At this stage we care about the words that make up the sentence, how they are formed,
and how do they change depending on their context. Some examples of these include:
• Prefixes/suffixes
• Singularization/pluralization
• Gender detection
• Word inflection (modification of word to express different grammatical categories
such tenses, case, voice etc..). Other forms of inflection includes conjugation
(inflection of verbs) and declension (inflection of nouns, adjectives, adverbs etc…).
• Lemmatization (the base form of the word, or the reverse of inflection)
• Spell checking

Source: https://towardsdatascience.com/linguistic-knowledge-in-natural-language-processing-
332630f43ce1
Syntax (Parsing)
In this stage, we focus more on the relationship of the words within a sentence ie
how a sentence is constructed.

syntactical analysis is usually done at a sentence-level, where as for morphology


the analysis is done at word level.

When we’re building dependency trees or processing parts-of-speech — we’re


basically analyzing the syntax of the sentence.

Source: https://towardsdatascience.com/linguistic-knowledge-in-natural-language-processing-
Semantics
Once we’ve understood the syntactic structures, we are more prepared to get into the “meaning” of
the sentence (for a fun read on what meaning can actually mean in NLP — head over here to dive
into a Twitter discussion on the subject ).
Some example of tasks performed at this stage include:
• Named Entity Recognition (NER)
• Relationship Extraction

Source: https://towardsdatascience.com/linguistic-knowledge-in-natural-language-processing-
332630f43ce1
Pragmatics
At this level, we try to understand the text as a whole. Popular problems that we’re
trying to solve at this stage are:
• Topic modelling
• Co-reference
• Summarization
• Question & Answering

Source: https://towardsdatascience.com/linguistic-knowledge-in-natural-language-processing-
332630f43ce1
Text processing
Theory and practice of automating the creation or manipulation of electronic text.

Representation of data:
• Text.
• Images.
• Audio.
• Videos.

Analyzing the data which may be structured or unstructured to


obtain structured information

• Text extraction.
• Text classification.
Extracting individual and small bits of information from large text data is called as text extraction.
Assigning values to the text data depending upon the content is called as text classification.
Text analysis vs. Text mining vs. Text analytics
• Used to obtain data by statistical pattern learning.
• Both text analysis and text mining are qualitative processes.
• Text Analytics is quantitative process.

• Example:
– Banking service: Customer satisfaction.
– Text analysis: Individual performance of the customer support
executive. Text used in the feedback like "good", "bad“.
– Text analytics:
• Overall performance of all the support executives.
• Graph for visualizing the performance of the entire support team.
– Text analytics for overall count of issues resolved.
Text processing tools
• Statistical methods
• Text classification methods
• Text extraction methods
Tools and methodologies: Statistical methods
• Statistical methods:
– Word frequency: Identify the most regularly used expressions or words that is present in a specific
text.
– Collocation: Method for identifying the common words that appear together.
– Concordance: Methodology to provide context to the natural language.
– TF-IDF: Identifies the importance of words in a document.

Figure: Statistical methods


Source: http://grjenkin.com/articles/category/data-science/106322/big-data-data-science-and-machine-learning-explained
Tools and methodologies: Text classification
• Text classification:
– Content is analyzed and classified into multiple predefined groups based upon the analysis.

Figure: Text classification


Source: https://hackernoon.com/text-classification-simplified-with-facebooks-fasttext-b9d3022ac9cb
Tools and methodologies: Text classification
• Topic analysis: Identify and interpret la2e collection of text according to the individual topics
assigned.
• Sentiment analysis: Understanding the emotional feel represent in a textual message.

Figure: Language classification


Source: https://www.kdnuggets.com/2018/03/5-things-sentiment-analysis-classification.html
Tools and methodologies: Text extraction
• Text extraction: Process of gathering valuable pieces of information present within the text
data.

Figure: Text extraction


Source: https://www.upgrad.com/blog/what-is-text-mining-techniques-and-applications/

• Keyword extraction: Identifying and detecting the most relevant of the words inside a text.

• Entity extraction: Useful for gathering information on specific relevant elements and to
discard all the other irrelevant elements.
Scope of text analysis/processing
• Large documents:
– Refer for a context.
– Cross examine multiple documents.

• Individual sentences:
– Gathering specific information.
– Identify the emotional or intentional activities.

• Parts of the sentences:


– Sentiments of the words can be analyzed.
– Better understanding of the natural language.
– Provided for machine to analyze and understand.
Importance of text analysis
• Business growth:
– Extraction of information to identify the customer.

• Real time analysis:


– Urgent requirements or complaint handled on a real-time basis.
– Categorized as priority.
– Require multiple analysis.

• Checking for consistency:


– Detect latest models.
– Analyzing.
– Understanding.
– Sharing of the available data accurately.
Working principles of text analysis
• Data gathering.

• Data preparation.

• Data analysis.
Data gathering
• Text analysis: Gathering the required data that need to be analyzed.

• Internal data:
– Email.
– Chat messages.
– CRM tools.
– Databases.
– Surveys.
– Spreadsheets.
– Product analysis report.

Figure: Text analysis


Source: https://voziq.com/customer-retention/improving-customer-retention-strategies-with-unstructured-customer-data/attachment/common-
sources-of-unstructured-data/
Data gathering
• External data: The external data do not belong to the organization and are available freely
through other sources.

• Web scraping tools.


• Open data.

Figure: Web scraping tools


Source: https://strikedeck.com/top-10-customer-data-sources/
Data preparation
• Before text is analyzed by any machine learning algorithm, it needs to be prepared.

• Tokenization:
– Identify and recognize the unit of text.
– Process of breaking up text characters into meaningful elements.
– Analyze the meaningful parts of the text and discarding the meaningless sections.
– Removes all the frequent words that can be found in a sentence.

• Stemming:
– Used to reduces a word to its root to convey meaning.
– Unnecessary character removal like prefix, suffix etc.

• Lemmatization:
– Identify parts of the speech not needed and removes the inflection.
Dependency parsing
Constituency parsing
– Uses syntactic structures: Abstract notes associated to words and abstract categories.

Figure: Constituency parsing


Source: http://www.cs.cornell.edu/courses/cs5740/2017sp/lectures/13-parsing-const.pdf
Lexical analysis, Syntactic analysis, Semantic
analysis, Discourse integration, Pragmatic
analysis
Lexical analysis
• Lexical analysis is the process of converting a sequence of characters into a
sequence of tokens. A lexer is generally combined with a parser, which together
analyzes the syntax of programming languages, web pages, and so forth.
• Lexers and parsers are most often used for compilers but can be used for other
computer languages tools, such as pretty printers or liters.
• Lexical analysis is also an important analysis during the early stage of natural
language processing, where text or sound waves are segmented into words and
other units

How it can be done?


Classical way  Lookup in a Dictionary
Traditional Way  Tokenization by ML methods
Tokenization
Tags or recognition
Syntactic analysis
• Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string
of symbols, either in natural language, computer languages, or data structures,
conforming to the rules of formal grammar.
• It is used in the analysis of computer languages, referring to the syntactic
analysis of the input code into its component parts to facilitate the writing of
compilers and interpreters.
• Grammatical rules are applied to categories and groups of words, not individual
words. Syntactic analysis is a very important part of NLP that helps in
understanding the grammatical meaning of any sentence.
Syntactic analysis
Semantic analysis
• Semantic Analysis attempts to understand the meaning of Natural Language.
• Semantic Analysis of Natural Language captures the meaning of the given text
while considering context, logical structuring of sentences, and grammar roles.
• Semantic analysis can begin with the relationship between individual words.
Discourse integration
The analysis and identification of the larger context for any smaller part of natural
language structure

“He” can be Ram or Mohan


Pragmatic analysis
• Refers to the study of how language is used in context to convey meaning.
• Pragmatic analysis deals with outside word knowledge, which means knowledge
that is external to the documents and/or queries
Corpus
What Is a Corpus?
A corpus is a collection of examples of language in use that are selected
and compiled in a principled way.
• The term corpus refers to the intention for it to be a representative
body of evidence for the study of language and language use. In the
most general terms, the purpose of a corpus is to document a
language. Hence a corpus is often an essential component of
language documentation or language archives.
• The construction of a corpus starts with decisions on design criteria.
Corpus design criteria are mainly driven by the purpose of the corpus,
but may also be affected by meta-theoretical concerns such as
evaluation methods, reusability, and interoperability.
Corpus creation
• Basic reference location that the features of the natural language.
• Created based upon the grammar and context.
• Analysis of the Linguistics and Hypothesis testing.

Figure: Corpus creation


Source: https://devopedia.org/text-corpus-for-nlp
Types of corpora
Single Language(or monolingual) corpus
Balanced (General purpose) Corpus Multilanguage (or multilingual) corpus
• Example English Language
Specialized Corpus
Parallel Corpus
• Example Engineering, medical in English
• contains texts of the same content in different
language
languages (e.g., an original text and its translation(s) in
Synchronic Corpus one or more other languages).
• contains language data that are produced in roughly Comparable Corpus
the same time period • Multilanguage corpus containing a collection of texts
Diachronic Corpus from two or more languages collected under the same
• contains data from different time periods set of criteria.
• Language for general purposes corpora:
Spoken Corpus – Economic corpora.
Written Corpus – Legal corpora.
Mixed Corpus – Medical corpora.

• Multilingual parallel corpora:


– L1: L2 bidirectional.
– L1 translation L2.
Usage areas of corpora
• Translation:
– Parallel corpora.
– Native corpora.

• Education:
– Data driven learning.
– Concordance usage.
– Generalization extraction from data.

• General usages:
– Native speaker intuition.
– Frequency of occurrence.
– Relationship as per usage.
Traits of a good text corpus
• Depth:
– Example: Wordlist: Top 50000 words and not just top 5000 words.

• Recent:
– Example: outdated texts: Courting vs current age texts: Dating.

• Metadata:
– Example: Source, Genre, Type.

• Genre:
– Example: Text from newspapers, journals, etc.

• Size:
– Example: half million English words.

• Clean:
– Example: Noun – Flower, Fruit Verb – Eat, Smell.
Annotation and Storage of Corpus
• modern corpora are digitally stored.
• a popular approach is to use the Extensible Markup Language (XML) for encoding and storing
corpus files containing two types of elements: content (for keeping linguistic data collected for
the corpus) and markup (for keeping annotations).
• annotation enriches a corpus and facilitates discovery of generalizations and
knowledge which is difficult, if not impossible, to obtain from a raw corpus.
Example.. POS Tagging
• All currently available balanced corpora are POS-tagged
• a syntactically annotated corpus, however, is conventionally called a treebank
• Standard way is manual annotation by annotators
• Modern Automated annotation is used using prediction using either statistic or heuristic rules
• Another way is crowdsourcing.
Annotations in text corpus (1 of 2)

Figure: Inline annotations


Source: https://www.researchgate.net/figure/A-piece-of-biology-text-annotated-with-multiple-ontologies-Different-color-
highlights_fig1_261028765
Corpus-Words
They picnicked by the pool, then lay back on the grass and looked at the stars.

• Types are the number of distinct words in a corpus.


• Tokens are the total number N of running words.
If we ignore punctuation, the above sentence has 16 tokens and 14 types
Datasheet or Data statement
The best way is for the corpus creator to build a datasheet or data statement
A datasheet specifies properties of a dataset like:
Motivation: Why was the corpus collected, by whom, and who funded it?
Situation: When and in what situation was the text written/spoken? For example, was there a task?
Was the language originally spoken conversation, edited text, social media communication,
monologue vs. dialogue?
Language variety: What language (including dialect/region) was the corpus in?
Speaker demographics: What was, e.g., age or gender of the authors of the text?
Collection process: How big is the data? If it is a subsample how was it sampled? Was the data
collected with consent? How was the data pre-processed, and what metadata is available?
Annotation process: What are the annotations, what are the demographics of the annotators, how
were they trained, how was the data annotated?
Distribution: Are there copyright or other intellectual property restrictions?
NLP Libraries
• Scikit-learn: It provides a wide range of algorithms for building machine
learning models in Python.
• Natural language Toolkit (NLTK): NLTK is a complete toolkit for all NLP
techniques.
• Pattern: It is a web mining module for NLP and machine learning.
• TextBlob: It provides an easy interface to learn basic NLP tasks like
sentiment analysis, noun phrase extraction, or pos-tagging.
• Quepy: Quepy is used to transform natural language questions into queries
in a database query language.
• SpaCy: SpaCy is an open-source NLP library which is used for Data
Extraction, Data Analysis, Sentiment Analysis, and Text Summarization.
• Gensim: Gensim works with large datasets and processes data streams.

You might also like