0% found this document useful (0 votes)
60 views58 pages

Tsa Unit1

This document discusses the basics of natural language processing including language syntax and structure, text preprocessing techniques like tokenization, and modeling approaches like bag of words. It also covers topics like foundations of NLP, linguistics, language acquisition and usage.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views58 pages

Tsa Unit1

This document discusses the basics of natural language processing including language syntax and structure, text preprocessing techniques like tokenization, and modeling approaches like bag of words. It also covers topics like foundations of NLP, linguistics, language acquisition and usage.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

CCS369 - TEXT AND SPEECH ANALYSIS

By
C.Jerin Mahibha
Assoc. Prof / CSE
UNIT I NATURAL LANGUAGE BASICS
Foundations of natural language processing Feature engineering for
Language Syntax and Structure
Text Preprocessing and Wrangling text representation
Text tokenization Bag of Words model
• Stemming
• Lemmatization • Bag of N-Grams model
• Removing stop-words • TF-IDF model

COURSE OBJECTIVES:
Understand natural language processing basics
COURSE OUTCOME:
CO1:Explain existing and emerging deep learning architectures for text and speech
processing
Text Book :
“Text Analytics with Python: A Practical Real-World approach to Gaining Actionable
insights from your data” by Dipanjan Sarkar - Chapter 1, 3
Foundations of natural language processing
• Big Data - “the 3 V’s”—volume, variety, and velocity of data
• Examples
• social media - tweets, status updates, comments, hashtags, articles, blogs, wikis
• retail and e-commerce -customer reviews and feedback
• Challenges associated with textual data
Effective storage and management of the data - unstructured
Analyzing the data and trying to extract meaningful patterns and useful insights
Natural language processing ( NLP ) - different from programming languages that
are easily understood by machines.
Textual data - highly unstructured - does not follow or adhere to structured or
regular syntax and patterns — mathematical or statistical models cannot be
directly used to analyze it
The Philosophy of Language
Natural Language
• evolved by humans through natural use and communication
• not constructed and created artificially - like computer programming language
• can be communicated in different forms - speech, writing, or even signs
• Eg: English, Japanese, and Sanskrit
Philosophy deals with four problems and seeks answers to solve them:
• The nature of meaning in a language
o concerned with the semantics of a language and the nature of meaning
o how words, which have their own meanings, are structured together to form meaningful sentences
o solved using - Syntax, semantics, grammars, and parse trees
• The use of language
• how language is used as an entity in various scenarios and communication between human beings
• analyzing speech and the usage of language when speaking-speaker’s intent, tone, content and actions
• Language cognition
• focuses on how the cognitive functions of the human brain are responsible for understanding and
interpreting language
• how the mind works in combining and relating words into sentences and then into a meaningful message
• The relationship between language and reality
• extent of truth of expressions originating from language
Two popular models
1. Triangle of reference model
2. Direction of fit model
Language Acquisition and Usage
Language Acquisition
• ability of acquiring and producing languages
o god gifted
o word meaning mapping
o behavioral theory - by imitating and hearing from adults
o language acquisition device - syntax, semantics, concepts of parts of speech, and
grammar
o autonomy of syntax – colorless green idea sleep furiously
• process by which human beings utilize their cognitive abilities, knowledge,
and experience to understand language based on hearing and perception and
start using it in terms of words, phrases, and sentences to communicate with
other human beings
Language Usage
• different ways in which language is used in communication
• three main categories of speech acts:
1. Locutionary acts
o concerned with the actual delivery of the sentence when communicated from one human being to another by speaking it
2. Illocutionary acts
o focus further on the actual semantics and significance of the sentence which was communicated
o five different classes of illocutionary speech acts
 Assertives
• speech acts that communicate how things are already existent in the world.
• Represent word-to-world direction of fit
• The Earth revolves round the Sun
 Directives
• speech acts that the sender communicates to the receiver asking or directing them to do something.
• Get me the book from the table
 Commisives
• speech acts that commit the sender or speaker who utters them to some future voluntary act or action
• I promise to be there tomorrow for the ceremony
 Expressives
• reveal a speaker or sender’s disposition and outlook toward a particular proposition communicated through the message
• Congratulations on graduating top of the class
 Declarations
• powerful speech acts that have the capability to change the reality based on the declared proposition in the message communicated by the speaker/sender-
• I hereby declare him to be guilty of all charges .
3. Perlocutionary acts
o actual effect the communication had on its receiver, which is more psychological or behavioral
Get me the book from the table spoken by a father to his child
 phrase when spoken by the father - locutionary act
 directs the child to get the book from the table - illocutionary act
 he brings the book from the table to his father - perlocutionary act
Linguistics
 scientific study of language-form and syntax of language, meaning, and semantics and context of use
 detailed exploration of linguistics is not needed for text analytics
 main distinctive areas of study:
Phonetics :
• Study of the acoustic properties of sounds produced by the human vocal tract during speech
• Smallest individual unit of human speech in a specific language is called a phoneme
Phonology :
• Study of sound patterns as interpreted in the human mind and used for distinguishing between different
phonemes
• includes phonemes, accents, tone, and syllable structures
Syntax :
• Study of sentences, phrases, words, and their structures
Semantics :
• Study of meaning in language
 Lexical semantics : Meanings of words and symbols using morphology and syntax.
 Compositional semantics : Studying relationships among words and combination of words and understanding the
meanings of phrases and sentences and how they are related.
Morphology :
• study of the structure and meaning of distinctive units or morphemes -smallest unit of language that has
distinctive meaning
• includes things like words, prefixes, suffixes, and so on
Linguistics - Contd
Lexicon :
• Study of properties of words and phrases used in a language and how they build the vocabulary of the
language.
• Include what kinds of sounds are associated with meanings for words
Pragmatics :
• Study of how both linguistic and nonlinguistic factors like context and scenario might affect the meaning of an
expression of a message or an utterance
Discourse analysis :
• This analyzes language and exchange of information in the form of sentences across conversations among
human beings.
• Could be spoken, written, or even signed.
Stylistics :
• Study of language with a focus on the style of writing, including the tone, accent, dialogue, grammar, and type
of voice.
Semiotics :
• Study of signs, symbols, and sign processes and how they communicate meaning.
• Things like analogy, metaphors, and symbolism are covered in this area.

Syntax and semantics are some of the most important concepts that often form the foundations to natural
language processing
Language Syntax and Structure
• set of specific rules, conventions, and principles to combine
• words into phrases, phrases into clauses, and clauses into sentences
• related to each other in a hierarchical structure
• sentence is a structured format of representing a collection of words
provided they follow certain syntactic rules like grammar.
Sentences with proper syntax not only help us give proper structure
and relate words together but also help them convey meaning based
on the order or position of the words. Considering our previous
hierarchy of sentence → clause → phrase → word, we can construct
the hierarchical sentence tree
Words
• smallest units in a language that are independent and have a meaning of their own
• word can be comprised of several morphemes
• useful to annotate and tag words and analyze them into their parts of speech (POS) -syntactic categories
Open classes - consist of an infinite set of words, accept new additions - N, V, ADJ and ADV
Closed classes - consist of a closed and finite set of words and do not accept new additions - Pronouns
Clauses
• group of words with some relation between
• usually contains a subject and a predicate
• can act as independent sentences , or several clauses can be combined together to form a sentence
• can be subdivided into several categories based on syntax
 Declarative : standard statements, which are declared with a neutral tone and which could be factual or non-factual
- Grass is green
 Imperative : are usually in the form of a request, command, rule, or advice.- tone is order- Please do not talk in class
 Relative: subordinate clauses- dependent on another part of the sentence that usually contains a word, phrase, or
even a clause- John just mentioned that he wanted a soda
 Interrogative : usually are in the form of questions- Didn’t you go to school?
 Exclamative : used to express shock, surprise, or even compliments- What an amazing race!

Grammar
• consists of a set of rules used in determining how to position words, phrases, and clauses when constructing
sentences -Subject-Verb-Object (SVO)
• subdivided into two main classes—based on their representations for linguistic syntax and structure
 dependency grammars
 Constituency grammars
Dependency grammars
• word-based grammars
• Dependencies in this context are labeled word-word relations
• word that has no dependency is called the root of the sentence.- verb
Constituency Grammars
• also called phrase structure grammars
• sentence can be represented by several constituents derived from it
• represent the internal structure of sentences in terms of a hierarchically ordered structure of
their constituents
• S→NP VP where S is the sentence or clause, and it is divided into the subject, denoted by the
noun phrase (NP) and the predicate, denoted by the verb phrase (VP).
Word Order Typology
• field that specifically deals with trying to classify languages based on their syntax, structure, and functionality
• classify them according to their dominant word orders, also known as word order typology
Text Pre-processing and Wrangling
• Machine learning (ML) algorithms - usually work with input features that are numeric in nature
• Text data is highly unstructured
• Need to clean, normalize, and pre-process the initial textual data
• pre-processing
• techniques to convert raw text into well-defined sequences of linguistic components that have standard structure
• helps in cleaning and standardization of the text-helps in analytical systems- increasing the accuracy of classifiers
• robust text pre-processing system - essential part of any application on NLP and text analytics
• “garbage in, garbage out” - if we do not process the text properly - end up with unwanted and irrelevant results
• Popular text pre-processing techniques
 Tokenization
 Tagging
 Chunking
 Stemming
 Lemmatization
• Basic operations
 Dealing with misspelled text
 Removing stop words
 Handling irrelevant components - based on the problem to be solved
Text tokenization
• process of breaking down or splitting textual data into smaller meaningful
components called tokens
• tokens are independent and minimal textual components that have some definite
syntax and semantics
• text document-sentences -clauses, phrases, and words
• popular tokenization techniques include
• sentence tokenization and
• word tokenization
Sentence Tokenization
• process of splitting a text corpus into sentences
• act as the first level of tokens
• also known as sentence segmentation
• text corpus - text where each paragraph comprises several sentences
• look for specific delimiters between sentences,
• a period (.)
• a newline character (\n)
• a semi-colon (;)
• NLTK framework
• sent_tokenize – default tokenizer- uses PunktSentenceTokenizer class internally
• PunktSentenceTokenizer
• RegexpTokenizer
• Pre-trained sentence tokenization models
• tokenizer - quite intelligent
• doesn’t just use periods to delimit sentences
• also considers other punctuation and the capitalization of words
• use specific regular
expression-based
patterns to segment
sentences.
• regex pattern to
tokenize sentences:

SENTENCE_TOKENS_PATTERN
‘(?<!\w\.\w.)(?<![A-Z][a-
z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
4. Pre-trained sentence tokenization models
• tokenize text of other languages - German text
• Implemented in two ways
• sent_tokenize , which is already trained
• load a pre-trained tokenization model on German text into a PunktSentenceTokenizer instance and
perform the same operation.
2.Load a pre-trained tokenization model on
1. sent_tokenize
German text into a PunktSentenceTokenizer
Word Tokenization
• process of splitting or segmenting sentences into their constituent words.
• important in many processes - cleaning and normalizing text
• nltk provides interfaces for word tokenization
word_tokenize – default - instance or object of the TreebankWordTokenizer class
TreebankWordTokenizer
RegexpTokenizer
Inherited tokenizers from RegexpTokenizer
2. TreebankWordTokenizer
• based on the Penn Treebank
• uses various regular expressions to tokenize the text
• one primary assumption - sentence tokenization performed beforehand
• output - similar to word_tokenize() - both use the same tokenizing mechanism
• Penn Treebank - www.cis.upenn.edu/~treebank/tokenizer.sed
• main features
• Splits and separates out periods that appear at the end of a sentence
• Splits and separates commas and single quotes when followed by whitespaces
• Most punctuation characters are split and separated into independent tokens
• Splits words with standard contractions—examples would be don’t to do and n’t
two main parameters
• regex pattern for building the tokenizer
3.RegexpTokenizer class • gaps parameter - True - is used to find the gaps between the
tokens else find the tokens themselves
4. Inherited tokenizers from RegexpTokenizer
• WordPunktTokenizer - uses the pattern r'\w+|[^\w\s]+'
• WhitespaceTokenizer - based on whitespaces - tabs, newlines, and spaces .
Word stems - base form of a word
Stemming
• Stemming
• Generating base form of a word from its inflected form
• reverse of inflection
• helps in standardizing words
• Used in applications like classifying , clustering text, information retrieval
• Used by search engines - to give better and more accurate results
• Morphemes
• the smallest independent unit in any natural language
• consist of units that are stems and affixes - prefixes, suffixes - change
meaning or create a new word
• Inflection
• creating new words by attaching affixes
• NLTK package has several implementations for stemmers
• stemmer can be chosen based on the problem and after trial and
error
• stemmers are implemented in the stem module, which inherits the
Stemmer interface in the nltk.stem.api module
• User can create stemmer using the above class as the base class
• build stemmer with user defined rules
1. Porter stemmer

• popular stemmer
• five different phases for reduction of
inflections - each phase has its own set of
rules.
• There also exists a Porter2 algorithm - with
some improvements
2. Lancaster stemmer
• based on the Lancaster stemming algorithm, also known as the Paice/Husk
stemmer
• is an iterative stemmer
• has over 120 rules specifying specific removal or replacement for affixes to
obtain the word stems
3. RegexpStemmer

• build own stemmer based on user-defined


rules
• based completely on custom-defined rules
based on regular expressions
• uses regular expressions to identify the • Min – Minimum word
morphological affixes in words length
• any part of the string matching the same is
removed
4.SnowballStemmer
- supports stemming in 13 different languages besides English
Lemmatization
• The process similar to stemming—remove word affixes to get to a
base form of the word
• base form -root word (lemma)
- always be present in the dictionary
- not the root stem
- may not always be a lexicographically correct word
- it may not be present in the dictionary
• considerably slower than stemming - present in the dictionary
• The nltk package
- uses WordNet
- word’s syntax and semantics, like part of speech
and—nouns, verbs, and adjectives
• Uses WordNetLemmatizer class
uses the morphy() function belonging to the WordNetCorpusReader class
Uses part of speech by checking the Wordnet corpus
use a recursive technique for removing affixes from the word until a match is
found in WordNet
If no match is found, the input word itself is returned unchanged
Removing stop-words
• Words that have little or no significance
• Removed from text during processing - retain words having
maximum significance and context
• Usually occur the most
• Words like a, the , me , and so on
• No universal or exhaustive list of stopwords
• Each domain or language may have its own set of stopwords
• List of all English stopwords in nltk’s vocabulary -
nltk.corpus.stopwords.words(‘english’)
• negations - not and no - removed - essential to preserve to know
the actual context - sentiment analysis
Feature Engineering for Text representation
• Dataset- has many data points
• Rows, columns of the dataset - are various features or properties of the dataset
• Features
- unique, measurable attributes or properties for each data point
- usually numeric in nature
- can be absolute numeric values or categorical
• Process of encoding the features as binary features - one-hot encoding
• Process of extracting and selecting features - feature extraction or feature
engineering
• extracted features - fed into ML algorithms for learning patterns - can be applied
on new data points for getting insights.
• ML algorithms
- Base is mathematical operation - optimization and minimizing loss and error
- Expect features in the form of numeric vectors
- Textual data - challenge - transforming textual data and extract numeric features
- Use feature-extraction concepts and techniques
The Vector Space Model or Term Vector Model
• Very useful when dealing with textual data
• Very popular in information retrieval and document ranking
• Uses mathematical and algebraic model for transforming
• Represent text documents as numeric vectors of specific terms that form the vector dimensions
• Mathematically this can be defines as follows.
D - document
VS - document vector space
• The number of dimensions or columns for each document = total number of distinct terms or words
n - distinct words across all documents
• The document D in the vector space can be represented as
where wDn - weight for word n in document D
Weight
- is a numeric value
- can represent
 frequency of the word in the document
 average frequency of occurrence, or
 TF-IDF weight
Feature-extraction techniques:
• Bag of Words model
• TF-IDF model
• Advanced word vectorization models
Points to remember :
• build a feature extractor using some transformations and mathematical
operations
• reuse the same process when extracting features from new documents to be
predicted
• do not rebuild the whole algorithm again based on the new documents
Bag of Words model
• one of the simplest yet most powerful feature extraction techniques
• convert text documents into vectors
• represents the frequency of all the distinct words for that specific
document.
• mathematical notation for D
weight for each word - frequency of occurrence in that document
Bag of N-Grams model
• Same model can be used for
 individual word occurrences
 occurrences for n-grams - n-gram Bag of Words - frequency of distinct n-
grams
• Same code snippet can be used to implement a Bag of Words–and
Bag of N-Grams model – set ngram_range parameter
• feature-extraction process, model, and vocabulary
 always based on the training data
 will never change or get influenced on newer documents
Bag of Words model - Implementation
Bag of N-Grams model - Implementation
TF-IDF model
Bag of Words model - completely based on absolute frequencies of word
occurrences
• problems – words with higher frequencies- insignificant words
• Solved using TF-IDF
TF-IDF stands for Term Frequency-Inverse Document Frequency
• Combination of two metrics:
Term frequency (tf) and
Inverse document frequency (idf)
• Originally developed as a metric for ranking functions and has come to be a
part of information retrieval and text feature extraction
• Mathematically, TF-IDF is the
• product of two metrics and can be represented as
1. Term frequency (tf)

• computed using Bag of Words model


• raw frequency value of that term in a particular document
• Mathematically it can be represented
fwD- frequency for word w in document D
• other representations and computations for term frequency
• converting frequency to a binary feature where 1-the term has occurred in
the document and 0 - it has not.
• normalize the absolute raw frequency using logarithms or averaging the
frequency.
• using the raw frequency in our computations
2. Inverse document frequency (idf )
• inverse of the document frequency for each term
• It is computed by dividing the total number of documents in our
corpus by the document frequency for each term and then applying
logarithmic scaling on the result
• adding 1 – smoothing -prevent division-by-zero errors
• add 1 to the result - avoid ignoring terms completely that might have
zero idf.
• Mathematically idf can be represented by
3. Term frequency-inverse document frequency (tf-idf)
• Computed by multiplying tf and idf
• final TF-IDF metric - is a normalized version of the tfidf matrix
• Normalize the tf-idf matrix
• by dividing it with the L2 norm of the matrix- Euclidean norm- square root of
the sum of the square of each term’s tfidf weight.
• Mathematically we can represent the final tf-idf feature vector as

where represents the Euclidean L2 norm for the tf-idf matrix


Implementation

Two ways
1. TfidfTransformer()
2. TfidfVectorizer()
Generic Vectorizer
Internal Working

Implement the mathematical equations to compute the tfidf-based


feature vectors
• Load necessary dependencies
• compute the term frequencies (TF) - using Bag of Words based features
• compute document frequencies (DF)
• compute inverse document frequency (idf)
• compute the tfidf feature matrix using matrix multiplication
• divide it with the L2 norm
THANK YOU

You might also like