Tsa Unit1
Tsa Unit1
By
C.Jerin Mahibha
Assoc. Prof / CSE
UNIT I NATURAL LANGUAGE BASICS
Foundations of natural language processing Feature engineering for
Language Syntax and Structure
Text Preprocessing and Wrangling text representation
Text tokenization Bag of Words model
• Stemming
• Lemmatization • Bag of N-Grams model
• Removing stop-words • TF-IDF model
COURSE OBJECTIVES:
Understand natural language processing basics
COURSE OUTCOME:
CO1:Explain existing and emerging deep learning architectures for text and speech
processing
Text Book :
“Text Analytics with Python: A Practical Real-World approach to Gaining Actionable
insights from your data” by Dipanjan Sarkar - Chapter 1, 3
Foundations of natural language processing
• Big Data - “the 3 V’s”—volume, variety, and velocity of data
• Examples
• social media - tweets, status updates, comments, hashtags, articles, blogs, wikis
• retail and e-commerce -customer reviews and feedback
• Challenges associated with textual data
Effective storage and management of the data - unstructured
Analyzing the data and trying to extract meaningful patterns and useful insights
Natural language processing ( NLP ) - different from programming languages that
are easily understood by machines.
Textual data - highly unstructured - does not follow or adhere to structured or
regular syntax and patterns — mathematical or statistical models cannot be
directly used to analyze it
The Philosophy of Language
Natural Language
• evolved by humans through natural use and communication
• not constructed and created artificially - like computer programming language
• can be communicated in different forms - speech, writing, or even signs
• Eg: English, Japanese, and Sanskrit
Philosophy deals with four problems and seeks answers to solve them:
• The nature of meaning in a language
o concerned with the semantics of a language and the nature of meaning
o how words, which have their own meanings, are structured together to form meaningful sentences
o solved using - Syntax, semantics, grammars, and parse trees
• The use of language
• how language is used as an entity in various scenarios and communication between human beings
• analyzing speech and the usage of language when speaking-speaker’s intent, tone, content and actions
• Language cognition
• focuses on how the cognitive functions of the human brain are responsible for understanding and
interpreting language
• how the mind works in combining and relating words into sentences and then into a meaningful message
• The relationship between language and reality
• extent of truth of expressions originating from language
Two popular models
1. Triangle of reference model
2. Direction of fit model
Language Acquisition and Usage
Language Acquisition
• ability of acquiring and producing languages
o god gifted
o word meaning mapping
o behavioral theory - by imitating and hearing from adults
o language acquisition device - syntax, semantics, concepts of parts of speech, and
grammar
o autonomy of syntax – colorless green idea sleep furiously
• process by which human beings utilize their cognitive abilities, knowledge,
and experience to understand language based on hearing and perception and
start using it in terms of words, phrases, and sentences to communicate with
other human beings
Language Usage
• different ways in which language is used in communication
• three main categories of speech acts:
1. Locutionary acts
o concerned with the actual delivery of the sentence when communicated from one human being to another by speaking it
2. Illocutionary acts
o focus further on the actual semantics and significance of the sentence which was communicated
o five different classes of illocutionary speech acts
Assertives
• speech acts that communicate how things are already existent in the world.
• Represent word-to-world direction of fit
• The Earth revolves round the Sun
Directives
• speech acts that the sender communicates to the receiver asking or directing them to do something.
• Get me the book from the table
Commisives
• speech acts that commit the sender or speaker who utters them to some future voluntary act or action
• I promise to be there tomorrow for the ceremony
Expressives
• reveal a speaker or sender’s disposition and outlook toward a particular proposition communicated through the message
• Congratulations on graduating top of the class
Declarations
• powerful speech acts that have the capability to change the reality based on the declared proposition in the message communicated by the speaker/sender-
• I hereby declare him to be guilty of all charges .
3. Perlocutionary acts
o actual effect the communication had on its receiver, which is more psychological or behavioral
Get me the book from the table spoken by a father to his child
phrase when spoken by the father - locutionary act
directs the child to get the book from the table - illocutionary act
he brings the book from the table to his father - perlocutionary act
Linguistics
scientific study of language-form and syntax of language, meaning, and semantics and context of use
detailed exploration of linguistics is not needed for text analytics
main distinctive areas of study:
Phonetics :
• Study of the acoustic properties of sounds produced by the human vocal tract during speech
• Smallest individual unit of human speech in a specific language is called a phoneme
Phonology :
• Study of sound patterns as interpreted in the human mind and used for distinguishing between different
phonemes
• includes phonemes, accents, tone, and syllable structures
Syntax :
• Study of sentences, phrases, words, and their structures
Semantics :
• Study of meaning in language
Lexical semantics : Meanings of words and symbols using morphology and syntax.
Compositional semantics : Studying relationships among words and combination of words and understanding the
meanings of phrases and sentences and how they are related.
Morphology :
• study of the structure and meaning of distinctive units or morphemes -smallest unit of language that has
distinctive meaning
• includes things like words, prefixes, suffixes, and so on
Linguistics - Contd
Lexicon :
• Study of properties of words and phrases used in a language and how they build the vocabulary of the
language.
• Include what kinds of sounds are associated with meanings for words
Pragmatics :
• Study of how both linguistic and nonlinguistic factors like context and scenario might affect the meaning of an
expression of a message or an utterance
Discourse analysis :
• This analyzes language and exchange of information in the form of sentences across conversations among
human beings.
• Could be spoken, written, or even signed.
Stylistics :
• Study of language with a focus on the style of writing, including the tone, accent, dialogue, grammar, and type
of voice.
Semiotics :
• Study of signs, symbols, and sign processes and how they communicate meaning.
• Things like analogy, metaphors, and symbolism are covered in this area.
Syntax and semantics are some of the most important concepts that often form the foundations to natural
language processing
Language Syntax and Structure
• set of specific rules, conventions, and principles to combine
• words into phrases, phrases into clauses, and clauses into sentences
• related to each other in a hierarchical structure
• sentence is a structured format of representing a collection of words
provided they follow certain syntactic rules like grammar.
Sentences with proper syntax not only help us give proper structure
and relate words together but also help them convey meaning based
on the order or position of the words. Considering our previous
hierarchy of sentence → clause → phrase → word, we can construct
the hierarchical sentence tree
Words
• smallest units in a language that are independent and have a meaning of their own
• word can be comprised of several morphemes
• useful to annotate and tag words and analyze them into their parts of speech (POS) -syntactic categories
Open classes - consist of an infinite set of words, accept new additions - N, V, ADJ and ADV
Closed classes - consist of a closed and finite set of words and do not accept new additions - Pronouns
Clauses
• group of words with some relation between
• usually contains a subject and a predicate
• can act as independent sentences , or several clauses can be combined together to form a sentence
• can be subdivided into several categories based on syntax
Declarative : standard statements, which are declared with a neutral tone and which could be factual or non-factual
- Grass is green
Imperative : are usually in the form of a request, command, rule, or advice.- tone is order- Please do not talk in class
Relative: subordinate clauses- dependent on another part of the sentence that usually contains a word, phrase, or
even a clause- John just mentioned that he wanted a soda
Interrogative : usually are in the form of questions- Didn’t you go to school?
Exclamative : used to express shock, surprise, or even compliments- What an amazing race!
Grammar
• consists of a set of rules used in determining how to position words, phrases, and clauses when constructing
sentences -Subject-Verb-Object (SVO)
• subdivided into two main classes—based on their representations for linguistic syntax and structure
dependency grammars
Constituency grammars
Dependency grammars
• word-based grammars
• Dependencies in this context are labeled word-word relations
• word that has no dependency is called the root of the sentence.- verb
Constituency Grammars
• also called phrase structure grammars
• sentence can be represented by several constituents derived from it
• represent the internal structure of sentences in terms of a hierarchically ordered structure of
their constituents
• S→NP VP where S is the sentence or clause, and it is divided into the subject, denoted by the
noun phrase (NP) and the predicate, denoted by the verb phrase (VP).
Word Order Typology
• field that specifically deals with trying to classify languages based on their syntax, structure, and functionality
• classify them according to their dominant word orders, also known as word order typology
Text Pre-processing and Wrangling
• Machine learning (ML) algorithms - usually work with input features that are numeric in nature
• Text data is highly unstructured
• Need to clean, normalize, and pre-process the initial textual data
• pre-processing
• techniques to convert raw text into well-defined sequences of linguistic components that have standard structure
• helps in cleaning and standardization of the text-helps in analytical systems- increasing the accuracy of classifiers
• robust text pre-processing system - essential part of any application on NLP and text analytics
• “garbage in, garbage out” - if we do not process the text properly - end up with unwanted and irrelevant results
• Popular text pre-processing techniques
Tokenization
Tagging
Chunking
Stemming
Lemmatization
• Basic operations
Dealing with misspelled text
Removing stop words
Handling irrelevant components - based on the problem to be solved
Text tokenization
• process of breaking down or splitting textual data into smaller meaningful
components called tokens
• tokens are independent and minimal textual components that have some definite
syntax and semantics
• text document-sentences -clauses, phrases, and words
• popular tokenization techniques include
• sentence tokenization and
• word tokenization
Sentence Tokenization
• process of splitting a text corpus into sentences
• act as the first level of tokens
• also known as sentence segmentation
• text corpus - text where each paragraph comprises several sentences
• look for specific delimiters between sentences,
• a period (.)
• a newline character (\n)
• a semi-colon (;)
• NLTK framework
• sent_tokenize – default tokenizer- uses PunktSentenceTokenizer class internally
• PunktSentenceTokenizer
• RegexpTokenizer
• Pre-trained sentence tokenization models
• tokenizer - quite intelligent
• doesn’t just use periods to delimit sentences
• also considers other punctuation and the capitalization of words
• use specific regular
expression-based
patterns to segment
sentences.
• regex pattern to
tokenize sentences:
SENTENCE_TOKENS_PATTERN
‘(?<!\w\.\w.)(?<![A-Z][a-
z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
4. Pre-trained sentence tokenization models
• tokenize text of other languages - German text
• Implemented in two ways
• sent_tokenize , which is already trained
• load a pre-trained tokenization model on German text into a PunktSentenceTokenizer instance and
perform the same operation.
2.Load a pre-trained tokenization model on
1. sent_tokenize
German text into a PunktSentenceTokenizer
Word Tokenization
• process of splitting or segmenting sentences into their constituent words.
• important in many processes - cleaning and normalizing text
• nltk provides interfaces for word tokenization
word_tokenize – default - instance or object of the TreebankWordTokenizer class
TreebankWordTokenizer
RegexpTokenizer
Inherited tokenizers from RegexpTokenizer
2. TreebankWordTokenizer
• based on the Penn Treebank
• uses various regular expressions to tokenize the text
• one primary assumption - sentence tokenization performed beforehand
• output - similar to word_tokenize() - both use the same tokenizing mechanism
• Penn Treebank - www.cis.upenn.edu/~treebank/tokenizer.sed
• main features
• Splits and separates out periods that appear at the end of a sentence
• Splits and separates commas and single quotes when followed by whitespaces
• Most punctuation characters are split and separated into independent tokens
• Splits words with standard contractions—examples would be don’t to do and n’t
two main parameters
• regex pattern for building the tokenizer
3.RegexpTokenizer class • gaps parameter - True - is used to find the gaps between the
tokens else find the tokens themselves
4. Inherited tokenizers from RegexpTokenizer
• WordPunktTokenizer - uses the pattern r'\w+|[^\w\s]+'
• WhitespaceTokenizer - based on whitespaces - tabs, newlines, and spaces .
Word stems - base form of a word
Stemming
• Stemming
• Generating base form of a word from its inflected form
• reverse of inflection
• helps in standardizing words
• Used in applications like classifying , clustering text, information retrieval
• Used by search engines - to give better and more accurate results
• Morphemes
• the smallest independent unit in any natural language
• consist of units that are stems and affixes - prefixes, suffixes - change
meaning or create a new word
• Inflection
• creating new words by attaching affixes
• NLTK package has several implementations for stemmers
• stemmer can be chosen based on the problem and after trial and
error
• stemmers are implemented in the stem module, which inherits the
Stemmer interface in the nltk.stem.api module
• User can create stemmer using the above class as the base class
• build stemmer with user defined rules
1. Porter stemmer
• popular stemmer
• five different phases for reduction of
inflections - each phase has its own set of
rules.
• There also exists a Porter2 algorithm - with
some improvements
2. Lancaster stemmer
• based on the Lancaster stemming algorithm, also known as the Paice/Husk
stemmer
• is an iterative stemmer
• has over 120 rules specifying specific removal or replacement for affixes to
obtain the word stems
3. RegexpStemmer
Two ways
1. TfidfTransformer()
2. TfidfVectorizer()
Generic Vectorizer
Internal Working