0% found this document useful (0 votes)
106 views103 pages

NLP Textbook Star Edu

nlp notes

Uploaded by

Harsh mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
106 views103 pages

NLP Textbook Star Edu

nlp notes

Uploaded by

Harsh mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 103
Processin n to Natural Lan juage - language spoken by people, e.g. English, Hindi, Marath; like C, C++, Java, etc. a Introductio! fers to the i lages, Vprogramming langu dinary language is any language that has evolveg gh use and repetition without conscious planning meditation. Natural languages can take aileron forms, such as written : : oe speech or signing etc.) They are distinguished from constructed ang ei ext, Ir formal languages such as those used to program computes Si to study logic, Ke guage processing (NLP) is a branch of artificial intelligence that helps Natural Language re! as opposed to artificial ‘a natural language or © ‘paturally in humans throu Natural lan, computers understand, interpret and manipulate human language. NLP draws from co many disciplines, including computer science and computational linguistics, in its Pursuit ce 1 {o fill the gap between human communication and computer understanding. Natural Language Processing (NLP) is a field of research and application that dete the way computers can be used to understand and manage natural language speech to do useful things. The term “natural” in the context of the language to distinguish human languages (such as Gujarati, English, Spanish and Fret computer languages (such as C, C++, Java and Prolog). The definition Language Processing clarifies that it is a theoretically induced range of com techniques (multiple methods or techniques for language analysis) for 4 representing naturally occurring text (such as English, Gujarati al or more levels of linguistic analysis for the purpose of achieving hum Processing for a range of tasks or applications, 2. Need for Natural Language Processing Significant growth in the volume and variety of data is due to the amount text data—in fact, up to 80% of all your data is unstructured text data. C huge amounts of documents, emails, social media, and other text-b to get to know their customers better, offer services or market th Most of this data is unused and untouched, : Text gels, through the use of natural language prox k 1e business value within these vast data e nesses can fully utilize h I Consider a example given in Figure 4 kufmmi mmmvw nnnfinaa3 | Ujiheaie elece mnster vensi credur | | Baboi oi cestnize | | COOVOE!2* ekk; IdsIIk Ikdf vnnjfi? | |_Famgimiik mifin kire xnnnt Figure 1: Sample Textin natural form Computers “see” text in English the same you have seen the figure 1. Normally, People have no trouble Understanding natural language as they have Common sense knowledge, Common-sense knowledge, Reasoning capacity, Experience. Unless we teach computers to do so, they will not understand any natu ral language. 3. Goals of Natural Langua je Processin 4 The ultimate goal of natural language Processing is for computers to achieve human-like comprehension of texts/ianguages. When this is achieved, computer systems will be able to understand, draw inferences from, summarize, translate and generate accurate and natural human text and language. - The goal of natural language processing is to specify a language comprehension and production theory to such a level of detail that a person is able to write a computer program which can understand and produce natural language. 3. The basic goal of NLP is to accomplish human like language Processing. The choice of word “processing” is very deliberate and should not be replaced with “understanding”. For although the field of NLP was originally referred to as Natural Language Understanding (NLU), that goal has not yet been accomplished. A full NLU system would be able to: ~ Paraphrase an input text. — Translate the text into another language. a Answer questions about the contents of the text. _ ~ _ Draw inferences from the text. we spendy 2 conan == ———__— 4. Brief overview of NLP a | ———— focuses on the interactions between human language 4 The field of study tha or NLP for short. all | Language Processing, ters is called Natural se of computer science, artificial intelligence, and computational linguistigg inters 4 The essence of Natural Language Processing lies in making computers undersiang the natural language. That's not an easy task though. Computers can understand the structured form of data like spreadsheets and the tables in the database, but hy languages, texts, and voices form an unstructured category of da and it gets for the computer to understand it, and there arises the need for Natural Lan Processing. f There's a lot of natural language data out there in various forms and it would get | easy if computers can understand and process that data. We can train the ol in accordance with expected output in different ways. Humans have been writing ot thousands of years, there are a lot of literature pieces available, and it would be real if we make computers understand that. But the task is never going to be easy. There are various challenges floating out there like understanding the correct meaning of the 5. sentence, correct Named-Entity Recognition(NER), correct prediction of various partso —— speech, coreference resolution(the most challenging thing in my opinion). Oe NLP Computers can't truly understand the human language. if we feed enough data and train @ model property, it can distinguish and try categorizing various parts of speech( tae verb, adjective, supporters, etc...) based on Previously fed data and experiences. If favs’ encounters @ new word it tried making the nearest guess which can be embarrassit requir wrong a few times. aah aa _ base s Its very difficult for a computer to e Eo example — The boy radiated fire like transla he actually radiated fire? As you can see o\ | rif meanii going to be complicated, ; willing, Spoiled Choms the pro Notatior = Fen 1 Database [ Inteligence Algorithms [ Networking . (Robe Ne Robots } latura) Language (i Search | Web NLP Retrieval ||] Translation Categorization (using ontology) (E-M& ME) ‘summarization Extractive Summarization Figure 2: NLP in the Computer science taxonomy 5. History of NLP t fs NLP began in the 1950s as the intersection of artificial. intelligence and linguistics. NLP was originally distinct from text information retrieval (IR), which employs highly scalable statistics-based techniques to index and search large volumes of text efficiently: Manning et al1 provide an excellent introduction to IR. With time, however, NLP and IR have converged somewhat. Currently, NLP borrows from several, very diverse fields, requiring today's NLP researchers and developers to broaden their mental knowledge- base significantly. Early simplistic approaches, for example, word-for-word Russian-to-English machine Soe were defeated by homographs—identically ‘spelled words with mayo Morphological Analysis Information Machine ‘Text [ Language 7. Levels of NLP a, ort Often ROP ‘on multiple levels and most often, these Sitier dete The NLP can broadly be divided into various thet Rea: Natu in the Contextual : al reasoning ge Processing works ch other. Natural Langu: areas synergize well with ea as shown in figure. Toa Phonology: it deals with interpretation of speech sound within and across words. Mor Morphology: It is a study of the way words are built up from smaller meaning-bearingun the called morphemes. For example, the word ‘fox’ has single morpheme while the word a8 Mor; have two morphemes ‘cat’ and morpheme ‘-s' represents singular and plural concepts. SYN Morphological lexicon is the list of stem and affixes together with basic informal Synt whether the stem is 2 Noun stem or a Verb stem [21]. The detailed analysis of this com is discussed in chapter 4. Syntax: Itis a study of formal relationships between as is @ study of how words are clustered in classe aie what : SEM ; ‘Semantics: It is a study of the me ico structure. It consists of two kinds ing semantic grammar. The de discourse context, the level of resolution is replacing of words such as pronouns. Discourse structure recognition determines the function of sentences in the text which adds meaningful representation of the text. Reasoning: To produce an answer to a question which is not explicitly stored in a database; Natural Language Interface to Database (NLIDB) carries out reasoning based on data stored inthe database. For example, consider a database that holds the academic information about student, and user posed a query such as: ‘Which student is likely to fail in Maths subject? To answer the query, NLIDB needs a domain expert to narrow down the reasoning process. 8. Knowledge in Language processing _ Anatural language understanding system must have knowledge about what the words mean, how words combine to form sentences, how word meanings combine to from sentence meanings and so on. The different forms of knowledge required for natural language understanding are given below. PHONETIC AND PHONOLOGICAL KNOWLEDGE Phonetics is the study of language at the level of sounds while phonology is the study of combination of sounds into organized units of speech, the formation of syllables and larger units. Phonetic and phonological knowledge are essential for speech based systems as they deal with how words are related to the sounds that realize them MORPHOLOGICAL KNOWLEDGE Morphology concerns word formation. Itis a study of the patterns of formation of words by the combination of sounds into minimal distinctive units of meaning called morphemes. Morphological knowledge concerns how words are constructed from morphemes. SYNTACTIC KNOWLEDGE ” Syntax is the level at which we study how words combine to form phrases, phrases combine to form clauses and clauses join to make sentences. Syntactic analysis concems sentence formation. It deals with how words can be put together to form correct sentences. It also determines what structural role each word plays in the sentence and What phrases are subparts of what other phrases. SEMANTIC KNOWLEDGE 't concems the meanings of the words and senten independent meaning that is the meaning a used. Defining the meaning of a s pludy of context context it is involved. eee i] Pragmatic a wt DP rae a SyntacticNet_ >) Execute the command . Ipr /ali/stuff.init level of linguistic processing deals with the analysis of structure and meaning Single sentence, making connections between words and sentences. Af Resolution is also achieved by Identifying the entity referenced by an iy in the form of, but not limited to, a Pronoun). An example Is shown Voted for Obama because he was most 5: Anaphora Resolution Wlustration resolve anaphora relat » at the effi Lar ano user: inclu inforn Infor) text. | area ir relatio Quest finding Natura from dé 11. Applications of NLP , information extraction, machine learning systems question answering system, dialogue system, email fouting, telephone banking, speech management, multilingual query processing, and natural language interface to database system. Currently interactive applications may be classified into following categories Speech Recognition / Speech Understanding and Synthesis / Speech Generation: Speech understanding system attempts to perform a semantic and Pragmatic processing Of spoken utterance to understand what the user is saying and act on what is being said The research area in this category includes: linguistic analysis, design & developing efficient and effective algorithms for speech recognition and synthesis. Language Translator: It is a task of automatically converting one natural language into Snother preserving the meaning of input text and producing an equivalent text in the ‘Output language. The research area in this category includes language modelling. Information Retrieval (IR): Itis a scientific discipline that deals with analysis, design and implementation of a computerized system that addresses representation, organization, and access to large amounts of heterogeneous information encoded in digital format. The search engine is the well known application of IR which accepts query from user and returns the relevant document to user. It returns the document, not the relevant answers; users are left to extract answers from the returned documents. The research area in IR includes: information searching, information extraction, information categorization and information summarization from unstructured information. Information Extraction: It includes extraction of structured information from unstructured xt. It is an activity of filling predefined template from natural language text. The: rea in this category includes identifying named entity, resolving anaphora lationships between entities. lestion Answering (QA): It is passage retrieval in specific domain, It ing answers for a given question from a large collection of ‘ural Language Interface to Database (NLIDB): It is a process: database by asking questions in natural language. ters. It determineg 4 study of dialog between hu Dia stems The research grammar and style of the sentence based on that it giv + human-robot dialog ang area in this category includes the design of conventional agent, hu analysis of human-human dialog Text Generation: The task off Discourse Management / Story Understanding oe i discourse relationship Identifying the discourse structure is to identify the nature o Staci ast a oc between sentences such as elaboration, explanation, contrast and also Speech acts in a chunk of text (For example, yes-no, statement and assertion). Expected Questions ‘ SS 7 What is Natural language Processing ( NLP) ? Discuss various stages involved in NLP process with suitable example _-2— What is Natural Language Understanding? Discuss various levels of analysis under it with example. _—4— What do you mean by ambiguity in Natural language? Explain with suitable example. Discuss various ways to resolve ambiguity in NL. _4What do mean by lexical ambiguity and syntactic ambiguity in Natural language? What are different ways to resolve these ambiguities? ot tie _-3< List various applications of NLP and discuss any 2 applications in detail, wa — 1. Morphology anal. What are words? 2. E uman language jords are the fundamental building block of language. Every h z —— and language spoken, signed, or written, is composed of words. Every area of speec processing, from speech recognition to machine translation to information revo the web, requires extensive knowledge about words. Psycholinguistic models 0 — language processing and models from generative linguistic are also heavily based on lexical knowledge. Words are Orthographic tokens separated by white space. In some languages the distinction between words and sentences is less clear. Chinese, Japanese: no white space between words nowhitespace —- no white space/no whites pace/now hit esp ace Turkish: words could represent a complete “sentence” Eg: uygariastiramadiklarimizdanmissinizcasina “(behaving) as if you are among those whom we could not civilize” Morphology is the study of the structure and formation of words. Its most important unit 's the morpheme, which is defined as the ‘minimal unit of meaning” Consider a word like: “unhappiness”. This has three parts: ‘morphiemes: AN, __- uh happyness, 6 stem Suffix : ness Affixes : happy Stem : happy un ness today”. Bound Morphemes: These are lexical items incorporated into a word as a dependent part. They cannot stand alone, but must be connected to another morphemes. Bound jorphemes operate in the connection processes by means of derivation, inflection, and jompounding. Free morphemes, on the other hand, are autonomous, can occur on their pwn and are thus also Words at the same time. Technically, bound morphemes and free lorphemes are said to differ in terms oftheir distribution or freedom of occurrence. As a ule, lexemes consist of at least one free morpheme lorphology handles the formation of words by using morphemes base form (stem, lemma), » believe affixes (suffixes, prefixes, infixes), e.g., un-, -able, -ly lorphological parsing is the task of recognizing the morphemes inside a wordee.g., hands, Oxes, children and its important for many tasks like machine translation, information Tetrieval, etc. and useful in parsing, text simplification, etc Survey of English Morphology Morphology is the study of the way words are built up from smaller meaning bearing units, morphemes. A morpheme is often defined as the minimal meaning-bearing unit in a language. So for example the word fox consists of a single morpheme (the morpheme fox) while the word cats consists of two: the morpheme cat and the morpheme -s. As this example suggests, it is often useful to distinguish two broad classes of morphemes: stems and affixes. The exact details of the distinction vary from language to language, but intuitively, the stem is the ‘main’ morpheme of the word, supplying the main meaning, while the affixes add ‘additional’ meanings of various kinds. Affixes divided into Prefixes, suffixes, infixes, and circum- fixes. Prefixes precede the stem, circumfixes do both, and infixes are inserted inside Word eats is composed of a stem eat and the suffix -s. The a stem buckle and the prefix un-. English doesn't have an For example, the affix Stem hingi ‘borrow’ to produce humingi Prefixes and suffixes are often called concatenative composed of a number of morphemes concatenated together. A n have extensive non-concatenative morphology, in which morphemes are cor more complex ways. The Tagalog in- fixation example above is one example of concatenative morphology, since two morphemes (hing! and um) are intermingled: Another kind of non-concatenative Morphology is called templatic morphology oF roate ‘and-patiern morphology. This is very common in Arabic, Hebrew and other Semitic languages. In Hebrew, for example, a verb is a Ti a word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. For example, the verb computerizes can take the derivational suffix -ation to produce the noun computer: ization, inflectional morphology & Derivational morphology Morphemes are defined as smallest meaning-bearing units. Morphemes can be classified in various ways. One common classification is to separate those morphemes that mark the grammatical forms of words (-s, -ed, -ing and others) from those that form new lexemes conveying new meanings, e.g. un- and -ment. The former morphemes are inflectional morphemes and form a key part of grammar, the latter are derivational morphemes and play a role in word-formation, as we have seen. The following criteria help you to distinguish the two types: * Effect: Inflectional morphemes encode grammatical categories and relations, thus marking word-forms, while derivational morphemes create new lexemes. * Position: Derivational morphemes are closer to the stem than inflectional morphemes, cf. amendments (amend[stem] - ment{derivational] - s[inflectional]) and legalized (legal[stem] — ize[derivational] — edjinflectional]). * Productivity: inflectional morphemes are highly productive, which means that they can be attached to the vast majority of the members of a given class (say, verbs, nouns or adjectives), whereas derivational morphemes tend to be more restricted with regard to their scope of application. For example, the past morpheme can in Principle be attached to all verbs; suffixation by means of the adjective-forming derivational morphemes -able, however, is largely restricted to dynamic transitive verbs, which excludes formations such as *bleedable or *lieable. Class properties: Inflectional morphemes make up a closed and fairly stable class of items which can be listed exhaustively, while derivatio 0 much more numerable and more open to changes in thei Both inflectional and derivational morphemes must be cannot occur by themselves, in isolation, and are there Inflected words are variations of already existing lexem grammatical shape. Therefore many of the Infic dictionary. If you know the word surprise and look the word surprise -s which simply expresses the | = . oy of affixes to create new words out -ship and so on, These affixes do, but instead Derivational Morphology on the other hand uses already existing lexemes. Typical affixes are -ness, -ish, do not change the grammatical form of a word such as inflectional affixes ~ often create a new meaning of the base or change the word class of the ise A Example would be the word light. The plural form light-s would be considered Inflect oa Morphology, but if we consider de-light the prefix -de has changed the meaning of te word completely. We now do not think of light in the form of sunshine or lamps anymore but instead about a feeling. Also if we consider en-light the suffix -en has changed the word class of light from noun to verb. INFLECTIONAL MORPHOLOGY Inflection is a morphological process that adapts existing words so that they function effectively in sentences without changing the category of the base morphemes. Inflection can be seen as the “realization of morphosyntactic features through: morphological means” . But what exactly does that mean? It means that inflectional morphemes of a word do not have to be listed in a dictionary since we can guess their meaning from the root word. We know when we see the word what it connects to a most times can even guess the difference to its original. For example let us cons help-s, help-ed and help-er. According to what | have said about words listed dictionary, all of these variants might be inflectional morphemes, but then on tl hand does help-s really need an extra listing or can we guess from the root the suffix -s what it means? Does our natural feeling and instinct for langua us, that the suffix -s indicates the third person singular and that help-s therefo a variant from help (considering help as a verb and not a noun here)? Yes it d native speaker one instantly knows thats, as also the past form indicator eq grammatical variant of the root lexeme help. inflectional morpheme again, the root here beir Ing: emove all affixes. 7 To illustrate this consider the following two sentences: 1. I help my grandmother in her garden, 2. He is my grandmother's help. Here our general knowledge of words and their meaning shows us that in 4 help is used as a verb and expresses us working with our grandmother in order to support her. In 2. | help is a noun and stands for the person that re variation of a word without actually chan: cannot only distinguish verb and noun also singular and plural, which makes i later in 2.2.: Inflection in nouns, though. gularly supports my grandmother. This iging its form is called zero morphemes and (which makes it a derivational morphemes) but it an inflectional morpheme. | will talk about this ‘We may define inflectional morphology as the branch of morphology that deals with Paradigms. It is therefore concerned with two things: on the one hand, with the semantic oppositions among categories; and on the other, with the formal means, including inflections, that distinguish them.” (Matthews, 1991) Inflectional morphology is that itchanges the word form, it determines the grammar and it does not form a new lexeme but rather a variant of a lexeme that does not need its own entry in the dictionary. word stem + grammatical morphemes cat + s only for nouns, verbs, and some adjectives Nouns plural: regular: +s, +es _ irregular: mouse - mice; ox - oxen ‘ many spelling rules: e.g. -y -> -ies like: butterfly - butterflies possessive: +'s, +’ Verbs main verbs (sleep, eat, walk) modal verbs (can, will, should) primary verbs (be, have, do) VERB INFLECTIONAL SUFFIXES 1. The suffix —s functions in the Present Simple as the jing of the verb : to work — he work-s 2. The suffix -ed functions in the past simple as in regular verbs: to love —lov-ed @) " in the he perfect aspect he fp To study studied studied / To eat ate eaten 4. The suffix -ing functions in the marking of the present participle, the gerund andin the marking of the continuous aspect: To eat — eating / To study - studying NOUN INFLECTIONAL SUFFIXES 1. The suffix -s functions in the marking of the plural of nouns: dog ~ dogs 2. The suffix -s functions as a possessive marker (saxon genitive): Laura = book ADJECTIVE INFLECTIONAL SUFFIXES The suffix er functions as comparative marker: quick - quicker The suffix -est functions as superlative marker: quick - quickest DERIVATIONAL MORPHOLOGY ot Derivation is concerned with the way morphemes are connected to e: as affixes. Derivational morphology is a type of word formation that or either by changing syntactic category or by adding substantial new a free or bound base. Derivation may be contrasted with inflection with compounding on the other. The distinctions between derivatio between derivation and compounding, however, are not always may be derived by a variety of formal means including affixation, Modification of various sorts, subtraction, and conversion. Affixatio linguistically, especially prefixation and suffixation, Reduplica with various internal changes like ablaut and root and pattern ived words may fit into a number of semantic categories. F lal and participant, collective and abstract noun are ative categories are well-attested, as are rela Languages frequently also have ways o ative. Most languages have deri study of derivation has also been important in a num concerning the perception and Production of language. Derivational morphology is defined as morphology that creates new lexemes, either by langing the syntactic category (part Of speech) of a base or by adding substantial, non= grammatical meaning or both. On the one hand, derivation may be distinguished from change category but rather modifies ber of psycholinguistic debates e number, case, tense, aspect, person, among lay be distinguished from compounding, which combining two or more bases rather than by affi ternal modification of various sorts, Although thi practice applying them is not always easy, le can distinguish affixes in two principal types: iT. others. On the other hand, derivation also creates new lexemes, but by ixation, reduplication, subtraction, or é distinctions are generally useful, in Prefixes - attached at the beginning of a lexical item or base-morpheme — ex: un-, pre-, post-, dis, im-, etc. 2. Suffixes — attached at the end of a lexical item ex: -age, “ng, ful, -able, “ness, 5 Aeensipeea -hood, -ly, etc. AMPLES OF MORPHOLOGICAL DERIVATION . Lexical item (free morpheme): like (verb) b. Lexical + prefix (bound morpheme) dis- = dislike (verb) Derivational affixes can cause semantic change: - Prefix pre- means before; post- means after; un- rr . Prefix = fixed before; Unhappy = not happy = . Prefix de- added to a verb conveys a sense of : sense of negativity. 7 - To decompose; to defame; to unex aD yn of Daogest ote: eatoshry petra O INENogMi mart eh Kaimya. al ~~; Derivation Versus Inflection The distinction between derivation and inflection is a functional one rather than a formal one, as Boolj (2000, p. 360) has pointed out. Either derivation or inflection may be affected by formal means like affixation, ‘reduphication| intemal modification of bases, i to create new lexemes while and other morphological processes But derivation serves inflection prototypically serves to modify lexemes to fit differe the clearest cases, derivation changes category, for example taking a verb like employ and making it a noun (employment, employer, employee) or an adjective (employa or taking @ noun like union and making it 2 verb (unionize) or an adjective (unio, unionesque). Derivation need not change category, however. For example, the Of abstract nouns from concrete ones in English (king ~ kingdom; child nt grammatical contexts, In imperfective, habitual), categories that languages mig a inflection from derivation; inflection is invariabh Booij (1996) has argued that even this criterion we mean by relevance to syntax. Case infle. context, and are therefore clearly inflectional inflectional when it is triggered by the number of or tense and aspect on verbs is a matter of semai configuration. Booij therefore distinguishes wha triggered by distinctions elsewhere in a sente that does not depend on the syntactic context, the latter being closer to derivation than the former. Some theorists (Bybee, 1985; Dressler, 1989) postulate a continuum from derivation to inflection, with no clear dividing line between them. Another position is that of Scalise (1984), who has argued that evaluative morphology is neither inflectional nor Gerivational but rather constitutes a third category of morphology. ly relevant to syntax, derivation not. But is problematic unless we are clear what cctions, for example, mark grammatical Number-marking on verbs is arguably subject or object, but number on nouns intic choice, independent of grammatical i he calls contextual inflection, inflection Nce, from inherent inflection, inflection 2. Stemming and Lemmatization In natural language processing, there may come a time when ecognize that the words “ask” and “asked” his is the idea of reducing different forms derived from one another can be mapped to ave the same core meaning. you want your program to are just different tenses of the same verb. of a word to a core root. Words that are central word or symbol, especially if they ey jaybe this is in an information retrieval setting and you want to boost your algorithm's ecall. Or perhaps you are trying to. analyze word usage in; nd wish to condense elated words so that you don't have as much variability. s technique of text jormalization may be useful to you. 4 ‘ lis is where something like stemming or lemmatizatio’ ‘or grammatical reasons, documents are going to use dif organize, organizes, and organizing. Additionally, there iords with similar meanings, such as democracy, ade r ‘any situations, it seems as if it would be useful for a elu documents that contain another word in the ‘word, such as he goal of both stemming and lemmatization sometimes derivationally related forms of a word am, are, is > be Car, cars, car's, cars’ => car lemming Algorithms Examples orter stemmer: This stemming algorithm is an older one. It’s from the 1980s and its ain concer is removing the common endings to words so that they can be resolved common form. It's not too complex and development on it is frozen. Typically, it's nice starting basic stemmer, but it's not feally advised to use it for any production/ mplex application. Instead, it has its place in research as a nice, basic stemming gorithm that can guarantee reproducibility. It also is a very gentle stemming algorithm yen compared to others. snowball stemmer: This algorithm is also known as the Porter2 stemming algorithm. It is lost universally accepted as better than the Porter stemmer, even being acknowledged is Such by the individual who created the Porter stemmer. That being said, it is also more essive than the Porter stemmer. A lot of the things added to the Snowball stemmer ere because of issues noticed with the Porter stemmer. There is about a 5% difference the way that Snowball stems versus Porter. ancaster stemmer: Just for fun, the Lancaster stemming algorithm is another algorithm fat you can use. This one is the most aggressive stemming algorithm of the bunch. lowever, if you use the stemmer in NLTK, you can add your own custom rules to this algorithm very easily. It's a good choice for that. One complaint around this stemming algorithm though is that it sometimes is overly aggressive and can really transform words ‘emmatization usually refers to doing things properly with the use of a vocabulary and Morphological analysis of words, normally aiming to remove inflectional endings only ‘and to return the base or dictionary form of a word, which is known as the lemma . lemmatization is a more calculated process. It involves resolving words to their dictionary form. in fact, a lemma of a word is its dictionary or canonical form! + Because lemmatization is more nuanced in this respect, it requi make work. For lemmatization to resolve a word to its lemma, it Of speech. That requires extra computational linguistics power tagger. This allows it to do better resolutions (like resolving is Another thing to note about lemmatization is that it’s often lemmatizer in a new language than it is a stemming ture of a language, it's a much more intensive uct ic stemming algorithm g might return just s, whereas lemmatization depending on whether the use of the token iso difer in that stemming most commoniy was as a verb or a nou! rds, whereas lemmatization commonly only collapses collapses derivationally related words, Linguistic processing for stemming or lemma. the different inflectional forms of @ } f lemmatization is often done by an additional plug-in component fo the indexing process, andia number of such components exist, both commercial and open-source. The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective, is Porter's algorithm (Porter, 1980). The entire algorithm is too long and intricate to present here, but we will indicate its general nature Porter's algorithm consists of 5 phases of word reductions, applied sequentially. Within each phase there are various conventions to select rules, such as selecting the rule from each rule group that applies to the longest suffix. In the first phase, this convention is used with the following rule group: stru require a lot more knowledge about the uristi process than just trying to set up a he nin If confronted with the token saw, stem it Ww would attempt to return either see oF $2! in, The two may al () Rule Example SSES + ss caresses caress HEB lees | Ponies Poni > —> caress > inereases recall while harming precision. As an example of what can go wrong, note that the Porter stemmer stems all of the following words: ‘operate operating operates operation operative operatives operational to oper. jever, since operate in its various forms is a common verb, we would expect to lose siderable precision on queries such as the following with Porter stemming: operational and research operating and system operative and dentistry a case like this, moving to using a lemmatizer would not completely fix the problem fause particular inflectional forms are used in particular collocations: a sentence with jords operate and system is not a good match for the query operating and system. ting better value from term normalization depends more on pragmatic issues of word than on formal issues of linguistic morphology. situation is different for languages with much more morphology (such as Spanish, an, Hindi and Finnish). Regular expression gular expression (RE), a language for specifying text search strings or search pattern. e regular expression languages used for searching texts in UNIX (vi, Perl, Emacs, p) and Microsoft. Usually search patterns are used by string searching algorithms for i” or “find and replace” operations on strings, or for input validation. Itis a technique eloped in theoretical computer science and formal language theory. ords are almost identical, and many RE features exist in the various Web search ines. Besides this practical use, the regular expression is an important theoretical pI throughout computer science and linguistics, A regular expression is a formula in special language that is used for specifying simple classes of strings. A string isa quence of symbols; for the purpose of most text-based search techniques, a string is a quence of alphanumeric characters (letters, numbers, space s, and punctuation). ¢ is just a character like any ot d we represent it with n for characterizing r these purposes a spac € symbol .. Formally, @ regular expression is an algebré set of strings. Thus, they can be used to specify sea \guage in a formal way. attern describing a certain amount of text. Thej is a pi ‘ m: ory on which they are ased. But we hematical theory on whl y bi " Will gj ated to “regex” or “regexp”. Regu dig into that. You will usually find the name abbrevi g) ; Qular - i 3 (regex or regexp) are extremely useful in extracting a ion from an, expressions " m. a a by searching for one or more matches of a specific search pattern (i.e. a specif text by sequence of ASCII or unicode characters). for short, i o like speech and text, by e for the field of Basically, a regular expression | name comes from the math broadly defined as the Natural Language Processing, s broadly L language, automatic manipulation of natura Uke software.Statistical { aims to do statistical inferenc natural language. /\b(\wANLP\ we) \b/e Ez Figure 1: shows matching of the string NLP in the given text on the site https://www-regextester.com/ Regular Expression Patterns —“and$ matches any string that starts with The -> Try it! matches a string that ends with end exact string match (starts and ends with The end) hes any string that has the text roar in it *+?and gD S a string that has ab followed by zero or more c a string that has ab followed by one or more c @ string that has ab followed by zero or one c string that has ab followed by2c — a[bc] same as previous Character classes —\d \w\s and. \d matches a single character that is a digit \w matches a word character (alphanumeric character plus underscore) matches a whitespace character (includes tabs and line breaks) matches any character \w and \s also present their negations with \D, \W and \S respectively. example, \D will perform the inverse match with tespect to that obtained with \d. matches a single non-digit character rder to be taken literally, you must escape the characters *.[$()|*+?{\with a backslash they have special meaning. matches a string that has a $ before one digit can match also non-printable characters like tabs \t, new-lines \n, carriage returns \r is are learning how to construct a regex but forgetting a fundamental concept: flags. ex usually comes within this form /abc/, where the search pattern is delimited by slash characters /. At the end we can specify a flag with these values (we can also bine them each other): g (global) does not return after the first match, restarting the subsequent searches from the end of the previous match m (multi-line) when enabled * and $ will match the start and end of a line, instead of the whole string i (insensitive) makes the whole expression case-insensitive (for would match AbC) uping and capturing — () JaBcli parentheses create a capturing group with value bo)* using ?: we disable the capturing group bc) using ? we put a name to the group Operator is very useful when we need to extract i your preferred programming language. Any mul cena a 4 in the form of a classical array: we will access their t a name to the groups (using (?...)) we will be able to retrieve ng the match result like a dictionary where the keys will be the name Bracket expressions—[] [abe] matches a string that has either an a or a b or ac -> is the same as albjc [a<} matches a string that has either an a or ab or ac -> is the same as albje [244-F0-9] a string that represents a single hexadecimal digit, case insensitively fo-9y% @ siting that has a character from 0 to 9 before a % sign [22A-Z] 2 string that has not a letter from a to z or from A to Z. In this case the “'s used as negation of the expression Greedy and Lazy match ‘The quantifiers ("+ (}) are greedy operators, so they expand the match as far as they an through the provided text. For example, <+> matches
simple div
in This is a
simple div
‘Sst In order to caich only the div tag we can use a? to make it lazy: <4? matches any character one or more times j insi times included S included inside < and >, Boundaries — \b and \B ‘Babcib performs “whole words only” search ‘© represents an anchor like caret it is similar 4.Finite Automata = regular expression is more than just a convenient metalanguage for text searching i, a regular expression is one way of describing a finite-state automaton (FSA) -state automata are the theoretical foundation of a good deal of the computational we will describe in this book, Any FSA regular expression can be implement ite-state automaton (except regular expressions that use the memory feature this later). Symmetrically, any finite-state automaton can be described with a regular ression. Second, a regular expression is one way of characterizing a particular kind formal language called a regular language. Both regular expressions and finite-state lomata can be used to describe regular languages. The relation among these three oretical constructions is sketched out in the figure below. regular expressions finite regular automata languages Figure 2: The relationship between finite automata, regular expressions,and regular languages formal language is completely determined by the ‘words in the dictionary’, rather than any grammatical rules (formal) language L over alphabet 5 is just a set of strings in =". us any subset L < 2* determines a language over > Janguaage determined by a regular expression r over 5 is L(r) def {v <=*|v matches r}. reais expressions rand s (over the same alphabet) are Sieteaioct iff L(r) and L(s) equal sets (i.e. have exactly the same members.) finite automaton has a finite set of states with which it a inite State Automata (FSA) can be: terministic each input there is one and only one state to which Current state Nondeterministic onc An automaton can be in several states at Deterministic finite state automaton 1. Afinite set of states, often denoted Q 2. A finite set of input symbols, often denoted = —- 3. A transition function that takes ag arguments a state an returns a state. The transition fundtion is commonly denoted 3 If q 2 pape a symbol, then 8(q, a) is @ state p (and in the graph that represents there is an arc from q to p labeled-a) 4. Astart state, one of the states in @ 5. Aset of final or accepting states F (F < Q) ADFA is atuple A= (Q, &, 5, q,, F) Other notations for DFAs Transition diagrams * Each state is a node * Foreach state q < Q and each symbol a « F, let 5(q, a) = Pp * Then the transition diagram has an are from q to p, labeled a * There is an arrow to the start state q, * Nodes corresponding to final states are marked with doubled circle Transition tables * Tabular representation of a function * The rows correspond to the states and the columns to the input s * The entry for the row Corresponding to state q and the column ©orTesponding to input ais the state 6(q, a) = {4 Gy 4, £0, 1}, 5, dye (4, }) e the transition function 6 is given by the table scribes what happens when we start in any state and follow any sequence of inputs is our transition function, then the extended transition function is denoted by 6 e extended transition function is a function that takes a state q and a string w and ms a state p (the state that the automaton reaches when starting in state q and cessing the sequence of inputs w) ‘mal definition of the extended transition function ifinition by induction on the length of the input string is: 5 (q, E)=q we are in a state q and read no inputs, then we are still in state q Induction: Suppose is a string of the form xa; that is ais the last symbol of w, and x is the string consisting all but the last symbol ge en: 5 (q, w) = 5(6(4.x), a) - compute 6 (q, w), first compute 8(q,x), the state that the automation isin after procesing but the last symbol of w a Ippose this state is p, ie., 5(q,x) =P en 8 (g, w) is what we get by making a transition from bol of w FA to accept the language r of 1} Design a DI ber of and an even numbe! even num L= (| whas both an The Language of a DFA oted L(A) is defined by 5 deni The language of a DFAA= (Q, 2, 3.40 F), det L(A) = (w | 5 (go, #) i sin F} start state go to the one of the The language of A is the set of strings w that take accepting states - age If L is a L(A) from some DFA, then L is a regular languag\ Nondeterministic Finite Automata (NFA) J a A NFA has the power to be in several states at once. This ability is often express an ability to “guess” something about its input . Each NFA accepts a language that also accepted by some DFA. NFA are often more succinct and easier than DFAs . can always convert an NFA to a DFA, but the latter may have exponentially more sf than the NEA (a rare case) .The difference between the DFA and the NFA is the type o transition function & For a NFA © is a function that takes a state and input symbol as. arguments (Ike the DFA transition function), but returns a set of zero OF more states (rather than returning exactly one state, as the DEA must) Example: An NFA accepting strings that end in 01 Nondeterministic automaton that a iecepts all and only the str 7 he ly the strings of Os ang 1s that NFA: Formal definition Anondeterministic finite automaton (NFA) is a tuple A = (Q, £, 8, q,, F) where: 4. Qis a finite set of states 2. Zisa finite set of input symbols q, < Qis the start state F (F < Q)is the set of final or accepting states 6, the transition function is a function that takes a state in Q and an input symbol in 4 as arguments and returns a subset of Q only difference between a NFA and a DFAis in the type of value that 5 returns ple: An NFA accepting strings that end in 01 ({g0, 91, 92}, {0,1}, 5, 40, {92}) the transition function 6 is given by the table Extended Transition Function ics: : 5°(q, 9) = {a} \out reading any input symbols, we are only in the sta luction and x is the string Suppose w is a string of the form xa; that is a is the last symbol of W Consisting of all but the last symbol Also suppose that 5; (g, x) = {P1, P2,..... Pk} Let k U BPI) = fy Faye} i=1 Then: 8(9, W) = {Fy Fel) from w) by first computing 5°(q, x) and by then following any transition any of these states that is labeled a We compute 5°(q, Example: An NFA accepting strings that end in 04 Processing w = 00101 1. 8(90, €) = {qo} 2. 8(40, 0) = 8(90, 0) = {g0, qt} 3. 3(90, 00) = 8(q0, 0) Ua(qt, ) = (40, q1} U8=1q0, q1} 4. (90, 001) = 5(q0, 1) Ua(qt, 1) = {40} U {92} (40, q2} 5. 8(40, 0010) = 8(g0, 0) U8(92, 0) = (40, q1} U6@={90, q1} 6. 8(90, 00101) = 3(90, 1) Ua(q7, 1) = (40} U {2} = (90, q2) The Language of a NFA language of aNFAA=(Q, 5, 6, WF), w | 8(g0, w) NF # By ge of Ais the set of strings w © E * such that 5°(q,. w) contains at least one te. denoted L(A) is defined by loosing using the input symbols of w lead to a atall, does not prevent w from being ac Nnon-accepting State, or ccepted by a NFA as a Equivalence of Deterministic and Nondeterministic Finite Automata Every language that can be described by some NFA can also be described by some DFA. The DFA in practice has about as many states as the NFA, although it has more transitions. In the worst case, the smallest DFA can have 2" (for a smallest NFA with n state) 5. Finite-State Morphological Parsing insider a simple example: parsing just the productive nominal plural (-s) and the verbal essive (-ing). Our goal will be to take input forms like those in the first column below produce output forms like those in the second column. Morphological Parsed Outp’* cat +N +PL cat +N +SG city +N +PL goose +N +PL (goose +N +SG) or (goose +V) goose +V 3SG merge +V +PRES — PART (catch +V +PAST — PART) or (catch +V +PAST) ms second column contains the stem of each word as well as assorted morphological tures. These features specify additional information about the stem. For example the lure +N means that the word is a noun; +SG means itis singular, +PL that it is plural. der +SG to be a primitive unit that means ‘singular’. Note that some of the input s (like caught or goose) will be ambiguous between different morphological parses. der to build a morphological parser, we'll need at least the following: a lexicon: The list of stems and affixes, together with basic information about them (whether a stem is a Noun stem or a Verb stem, etc). morphotactics: the model of morpheme ordering that explains which classes of morphemes can follow other classes of morphemes inside a ord. For example, the rule that the English plural morpheme follows the noun al g to model the changes that ogg, se 1g rules are US: - ple the y ! ie g 2. orthographic rules: these speling combine (for example the Peli wo morphemes ather than cities) -s to cities rather in a word, usually wh — rule discussed above that chang} 6. Building a Finite-State LexicON —— £4 ae implest possible lexicon wou! ‘ Dlg Alexicon is a repository for words. The simp! fRelididg abbreviations (AAA) and age list of every word of the language (every word, i.e. achon-aeniverk; serdwol, abaiaall names (‘Jane' or ‘Beijing’) as follows: a, AAA, AA, y ic ic - sons we discu: Since it will often be inconvenient or impossible, for the various rea ISSeq above, to list every word in the language, computational lexicons are usually with a list of each of the stems and affixes of the language together with a representation of the morphotactics that tells us how they can fit together. There are many we iys model morphotactics; one of the most common is the finite-state automaton. A \ simple finite-state model for English nominal inflection might look like Figure 3. — 4 The FSA in Figure 3 assumes that the lexicon includes regular nouns (re lake the regular -s plural (e.g. cat, dog, fox, aardvark). These are the vast louns since for now we will ignore the fact that the plural of words fil foxes. The lexicon also includes irregular noun forms that d g-i noun (goose, mouse) and plural itreg-pl-noun (geese, reg-noun Plural (-s) irreg-Sa—noun automaton for. English nominal inflection, irreg-past-verb—form preterite (ed) reg-verb-stem pst parti { % le (-ed) A irreg-verb-stem Tsing (-s) Figure 4: A finite-state automaton for English verbal inflection is lexicon has three stem classes (reg-verb-stem, irreg-verb-stem, and irreg-past- rb-form), plus 4 more affix classes (-ed past, -ed participle, -ing participle, and 3rd gular -s): eee e ca ES PO ns kee ke CC mame eer ty cut caught speak ate sing eaten sang cut spoken ome models of English derivation, in fact, are based on the more complex co jrammars S a preliminary example, though, of the kind of analysis it would require, small part of the morphotactics of English adjectives, taken from \ntworth offers the following data on English adjectives: , bigger, biggest |, cooler, coolest, coolly Gd, redder, reddest clear, clearer, clearest, clearly, unclear, unclearly appy, happier, happiest, happily unhappy, unhappier, unhappiest, unhappily feal, unreal, really ave an optional pre- fix (un-), an An initial hypothesis might be that adjectives can hi ) and an optional suffix (-er, -est, or -ly). This might suggest obligatory root (big, cool, etc) the FSA in Figure 5. Alas, while this FSA will recognize all the adjectives in the table above, it will also recognize ungrammatical forms like unbig, redly, and realest. We need to set up classes of roots and specify which can occur with which suffixes. So adj-root1 would include adjectives that can occur with un- and -ly (clear, happy, and real) while adj-root2 will include adjectives that can’t (big, cool, and red). Antworth (1990) presents Figure 6 as a partial solution to these problems. This gives an idea of the complexity to be expected from English derivation. ~est -er un adj-root a5 € 90, q1, and q2. Similarly, take the suffix ity, ay g the Nominal Inflection Figure 8 shows the noun-recognition FSA produced by expanding ns for each class. We can us FSA of Figure 9 with Sample regular and irregular nouns for e . g at the in! Figure 8 to recognize strings like aardvarks by simply starting ai ing arc, e comparing the input letter by letter with each word on each outgoing Finite-State Transducers a lexicon, and je rep tation can be used for word recognition. A transducer maps between one represent and another; a finite-state transducer or FST is a type of finite automaton which nae between two sets of symbols. We can visualize an FST as a two-tape automaton which recognizes or generates Pairs of strings. Intuitively, we can do this by labeling each are We've now seen that FSAs can represent the morphotactic structure o! The FST thus has a more language by defining a set of strings, Another way of looking at an FST is another. Here's a summary of this for FST as recognizer: a transducer that takes Pt if the string-pair is in the string-pair language, and reject if it is not. aS generator: a machine that outputs pairs of Strings of the language. Thus ayes or no, and a pair of Output strings. All of these have applications in speech and language processing. For morphological parsing (and for many other NLP applications), we will apply the FST as translator metaphor, taking as inputa string of letters and producing as outputa string of morphemes ‘An FST can be formally defined in a number of ways; we will rely on the following definition, based on what is called the Mealy machine MEALY MACHINE extension to a simple FSA: a finite set of states q0, q1,....., qn—1 a finite set corresponding to the input alphabet a finite set corresponding to the output alphabet a the start state the set of final states , W) the transition function or transition matrix between states; Given a state q€ Qand a string w € E*, 8(q, w) returns a set of new states Q' € Q. 5 is thus a function from Q x &* to 2° (because there are 2° possible subsets of Q). 5 returns a set of states rather than a single state because a given input may be ambiguous in which state it maps to. the output function giving the set of possible output strings for each state and input. Given a state q € Q and a string w € E*, (q, w) gives a set of output strings, each a string o € A*. cis thus a function from Q x Z* to a lere FSAs are isomorphic to regular languages, FSTs are isomorphic to regular tions. Regular relations are sets of pairs of strings, a natural extension of the regular guages, which are sets of strings. Like FSAs and regular languages, FSTs and regular tions are closed under union, although in general they are not closed under difference, plementation and intersection (although some useful subclasses of FSTs are closed ler these operations; in general, FSTs that are not augmented with the ¢ are more ly to have such closure properties). Besides union, FSTs have two additional closure perties that turn out to be extremely useful: inversion: The inversion of a transducer T +) simply switches the input and output labels. Thus, if T maps from the input alphabet the output alphabet O, T’ maps from O to I. Pe noua Position: If T1 is a transducer from I1 to 01 and T2 a transducer *T2 maps from 11 to 02. inversion is useful because it makes it easy to convert a FS parser into an F: h is useful because it it easy to version is use| caus " nerator. Composition is useft allows us to take two transdi ator. Composition is useful because it allows u: ger o s re CO transducer in series and replace them with one more complex ical to applying T1 to $ and yp algebra; applying T1 « T2 to an input sequence S is identical to applying a T2 to the result; thus T1 » T2(S) = T2(T1(S)) Fig. 10, for example, shows the composition of [a:b]+ with [ r act are ab ab Figure 10: The composition of fa:bl+- with The projection of an FST is the FSA that is Produced by extracting only one side of the ‘elation. We can refer to the projection to the left or upper side of the relation as the upper Or first projection and the projection to the lower or right side of the relation as the lower ©F Second projection. Morphological Parsing with Finite-State Transducers Let's now turn to the task of morphological parsing. Given the input Cats, for instance, we'd like to output cat +N +P, teling us that cat i 17: Schematic examples of the. ‘lexical and surface tapes; ill involve intermediate tapes as well, i Koskenniemi (1983), we allow each arc only to have a single symbol from each alphabet. We can then combine the two symbol alphabets = and A to create a new alphabet, 2’, which makes the relationship to FSAs quite clear. =’ is a finite alphabet of complex symbols. Each complex symbol is:composed of an input output pair i : 0; one symbol i from the input alphabet Z, and one symbol o from an output alphabet A, thus 2’ c 2A. Z and A may each also include the epsilon symbol ¢. Thus, where an FSA accepts a language stated over a alphabet of single symbols, as the alphabet of our sheep language: ) = = {b,a,!} FST defined this way accepts a language stated over pairs of symbols, as in: ) L= fara, bib, Va: ace, es} two-level morphology, the pairs of symbols in. =’ Feasible pair are also called feasible irs. Thus each symbol a :b in the transducer alphabet Z’expresses how the symbol a from tape is mapped to the symbol b on the other tape. For example a : q means that ana the upper tape will correspond to nothing on the lower tape. Just as for an FSA, we can regular expressions in the complex alphabet 2. Since it's most common for symbols map to themselves, in two-level morphology we call pairs like a: a default pairs, and t refer to them by the single letter a. We are now ready to build an FST morphological rser out of our earlier morphotactic FSAs and lexica by adding an extra “lexical” tape \d the appropriate morphological features. Fig. 12 shows an augmentation of Fig. 13 with nominal morphological features (+Sg and +P}) that correspond to each eS. The bol * indicates a morpheme boundary, while the symbol # indicates a e morphological features map to the empty string 9 or the boundary no segment corresponding to them on the output tape. Figure 12: A schematic transducer for English nom The sym- bols above each arc represent ele lexical tape; the symbols below each are rep

You might also like