Download as TXT, PDF, TXT or read online from Scribd
Download as txt, pdf, or txt
You are on page 1of 6
>> Word tokenization is an important part
of text processing, Every natural language
processing text has to normalize the text in some way, we start by segmenting or tokenizing the words off and we often have to normalize the format of each of the words and as part of this process we're gonna have to break out sentences from the text. So lets start by talking about this kind of word tokenization. How many words are there in a sentence? Here's a sentence, I do main, mainly business data processing. How many words are in that sentence? It's a complicated question. There's a word like is a word? Or how about the cut off main of mainly? So we call things like main a fragment. We call things like "uh" filled pause. So for certain applications we might want to be counting these, if we're dealing with speech synthesis or speech recognition, or, or correcting things. What about cat and cats? We talked about the cat in the hat. So we define the term lemma. Two words are the same lemma if they have the same stem, the same part of speech, the roughly the same word sense. So cat and cats are both nouns, they have similar meaning. We could, we say that cat and cats are the same lemma, so the same word in that sense. We define the term word form to mean the full-inflected surface form, so cat and cats by that definition of word are different words. They're different word forms. So we're gonna use different definitions depending on our goals. So let's look at an example sentence. They lay back on the San Francisco grass and looked at the stars, and their, so on, and let's Ask how many words there are in this sentence. So count for yourself. We can define words in a couple ways. Word types, how many vocabulary elements there are. How many unique words there are, and word tokens. How many instances of that particular type there are in running text. So, how many tokens do we have in, in here? Well, should be easy to count: one, two, three, four, five and so on. So if we count San and Francisco separately, we end up with fifteen. If we count San Francisco as one token, we end up with fourteen. So even the definition of, of a word depends a little bit on what we're gonna do with our spaces. How about types, count for yourself. Well, it's thirteen types, again depending on how we count. So we have multiple copies. The word the, there's the and the again it depends if we count San Francisco as a one word or two and remember our lemmas. We might decide that they and their and since they are the same lemma although a different word forms, we might want to count them as the same type depending again on our goal. In general, we're gonna be referring to the number of tokens, which comes up whenever we're counting things with capital N. And we'll use capital V to mean the vocabulary, the set of different types and we'll use set notation so the cardinality of the set V is the size of the vocabulary, although sometimes for simplification, we'll just use capital V to mean the vocabulary size when it's not ambiguous. So how many words and tokens and types are there in the kind of data sets that we look at in natural language processing? Well, let's look at a couple of these. Da, data sets of text are called corpora. And here's three important corpora. The switchboard corpus of phone conversations has 2.4 million word tokens. And there's 20,000 word types in those 2.4 millions words. Shakespeare has just under a million word tokens. Shakespeare is quite a small corpus. He wrote, 800,000 words in his lifetime. And in that less than a million words, he actually used 31,000 distinct words. So he had a very, very broad vocabulary famously. And if you look at a very huge corpus, the Google N-grams corpus that has a trillion different tokens, a very large number of words, there's. Thirteen million types, so how many words are there in English? Well, if you look at conversation, 20,000 different words. If you look at Shakespeare, 30,000 words. And if you combine the two, probably somewhere, not quite the sum of the two, but some larger number. But if you look at the Google engrams, we have thirteen million. And of course, some of those are probably urls, and. And email addresses, but even if you eliminate all of those, the number of words in a language is very large, maybe there's a million words of English and if fact. Church and Gale have, suggested the size of the vocabulary grows greater than, with the square root of the number of tokens as you get N tokens. The square root of n more vocabulary on it. So vocabulary keeps growing and growing and it's names and other kind of things that contribute to this growing in vocabulary. We are gonna introduce some standard Unix tools that are used for text processing. So I have here a corpus of Shakespeare, Shakespeare's complete works, you can see here's the sonnet and it goes onto a place. So let's start by extracting all the words in the corpus. So we are gonna do this using the TR program. [sound] All right, so the TR program takes. Character, and it maps every instance of that character into another character. And we specify TR-C, which means compliment. So it means, take every character that's not of these characters. And turn it into this character. So in this case, it's take every non-alphabetic character, and turn it into a carriage return. So we're gonna replace all the periods and commas in the spaces in Shakespeare with new lines. So we're gonna create one line, one word per line in this way, so let's look at that. So there's, we've know turned the sonnets into one word per line. [sound]. And now. We're gonna sort those words, to let us look at the unique word types, so let's do that. And you can see, here's all the As, there's a lot of them and it occurs a lot in Shakespeare. And, that's, this is a very boring way to look through all of Shakespeare, we don't wanna do this. So let's, instead. Use the program uniq, and the program uniq will take that sorted file and tell us for each unique type, the count of times that it occurs. So, let's try that. So, here we have all the words in Shakespeare, with a count along the left. This is the product of the unique program, and we can walk through. So we know that in Shakespeare the word achievement with a capitol A occurs once, the word Achilles appears 79 times, the word acquaint six times and so on. So that's interesting. But, it would be nice if we didn't have to just look at these words in alphabetical order. But if we could look at them in frequency order. So let's take this same, list of words, and now, resort it by frequency. So now we have them. The most frequent word in Shakespeare is the word 'the', followed by the word 'I', followed by the word 'and', then we have the actual accounts in Shakespeare. So that, here is all the lexicon of Shakespeare sorted in a frequency order. There are some problems, one is that the word 'and' occurs twice because we didn't map our uppercase words to lowercase words. So let's, let's fix the mapping of case first. So let's try that again. We're gonna map all of the uppercase letters to lower case letters. And Shakespeare. And we're gonna pipe that. To another instance of the t r program. Which replaces all of the non alphabetics with new lines and now we're gonna do our sorting as we did before we're going to be unique to find all the individual type "uniq -c" tells us the actual account and then we're gonna sort again, n means numerically and r means start from the highest one and then we'll look at those, so lets do that. Alright, so now we've solved the problem of the Ands, so now we only have lowercase and we don't have our uppercase and appearing. But we have another problem. We have this D here. Why is the word D or the word S. Why are they so frequent in Shakespeare? We also have to decide a standard that we're gonna need for. Is our words, so for example if our input is Finland, apostrophe S, capital. How we gonna tokenize Finland's depends on, on our goal. So we might choose to keep all the apostrophes in and then we have Finland apostrophe S, we might choose to replace all the apostrophes with nothing, we might choose to eliminate all the apostrophe S-es. Similarly we might choose to expand the what'res to what are. And the "I'm"s to "I am"s because if we're for example looking for all the cases of I for some sentiment analysis task. Or if we're looking the cases of negation for some, some task we might want to turn isn't to is not. How about Hewlett Packard. We have to decide whether a word like Hewlett Packard are going to be represented. Or n. Or with a space. The same is true with phrases like state of the art. We'll have to decide for words like lowercase, should they have a dash, no dash at all. Should they have a space. We talked about the issue of San Francisco. And then issues with periods become a huge issue. We have to decide if we're gonna represent MPH, leave the periods in and then all of our algorithms that use periods for splitting things are gonna have to be sensitive to this. The issue of tokenization becomes even more complicated in other languages. We have the French phrase L'ensemble. For the l apostrophe to be a separate word, and if so do we turn it into, into the full article Le or do we just keep it as l apostrophe, or just an l by itself. We'd like it to match this same, the same word ensemble, even if a different article occurs before it. So we're going to want to break them up for some reasons but then we're stuck with these sort of non-words. So another issue that we have to deal with. In German. The long nouns are not segmented as they are in English. So, a word like life insurance company employee in English would be segmented up. In German, we're gonna get into these very long phrase, but spelled as a single word. So if, for German tasks, like information retrieval, we're gonna need to do, like compound splitting. In Chinese and Japanese, we have a different problem. There's no spaces at all between the words. So here's [inaudible] and we've shown you the, the original Chinese sentence here. And now here's the sentence segmented out so here Sharapova now lives in US, and so on and so on. In English we segment out, in Chinese we don't. So if we wanna do natural language processing on Chinese in the applications we need to break things up into words and so we'll need some way of doing that. Similarly in Japanese we have the problem that there's no spaces between words and we have the problem that there are multiple alphabets that are intermingled. There is the katakana alphabet. There's the hiragana alphabet. There are kanji, which are like the Chinese characters. And there's romaji, the roman letters. Another complicating issue that has to be dealt with in tokenizing Japanese. [sound]. The word tokenization in Chinese is a common research problem that has to be addressed when doing any kind of Chinese natural language processing. And the, characters in Chinese represent a single syllable, often a single morpheme. And the average word is about 2.4 characters long. So a word, it has to be broken up into approximately two or three characters. And there are lots of complicated algorithms. For this, but there was a standard, a baseline segmentation algorithm called the max match, the maximum matching algorithm, also called the greedy algorithm. So let's look at max match as an algorithm. We're given a word list of Chinese. So a vocabulary of Chinese, a dictionary, and a string. We'll start a pointer at the beginning of the string. We'll find the longest word in the dictionary that matches the string, so far, starting at the pointer. We'll move the pointer over the word in the string and then we'll go back and move on from the next words. Let's just see an example of that working, I'm gonna pick an English example, it's easier to think about. We'll take the phrase, imagine English was written like Chinese with no spaces, we'll have a phrase like, the cat in the hat all ran bun together. And we have a dictionary that has words like. The and, and cat. So we look at this and we say what's the longest word in our dictionary? That matches the beginning and the longest word in our dictionary is the, because "thec" is not a word, and "theca" is not a word, and so on. So we'll start with the, and now we've gotten to here, and then we say what's the longest word starting with C, and the longest word is cat. So now we say, what's the longest word, starting with the I, and so on. And we do a good job. How about the phrase, the table down there. We've taken the spaces out of the table down there. What's our segmentation of a [inaudible] segmentation algorithm gonna do with the table down there? Think a little for yourself. You may think that what it's gonna do is produce the table down there. But there's a problem, English has a lot of long words. English has the word theta for the variable. And so, instead of, the table down there, we're gonna get theta, right after that, bled. And then own and then there. So we're gonna get theta bled on there. So "Max match" is in fact not a generally good algorithm for this kind of pseudo English, English without spaces, because English has these very long words and short words all mixed together. But since Chinese in general has relatively consistent word length this works very well for, for Chinese. And it turns out the at modern problemistic segmentation algorithms work even better. So that's the end of our section on word tokenization.