0% found this document useful (0 votes)
23 views5 pages

Detail NLP

The document provides an introduction to natural language processing and demonstrates how to analyze text data using Python and the NLTK library. It shows how to clean HTML tags from text, tokenize words, and calculate word frequencies as an example of text analysis and compares implementing these tasks with only Python versus using NLTK functions.

Uploaded by

shahzad sultan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views5 pages

Detail NLP

The document provides an introduction to natural language processing and demonstrates how to analyze text data using Python and the NLTK library. It shows how to clean HTML tags from text, tokenize words, and calculate word frequencies as an example of text analysis and compares implementing these tasks with only Python versus using NLTK functions.

Uploaded by

shahzad sultan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Chapter 1

Let's search for something in the running example, where mystring is the same
string object, and we will try to look for some patterns in that. A substring search is
one of the common use-cases of the re module. Let's implement this:
>>># We have to import re module to use regular expression
>>>import re
>>>if re.search('Python',mystring):
>>> print "We found python "
>>>else:
>>> print "NO "

Once this is executed, we get the message as follows:


We found python

We can do more pattern finding using regular expressions. One of the common
functions that is used in finding all the patterns in a string is findall. It will look for
the given patterns in the string, and will give you a list of all the matched objects:
>>>import re
>>>print re.findall('!',mystring)
['!', '!']

As we can see there were two instances of the "!" in the mystring and findall
return both object as a list.

Dictionaries
The other most commonly used data structure is dictionaries, also known as
associative arrays/memories in other programming languages. Dictionaries are
data structures that are indexed by keys, which can be any immutable type; such as
strings and numbers can always be keys.

Dictionaries are handy data structure that used widely across programming
languages to implement many algorithms. Python dictionaries are one of the most
elegant implementations of hash tables in any programming language. It's so easy to
work around dictionary, and the great thing is that with few nuggets of code you can
build a very complex data structure, while the same task can take so much time and
coding effort in other languages. This gives the programmer more time to focus on
algorithms rather than the data structure itself.

[9]

www.it-ebooks.info
Introduction to Natural Language Processing

I am using one of the very common use cases of dictionaries to get the frequency
distribution of words in a given text. With just few lines of the following code, you
can get the frequency of words. Just try the same task in any other language and you
will understand how amazing Python is:
>>># declare a dictionary
>>>word_freq={}
>>>for tok in string.split():
>>> if tok in word_freq:
>>> word_freq [tok]+=1
>>> else:
>>> word_freq [tok]=1
>>>print word_freq
{'!': 2, 'and': 1, 'holy': 1, 'Python': 1, 'Grail': 1, 'the': 1, 'Monty':
1}

Writing functions
As any other programming langauge Python also has its way of writing functions.
Function in Python start with keyword def followed by the function name and
parentheses (). Similar to any other programming language any arguments and the
type of the argument should be placed within these parentheses. The actual code
starts with (:) colon symbol. The initial lines of the code are typically doc string
(comments), then we have code body and function ends with a return statement. For
example in the given example the function wordfreq start with def keyword, there
is no argument to this function and the function ends with a return statement.
>>>import sys
>>>def wordfreq (mystring):
>>> '''
>>> Function to generated the frequency distribution of the given text
>>> '''
>>> print mystring
>>> word_freq={}
>>> for tok in mystring.split():
>>> if tok in word_freq:
>>> word_freq [tok]+=1
>>> else:
>>> word_freq [tok]=1

[ 10 ]

www.it-ebooks.info
Chapter 1

>>> print word_freq


>>>def main():
>>> str="This is my fist python program"
>>> wordfreq(str)
>>>if __name__ == '__main__':
>>> main()

This was the same code that we wrote in the previous section the idea of writing in a
form of function is to make the code re-usable and readable. The interpreter style of
writing Python is also very common but for writing big programes it will be a good
practice to use function/classes and one of the programming paradigm. We also
wanted the user to write and run first Python program. You need to follow
these steps to achive this.

1. Open an empty python file mywordfreq.py in your prefered text editor.


2. Write/Copy the code above in the code snippet to the file.
3. Open the command prompt in your Operating system.
4. Run following command prompt:
$ python mywordfreq,py "This is my fist python program !!"

5. Output should be:


{'This': 1, 'is': 1, 'python': 1, 'fist': 1, 'program': 1, 'my':
1}

Now you have a very basic understanding about some common data-structures that
python provides. You can write a full Python program and able to run that. I think
this is good enough I think with this much of an introduction to Python you can
manage for the initial chapters.

Please have a look at some Python tutorials on the following


website to learn more commands on Python:
https://wiki.python.org/moin/BeginnersGuide

Diving into NLTK


Instead of going further into the theoretical aspects of natural language processing,
let's start with a quick dive into NLTK. I am going to start with some basic example
use cases of NLTK. There is a good chance that you have already done something
similar. First, I will give a typical Python programmer approach, and then move on
to NLTK for a much more efficient, robust, and clean solution.

[ 11 ]

www.it-ebooks.info
Introduction to Natural Language Processing

We will start analyzing with some example text content. For the current example, I
have taken the content from Python's home page.
>>>import urllib2
>>># urllib2 is use to download the html content of the web link
>>>response = urllib2.urlopen('http://python.org/')
>>># You can read the entire content of a file using read() method
>>>html = response.read()
>>>print len(html)
47020

We don't have any clue about the kind of topics that are discussed in this URL, so
let's start with an exploratory data analysis (EDA). Typically in a text domain, EDA
can have many meanings, but will go with a simple case of what kinds of terms
dominate the document. What are the topics? How frequent they are? The process
will involve some level of preprocessing steps. We will try to do this first in a pure
Python way, and then we will do it using NLTK.

Let's start with cleaning the html tags. One ways to do this is to select just the
tokens, including numbers and character. Anybody who has worked with regular
expression should be able to convert html string into list of tokens:
>>># Regular expression based split the string
>>>tokens = [tok for tok in html.split()]
>>>print "Total no of tokens :"+ str(len(tokens))
>>># First 100 tokens
>>>print tokens[0:100]
Total no of tokens :2860
['<!doctype', 'html>', '<!--[if', 'lt', 'IE', '7]>', '<html', 'class="no-
js', 'ie6', 'lt-ie7', 'lt-ie8', 'lt-ie9">', '<![endif]-->', '<!--[if',
'IE', '7]>', '<html', 'class="no-js', 'ie7', 'lt-ie8', 'lt-ie9">',
'<![endif]-->', ''type="text/css"', 'media="not', 'print,', 'braille,'
...]

As you can see, there is an excess of html tags and other unwanted characters when
we use the preceding method. A cleaner version of the same task will look something
like this:
>>>import re
>>># using the split function
>>>#https://docs.python.org/2/library/re.html
>>>tokens = re.split('\W+',html)

[ 12 ]

www.it-ebooks.info
Chapter 1

>>>print len(tokens)
>>>print tokens[0:100]
5787
['', 'doctype', 'html', 'if', 'lt', 'IE', '7', 'html', 'class', 'no',
'js', 'ie6', 'lt', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if',
'IE', '7', 'html', 'class', 'no', 'js', 'ie7', 'lt', 'ie8', 'lt', 'ie9',
'endif', 'if', 'IE', '8', 'msapplication', 'tooltip', 'content', 'The',
'official', 'home', 'of', 'the', 'Python', 'Programming', 'Language',
'meta', 'name', 'apple' ...]

This looks much cleaner now. But still you can do more; I leave it to you to try
to remove as much noise as you can. You can clean some HTML tags that are
still popping up, You probably also want to look for word length as a criteria and
remove words that have a length one—it will remove elements like 7, 8, and so on,
which are just noise in this case. Now instead writing some of these preprocessing
steps from scratch let's move to NLTK for the same task. There is a function called
clean_html() that can do all the cleaning that we were looking for:

>>>import nltk
>>># http://www.nltk.org/api/nltk.html#nltk.util.clean_html
>>>clean = nltk.clean_html(html)
>>># clean will have entire string removing all the html noise
>>>tokens = [tok for tok in clean.split()]
>>>print tokens[:100]
['Welcome', 'to', 'Python.org', 'Skip', 'to', 'content', '&#9660;',
'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', '&#9650;',
'The', 'Python', 'Network', '&equiv;', 'Menu', 'Arts', 'Business' ...]

Cool, right? This definitely is much cleaner and easier to do.

Let's try to get the frequency distribution of these terms. First, let's do it the Pure
Python way, then I will tell you the NLTK recipe.
>>>import operator
>>>freq_dis={}
>>>for tok in tokens:
>>> if tok in freq_dis:
>>> freq_dis[tok]+=1
>>> else:
>>> freq_dis[tok]=1
>>># We want to sort this dictionary on values ( freq in this case )

[ 13 ]

www.it-ebooks.info

You might also like