0% found this document useful (0 votes)

23 views5 pages

Detail NLP

The document provides an introduction to natural language processing and demonstrates how to analyze text data using Python and the NLTK library. It shows how to clean HTML tags from text, tokenize words, and calculate word frequencies as an example of text analysis and compares implementing these tasks with only Python versus using NLTK functions.

Uploaded by

shahzad sultan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views5 pages

Detail NLP

Uploaded by

shahzad sultan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Chapter 1

Let's search for something in the running example, where mystring is the same
string object, and we will try to look for some patterns in that. A substring search is
one of the common use-cases of the re module. Let's implement this:
>>># We have to import re module to use regular expression
>>>import re
>>>if re.search('Python',mystring):
>>> print "We found python "
>>>else:
>>> print "NO "

Once this is executed, we get the message as follows:

We found python

We can do more pattern finding using regular expressions. One of the common
functions that is used in finding all the patterns in a string is findall. It will look for
the given patterns in the string, and will give you a list of all the matched objects:
>>>import re
>>>print re.findall('!',mystring)
['!', '!']

As we can see there were two instances of the "!" in the mystring and findall
return both object as a list.

Dictionaries
The other most commonly used data structure is dictionaries, also known as
associative arrays/memories in other programming languages. Dictionaries are
data structures that are indexed by keys, which can be any immutable type; such as
strings and numbers can always be keys.

Dictionaries are handy data structure that used widely across programming
languages to implement many algorithms. Python dictionaries are one of the most
elegant implementations of hash tables in any programming language. It's so easy to
work around dictionary, and the great thing is that with few nuggets of code you can
build a very complex data structure, while the same task can take so much time and
coding effort in other languages. This gives the programmer more time to focus on
algorithms rather than the data structure itself.

[9]

www.it-ebooks.info
Introduction to Natural Language Processing

I am using one of the very common use cases of dictionaries to get the frequency
distribution of words in a given text. With just few lines of the following code, you
can get the frequency of words. Just try the same task in any other language and you
will understand how amazing Python is:
>>># declare a dictionary
>>>word_freq={}
>>>for tok in string.split():
>>> if tok in word_freq:
>>> word_freq [tok]+=1
>>> else:
>>> word_freq [tok]=1
>>>print word_freq
{'!': 2, 'and': 1, 'holy': 1, 'Python': 1, 'Grail': 1, 'the': 1, 'Monty':
1}

Writing functions
As any other programming langauge Python also has its way of writing functions.
Function in Python start with keyword def followed by the function name and
parentheses (). Similar to any other programming language any arguments and the
type of the argument should be placed within these parentheses. The actual code
starts with (:) colon symbol. The initial lines of the code are typically doc string
(comments), then we have code body and function ends with a return statement. For
example in the given example the function wordfreq start with def keyword, there
is no argument to this function and the function ends with a return statement.
>>>import sys
>>>def wordfreq (mystring):
>>> '''
>>> Function to generated the frequency distribution of the given text
>>> '''
>>> print mystring
>>> word_freq={}
>>> for tok in mystring.split():
>>> if tok in word_freq:
>>> word_freq [tok]+=1
>>> else:
>>> word_freq [tok]=1

[ 10 ]

www.it-ebooks.info
Chapter 1

>>> print word_freq

>>>def main():
>>> str="This is my fist python program"
>>> wordfreq(str)
>>>if __name__ == '__main__':
>>> main()

This was the same code that we wrote in the previous section the idea of writing in a
form of function is to make the code re-usable and readable. The interpreter style of
writing Python is also very common but for writing big programes it will be a good
practice to use function/classes and one of the programming paradigm. We also
wanted the user to write and run first Python program. You need to follow
these steps to achive this.

1. Open an empty python file mywordfreq.py in your prefered text editor.

2. Write/Copy the code above in the code snippet to the file.
3. Open the command prompt in your Operating system.
4. Run following command prompt:
$ python mywordfreq,py "This is my fist python program !!"

5. Output should be:

{'This': 1, 'is': 1, 'python': 1, 'fist': 1, 'program': 1, 'my':
1}

Now you have a very basic understanding about some common data-structures that
python provides. You can write a full Python program and able to run that. I think
this is good enough I think with this much of an introduction to Python you can
manage for the initial chapters.

Please have a look at some Python tutorials on the following

website to learn more commands on Python:
https://wiki.python.org/moin/BeginnersGuide

Diving into NLTK

Instead of going further into the theoretical aspects of natural language processing,
let's start with a quick dive into NLTK. I am going to start with some basic example
use cases of NLTK. There is a good chance that you have already done something
similar. First, I will give a typical Python programmer approach, and then move on
to NLTK for a much more efficient, robust, and clean solution.

[ 11 ]

www.it-ebooks.info
Introduction to Natural Language Processing

We will start analyzing with some example text content. For the current example, I
have taken the content from Python's home page.
>>>import urllib2
>>># urllib2 is use to download the html content of the web link
>>>response = urllib2.urlopen('http://python.org/')
>>># You can read the entire content of a file using read() method
>>>html = response.read()
>>>print len(html)
47020

We don't have any clue about the kind of topics that are discussed in this URL, so
let's start with an exploratory data analysis (EDA). Typically in a text domain, EDA
can have many meanings, but will go with a simple case of what kinds of terms
dominate the document. What are the topics? How frequent they are? The process
will involve some level of preprocessing steps. We will try to do this first in a pure
Python way, and then we will do it using NLTK.

Let's start with cleaning the html tags. One ways to do this is to select just the
tokens, including numbers and character. Anybody who has worked with regular
expression should be able to convert html string into list of tokens:
>>># Regular expression based split the string
>>>tokens = [tok for tok in html.split()]
>>>print "Total no of tokens :"+ str(len(tokens))
>>># First 100 tokens
>>>print tokens[0:100]
Total no of tokens :2860
['<!doctype', 'html>', '', '', ''type="text/css"', 'media="not', 'print,', 'braille,'
...]

As you can see, there is an excess of html tags and other unwanted characters when
we use the preceding method. A cleaner version of the same task will look something
like this:
>>>import re
>>># using the split function
>>>#https://docs.python.org/2/library/re.html
>>>tokens = re.split('\W+',html)

[ 12 ]

www.it-ebooks.info
Chapter 1

>>>print len(tokens)
>>>print tokens[0:100]
5787
['', 'doctype', 'html', 'if', 'lt', 'IE', '7', 'html', 'class', 'no',
'js', 'ie6', 'lt', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if',
'IE', '7', 'html', 'class', 'no', 'js', 'ie7', 'lt', 'ie8', 'lt', 'ie9',
'endif', 'if', 'IE', '8', 'msapplication', 'tooltip', 'content', 'The',
'official', 'home', 'of', 'the', 'Python', 'Programming', 'Language',
'meta', 'name', 'apple' ...]

This looks much cleaner now. But still you can do more; I leave it to you to try
to remove as much noise as you can. You can clean some HTML tags that are
still popping up, You probably also want to look for word length as a criteria and
remove words that have a length one—it will remove elements like 7, 8, and so on,
which are just noise in this case. Now instead writing some of these preprocessing
steps from scratch let's move to NLTK for the same task. There is a function called
clean_html() that can do all the cleaning that we were looking for:

>>>import nltk
>>># http://www.nltk.org/api/nltk.html#nltk.util.clean_html
>>>clean = nltk.clean_html(html)
>>># clean will have entire string removing all the html noise
>>>tokens = [tok for tok in clean.split()]
>>>print tokens[:100]
['Welcome', 'to', 'Python.org', 'Skip', 'to', 'content', '▼',
'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', '▲',
'The', 'Python', 'Network', '&equiv;', 'Menu', 'Arts', 'Business' ...]

Cool, right? This definitely is much cleaner and easier to do.

Let's try to get the frequency distribution of these terms. First, let's do it the Pure
Python way, then I will tell you the NLTK recipe.
>>>import operator
>>>freq_dis={}
>>>for tok in tokens:
>>> if tok in freq_dis:
>>> freq_dis[tok]+=1
>>> else:
>>> freq_dis[tok]=1
>>># We want to sort this dictionary on values ( freq in this case )

[ 13 ]

www.it-ebooks.info

NLP Lab Manual
No ratings yet
NLP Lab Manual
38 pages
Python Lectures
No ratings yet
Python Lectures
125 pages
Final Summary NLP
No ratings yet
Final Summary NLP
446 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
2 Python
No ratings yet
2 Python
131 pages
Health & Hygiene Inspection RIG 215 17 April 2018
No ratings yet
Health & Hygiene Inspection RIG 215 17 April 2018
8 pages
Module 5
No ratings yet
Module 5
69 pages
Natural Language Processing
No ratings yet
Natural Language Processing
116 pages
Pythonlevel 2
No ratings yet
Pythonlevel 2
99 pages
10.1 Object-Oriented Programming in Python
No ratings yet
10.1 Object-Oriented Programming in Python
23 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
Spec 161 3 0
No ratings yet
Spec 161 3 0
8 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
2 NLP Pipeline
No ratings yet
2 NLP Pipeline
57 pages
Tsa Lab Record - Cse
No ratings yet
Tsa Lab Record - Cse
53 pages
Strip HTML Tags Using Python
No ratings yet
Strip HTML Tags Using Python
8 pages
CCS369 - Text and Speech Analysis
No ratings yet
CCS369 - Text and Speech Analysis
31 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
01 Getting Started With Python
No ratings yet
01 Getting Started With Python
41 pages
Python For Linguists
No ratings yet
Python For Linguists
45 pages
Chapter 1
No ratings yet
Chapter 1
31 pages
Python Prac3 14
No ratings yet
Python Prac3 14
24 pages
Ai & ML Week-11
No ratings yet
Ai & ML Week-11
32 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
TSA Student
No ratings yet
TSA Student
20 pages
Ngram 2x3
No ratings yet
Ngram 2x3
5 pages
Python NLP
No ratings yet
Python NLP
15 pages
Python Quick Reference
100% (1)
Python Quick Reference
3 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
No ratings yet
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
8 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
Updated Handover 16th July, 2018
No ratings yet
Updated Handover 16th July, 2018
5 pages
Batch 2
No ratings yet
Batch 2
13 pages
How To Program Python-Introducing XML-2002-Deitel (Pythonhtp1 Toc)
No ratings yet
How To Program Python-Introducing XML-2002-Deitel (Pythonhtp1 Toc)
13 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Course Notes For Unit 1 of The Udacity Course CS262 Programming Languages
No ratings yet
Course Notes For Unit 1 of The Udacity Course CS262 Programming Languages
32 pages
String and Text Processing
No ratings yet
String and Text Processing
8 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
NLP TP1 Report Lahouel Ibrahim
No ratings yet
NLP TP1 Report Lahouel Ibrahim
6 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
A Python Book
No ratings yet
A Python Book
148 pages
Practical Scientific Computing in Python A Workbook
No ratings yet
Practical Scientific Computing in Python A Workbook
43 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
HSSE Partner Report - 05 - 12-12-2018
No ratings yet
HSSE Partner Report - 05 - 12-12-2018
1 page
Unit 5
No ratings yet
Unit 5
4 pages
Python Week 1 PDF
No ratings yet
Python Week 1 PDF
8 pages
Ccs369-Lab Ex 3,4,5
No ratings yet
Ccs369-Lab Ex 3,4,5
8 pages
Gentle Start To Natural Language Processing Using Python
No ratings yet
Gentle Start To Natural Language Processing Using Python
6 pages
AI Zone: Log in Sign Up
No ratings yet
AI Zone: Log in Sign Up
24 pages
The Analysis of Web Page Information Processing Based On Natural Language Processing
No ratings yet
The Analysis of Web Page Information Processing Based On Natural Language Processing
4 pages
25 Python Materials
No ratings yet
25 Python Materials
3 pages
Que&practical
No ratings yet
Que&practical
3 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
Onshore Operation Daily Maintenance Report: H2S Support Engineer
No ratings yet
Onshore Operation Daily Maintenance Report: H2S Support Engineer
1 page
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Handover Note: "Welcome Back"
No ratings yet
Handover Note: "Welcome Back"
3 pages
Handover Note: "Welcome Back": Position Name: Days Name: Nights
No ratings yet
Handover Note: "Welcome Back": Position Name: Days Name: Nights
3 pages
Daily Report Aug 6 2018
No ratings yet
Daily Report Aug 6 2018
1 page
Daily Report Aug 9 2018
No ratings yet
Daily Report Aug 9 2018
1 page
Daily Report Aug 8 2018
No ratings yet
Daily Report Aug 8 2018
1 page
Onshore Operation Daily Maintenance Report: H2S Support Engineer
No ratings yet
Onshore Operation Daily Maintenance Report: H2S Support Engineer
1 page
Onshore Operation Daily Maintenance Report: Company Man Name Rig Name Location USL H2S Support Engineer Date
No ratings yet
Onshore Operation Daily Maintenance Report: Company Man Name Rig Name Location USL H2S Support Engineer Date
1 page
Onshore Operation Daily Maintenance Report: H2S Support Engineer
No ratings yet
Onshore Operation Daily Maintenance Report: H2S Support Engineer
1 page
Onshore Operation Daily Maintenance Report: H2S Support Engineer
No ratings yet
Onshore Operation Daily Maintenance Report: H2S Support Engineer
1 page
Onshore Operation Daily Maintenance Report: H2S Support Engineer
No ratings yet
Onshore Operation Daily Maintenance Report: H2S Support Engineer
1 page
Onshore Operation Daily Maintenance Report: H2S Support Engineer
No ratings yet
Onshore Operation Daily Maintenance Report: H2S Support Engineer
1 page
Daily Report November 13, 2018 PDF
No ratings yet
Daily Report November 13, 2018 PDF
1 page
Daily Report November 13, 2018 PDF
No ratings yet
Daily Report November 13, 2018 PDF
1 page
UEP Journey Management Form Shehzad - 26 Jan, 2019
No ratings yet
UEP Journey Management Form Shehzad - 26 Jan, 2019
2 pages
Eng101 Summary
No ratings yet
Eng101 Summary
6 pages
Onshore Operation Daily Maintenance Report: H2S Support Engineer
No ratings yet
Onshore Operation Daily Maintenance Report: H2S Support Engineer
1 page
Engnotes
No ratings yet
Engnotes
7 pages

Detail NLP

Uploaded by

Detail NLP

Uploaded by

Chapter 1

Once this is executed, we get the message as follows:

>>> print word_freq

1. Open an empty python file mywordfreq.py in your prefered text editor.

5. Output should be:

Please have a look at some Python tutorials on the following

Diving into NLTK

Cool, right? This definitely is much cleaner and easier to do.

You might also like