Natural Language Processing: Python and NLTK
By Jacob Perkins, Nitin Hardeniya, Nisheeth Joshi and
()
About this ebook
Read more from Jacob Perkins
Python Text Processing with NLTK 2.0 Cookbook: LITE Rating: 4 out of 5 stars4/5Python 3 Text Processing with NLTK 3 Cookbook Rating: 4 out of 5 stars4/5
Related to Natural Language Processing
Related ebooks
Real-World Natural Language Processing: Practical applications with deep learning Rating: 0 out of 5 stars0 ratingsTensorFlow Machine Learning Cookbook Rating: 4 out of 5 stars4/5Modern Python Cookbook Rating: 5 out of 5 stars5/5Python GUI Programming Cookbook Rating: 5 out of 5 stars5/5Flask Framework Cookbook Rating: 5 out of 5 stars5/5Natural Language Processing with Python: Natural Language Processing Using NLTK Rating: 4 out of 5 stars4/5Natural Language Processing in Action: Understanding, analyzing, and generating text with Python Rating: 0 out of 5 stars0 ratingsDeep Learning with Python Rating: 5 out of 5 stars5/5Deep Reinforcement Learning in Action Rating: 4 out of 5 stars4/5Deep Learning for Vision Systems Rating: 5 out of 5 stars5/5Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch Rating: 0 out of 5 stars0 ratingsReal-World Machine Learning Rating: 0 out of 5 stars0 ratingsData Science Bookcamp: Five real-world Python projects Rating: 5 out of 5 stars5/5Pandas in Action Rating: 0 out of 5 stars0 ratingsPython Machine Learning By Example Rating: 4 out of 5 stars4/5Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation Rating: 0 out of 5 stars0 ratingsLearning pandas Rating: 4 out of 5 stars4/5Deep Learning with PyTorch Rating: 5 out of 5 stars5/5Python Deep Learning Rating: 5 out of 5 stars5/5Python: Real-World Data Science Rating: 0 out of 5 stars0 ratingsAdvanced Machine Learning with Python Rating: 0 out of 5 stars0 ratingsDeep Learning with TensorFlow Rating: 5 out of 5 stars5/5Deep Learning with Python, Second Edition Rating: 0 out of 5 stars0 ratingsLearning Predictive Analytics with Python Rating: 0 out of 5 stars0 ratingsNatural Language Processing Rating: 0 out of 5 stars0 ratingsDeep Learning for Search Rating: 0 out of 5 stars0 ratingsDeep Learning for Natural Language Processing Rating: 0 out of 5 stars0 ratingsTransfer Learning for Natural Language Processing Rating: 0 out of 5 stars0 ratingsDeep Learning with Keras Rating: 5 out of 5 stars5/5
Intelligence (AI) & Semantics For You
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Midjourney Mastery - The Ultimate Handbook of Prompts Rating: 5 out of 5 stars5/5ChatGPT 4 $10,000 per Month #1 Beginners Guide to Make Money Online Generated by Artificial Intelligence Rating: 0 out of 5 stars0 ratingsChatGPT For Dummies Rating: 4 out of 5 stars4/52084: Artificial Intelligence and the Future of Humanity Rating: 4 out of 5 stars4/5101 Midjourney Prompt Secrets Rating: 3 out of 5 stars3/5Summary of Super-Intelligence From Nick Bostrom Rating: 4 out of 5 stars4/5Artificial Intelligence: A Guide for Thinking Humans Rating: 4 out of 5 stars4/5Nexus: A Brief History of Information Networks from the Stone Age to AI Rating: 4 out of 5 stars4/5THE CHATGPT MILLIONAIRE'S HANDBOOK: UNLOCKING WEALTH THROUGH AI AUTOMATION Rating: 5 out of 5 stars5/5AI for Educators: AI for Educators Rating: 5 out of 5 stars5/5Writing AI Prompts For Dummies Rating: 0 out of 5 stars0 ratingsCo-Intelligence: Living and Working with AI Rating: 4 out of 5 stars4/5The Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5The Roadmap to AI Mastery: A Guide to Building and Scaling Projects Rating: 3 out of 5 stars3/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Artificial Intelligence For Dummies Rating: 3 out of 5 stars3/5Killer ChatGPT Prompts: Harness the Power of AI for Success and Profit Rating: 2 out of 5 stars2/5A Quickstart Guide To Becoming A ChatGPT Millionaire: The ChatGPT Book For Beginners (Lazy Money Series®) Rating: 4 out of 5 stars4/5ChatGPT For Fiction Writing: AI for Authors Rating: 5 out of 5 stars5/5Chat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures Rating: 3 out of 5 stars3/5
Reviews for Natural Language Processing
0 ratings0 reviews
Book preview
Natural Language Processing - Jacob Perkins
Table of Contents
Natural Language Processing: Python and NLTK
Natural Language Processing: Python and NLTK
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Module 1
1. Introduction to Natural Language Processing
Why learn NLP?
Let's start playing with Python!
Lists
Helping yourself
Regular expressions
Dictionaries
Writing functions
Diving into NLTK
Your turn
Summary
2. Text Wrangling and Cleansing
What is text wrangling?
Text cleansing
Sentence splitter
Tokenization
Stemming
Lemmatization
Stop word removal
Rare word removal
Spell correction
Your turn
Summary
3. Part of Speech Tagging
What is Part of speech tagging
Stanford tagger
Diving deep into a tagger
Sequential tagger
N-gram tagger
Regex tagger
Brill tagger
Machine learning based tagger
Named Entity Recognition (NER)
NER tagger
Your Turn
Summary
4. Parsing Structure in Text
Shallow versus deep parsing
The two approaches in parsing
Why we need parsing
Different types of parsers
A recursive descent parser
A shift-reduce parser
A chart parser
A regex parser
Dependency parsing
Chunking
Information extraction
Named-entity recognition (NER)
Relation extraction
Summary
5. NLP Applications
Building your first NLP application
Other NLP applications
Machine translation
Statistical machine translation
Information retrieval
Boolean retrieval
Vector space model
The probabilistic model
Speech recognition
Text classification
Information extraction
Question answering systems
Dialog systems
Word sense disambiguation
Topic modeling
Language detection
Optical character recognition
Summary
6. Text Classification
Machine learning
Text classification
Sampling
Naive Bayes
Decision trees
Stochastic gradient descent
Logistic regression
Support vector machines
The Random forest algorithm
Text clustering
K-means
Topic modeling in text
Installing gensim
References
Summary
7. Web Crawling
Web crawlers
Writing your first crawler
Data flow in Scrapy
The Scrapy shell
Items
The Sitemap spider
The item pipeline
External references
Summary
8. Using NLTK with Other Python Libraries
NumPy
ndarray
Indexing
Basic operations
Extracting data from an array
Complex matrix operations
Reshaping and stacking
Random numbers
SciPy
Linear algebra
eigenvalues and eigenvectors
The sparse matrix
Optimization
pandas
Reading data
Series data
Column transformation
Noisy data
matplotlib
Subplot
Adding an axis
A scatter plot
A bar plot
3D plots
External references
Summary
9. Social Media Mining in Python
Data collection
Data extraction
Trending topics
Geovisualization
Influencers detection
Influencer friends
Summary
10. Text Mining at Scale
Different ways of using Python on Hadoop
Python streaming
Hive/Pig UDF
Streaming wrappers
NLTK on Hadoop
A UDF
Python streaming
Scikit-learn on Hadoop
PySpark
Summary
2. Module 2
1. Tokenizing Text and WordNet Basics
Introduction
Tokenizing text into sentences
Getting ready
How to do it...
How it works...
There's more...
Tokenizing sentences in other languages
See also
Tokenizing sentences into words
How to do it...
How it works...
There's more...
Separating contractions
PunktWordTokenizer
WordPunctTokenizer
See also
Tokenizing sentences using regular expressions
Getting ready
How to do it...
How it works...
There's more...
Simple whitespace tokenizer
See also
Training a sentence tokenizer
Getting ready
How to do it...
How it works...
There's more...
See also
Filtering stopwords in a tokenized sentence
Getting ready
How to do it...
How it works...
There's more...
See also
Looking up Synsets for a word in WordNet
Getting ready
How to do it...
How it works...
There's more...
Working with hypernyms
Part of speech (POS)
See also
Looking up lemmas and synonyms in WordNet
How to do it...
How it works...
There's more...
All possible synonyms
Antonyms
See also
Calculating WordNet Synset similarity
How to do it...
How it works...
There's more...
Comparing verbs
Path and Leacock Chordorow (LCH) similarity
See also
Discovering word collocations
Getting ready
How to do it...
How it works...
There's more...
Scoring functions
Scoring ngrams
See also
2. Replacing and Correcting Words
Introduction
Stemming words
How to do it...
How it works...
There's more...
The LancasterStemmer class
The RegexpStemmer class
The SnowballStemmer class
See also
Lemmatizing words with WordNet
Getting ready
How to do it...
How it works...
There's more...
Combining stemming with lemmatization
See also
Replacing words matching regular expressions
Getting ready
How to do it...
How it works...
There's more...
Replacement before tokenization
See also
Removing repeating characters
Getting ready
How to do it...
How it works...
There's more...
See also
Spelling correction with Enchant
Getting ready
How to do it...
How it works...
There's more...
The en_GB dictionary
Personal word lists
See also
Replacing synonyms
Getting ready
How to do it...
How it works...
There's more...
CSV synonym replacement
YAML synonym replacement
See also
Replacing negations with antonyms
How to do it...
How it works...
There's more...
See also
3. Creating Custom Corpora
Introduction
Setting up a custom corpus
Getting ready
How to do it...
How it works...
There's more...
Loading a YAML file
See also
Creating a wordlist corpus
Getting ready
How to do it...
How it works...
There's more...
Names wordlist corpus
English words corpus
See also
Creating a part-of-speech tagged word corpus
Getting ready
How to do it...
How it works...
There's more...
Customizing the word tokenizer
Customizing the sentence tokenizer
Customizing the paragraph block reader
Customizing the tag separator
Converting tags to a universal tagset
See also
Creating a chunked phrase corpus
Getting ready
How to do it...
How it works...
There's more...
Tree leaves
Treebank chunk corpus
CoNLL2000 corpus
See also
Creating a categorized text corpus
Getting ready
How to do it...
How it works...
There's more...
Category file
Categorized tagged corpus reader
Categorized corpora
See also
Creating a categorized chunk corpus reader
Getting ready
How to do it...
How it works...
There's more...
Categorized CoNLL chunk corpus reader
See also
Lazy corpus loading
How to do it...
How it works...
There's more...
Creating a custom corpus view
How to do it...
How it works...
There's more...
Block reader functions
Pickle corpus view
Concatenated corpus view
See also
Creating a MongoDB-backed corpus reader
Getting ready
How to do it...
How it works...
There's more...
See also
Corpus editing with file locking
Getting ready
How to do it...
How it works...
4. Part-of-speech Tagging
Introduction
Default tagging
Getting ready
How to do it...
How it works...
There's more...
Evaluating accuracy
Tagging sentences
Untagging a tagged sentence
See also
Training a unigram part-of-speech tagger
How to do it...
How it works...
There's more...
Overriding the context model
Minimum frequency cutoff
See also
Combining taggers with backoff tagging
How to do it...
How it works...
There's more...
Saving and loading a trained tagger with pickle
See also
Training and combining ngram taggers
Getting ready
How to do it...
How it works...
There's more...
Quadgram tagger
See also
Creating a model of likely word tags
How to do it...
How it works...
There's more...
See also
Tagging with regular expressions
Getting ready
How to do it...
How it works...
There's more...
See also
Affix tagging
How to do it...
How it works...
There's more...
Working with min_stem_length
See also
Training a Brill tagger
How to do it...
How it works...
There's more...
Tracing
See also
Training the TnT tagger
How to do it...
How it works...
There's more...
Controlling the beam search
Significance of capitalization
See also
Using WordNet for tagging
Getting ready
How to do it...
How it works...
See also
Tagging proper names
How to do it...
How it works...
See also
Classifier-based tagging
How to do it...
How it works...
There's more...
Detecting features with a custom feature detector
Setting a cutoff probability
Using a pre-trained classifier
See also
Training a tagger with NLTK-Trainer
How to do it...
How it works...
There's more...
Saving a pickled tagger
Training on a custom corpus
Training with universal tags
Analyzing a tagger against a tagged corpus
Analyzing a tagged corpus
See also
5. Extracting Chunks
Introduction
Chunking and chinking with regular expressions
Getting ready
How to do it...
How it works...
There's more...
Parsing different chunk types
Parsing alternative patterns
Chunk rule with context
See also
Merging and splitting chunks with regular expressions
How to do it...
How it works...
There's more...
Specifying rule descriptions
See also
Expanding and removing chunks with regular expressions
How to do it...
How it works...
There's more...
See also
Partial parsing with regular expressions
How to do it...
How it works...
There's more...
The ChunkScore metrics
Looping and tracing chunk rules
See also
Training a tagger-based chunker
How to do it...
How it works...
There's more...
Using different taggers
See also
Classification-based chunking
How to do it...
How it works...
There's more...
Using a different classifier builder
See also
Extracting named entities
How to do it...
How it works...
There's more...
Binary named entity extraction
See also
Extracting proper noun chunks
How to do it...
How it works...
There's more...
See also
Extracting location chunks
How to do it...
How it works...
There's more...
See also
Training a named entity chunker
How to do it...
How it works...
There's more...
See also
Training a chunker with NLTK-Trainer
How to do it...
How it works...
There's more...
Saving a pickled chunker
Training a named entity chunker
Training on a custom corpus
Training on parse trees
Analyzing a chunker against a chunked corpus
Analyzing a chunked corpus
See also
6. Transforming Chunks and Trees
Introduction
Filtering insignificant words from a sentence
Getting ready
How to do it...
How it works...
There's more...
See also
Correcting verb forms
Getting ready
How to do it...
How it works...
See also
Swapping verb phrases
How to do it...
How it works...
There's more...
See also
Swapping noun cardinals
How to do it...
How it works...
See also
Swapping infinitive phrases
How to do it...
How it works...
There's more...
See also
Singularizing plural nouns
How to do it...
How it works...
See also
Chaining chunk transformations
How to do it...
How it works...
There's more...
See also
Converting a chunk tree to text
How to do it...
How it works...
There's more...
See also
Flattening a deep tree
Getting ready
How to do it...
How it works...
There's more...
The cess_esp and cess_cat treebank
See also
Creating a shallow tree
How to do it...
How it works...
See also
Converting tree labels
Getting ready
How to do it...
How it works...
See also
7. Text Classification
Introduction
Bag of words feature extraction
How to do it...
How it works...
There's more...
Filtering stopwords
Including significant bigrams
See also
Training a Naive Bayes classifier
Getting ready
How to do it...
How it works...
There's more...
Classification probability
Most informative features
Training estimator
Manual training
See also
Training a decision tree classifier
How to do it...
How it works...
There's more...
Controlling uncertainty with entropy_cutoff
Controlling tree depth with depth_cutoff
Controlling decisions with support_cutoff
See also
Training a maximum entropy classifier
Getting ready
How to do it...
How it works...
There's more...
Megam algorithm
See also
Training scikit-learn classifiers
Getting ready
How to do it...
How it works...
There's more...
Comparing Naive Bayes algorithms
Training with logistic regression
Training with LinearSVC
See also
Measuring precision and recall of a classifier
How to do it...
How it works...
There's more...
F-measure
See also
Calculating high information words
How to do it...
How it works...
There's more...
The MaxentClassifier class with high information words
The DecisionTreeClassifier class with high information words
The SklearnClassifier class with high information words
See also
Combining classifiers with voting
Getting ready
How to do it...
How it works...
See also
Classifying with multiple binary classifiers
Getting ready
How to do it...
How it works...
There's more...
See also
Training a classifier with NLTK-Trainer
How to do it...
How it works...
There's more...
Saving a pickled classifier
Using different training instances
The most informative features
The Maxent and LogisticRegression classifiers
SVMs
Combining classifiers
High information words and bigrams
Cross-fold validation
Analyzing a classifier
See also
8. Distributed Processing and Handling Large Datasets
Introduction
Distributed tagging with execnet
Getting ready
How to do it...
How it works...
There's more...
Creating multiple channels
Local versus remote gateways
See also
Distributed chunking with execnet
Getting ready
How to do it...
How it works...
There's more...
Python subprocesses
See also
Parallel list processing with execnet
How to do it...
How it works...
There's more...
See also
Storing a frequency distribution in Redis
Getting ready
How to do it...
How it works...
There's more...
See also
Storing a conditional frequency distribution in Redis
Getting ready
How to do it...
How it works...
There's more...
See also
Storing an ordered dictionary in Redis
Getting ready
How to do it...
How it works...
There's more...
See also
Distributed word scoring with Redis and execnet
Getting ready
How to do it...
How it works...
There's more...
See also
9. Parsing Specific Data Types
Introduction
Parsing dates and times with dateutil
Getting ready
How to do it...
How it works...
There's more...
See also
Timezone lookup and conversion
Getting ready
How to do it...
How it works...
There's more...
Local timezone
Custom offsets
See also
Extracting URLs from HTML with lxml
Getting ready
How to do it...
How it works...
There's more...
Extracting links directly
Parsing HTML from URLs or files
Extracting links with XPaths
See also
Cleaning and stripping HTML
Getting ready
How to do it...
How it works...
There's more...
See also
Converting HTML entities with BeautifulSoup
Getting ready
How to do it...
How it works...
There's more...
Extracting URLs with BeautifulSoup
See also
Detecting and converting character encodings
Getting ready
How to do it...
How it works...
There's more...
Converting to ASCII
UnicodeDammit conversion
See also
A. Penn Treebank Part-of-speech Tags
3. Module 3
1. Working with Strings
Tokenization
Tokenization of text into sentences
Tokenization of text in other languages
Tokenization of sentences into words
Tokenization using TreebankWordTokenizer
Tokenization using regular expressions
Normalization
Eliminating punctuation
Conversion into lowercase and uppercase
Dealing with stop words
Calculate stopwords in English
Substituting and correcting tokens
Replacing words using regular expressions
Example of the replacement of a text with another text
Performing substitution before tokenization
Dealing with repeating characters
Example of deleting repeating characters
Replacing a word with its synonym
Example of substituting word a with its synonym
Applying Zipf's law to text
Similarity measures
Applying similarity measures using Ethe edit distance algorithm
Applying similarity measures using Jaccard's Coefficient
Applying similarity measures using the Smith Waterman distance
Other string similarity metrics
Summary
2. Statistical Language Modeling
Understanding word frequency
Develop MLE for a given text
Hidden Markov Model estimation
Applying smoothing on the MLE model
Add-one smoothing
Good Turing
Kneser Ney estimation
Witten Bell estimation
Develop a back-off mechanism for MLE
Applying interpolation on data to get mix and match
Evaluate a language model through perplexity
Applying metropolis hastings in modeling languages
Applying Gibbs sampling in language processing
Summary
3. Morphology – Getting Our Feet Wet
Introducing morphology
Understanding stemmer
Understanding lemmatization
Developing a stemmer for non-English language
Morphological analyzer
Morphological generator
Search engine
Summary
4. Parts-of-Speech Tagging – Identifying Words
Introducing parts-of-speech tagging
Default tagging
Creating POS-tagged corpora
Selecting a machine learning algorithm
Statistical modeling involving the n-gram approach
Developing a chunker using pos-tagged corpora
Summary
5. Parsing – Analyzing Training Data
Introducing parsing
Treebank construction
Extracting Context Free Grammar (CFG) rules from Treebank
Creating a probabilistic Context Free Grammar from CFG
CYK chart parsing algorithm
Earley chart parsing algorithm
Summary
6. Semantic Analysis – Meaning Matters
Introducing semantic analysis
Introducing NER
A NER system using Hidden Markov Model
Training NER using Machine Learning Toolkits
NER using POS tagging
Generation of the synset id from Wordnet
Disambiguating senses using Wordnet
Summary
7. Sentiment Analysis – I Am Happy
Introducing sentiment analysis
Sentiment analysis using NER
Sentiment analysis using machine learning
Evaluation of the NER system
Summary
8. Information Retrieval – Accessing Information
Introducing information retrieval
Stop word removal
Information retrieval using a vector space model
Vector space scoring and query operator interaction
Developing an IR system using latent semantic indexing
Text summarization
Question-answering system
Summary
9. Discourse Analysis – Knowing Is Believing
Introducing discourse analysis
Discourse analysis using Centering Theory
Anaphora resolution
Summary
10. Evaluation of NLP Systems – Analyzing Performance
The need for evaluation of NLP systems
Evaluation of NLP tools (POS taggers, stemmers, and morphological analyzers)
Parser evaluation using gold data
Evaluation of IR system
Metrics for error identification
Metrics based on lexical matching
Metrics based on syntactic matching
Metrics using shallow semantic matching
Summary
B. Bibliography
Index
Natural Language Processing: Python and NLTK
Natural Language Processing: Python and NLTK
Learn to build expert NLP and machine learning projects using NLTK and other Python libraries
A course in three modules
BIRMINGHAM - MUMBAI
Natural Language Processing: Python and NLTK
Copyright © 2016 Packt Publishing
All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Published on: November 2016
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78728-510-1
www.packtpub.com
Credits
Authors
Nitin Hardeniya
Jacob Perkins
Deepti Chopra
Nisheeth Joshi
Iti Mathur
Reviewers
Afroz Hussain
Sujit Pal
Kumar Raj
Patrick Chan
Mohit Goenka
Lihang Li
Maurice HT Ling
Jing (Dave) Tian
Arturo Argueta
Content Development Editor
Aishwarya Pandere
Production Coordinator
Arvindkumar Gupta
Preface
NLTK is one of the most popular and widely used library in the natural language processing (NLP) community. The beauty of NLTK lies in its simplicity, where most of the complex NLP tasks can be implemented using a few lines of code. Start off by learning how to tokenize text into component words. Explore and make use of the WordNet language dictionary. Learn how and when to stem or lemmatize words. Discover various ways to replace words and perform spelling correction. Create your own custom text corpora and corpus readers, including a MongoDB backed corpus. Use part-of-speech taggers to annotate words with their parts of speech. Create and transform chunked phrase trees using partial parsing. Dig into feature extraction for text classification and sentiment analysis. Learn how to do parallel and distributed text processing, and to store word distributions in Redis.
This learning path will teach you all that and more, in a hands-on learn-by-doing manner. Become an expert in using NLTK for Natural Language Processing with this useful companion.
What this learning path covers
Module 1, NLTK Essentials, talks about all the preprocessing steps required in any text mining/NLP task. In this module, we discuss tokenization, stemming, stop word removal, and other text cleansing processes in detail and how easy it is to implement these in NLTK.
Module 2, Python 3 Text Processing with NLTK 3 Cookbook, explains how to use corpus readers and create custom corpora. It also covers how to use some of the corpora that come with NLTK. It covers the chunking process, also known as partial parsing, which can identify phrases and named entities in a sentence. It also explains how to train your own custom chunker and create specific named entity recognizers.
Module 3, Mastering Natural Language Processing with Python, covers how to calculate word frequencies and perform various language modeling techniques. It also talks about the concept and application of Shallow Semantic Analysis (that is, NER) and WSD using Wordnet.
It will help you understand and apply the concepts of Information Retrieval and text summarization.
What you need for this learning path
Module 1:
We need the following software for this module:
Module 2:
You will need Python 3 and the listed Python packages. For this learning path, the author used Python 3.3.5. To install the packages, you can use pip (https://pypi.python.org/pypi/pip/). The following is the list of the packages in requirements format with the version number used while writing this learning path:
NLTK>=3.0a4
pyenchant>=1.6.5
lockfile>=0.9.1
numpy>=1.8.0
scipy>=0.13.0
scikit-learn>=0.14.1
execnet>=1.1
pymongo>=2.6.3
redis>=2.8.0
lxml>=3.2.3
beautifulsoup4>=4.3.2
python-dateutil>=2.0
charade>=1.0.3
You will also need NLTK-Trainer, which is available at https://github.com/japerk/nltk-trainer.
Beyond Python, there are a couple recipes that use MongoDB and Redis, both NoSQL databases. These can be downloaded at http://www.mongodb.org/ and http://redis.io/, respectively.
Module 3:
For all the chapters, Python 2.7 or 3.2+ is used. NLTK 3.0 must be installed either on 32-bit machine or 64-bit machine. Operating System required is Windows/Mac/Unix.
Who this learning path is for
If you are an NLP or machine learning enthusiast and an intermediate Python programmer who wants to quickly master NLTK for natural language processing, then this Learning Path will do you a lot of good. Students of linguistics and semantic/sentiment analysis professionals will find it invaluable.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the course's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a course, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the course in the Search box.
Select the course for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this course from.
Click on Code Download.
You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/Natural-Language-Processing-Python-and-NLTK. We also have other code bundles from our rich catalog of books, videos and courses available at https://github.com/PacktPublishing/. Check them out!
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this course. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
Questions
If you have a problem with any aspect of this course, you can contact us at <[email protected]>, and we will do our best to address the problem.
Part 1. Module 1
NLTK Essentials
Build cool NLP and machine learning applications using NLTK and other Python libraries
Chapter 1. Introduction to Natural Language Processing
I will start with the introduction to Natural Language Processing (NLP). Language is a central part of our day to day life, and it's so interesting to work on any problem related to languages. I hope this book will give you a flavor of NLP, will motivate you to learn some amazing concepts of NLP, and will inspire you to work on some of the challenging NLP applications.
In my own language, the study of language processing is called NLP. People who are deeply involved in the study of language are linguists, while the term 'computational linguist' applies to the study of processing languages with the application of computation. Essentially, a computational linguist will be a computer scientist who has enough understanding of languages, and can apply his computational skills to model different aspects of the language. While computational linguists address the theoretical aspect of language, NLP is nothing but the application of computational linguistics.
NLP is more about the application of computers on different language nuances, and building real-world applications using NLP techniques. In a practical context, NLP is analogous to teaching a language to a child. Some of the most common tasks like understanding words, sentences, and forming grammatically and structurally correct sentences, are very natural to humans. In NLP, some of these tasks translate to tokenization, chunking, part of speech tagging, parsing, machine translation, speech recognition, and most of them are still the toughest challenges for computers. I will be talking more on the practical side of NLP, assuming that we all have some background in NLP. The expectation for the reader is to have minimal understanding of any programming language and an interest in NLP and Language.
By end of the chapter we want readers
A brief introduction to NLP and related concepts.
Install Python, NLTK and other libraries.
Write some very basic Python and NLTK code snippets.
If you have never heard the term NLP, then please take some time to read any of the books mentioned here—just for an initial few chapters. A quick reading of at least the Wikipedia page relating to NLP is a must:
Speech and Language Processing by Daniel Jurafsky and James H. Martin
Statistical Natural Language Processing by Christopher D. Manning and Hinrich Schütze
Why learn NLP?
I start my discussion with the Gartner's new hype cycle and you can clearly see NLP on top of the cycle. Currently, NLP is one of the rarest skill sets that is required in the industry. After the advent of big data, the major challenge is that we need more people who are good with not just structured, but also with semi or unstructured data. We are generating petabytes of Weblogs, tweets, Facebook feeds, chats, e-mails, and reviews. Companies are collecting all these different kind of data for better customer targeting and meaningful insights. To process all these unstructured data source we need people who understand NLP.
We are in the age of information; we can't even imagine our life without Google. We use Siri for the most of basic stuff. We use spam filters for filtering spam emails. We need spell checker on our Word document. There are many examples of real world NLP applications around us.
Image is taken from http://www.gartner.com/newsroom/id/2819918
Let me also give you some examples of the amazing NLP applications that you can use, but are not aware that they are built on NLP:
Spell correction (MS Word/ any other editor)
Search engines (Google, Bing, Yahoo, wolframalpha)
Speech engines (Siri, Google Voice)
Spam classifiers (All e-mail services)
News feeds (Google, Yahoo!, and so on)
Machine translation (Google Translate, and so on)
IBM Watson
Building these applications requires a very specific skill set with a great understanding of language and tools to process the language efficiently. So it's not just hype that makes NLP one of the most niche areas, but it's the kind of application that can be created using NLP that makes it one of the most unique skills to have.
To achieve some of the above applications and other basic NLP preprocessing, there are many open source tools available. Some of them are developed by organizations to build their own NLP applications, while some of them are open-sourced. Here is a small list of available NLP tools:
GATE
Mallet
Open NLP
UIMA
Stanford toolkit
Genism
Natural Language Tool Kit (NLTK)
Most of the tools are written in Java and have similar functionalities. Some of them are robust and have a different variety of NLP tools available. However, when it comes to the ease of use and explanation of the concepts, NLTK scores really high. NLTK is also a very good learning kit because the learning curve of Python (on which NLTK is written) is very fast. NLTK has incorporated most of the NLP tasks, it's very elegant and easy to work with. For all these reasons, NLTK has become one of the most popular libraries in the NLP community:
I am assuming all you guys know Python. If not, I urge you to learn Python. There are many basic tutorials on Python available online. There are lots of books also available that give you a quick overview of the language. We will also look into some of the features of Python, while going through the different topics. But for now, even if you only know the basics of Python, such as lists, strings, regular expressions, and basic I/O, you should be good to go.
Note
Python can be installed from the following website:
https://www.python.org/downloads/
http://continuum.io/downloads
https://store.enthought.com/downloads/
I would recommend using Anaconda or Canopy Python distributions. The reason being that these distributions come with bundled libraries, such as scipy, numpy, scikit, and so on, which are used for data analysis and other applications related to NLP and related fields. Even NLTK is part of this distribution.
Note
Please follow the instructions and install NLTK and NLTK data:
http://www.nltk.org/install.html
Let's test everything.
Open the terminal on your respective operating systems. Then run:
$ python
This should open the Python interpreter:
Python 2.6.6 (r266:84292, Oct 15 2013, 07:32:41) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type help
, copyright
, credits
or license
for more information. >>>
I hope you got a similar looking output here. There is a chance that you will have received a different looking output, but ideally you will get the latest version of Python (I recommend that to be 2.7), the compiler GCC, and the operating system details. I know the latest version of Python will be in 3.0+ range, but as with any other open source systems, we should tries to hold back to a more stable version as opposed to jumping on to the latest version. If you have moved to Python 3.0+, please have a look at the link below to gain an understanding about what new features have been added:
https://docs.python.org/3/whatsnew/3.4.html.
UNIX based systems will have Python as a default program. Windows users can set the path to get Python working. Let's check whether we have installed NLTK correctly:
>>>import nltk >>>print Python and NLTK installed successfully
Python and NLTK installed successfully
Hey, we are good to go!
Let's start playing with Python!
We'll not be diving too deep into Python; however, we'll give you a quick tour of Python essentials. Still, I think for the benefit of the audience, we should have a quick five minute tour. We'll talk about the basics of data structures, some frequently used functions, and the general construct of Python in the next few sections.
Note
I highly recommend the two hour Google Python class. https://developers.google.com/edu/python should be good enough to start. Please go through the Python website https://www.python.org/ for more tutorials and other resources.
Lists
Lists are one of the most commonly used data structures in Python. They are pretty much comparable to arrays in other programming languages. Let's start with some of the most important functions that a Python list provide.
Try the following in the Python console:
>>> lst=[1,2,3,4] >>> # mostly like arrays in typical languages >>>print lst [1, 2, 3, 4]
Python lists can be accessed using much more flexible indexing. Here are some examples:
>>>print 'First element' +lst[0]
You will get an error message like this:
TypeError: cannot concatenate 'str' and 'int' objects
The reason being that Python is an interpreted language, and checks for the type of the variables at the time it evaluates the expression. We need not initialize and declare the type of variable at the time of declaration. Our list has integer object and cannot be concatenated as a print function. It will only accept a string object. For this reason, we need to convert list elements to string. The process is also known as type casting.
>>>print 'First element :' +str(lst[0]) >>>print 'last element :' +str(lst[-1]) >>>print 'first three elements :' +str(lst[0:2]) >>>print 'last three elements :'+str(lst[-3:]) First element :1 last element :4 first three elements :[1, 2,3] last three elements :[2, 3, 4]
Helping yourself
The best way to learn more about different data types and functions is to use help functions like help() and dir(lst).
The dir(python object) command is used to list all the given attributes of the given Python object. Like if you pass a list object, it will list all the cool things you can do with lists:
>>>dir(lst) >>>' , '.join(dir(lst)) '__add__ , __class__ , __contains__ , __delattr__ , __delitem__ , __delslice__ , __doc__ , __eq__ , __format__ , __ge__ , __getattribute__ , __getitem__ , __getslice__ , __gt__ , __hash__ , __iadd__ , __imul__ , __init__ , __iter__ , __le__ , __len__ , __lt__ , __mul__ , __ne__ , __new__ , __reduce__ , __reduce_ex__ , __repr__ , __reversed__ , __rmul__ , __setattr__ , __setitem__ , __setslice__ , __sizeof__ , __str__ , __subclasshook__ , append , count , extend , index , insert , pop , remove , reverse , sort'
With the help(python object) command, we can get detailed documentation for the given Python object, and also give a few examples of how to use the Python object:
>>>help(lst.index) Help on built-in function index: index(...) L.index(value, [start, [stop]]) -> integer -- return first index of value. This function raises a ValueError if the value is not present.
So help and dir can be used on any Python data type, and are a very nice way to learn about the function and other details of that object. It also provides you with some basic examples to work with, which I found useful in most cases.
Strings in Python are very similar to other languages, but the manipulation of strings is one of the main features of Python. It's immensely easy to work with strings in Python. Even something very simple, like splitting a string, takes effort in Java / C, while you will see how easy it is in Python.
Using the help function that we used previously, you can get help for any Python object and any function. Let's have some more examples with the other most commonly used data type strings:
Split: This is a method to split the string based on some delimiters. If no argument is provided it assumes whitespace as delimiter.
>>> mystring=Monty Python ! And the holy Grail ! \n
>>> print mystring.split()['Monty', 'Python', '!', 'and', 'the', 'holy', 'Grail', '!']
Strip: This is a method that can remove trailing whitespace, like '\n', '\n\r' from the string:
>>> print mystring.strip()>>>Monty Python ! and the holy Grail !
If you notice the '\n' character is stripped off. There are also methods like rstrip() and lstrip() to strip trailing whitespaces to the right and left of the string.
Upper/Lower: We can change the case of the string using these methods:
>>> print (mystring.upper()>>>MONTY PYTHON !AND THE HOLY GRAIL !
Replace: This will help you substitute a substring from the string:
>>> print mystring.replace('!','''''')>>> Monty Python and the holy Grail
There are tons of string functions. I have just talked about some of the most frequently used.
Note
Please look the following link for more functions and examples:
https://docs.python.org/2/library/string.html.
Regular expressions
One other important skill for an NLP enthusiast is working with regular expression. Regular expression is effectively pattern matching on strings. We heavily use pattern extrication to get meaningful information from large amounts of messy text data. The following are all the regular expressions you need. I haven't used any regular expressions beyond these in my entire life:
(a period): This expression matches any single character except newline \n.
\w: This expression will match a character or a digit equivalent to [a-z A-Z 0-9]
\W (upper case W) matches any non-word character.
\s: This expression (lowercase s) matches a single whitespace character - space, newline, return, tab, form [\n\r\t\f].
\S: This expression matches any non-whitespace character.
\t: This expression performs a tab operation.
\n: This expression is used for a newline character.
\r: This expression is used for a return character.
\d: Decimal digit [0-9].
^: This expression is used at the start of the string.
$: This expression is used at the end of the string.
\: This expression is used to nullify the specialness of the special character. For example, you want to match the $ symbol, then add \ in front of it.
Let's search for something in the running example, where mystring is the same string object, and we will try to look for some patterns in that. A substring search is one of the common use-cases of the re module. Let's implement this:
>>># We have to import re module to use regular expression >>>import re >>>if re.search('Python',mystring): >>> print We found python
>>>else: >>> print NO
Once this is executed, we get the message as follows:
We found python
We can do more pattern finding using regular expressions. One of the common functions that is used in finding all the patterns in a string is findall. It will look for the given patterns in the string, and will give you a list of all the matched objects:
>>>import re >>>print re.findall('!',mystring) ['!', '!']
As we can see there were two instances of the !
in the mystring and findall return both object as a list.
Dictionaries
The other most commonly used data structure is dictionaries, also known as associative arrays/memories in other programming languages. Dictionaries are data structures that are indexed by keys, which can be any immutable type; such as strings and numbers can always be keys.
Dictionaries are handy data structure that used widely across programming languages to implement many algorithms. Python dictionaries are one of the most elegant implementations of hash tables in any programming language. It's so easy to work around dictionary, and the great thing is that with few nuggets of code you can build a very complex data structure, while the same task can take so much time and coding effort in other languages. This gives the programmer more time to focus on algorithms rather than the data structure itself.
I am using one of the very common use cases of dictionaries to get the frequency distribution of words in a given text. With just few lines of the following code, you can get the frequency of words. Just try the same task in any other language and you will understand how amazing Python is:
>>># declare a dictionary >>>word_freq={} >>>for tok in string.split(): >>> if tok in word_freq: >>> word_freq [tok]+=1 >>> else: >>> word_freq [tok]=1 >>>print word_freq {'!': 2, 'and': 1, 'holy': 1, 'Python': 1, 'Grail': 1, 'the': 1, 'Monty': 1}
Writing functions
As any other programming langauge Python also has its way of writing functions. Function in Python start with keyword def followed by the function name and parentheses (). Similar to any other programming language any arguments and the type of the argument should be placed within these parentheses. The actual code starts with (:) colon symbol. The initial lines of the code are typically doc string (comments), then we have code body and function ends with a return statement. For example in the given example the function wordfreq start with def keyword, there is no argument to this function and the function ends with a return statement.
>>>import sys >>>def wordfreq (mystring): >>> ''' >>> Function to generated the frequency distribution of the given text >>> ''' >>> print mystring >>> word_freq={} >>> for tok in mystring.split(): >>> if tok in word_freq: >>> word_freq [tok]+=1 >>> else: >>> word_freq [tok]=1 >>> print word_freq >>>def main(): >>> str=This is my fist python program
>>> wordfreq(str) >>>if __name__ == '__main__': >>> main()
This was the same code that we wrote in the previous section the idea of writing in a form of function is to make the code re-usable and readable. The interpreter style of writing Python is also very common but for writing big programes it will be a good practice to use function/classes and one of the programming paradigm. We also wanted the user to write and run first Python program. You need to follow these steps to achive this.
Open an empty python file mywordfreq.py in your prefered text editor.
Write/Copy the code above in the code snippet to the file.
Open the command prompt in your Operating system.
Run following command prompt:
$ python mywordfreq,py This is my fist python program !!
Output should be:
{'This': 1, 'is': 1, 'python': 1, 'fist': 1, 'program': 1, 'my': 1}
Now you have a very basic understanding about some common data-structures that python provides. You can write a full Python program and able to run that. I think this is good enough I think with this much of an introduction to Python you can manage for the initial chapters.
Note
Please have a look at some Python tutorials on the following website to learn more commands on Python:
https://wiki.python.org/moin/BeginnersGuide
Diving into NLTK
Instead of going further into the theoretical aspects of natural language processing, let's start with a quick dive into NLTK. I am going to start with some basic example use cases of NLTK. There is a good chance that you have already done something similar. First, I will give a typical Python programmer approach, and then move on to NLTK for a much more efficient, robust, and clean solution.
We will start analyzing with some example text content. For the current example, I have taken the content from Python's home page.
>>>import urllib2 >>># urllib2 is use to download the html content of the web link >>>response = urllib2.urlopen('http://python.org/') >>># You can read the entire content of a file using read() method >>>html = response.read() >>>print len(html) 47020
We don't have any clue about the kind of topics that are discussed in this URL, so let's start with an exploratory data analysis (EDA). Typically in a text domain, EDA can have many meanings, but will go with a simple case of what kinds of terms dominate the document. What are the topics? How frequent they are? The process will involve some level of preprocessing steps. We will try to do this first in a pure Python way, and then we will do it using NLTK.
Let's start with cleaning the html tags. One ways to do this is to select just the tokens, including numbers and character. Anybody who has worked with regular expression should be able to convert html string into list of tokens:
>>># Regular expression based split the string >>>tokens = [tok for tok in html.split()] >>>print Total no of tokens :
+ str(len(tokens)) >>># First 100 tokens >>>print tokens[0:100] Total no of tokens :2860 ['', '', '', ''type=text/css
', 'media="not', 'print,', 'braille,' ...]
As you can see, there is an excess of html tags and other unwanted characters when we use the preceding method. A cleaner version of the same task will look something like this:
>>>import re >>># using the split function >>>#https://docs.python.org/2/library/re.html >>>tokens = re.split('\W+',html) >>>print len(tokens) >>>print tokens[0:100] 5787 ['', 'doctype', 'html', 'if', 'lt', 'IE', '7', 'html', 'class', 'no', 'js', 'ie6', 'lt', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '7', 'html', 'class', 'no', 'js', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '8', 'msapplication', 'tooltip', 'content', 'The', 'official', 'home', 'of', 'the', 'Python', 'Programming', 'Language', 'meta', 'name', 'apple' ...]
This looks much cleaner now. But still you can do more; I leave it to you to try to remove as much noise as you can. You can clean some HTML tags that are still popping up, You probably also want to look for word length as a criteria and remove words that have a length one—it will remove elements like 7, 8, and so on, which are just noise in this case. Now instead writing some of these preprocessing steps from scratch let's move to NLTK for the same task. There is a function called clean_html() that can do all the cleaning that we were looking for:
>>>import nltk >>># http://www.nltk.org/api/nltk.html#nltk.util.clean_html >>>clean = nltk.clean_html(html) >>># clean will have entire string removing all the html noise >>>tokens = [tok for tok in clean.split()] >>>print tokens[:100] ['Welcome', 'to', 'Python.org', 'Skip', 'to', 'content', '▼', 'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', '▲', 'The', 'Python', 'Network', '≡', 'Menu', 'Arts', 'Business' ...]
Cool, right? This definitely is much cleaner and easier to do.
Let's try to get the frequency distribution of these terms. First, let's do it the Pure Python way, then I will tell you the NLTK recipe.
>>>import operator >>>freq_dis={} >>>for tok in tokens: >>> if tok in freq_dis: >>> freq_dis[tok]+=1 >>> else: >>> freq_dis[tok]=1 >>># We want to sort this dictionary on values ( freq in this case ) >>>sorted_freq_dist= sorted(freq_dis.items(), key=operator.itemgetter(1), reverse=True) >>> print sorted_freq_dist[:25] [('Python', 55), ('>>>', 23), ('and', 21), ('to', 18), (',', 18), ('the', 14), ('of', 13), ('for', 12), ('a', 11), ('Events', 11), ('News', 11), ('is', 10), ('2014-', 10), ('More', 9), ('#', 9), ('3', 9), ('=', 8), ('in', 8), ('with', 8), ('Community', 7), ('The', 7), ('Docs', 6), ('Software', 6), (':', 6), ('3:', 5), ('that', 5), ('sum', 5)]
Naturally, as this is Python's home page, Python and the (>>>) interpreter symbol are the most common terms, also giving a sense of the website.
A better and more efficient approach is to use NLTK's FreqDist() function. For this, we will take a look at the same code we developed before:
>>>import nltk >>>Freq_dist_nltk=nltk.FreqDist(tokens) >>>print Freq_dist_nltk >>>for k,v in Freq_dist_nltk.items(): >>> print str(k)+':'+str(v)
Tip
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Let's now do some more funky things. Let's plot this:
>>>Freq_dist_nltk.plot(50, cumulative=False) >>># below is the plot for the frequency distributions
We can see that the cumulative frequency is growing, and at some point the curve is going into long tail. Still, there is some noise, there are words like the, of, for, and =. These are useless words, and there is a terminology for them. These words are stop words; words like the, a, an, and so on. Article pronouns are generally present in most of the documents, hence they are not discriminative enough to be informative. In most of the NLP and information retrieval tasks, people generally remove stop words. Let's go back again to our running example:
>>>stopwords=[word.strip().lower() for word in open(PATH/english.stop.txt
)] >>>clean_tokens=[tok for tok in tokens if len(tok.lower())>1 and (tok.lower() not in stopwords)] >>>Freq_dist_nltk=nltk.FreqDist(clean_tokens) >>>Freq_dist_nltk.plot(50, cumulative=False)
Note
Please go to http://www.wordle.net/advanced for more word clouds.
Looks much cleaner now! After finishing this much, you can go to wordle and put the distribution in a form of a CSV and you should be able to get something like this word cloud:
Your turn
Please try the