Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Natural Language Processing: Python and NLTK
Natural Language Processing: Python and NLTK
Natural Language Processing: Python and NLTK
Ebook1,606 pages9 hours

Natural Language Processing: Python and NLTK

Rating: 0 out of 5 stars

()

Read preview

About this ebook

If you are an NLP or machine learning enthusiast and an intermediate Python programmer who wants to quickly master NLTK for natural language processing, then this Learning Path will do you a lot of good. Students of linguistics and semantic/sentiment analysis professionals will find it invaluable.
LanguageEnglish
Release dateNov 22, 2016
ISBN9781787287846
Natural Language Processing: Python and NLTK

Read more from Jacob Perkins

Related to Natural Language Processing

Related ebooks

Intelligence (AI) & Semantics For You

View More

Related articles

Reviews for Natural Language Processing

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Natural Language Processing - Jacob Perkins

    Table of Contents

    Natural Language Processing: Python and NLTK

    Natural Language Processing: Python and NLTK

    Credits

    Preface

    What this learning path covers

    What you need for this learning path

    Who this learning path is for

    Reader feedback

    Customer support

    Downloading the example code

    Errata

    Piracy

    Questions

    1. Module 1

    1. Introduction to Natural Language Processing

    Why learn NLP?

    Let's start playing with Python!

    Lists

    Helping yourself

    Regular expressions

    Dictionaries

    Writing functions

    Diving into NLTK

    Your turn

    Summary

    2. Text Wrangling and Cleansing

    What is text wrangling?

    Text cleansing

    Sentence splitter

    Tokenization

    Stemming

    Lemmatization

    Stop word removal

    Rare word removal

    Spell correction

    Your turn

    Summary

    3. Part of Speech Tagging

    What is Part of speech tagging

    Stanford tagger

    Diving deep into a tagger

    Sequential tagger

    N-gram tagger

    Regex tagger

    Brill tagger

    Machine learning based tagger

    Named Entity Recognition (NER)

    NER tagger

    Your Turn

    Summary

    4. Parsing Structure in Text

    Shallow versus deep parsing

    The two approaches in parsing

    Why we need parsing

    Different types of parsers

    A recursive descent parser

    A shift-reduce parser

    A chart parser

    A regex parser

    Dependency parsing

    Chunking

    Information extraction

    Named-entity recognition (NER)

    Relation extraction

    Summary

    5. NLP Applications

    Building your first NLP application

    Other NLP applications

    Machine translation

    Statistical machine translation

    Information retrieval

    Boolean retrieval

    Vector space model

    The probabilistic model

    Speech recognition

    Text classification

    Information extraction

    Question answering systems

    Dialog systems

    Word sense disambiguation

    Topic modeling

    Language detection

    Optical character recognition

    Summary

    6. Text Classification

    Machine learning

    Text classification

    Sampling

    Naive Bayes

    Decision trees

    Stochastic gradient descent

    Logistic regression

    Support vector machines

    The Random forest algorithm

    Text clustering

    K-means

    Topic modeling in text

    Installing gensim

    References

    Summary

    7. Web Crawling

    Web crawlers

    Writing your first crawler

    Data flow in Scrapy

    The Scrapy shell

    Items

    The Sitemap spider

    The item pipeline

    External references

    Summary

    8. Using NLTK with Other Python Libraries

    NumPy

    ndarray

    Indexing

    Basic operations

    Extracting data from an array

    Complex matrix operations

    Reshaping and stacking

    Random numbers

    SciPy

    Linear algebra

    eigenvalues and eigenvectors

    The sparse matrix

    Optimization

    pandas

    Reading data

    Series data

    Column transformation

    Noisy data

    matplotlib

    Subplot

    Adding an axis

    A scatter plot

    A bar plot

    3D plots

    External references

    Summary

    9. Social Media Mining in Python

    Data collection

    Twitter

    Data extraction

    Trending topics

    Geovisualization

    Influencers detection

    Facebook

    Influencer friends

    Summary

    10. Text Mining at Scale

    Different ways of using Python on Hadoop

    Python streaming

    Hive/Pig UDF

    Streaming wrappers

    NLTK on Hadoop

    A UDF

    Python streaming

    Scikit-learn on Hadoop

    PySpark

    Summary

    2. Module 2

    1. Tokenizing Text and WordNet Basics

    Introduction

    Tokenizing text into sentences

    Getting ready

    How to do it...

    How it works...

    There's more...

    Tokenizing sentences in other languages

    See also

    Tokenizing sentences into words

    How to do it...

    How it works...

    There's more...

    Separating contractions

    PunktWordTokenizer

    WordPunctTokenizer

    See also

    Tokenizing sentences using regular expressions

    Getting ready

    How to do it...

    How it works...

    There's more...

    Simple whitespace tokenizer

    See also

    Training a sentence tokenizer

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Filtering stopwords in a tokenized sentence

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Looking up Synsets for a word in WordNet

    Getting ready

    How to do it...

    How it works...

    There's more...

    Working with hypernyms

    Part of speech (POS)

    See also

    Looking up lemmas and synonyms in WordNet

    How to do it...

    How it works...

    There's more...

    All possible synonyms

    Antonyms

    See also

    Calculating WordNet Synset similarity

    How to do it...

    How it works...

    There's more...

    Comparing verbs

    Path and Leacock Chordorow (LCH) similarity

    See also

    Discovering word collocations

    Getting ready

    How to do it...

    How it works...

    There's more...

    Scoring functions

    Scoring ngrams

    See also

    2. Replacing and Correcting Words

    Introduction

    Stemming words

    How to do it...

    How it works...

    There's more...

    The LancasterStemmer class

    The RegexpStemmer class

    The SnowballStemmer class

    See also

    Lemmatizing words with WordNet

    Getting ready

    How to do it...

    How it works...

    There's more...

    Combining stemming with lemmatization

    See also

    Replacing words matching regular expressions

    Getting ready

    How to do it...

    How it works...

    There's more...

    Replacement before tokenization

    See also

    Removing repeating characters

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Spelling correction with Enchant

    Getting ready

    How to do it...

    How it works...

    There's more...

    The en_GB dictionary

    Personal word lists

    See also

    Replacing synonyms

    Getting ready

    How to do it...

    How it works...

    There's more...

    CSV synonym replacement

    YAML synonym replacement

    See also

    Replacing negations with antonyms

    How to do it...

    How it works...

    There's more...

    See also

    3. Creating Custom Corpora

    Introduction

    Setting up a custom corpus

    Getting ready

    How to do it...

    How it works...

    There's more...

    Loading a YAML file

    See also

    Creating a wordlist corpus

    Getting ready

    How to do it...

    How it works...

    There's more...

    Names wordlist corpus

    English words corpus

    See also

    Creating a part-of-speech tagged word corpus

    Getting ready

    How to do it...

    How it works...

    There's more...

    Customizing the word tokenizer

    Customizing the sentence tokenizer

    Customizing the paragraph block reader

    Customizing the tag separator

    Converting tags to a universal tagset

    See also

    Creating a chunked phrase corpus

    Getting ready

    How to do it...

    How it works...

    There's more...

    Tree leaves

    Treebank chunk corpus

    CoNLL2000 corpus

    See also

    Creating a categorized text corpus

    Getting ready

    How to do it...

    How it works...

    There's more...

    Category file

    Categorized tagged corpus reader

    Categorized corpora

    See also

    Creating a categorized chunk corpus reader

    Getting ready

    How to do it...

    How it works...

    There's more...

    Categorized CoNLL chunk corpus reader

    See also

    Lazy corpus loading

    How to do it...

    How it works...

    There's more...

    Creating a custom corpus view

    How to do it...

    How it works...

    There's more...

    Block reader functions

    Pickle corpus view

    Concatenated corpus view

    See also

    Creating a MongoDB-backed corpus reader

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Corpus editing with file locking

    Getting ready

    How to do it...

    How it works...

    4. Part-of-speech Tagging

    Introduction

    Default tagging

    Getting ready

    How to do it...

    How it works...

    There's more...

    Evaluating accuracy

    Tagging sentences

    Untagging a tagged sentence

    See also

    Training a unigram part-of-speech tagger

    How to do it...

    How it works...

    There's more...

    Overriding the context model

    Minimum frequency cutoff

    See also

    Combining taggers with backoff tagging

    How to do it...

    How it works...

    There's more...

    Saving and loading a trained tagger with pickle

    See also

    Training and combining ngram taggers

    Getting ready

    How to do it...

    How it works...

    There's more...

    Quadgram tagger

    See also

    Creating a model of likely word tags

    How to do it...

    How it works...

    There's more...

    See also

    Tagging with regular expressions

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Affix tagging

    How to do it...

    How it works...

    There's more...

    Working with min_stem_length

    See also

    Training a Brill tagger

    How to do it...

    How it works...

    There's more...

    Tracing

    See also

    Training the TnT tagger

    How to do it...

    How it works...

    There's more...

    Controlling the beam search

    Significance of capitalization

    See also

    Using WordNet for tagging

    Getting ready

    How to do it...

    How it works...

    See also

    Tagging proper names

    How to do it...

    How it works...

    See also

    Classifier-based tagging

    How to do it...

    How it works...

    There's more...

    Detecting features with a custom feature detector

    Setting a cutoff probability

    Using a pre-trained classifier

    See also

    Training a tagger with NLTK-Trainer

    How to do it...

    How it works...

    There's more...

    Saving a pickled tagger

    Training on a custom corpus

    Training with universal tags

    Analyzing a tagger against a tagged corpus

    Analyzing a tagged corpus

    See also

    5. Extracting Chunks

    Introduction

    Chunking and chinking with regular expressions

    Getting ready

    How to do it...

    How it works...

    There's more...

    Parsing different chunk types

    Parsing alternative patterns

    Chunk rule with context

    See also

    Merging and splitting chunks with regular expressions

    How to do it...

    How it works...

    There's more...

    Specifying rule descriptions

    See also

    Expanding and removing chunks with regular expressions

    How to do it...

    How it works...

    There's more...

    See also

    Partial parsing with regular expressions

    How to do it...

    How it works...

    There's more...

    The ChunkScore metrics

    Looping and tracing chunk rules

    See also

    Training a tagger-based chunker

    How to do it...

    How it works...

    There's more...

    Using different taggers

    See also

    Classification-based chunking

    How to do it...

    How it works...

    There's more...

    Using a different classifier builder

    See also

    Extracting named entities

    How to do it...

    How it works...

    There's more...

    Binary named entity extraction

    See also

    Extracting proper noun chunks

    How to do it...

    How it works...

    There's more...

    See also

    Extracting location chunks

    How to do it...

    How it works...

    There's more...

    See also

    Training a named entity chunker

    How to do it...

    How it works...

    There's more...

    See also

    Training a chunker with NLTK-Trainer

    How to do it...

    How it works...

    There's more...

    Saving a pickled chunker

    Training a named entity chunker

    Training on a custom corpus

    Training on parse trees

    Analyzing a chunker against a chunked corpus

    Analyzing a chunked corpus

    See also

    6. Transforming Chunks and Trees

    Introduction

    Filtering insignificant words from a sentence

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Correcting verb forms

    Getting ready

    How to do it...

    How it works...

    See also

    Swapping verb phrases

    How to do it...

    How it works...

    There's more...

    See also

    Swapping noun cardinals

    How to do it...

    How it works...

    See also

    Swapping infinitive phrases

    How to do it...

    How it works...

    There's more...

    See also

    Singularizing plural nouns

    How to do it...

    How it works...

    See also

    Chaining chunk transformations

    How to do it...

    How it works...

    There's more...

    See also

    Converting a chunk tree to text

    How to do it...

    How it works...

    There's more...

    See also

    Flattening a deep tree

    Getting ready

    How to do it...

    How it works...

    There's more...

    The cess_esp and cess_cat treebank

    See also

    Creating a shallow tree

    How to do it...

    How it works...

    See also

    Converting tree labels

    Getting ready

    How to do it...

    How it works...

    See also

    7. Text Classification

    Introduction

    Bag of words feature extraction

    How to do it...

    How it works...

    There's more...

    Filtering stopwords

    Including significant bigrams

    See also

    Training a Naive Bayes classifier

    Getting ready

    How to do it...

    How it works...

    There's more...

    Classification probability

    Most informative features

    Training estimator

    Manual training

    See also

    Training a decision tree classifier

    How to do it...

    How it works...

    There's more...

    Controlling uncertainty with entropy_cutoff

    Controlling tree depth with depth_cutoff

    Controlling decisions with support_cutoff

    See also

    Training a maximum entropy classifier

    Getting ready

    How to do it...

    How it works...

    There's more...

    Megam algorithm

    See also

    Training scikit-learn classifiers

    Getting ready

    How to do it...

    How it works...

    There's more...

    Comparing Naive Bayes algorithms

    Training with logistic regression

    Training with LinearSVC

    See also

    Measuring precision and recall of a classifier

    How to do it...

    How it works...

    There's more...

    F-measure

    See also

    Calculating high information words

    How to do it...

    How it works...

    There's more...

    The MaxentClassifier class with high information words

    The DecisionTreeClassifier class with high information words

    The SklearnClassifier class with high information words

    See also

    Combining classifiers with voting

    Getting ready

    How to do it...

    How it works...

    See also

    Classifying with multiple binary classifiers

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Training a classifier with NLTK-Trainer

    How to do it...

    How it works...

    There's more...

    Saving a pickled classifier

    Using different training instances

    The most informative features

    The Maxent and LogisticRegression classifiers

    SVMs

    Combining classifiers

    High information words and bigrams

    Cross-fold validation

    Analyzing a classifier

    See also

    8. Distributed Processing and Handling Large Datasets

    Introduction

    Distributed tagging with execnet

    Getting ready

    How to do it...

    How it works...

    There's more...

    Creating multiple channels

    Local versus remote gateways

    See also

    Distributed chunking with execnet

    Getting ready

    How to do it...

    How it works...

    There's more...

    Python subprocesses

    See also

    Parallel list processing with execnet

    How to do it...

    How it works...

    There's more...

    See also

    Storing a frequency distribution in Redis

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Storing a conditional frequency distribution in Redis

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Storing an ordered dictionary in Redis

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Distributed word scoring with Redis and execnet

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    9. Parsing Specific Data Types

    Introduction

    Parsing dates and times with dateutil

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Timezone lookup and conversion

    Getting ready

    How to do it...

    How it works...

    There's more...

    Local timezone

    Custom offsets

    See also

    Extracting URLs from HTML with lxml

    Getting ready

    How to do it...

    How it works...

    There's more...

    Extracting links directly

    Parsing HTML from URLs or files

    Extracting links with XPaths

    See also

    Cleaning and stripping HTML

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Converting HTML entities with BeautifulSoup

    Getting ready

    How to do it...

    How it works...

    There's more...

    Extracting URLs with BeautifulSoup

    See also

    Detecting and converting character encodings

    Getting ready

    How to do it...

    How it works...

    There's more...

    Converting to ASCII

    UnicodeDammit conversion

    See also

    A. Penn Treebank Part-of-speech Tags

    3. Module 3

    1. Working with Strings

    Tokenization

    Tokenization of text into sentences

    Tokenization of text in other languages

    Tokenization of sentences into words

    Tokenization using TreebankWordTokenizer

    Tokenization using regular expressions

    Normalization

    Eliminating punctuation

    Conversion into lowercase and uppercase

    Dealing with stop words

    Calculate stopwords in English

    Substituting and correcting tokens

    Replacing words using regular expressions

    Example of the replacement of a text with another text

    Performing substitution before tokenization

    Dealing with repeating characters

    Example of deleting repeating characters

    Replacing a word with its synonym

    Example of substituting word a with its synonym

    Applying Zipf's law to text

    Similarity measures

    Applying similarity measures using Ethe edit distance algorithm

    Applying similarity measures using Jaccard's Coefficient

    Applying similarity measures using the Smith Waterman distance

    Other string similarity metrics

    Summary

    2. Statistical Language Modeling

    Understanding word frequency

    Develop MLE for a given text

    Hidden Markov Model estimation

    Applying smoothing on the MLE model

    Add-one smoothing

    Good Turing

    Kneser Ney estimation

    Witten Bell estimation

    Develop a back-off mechanism for MLE

    Applying interpolation on data to get mix and match

    Evaluate a language model through perplexity

    Applying metropolis hastings in modeling languages

    Applying Gibbs sampling in language processing

    Summary

    3. Morphology – Getting Our Feet Wet

    Introducing morphology

    Understanding stemmer

    Understanding lemmatization

    Developing a stemmer for non-English language

    Morphological analyzer

    Morphological generator

    Search engine

    Summary

    4. Parts-of-Speech Tagging – Identifying Words

    Introducing parts-of-speech tagging

    Default tagging

    Creating POS-tagged corpora

    Selecting a machine learning algorithm

    Statistical modeling involving the n-gram approach

    Developing a chunker using pos-tagged corpora

    Summary

    5. Parsing – Analyzing Training Data

    Introducing parsing

    Treebank construction

    Extracting Context Free Grammar (CFG) rules from Treebank

    Creating a probabilistic Context Free Grammar from CFG

    CYK chart parsing algorithm

    Earley chart parsing algorithm

    Summary

    6. Semantic Analysis – Meaning Matters

    Introducing semantic analysis

    Introducing NER

    A NER system using Hidden Markov Model

    Training NER using Machine Learning Toolkits

    NER using POS tagging

    Generation of the synset id from Wordnet

    Disambiguating senses using Wordnet

    Summary

    7. Sentiment Analysis – I Am Happy

    Introducing sentiment analysis

    Sentiment analysis using NER

    Sentiment analysis using machine learning

    Evaluation of the NER system

    Summary

    8. Information Retrieval – Accessing Information

    Introducing information retrieval

    Stop word removal

    Information retrieval using a vector space model

    Vector space scoring and query operator interaction

    Developing an IR system using latent semantic indexing

    Text summarization

    Question-answering system

    Summary

    9. Discourse Analysis – Knowing Is Believing

    Introducing discourse analysis

    Discourse analysis using Centering Theory

    Anaphora resolution

    Summary

    10. Evaluation of NLP Systems – Analyzing Performance

    The need for evaluation of NLP systems

    Evaluation of NLP tools (POS taggers, stemmers, and morphological analyzers)

    Parser evaluation using gold data

    Evaluation of IR system

    Metrics for error identification

    Metrics based on lexical matching

    Metrics based on syntactic matching

    Metrics using shallow semantic matching

    Summary

    B. Bibliography

    Index

    Natural Language Processing: Python and NLTK


    Natural Language Processing: Python and NLTK

    Learn to build expert NLP and machine learning projects using NLTK and other Python libraries

    A course in three modules

    BIRMINGHAM - MUMBAI

    Natural Language Processing: Python and NLTK

    Copyright © 2016 Packt Publishing

    All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Published on: November 2016

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78728-510-1

    www.packtpub.com

    Credits

    Authors

    Nitin Hardeniya

    Jacob Perkins

    Deepti Chopra

    Nisheeth Joshi

    Iti Mathur

    Reviewers

    Afroz Hussain

    Sujit Pal

    Kumar Raj

    Patrick Chan

    Mohit Goenka

    Lihang Li

    Maurice HT Ling

    Jing (Dave) Tian

    Arturo Argueta

    Content Development Editor

    Aishwarya Pandere

    Production Coordinator

    Arvindkumar Gupta

    Preface

    NLTK is one of the most popular and widely used library in the natural language processing (NLP) community. The beauty of NLTK lies in its simplicity, where most of the complex NLP tasks can be implemented using a few lines of code. Start off by learning how to tokenize text into component words. Explore and make use of the WordNet language dictionary. Learn how and when to stem or lemmatize words. Discover various ways to replace words and perform spelling correction. Create your own custom text corpora and corpus readers, including a MongoDB backed corpus. Use part-of-speech taggers to annotate words with their parts of speech. Create and transform chunked phrase trees using partial parsing. Dig into feature extraction for text classification and sentiment analysis. Learn how to do parallel and distributed text processing, and to store word distributions in Redis.

    This learning path will teach you all that and more, in a hands-on learn-by-doing manner. Become an expert in using NLTK for Natural Language Processing with this useful companion.

    What this learning path covers

    Module 1, NLTK Essentials, talks about all the preprocessing steps required in any text mining/NLP task. In this module, we discuss tokenization, stemming, stop word removal, and other text cleansing processes in detail and how easy it is to implement these in NLTK.

    Module 2, Python 3 Text Processing with NLTK 3 Cookbook, explains how to use corpus readers and create custom corpora. It also covers how to use some of the corpora that come with NLTK. It covers the chunking process, also known as partial parsing, which can identify phrases and named entities in a sentence. It also explains how to train your own custom chunker and create specific named entity recognizers.

    Module 3, Mastering Natural Language Processing with Python, covers how to calculate word frequencies and perform various language modeling techniques. It also talks about the concept and application of Shallow Semantic Analysis (that is, NER) and WSD using Wordnet.

    It will help you understand and apply the concepts of Information Retrieval and text summarization.

    What you need for this learning path

    Module 1:

    We need the following software for this module:

    Module 2:

    You will need Python 3 and the listed Python packages. For this learning path, the author used Python 3.3.5. To install the packages, you can use pip (https://pypi.python.org/pypi/pip/). The following is the list of the packages in requirements format with the version number used while writing this learning path:

    NLTK>=3.0a4

    pyenchant>=1.6.5

    lockfile>=0.9.1

    numpy>=1.8.0

    scipy>=0.13.0

    scikit-learn>=0.14.1

    execnet>=1.1

    pymongo>=2.6.3

    redis>=2.8.0

    lxml>=3.2.3

    beautifulsoup4>=4.3.2

    python-dateutil>=2.0

    charade>=1.0.3

    You will also need NLTK-Trainer, which is available at https://github.com/japerk/nltk-trainer.

    Beyond Python, there are a couple recipes that use MongoDB and Redis, both NoSQL databases. These can be downloaded at http://www.mongodb.org/ and http://redis.io/, respectively.

    Module 3:

    For all the chapters, Python 2.7 or 3.2+ is used. NLTK 3.0 must be installed either on 32-bit machine or 64-bit machine. Operating System required is Windows/Mac/Unix.

    Who this learning path is for

    If you are an NLP or machine learning enthusiast and an intermediate Python programmer who wants to quickly master NLTK for natural language processing, then this Learning Path will do you a lot of good. Students of linguistics and semantic/sentiment analysis professionals will find it invaluable.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

    To send us general feedback, simply e-mail <[email protected]>, and mention the course's title in the subject of your message.

    If there is a topic that you have expertise in and you are interested in either writing or contributing to a course, see our author guide at www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    You can download the code files by following these steps:

    Log in or register to our website using your e-mail address and password.

    Hover the mouse pointer on the SUPPORT tab at the top.

    Click on Code Downloads & Errata.

    Enter the name of the course in the Search box.

    Select the course for which you're looking to download the code files.

    Choose from the drop-down menu where you purchased this course from.

    Click on Code Download.

    You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account.

    Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

    WinRAR / 7-Zip for Windows

    Zipeg / iZip / UnRarX for Mac

    7-Zip / PeaZip for Linux

    The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/Natural-Language-Processing-Python-and-NLTK. We also have other code bundles from our rich catalog of books, videos and courses available at https://github.com/PacktPublishing/. Check them out!

    Errata

    Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this course. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

    To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

    Piracy

    Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

    Please contact us at <[email protected]> with a link to the suspected pirated material.

    We appreciate your help in protecting our authors and our ability to bring you valuable content.

    Questions

    If you have a problem with any aspect of this course, you can contact us at <[email protected]>, and we will do our best to address the problem.

    Part 1. Module 1

    NLTK Essentials

    Build cool NLP and machine learning applications using NLTK and other Python libraries

    Chapter 1. Introduction to Natural Language Processing

    I will start with the introduction to Natural Language Processing (NLP). Language is a central part of our day to day life, and it's so interesting to work on any problem related to languages. I hope this book will give you a flavor of NLP, will motivate you to learn some amazing concepts of NLP, and will inspire you to work on some of the challenging NLP applications.

    In my own language, the study of language processing is called NLP. People who are deeply involved in the study of language are linguists, while the term 'computational linguist' applies to the study of processing languages with the application of computation. Essentially, a computational linguist will be a computer scientist who has enough understanding of languages, and can apply his computational skills to model different aspects of the language. While computational linguists address the theoretical aspect of language, NLP is nothing but the application of computational linguistics.

    NLP is more about the application of computers on different language nuances, and building real-world applications using NLP techniques. In a practical context, NLP is analogous to teaching a language to a child. Some of the most common tasks like understanding words, sentences, and forming grammatically and structurally correct sentences, are very natural to humans. In NLP, some of these tasks translate to tokenization, chunking, part of speech tagging, parsing, machine translation, speech recognition, and most of them are still the toughest challenges for computers. I will be talking more on the practical side of NLP, assuming that we all have some background in NLP. The expectation for the reader is to have minimal understanding of any programming language and an interest in NLP and Language.

    By end of the chapter we want readers

    A brief introduction to NLP and related concepts.

    Install Python, NLTK and other libraries.

    Write some very basic Python and NLTK code snippets.

    If you have never heard the term NLP, then please take some time to read any of the books mentioned here—just for an initial few chapters. A quick reading of at least the Wikipedia page relating to NLP is a must:

    Speech and Language Processing by Daniel Jurafsky and James H. Martin

    Statistical Natural Language Processing by Christopher D. Manning and Hinrich Schütze

    Why learn NLP?

    I start my discussion with the Gartner's new hype cycle and you can clearly see NLP on top of the cycle. Currently, NLP is one of the rarest skill sets that is required in the industry. After the advent of big data, the major challenge is that we need more people who are good with not just structured, but also with semi or unstructured data. We are generating petabytes of Weblogs, tweets, Facebook feeds, chats, e-mails, and reviews. Companies are collecting all these different kind of data for better customer targeting and meaningful insights. To process all these unstructured data source we need people who understand NLP.

    We are in the age of information; we can't even imagine our life without Google. We use Siri for the most of basic stuff. We use spam filters for filtering spam emails. We need spell checker on our Word document. There are many examples of real world NLP applications around us.

    Image is taken from http://www.gartner.com/newsroom/id/2819918

    Let me also give you some examples of the amazing NLP applications that you can use, but are not aware that they are built on NLP:

    Spell correction (MS Word/ any other editor)

    Search engines (Google, Bing, Yahoo, wolframalpha)

    Speech engines (Siri, Google Voice)

    Spam classifiers (All e-mail services)

    News feeds (Google, Yahoo!, and so on)

    Machine translation (Google Translate, and so on)

    IBM Watson

    Building these applications requires a very specific skill set with a great understanding of language and tools to process the language efficiently. So it's not just hype that makes NLP one of the most niche areas, but it's the kind of application that can be created using NLP that makes it one of the most unique skills to have.

    To achieve some of the above applications and other basic NLP preprocessing, there are many open source tools available. Some of them are developed by organizations to build their own NLP applications, while some of them are open-sourced. Here is a small list of available NLP tools:

    GATE

    Mallet

    Open NLP

    UIMA

    Stanford toolkit

    Genism

    Natural Language Tool Kit (NLTK)

    Most of the tools are written in Java and have similar functionalities. Some of them are robust and have a different variety of NLP tools available. However, when it comes to the ease of use and explanation of the concepts, NLTK scores really high. NLTK is also a very good learning kit because the learning curve of Python (on which NLTK is written) is very fast. NLTK has incorporated most of the NLP tasks, it's very elegant and easy to work with. For all these reasons, NLTK has become one of the most popular libraries in the NLP community:

    I am assuming all you guys know Python. If not, I urge you to learn Python. There are many basic tutorials on Python available online. There are lots of books also available that give you a quick overview of the language. We will also look into some of the features of Python, while going through the different topics. But for now, even if you only know the basics of Python, such as lists, strings, regular expressions, and basic I/O, you should be good to go.

    Note

    Python can be installed from the following website:

    https://www.python.org/downloads/

    http://continuum.io/downloads

    https://store.enthought.com/downloads/

    I would recommend using Anaconda or Canopy Python distributions. The reason being that these distributions come with bundled libraries, such as scipy, numpy, scikit, and so on, which are used for data analysis and other applications related to NLP and related fields. Even NLTK is part of this distribution.

    Note

    Please follow the instructions and install NLTK and NLTK data:

    http://www.nltk.org/install.html

    Let's test everything.

    Open the terminal on your respective operating systems. Then run:

    $ python

    This should open the Python interpreter:

    Python 2.6.6 (r266:84292, Oct 15 2013, 07:32:41) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type help, copyright, credits or license for more information. >>>

    I hope you got a similar looking output here. There is a chance that you will have received a different looking output, but ideally you will get the latest version of Python (I recommend that to be 2.7), the compiler GCC, and the operating system details. I know the latest version of Python will be in 3.0+ range, but as with any other open source systems, we should tries to hold back to a more stable version as opposed to jumping on to the latest version. If you have moved to Python 3.0+, please have a look at the link below to gain an understanding about what new features have been added:

    https://docs.python.org/3/whatsnew/3.4.html.

    UNIX based systems will have Python as a default program. Windows users can set the path to get Python working. Let's check whether we have installed NLTK correctly:

    >>>import nltk >>>print Python and NLTK installed successfully Python and NLTK installed successfully

    Hey, we are good to go!

    Let's start playing with Python!

    We'll not be diving too deep into Python; however, we'll give you a quick tour of Python essentials. Still, I think for the benefit of the audience, we should have a quick five minute tour. We'll talk about the basics of data structures, some frequently used functions, and the general construct of Python in the next few sections.

    Note

    I highly recommend the two hour Google Python class. https://developers.google.com/edu/python should be good enough to start. Please go through the Python website https://www.python.org/ for more tutorials and other resources.

    Lists

    Lists are one of the most commonly used data structures in Python. They are pretty much comparable to arrays in other programming languages. Let's start with some of the most important functions that a Python list provide.

    Try the following in the Python console:

    >>> lst=[1,2,3,4] >>> # mostly like arrays in typical languages >>>print lst [1, 2, 3, 4]

    Python lists can be accessed using much more flexible indexing. Here are some examples:

    >>>print 'First element' +lst[0]

    You will get an error message like this:

    TypeError: cannot concatenate 'str' and 'int' objects

    The reason being that Python is an interpreted language, and checks for the type of the variables at the time it evaluates the expression. We need not initialize and declare the type of variable at the time of declaration. Our list has integer object and cannot be concatenated as a print function. It will only accept a string object. For this reason, we need to convert list elements to string. The process is also known as type casting.

    >>>print 'First element :' +str(lst[0]) >>>print 'last element :' +str(lst[-1]) >>>print 'first three elements :' +str(lst[0:2]) >>>print 'last three elements :'+str(lst[-3:]) First element :1 last element :4 first three elements :[1, 2,3] last three elements :[2, 3, 4]

    Helping yourself

    The best way to learn more about different data types and functions is to use help functions like help() and dir(lst).

    The dir(python object) command is used to list all the given attributes of the given Python object. Like if you pass a list object, it will list all the cool things you can do with lists:

    >>>dir(lst) >>>' , '.join(dir(lst)) '__add__ , __class__ , __contains__ , __delattr__ , __delitem__ , __delslice__ , __doc__ , __eq__ , __format__ , __ge__ , __getattribute__ , __getitem__ , __getslice__ , __gt__ , __hash__ , __iadd__ , __imul__ , __init__ , __iter__ , __le__ , __len__ , __lt__ , __mul__ , __ne__ , __new__ , __reduce__ , __reduce_ex__ , __repr__ , __reversed__ , __rmul__ , __setattr__ , __setitem__ , __setslice__ , __sizeof__ , __str__ , __subclasshook__ , append , count , extend , index , insert , pop , remove , reverse , sort'

    With the help(python object) command, we can get detailed documentation for the given Python object, and also give a few examples of how to use the Python object:

    >>>help(lst.index) Help on built-in function index: index(...)     L.index(value, [start, [stop]]) -> integer -- return first index of value. This function raises a ValueError if the value is not present.

    So help and dir can be used on any Python data type, and are a very nice way to learn about the function and other details of that object. It also provides you with some basic examples to work with, which I found useful in most cases.

    Strings in Python are very similar to other languages, but the manipulation of strings is one of the main features of Python. It's immensely easy to work with strings in Python. Even something very simple, like splitting a string, takes effort in Java / C, while you will see how easy it is in Python.

    Using the help function that we used previously, you can get help for any Python object and any function. Let's have some more examples with the other most commonly used data type strings:

    Split: This is a method to split the string based on some delimiters. If no argument is provided it assumes whitespace as delimiter.

    >>> mystring=Monty Python !  And the holy Grail ! \n>>> print mystring.split()['Monty', 'Python', '!', 'and', 'the', 'holy', 'Grail', '!']

    Strip: This is a method that can remove trailing whitespace, like '\n', '\n\r' from the string:

    >>> print mystring.strip()>>>Monty Python !  and the holy Grail !

    If you notice the '\n' character is stripped off. There are also methods like rstrip() and lstrip() to strip trailing whitespaces to the right and left of the string.

    Upper/Lower: We can change the case of the string using these methods:

    >>> print (mystring.upper()>>>MONTY PYTHON !AND THE HOLY GRAIL !

    Replace: This will help you substitute a substring from the string:

    >>> print mystring.replace('!','''''')>>> Monty Python  and the holy Grail

    There are tons of string functions. I have just talked about some of the most frequently used.

    Note

    Please look the following link for more functions and examples:

    https://docs.python.org/2/library/string.html.

    Regular expressions

    One other important skill for an NLP enthusiast is working with regular expression. Regular expression is effectively pattern matching on strings. We heavily use pattern extrication to get meaningful information from large amounts of messy text data. The following are all the regular expressions you need. I haven't used any regular expressions beyond these in my entire life:

    (a period): This expression matches any single character except newline \n.

    \w: This expression will match a character or a digit equivalent to [a-z A-Z 0-9]

    \W (upper case W) matches any non-word character.

    \s: This expression (lowercase s) matches a single whitespace character - space, newline, return, tab, form [\n\r\t\f].

    \S: This expression matches any non-whitespace character.

    \t: This expression performs a tab operation.

    \n: This expression is used for a newline character.

    \r: This expression is used for a return character.

    \d: Decimal digit [0-9].

    ^: This expression is used at the start of the string.

    $: This expression is used at the end of the string.

    \: This expression is used to nullify the specialness of the special character. For example, you want to match the $ symbol, then add \ in front of it.

    Let's search for something in the running example, where mystring is the same string object, and we will try to look for some patterns in that. A substring search is one of the common use-cases of the re module. Let's implement this:

    >>># We have to import re module to use regular expression >>>import re >>>if re.search('Python',mystring): >>>    print We found python >>>else: >>>    print NO

    Once this is executed, we get the message as follows:

    We found python

    We can do more pattern finding using regular expressions. One of the common functions that is used in finding all the patterns in a string is findall. It will look for the given patterns in the string, and will give you a list of all the matched objects:

    >>>import re >>>print re.findall('!',mystring) ['!', '!']

    As we can see there were two instances of the ! in the mystring and findall return both object as a list.

    Dictionaries

    The other most commonly used data structure is dictionaries, also known as associative arrays/memories in other programming languages. Dictionaries are data structures that are indexed by keys, which can be any immutable type; such as strings and numbers can always be keys.

    Dictionaries are handy data structure that used widely across programming languages to implement many algorithms. Python dictionaries are one of the most elegant implementations of hash tables in any programming language. It's so easy to work around dictionary, and the great thing is that with few nuggets of code you can build a very complex data structure, while the same task can take so much time and coding effort in other languages. This gives the programmer more time to focus on algorithms rather than the data structure itself.

    I am using one of the very common use cases of dictionaries to get the frequency distribution of words in a given text. With just few lines of the following code, you can get the frequency of words. Just try the same task in any other language and you will understand how amazing Python is:

    >>># declare a dictionary >>>word_freq={} >>>for tok in string.split(): >>>    if tok in word_freq: >>>        word_freq [tok]+=1 >>>    else: >>>        word_freq [tok]=1 >>>print word_freq {'!': 2, 'and': 1, 'holy': 1, 'Python': 1, 'Grail': 1, 'the': 1, 'Monty': 1}

    Writing functions

    As any other programming langauge Python also has its way of writing functions. Function in Python start with keyword def followed by the function name and parentheses (). Similar to any other programming language any arguments and the type of the argument should be placed within these parentheses. The actual code starts with (:) colon symbol. The initial lines of the code are typically doc string (comments), then we have code body and function ends with a return statement. For example in the given example the function wordfreq start with def keyword, there is no argument to this function and the function ends with a return statement.

    >>>import sys >>>def wordfreq (mystring): >>>  ''' >>>  Function to generated the frequency distribution of the given text >>>  ''' >>>  print mystring >>>  word_freq={} >>>  for tok in mystring.split(): >>>    if tok in word_freq: >>>      word_freq [tok]+=1 >>>    else: >>>      word_freq [tok]=1 >>>  print word_freq >>>def main(): >>>  str=This is my fist python program >>>  wordfreq(str) >>>if __name__ == '__main__': >>>  main()

    This was the same code that we wrote in the previous section the idea of writing in a form of function is to make the code re-usable and readable. The interpreter style of writing Python is also very common but for writing big programes it will be a good practice to use function/classes and one of the programming paradigm. We also wanted the user to write and run first Python program. You need to follow these steps to achive this.

    Open an empty python file mywordfreq.py in your prefered text editor.

    Write/Copy the code above in the code snippet to the file.

    Open the command prompt in your Operating system.

    Run following command prompt:

    $ python mywordfreq,py This is my fist python program !!

    Output should be:

    {'This': 1, 'is': 1, 'python': 1, 'fist': 1, 'program': 1, 'my': 1}

    Now you have a very basic understanding about some common data-structures that python provides. You can write a full Python program and able to run that. I think this is good enough I think with this much of an introduction to Python you can manage for the initial chapters.

    Note

    Please have a look at some Python tutorials on the following website to learn more commands on Python:

    https://wiki.python.org/moin/BeginnersGuide

    Diving into NLTK

    Instead of going further into the theoretical aspects of natural language processing, let's start with a quick dive into NLTK. I am going to start with some basic example use cases of NLTK. There is a good chance that you have already done something similar. First, I will give a typical Python programmer approach, and then move on to NLTK for a much more efficient, robust, and clean solution.

    We will start analyzing with some example text content. For the current example, I have taken the content from Python's home page.

    >>>import urllib2 >>># urllib2 is use to download the html content of the web link >>>response = urllib2.urlopen('http://python.org/') >>># You can read the entire content of a file using read() method >>>html = response.read() >>>print len(html) 47020

    We don't have any clue about the kind of topics that are discussed in this URL, so let's start with an exploratory data analysis (EDA). Typically in a text domain, EDA can have many meanings, but will go with a simple case of what kinds of terms dominate the document. What are the topics? How frequent they are? The process will involve some level of preprocessing steps. We will try to do this first in a pure Python way, and then we will do it using NLTK.

    Let's start with cleaning the html tags. One ways to do this is to select just the tokens, including numbers and character. Anybody who has worked with regular expression should be able to convert html string into list of tokens:

    >>># Regular expression based split the string >>>tokens = [tok for tok in html.split()] >>>print Total no of tokens :+ str(len(tokens)) >>># First 100 tokens >>>print tokens[0:100] Total no of tokens :2860 ['', '', '', ''type=text/css', 'media="not', 'print,', 'braille,' ...]

    As you can see, there is an excess of html tags and other unwanted characters when we use the preceding method. A cleaner version of the same task will look something like this:

    >>>import re >>># using the split function >>>#https://docs.python.org/2/library/re.html >>>tokens = re.split('\W+',html) >>>print len(tokens) >>>print tokens[0:100] 5787 ['', 'doctype', 'html', 'if', 'lt', 'IE', '7', 'html', 'class', 'no', 'js', 'ie6', 'lt', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '7', 'html', 'class', 'no', 'js', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '8', 'msapplication', 'tooltip', 'content', 'The', 'official', 'home', 'of', 'the', 'Python', 'Programming', 'Language', 'meta', 'name', 'apple' ...]

    This looks much cleaner now. But still you can do more; I leave it to you to try to remove as much noise as you can. You can clean some HTML tags that are still popping up, You probably also want to look for word length as a criteria and remove words that have a length one—it will remove elements like 7, 8, and so on, which are just noise in this case. Now instead writing some of these preprocessing steps from scratch let's move to NLTK for the same task. There is a function called clean_html() that can do all the cleaning that we were looking for:

    >>>import nltk >>># http://www.nltk.org/api/nltk.html#nltk.util.clean_html >>>clean = nltk.clean_html(html) >>># clean will have entire string removing all the html noise >>>tokens = [tok for tok in clean.split()] >>>print tokens[:100] ['Welcome', 'to', 'Python.org', 'Skip', 'to', 'content', '▼', 'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', '▲', 'The', 'Python', 'Network', '≡', 'Menu', 'Arts', 'Business' ...]

    Cool, right? This definitely is much cleaner and easier to do.

    Let's try to get the frequency distribution of these terms. First, let's do it the Pure Python way, then I will tell you the NLTK recipe.

    >>>import operator >>>freq_dis={} >>>for tok in tokens: >>>    if tok in freq_dis: >>>        freq_dis[tok]+=1 >>>    else: >>>        freq_dis[tok]=1 >>># We want to sort this dictionary on values ( freq in this case ) >>>sorted_freq_dist= sorted(freq_dis.items(), key=operator.itemgetter(1), reverse=True) >>> print sorted_freq_dist[:25] [('Python', 55), ('>>>', 23), ('and', 21), ('to', 18), (',', 18), ('the', 14), ('of', 13), ('for', 12), ('a', 11), ('Events', 11), ('News', 11), ('is', 10), ('2014-', 10), ('More', 9), ('#', 9), ('3', 9), ('=', 8), ('in', 8), ('with', 8), ('Community', 7), ('The', 7), ('Docs', 6), ('Software', 6), (':', 6), ('3:', 5), ('that', 5), ('sum', 5)]

    Naturally, as this is Python's home page, Python and the (>>>) interpreter symbol are the most common terms, also giving a sense of the website.

    A better and more efficient approach is to use NLTK's FreqDist() function. For this, we will take a look at the same code we developed before:

    >>>import nltk >>>Freq_dist_nltk=nltk.FreqDist(tokens) >>>print Freq_dist_nltk >>>for k,v in Freq_dist_nltk.items(): >>>    print str(k)+':'+str(v) >>': 23, 'and': 21, ',': 18, 'to': 18, 'the': 14, 'of': 13, 'for': 12, 'Events': 11, 'News': 11, ...> Python:55 >>>:23 and:21 ,:18 to:18 the:14 of:13 for:12 Events:11 News:11

    Tip

    Downloading the example code

    You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    Let's now do some more funky things. Let's plot this:

    >>>Freq_dist_nltk.plot(50, cumulative=False) >>># below is the plot for the frequency distributions

    We can see that the cumulative frequency is growing, and at some point the curve is going into long tail. Still, there is some noise, there are words like the, of, for, and =. These are useless words, and there is a terminology for them. These words are stop words; words like the, a, an, and so on. Article pronouns are generally present in most of the documents, hence they are not discriminative enough to be informative. In most of the NLP and information retrieval tasks, people generally remove stop words. Let's go back again to our running example:

    >>>stopwords=[word.strip().lower() for word in open(PATH/english.stop.txt)] >>>clean_tokens=[tok for tok in tokens if len(tok.lower())>1 and (tok.lower() not in stopwords)] >>>Freq_dist_nltk=nltk.FreqDist(clean_tokens) >>>Freq_dist_nltk.plot(50, cumulative=False)

    Note

    Please go to http://www.wordle.net/advanced for more word clouds.

    Looks much cleaner now! After finishing this much, you can go to wordle and put the distribution in a form of a CSV and you should be able to get something like this word cloud:

    Your turn

    Please try the

    Enjoying the preview?
    Page 1 of 1