Natural Language Processing With Python Cookbook 1st Edition Krishna Bhavsar Download PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Download the full version of the textbook now at textbookfull.

com

Natural Language Processing with Python


Cookbook 1st Edition Krishna Bhavsar

https://textbookfull.com/product/natural-language-
processing-with-python-cookbook-1st-edition-
krishna-bhavsar/

Explore and download more textbook at https://textbookfull.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Applied Natural Language Processing with Python:


Implementing Machine Learning and Deep Learning Algorithms
for Natural Language Processing 1st Edition Taweh Beysolow
Ii
https://textbookfull.com/product/applied-natural-language-processing-
with-python-implementing-machine-learning-and-deep-learning-
algorithms-for-natural-language-processing-1st-edition-taweh-beysolow-
ii/
textbookfull.com

Python Natural Language Processing Advanced machine


learning and deep learning techniques for natural language
processing 1st Edition Jalaj Thanaki
https://textbookfull.com/product/python-natural-language-processing-
advanced-machine-learning-and-deep-learning-techniques-for-natural-
language-processing-1st-edition-jalaj-thanaki/
textbookfull.com

Natural language processing with TensorFlow Teach language


to machines using Python s deep learning library 1st
Edition Thushan Ganegedara
https://textbookfull.com/product/natural-language-processing-with-
tensorflow-teach-language-to-machines-using-python-s-deep-learning-
library-1st-edition-thushan-ganegedara/
textbookfull.com

Computational Methods and Data Engineering: Proceedings of


ICMDE 2020, Volume 1 Vijendra Singh

https://textbookfull.com/product/computational-methods-and-data-
engineering-proceedings-of-icmde-2020-volume-1-vijendra-singh/

textbookfull.com
Religions and Migrations in the Black Sea Region 1st
Edition Eleni Sideri

https://textbookfull.com/product/religions-and-migrations-in-the-
black-sea-region-1st-edition-eleni-sideri/

textbookfull.com

The sound of gravel a memoir First Edition Church Of The


Firstborn Of The Fulness Of Times.

https://textbookfull.com/product/the-sound-of-gravel-a-memoir-first-
edition-church-of-the-firstborn-of-the-fulness-of-times/

textbookfull.com

The emergence of sin : the cosmic tyrant in Romans 1st


Edition Matthew Croasmun

https://textbookfull.com/product/the-emergence-of-sin-the-cosmic-
tyrant-in-romans-1st-edition-matthew-croasmun/

textbookfull.com

Generative and Transformational Techniques in Software


Engineering IV International Summer School GTTSE 2011
Braga Portugal July 3 9 2011 Revised Papers 1st Edition
Darius Blasband (Auth.)
https://textbookfull.com/product/generative-and-transformational-
techniques-in-software-engineering-iv-international-summer-school-
gttse-2011-braga-portugal-july-3-9-2011-revised-papers-1st-edition-
darius-blasband-auth/
textbookfull.com

Bayesian Hierarchical Models: With Applications Using R


Peter D. Congdon

https://textbookfull.com/product/bayesian-hierarchical-models-with-
applications-using-r-peter-d-congdon/

textbookfull.com
Wide Sargasso Sea at 50 Elaine Savory

https://textbookfull.com/product/wide-sargasso-sea-at-50-elaine-
savory/

textbookfull.com
Natural Language Processing with
Python Cookbook

Over 60 recipes to implement text analytics solutions using


deep learning principles

Krishna Bhavsar
Naresh Kumar
Pratap Dangeti

BIRMINGHAM - MUMBAI
Natural Language Processing with Python
Cookbook
Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the authors, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.

First published: November 2017

Production reference: 1221117


Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78728-932-1

www.packtpub.com
Credits

Authors Copy Editor


Krishna Bhavsar Vikrant Phadkay
Naresh Kumar
Pratap Dangeti

Reviewer Project Coordinator


Juan Tomas Oliva Ramos Nidhi Joshi

Commissioning Editor Proofreader


Veena Pagare Safis Editing

Acquisition Editor Indexer


Aman Singh Tejal Daruwale Soni

Content Development Editor Graphics


Aishwarya Pandere Tania Dutta

Technical Editors Production Coordinator


Dinesh Pawar Shraddha Falebhai
Suwarna Rajput
About the Authors
Krishna Bhavsar has spent around 10 years working on natural language processing, social
media analytics, and text mining in various industry domains such as hospitality, banking,
healthcare, and more. He has worked on many different NLP libraries such as Stanford
CoreNLP, IBM's SystemText and BigInsights, GATE, and NLTK to solve industry problems
related to textual analysis. He has also worked on analyzing social media responses for
popular television shows and popular retail brands and products. He has also published a
paper on sentiment analysis augmentation techniques in 2010 NAACL. he recently created
an NLP pipeline/toolset and open sourced it for public use. Apart from academics and
technology, Krishna has a passion for motorcycles and football. In his free time, he likes to
travel and explore. He has gone on pan-India road trips on his motorcycle and backpacking
trips across most of the countries in South East Asia and Europe.

First and foremost, I would like to thank my mother for being the biggest motivating force
and a rock-solid support system behind all my endeavors in life. I would like to thank the
management team at Synerzip and all my friends for being supportive of me on this
journey. Last but not least, special thanks to Ram and Dorothy for keeping me on track
during this professionally difficult year.

Naresh Kumar has more than a decade of professional experience in designing,


implementing, and running very-large-scale Internet applications in Fortune Top 500
companies. He is a full-stack architect with hands-on experience in domains such as e-
commerce, web hosting, healthcare, big data and analytics, data streaming, advertising, and
databases. He believes in open source and contributes to it actively. Naresh keeps himself
up-to-date with emerging technologies, from Linux systems internals to frontend
technologies. He studied in BITS-Pilani, Rajasthan with dual degree in computer science
and economics.

Pratap Dangeti develops machine learning and deep learning solutions for structured,
image, and text data at TCS, in its research and innovation lab in Bangalore. He has
acquired a lot of experience in both analytics and data science. He received his master's
degree from IIT Bombay in its industrial engineering and operations research program.
Pratap is an artificial intelligence enthusiast. When not working, he likes to read about next-
gen technologies and innovative methodologies. He is also the author of the book Statistics
for Machine Learning by Packt.

I would like to thank my mom, Lakshmi, for her support throughout my career and in
writing this book. I dedicate this book to her. I also thank my family and friends for their
encouragement, without which it would not have been possible to write this book.
About the Reviewer
Juan Tomas Oliva Ramos is an environmental engineer from the University of Guanajuato,
Mexico, with a master's degree in administrative engineering and quality. He has more than
5 years of experience in the management and development of patents, technological
innovation projects, and the development of technological solutions through the statistical
control of processes.

He has been a teacher of statistics, entrepreneurship, and the technological development of


projects since 2011. He became an entrepreneur mentor and started a new department of
technology management and entrepreneurship at Instituto Tecnologico Superior de
Purisima del Rincon Guanajuato, Mexico.

Juan is an Alfaomega reviewer and has worked on the book Wearable Designs for Smart
Watches, Smart TVs and Android Mobile Devices.

Juan has also developed prototypes through programming and automation technologies for
the improvement of operations, which have been registered for patents.

I want to thank God for giving me wisdom and humility to review this book.
I thank Packt for giving me the opportunity to review this amazing book and to collaborate
with a group of committed people
I want to thank my beautiful wife, Brenda, our two magic princesses (Maria Regina and
Maria Renata) and our next member (Angel Tadeo), all of you, give me the strength,
happiness, and joy to start a new day. Thanks for being my family.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com. Did
you know that Packt offers eBook versions of every book published, with PDF and ePub
files available? You can upgrade to the eBook version at www.PacktPub.comand as a print
book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
[email protected] for more details. At www.PacktPub.com, you can also read a
collection of free technical articles, sign up for a range of free newsletters and receive
exclusive discounts and offers on Packt books and eBooks.

https:/​/​www.​packtpub.​com/​mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt
books and video courses, as well as industry-leading tools to help you plan your personal
development and advance your career.

Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial
process. To help us improve, please leave us an honest review on this book's Amazon page
at https:/​/​www.​amazon.​com/​dp/​178728932X. If you'd like to join our team of regular
reviewers, you can email us at [email protected]. We award our regular
reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be
relentless in improving our products!
Table of Contents
Preface 1
Chapter 1: Corpus and WordNet 8
Introduction 8
Accessing in-built corpora 9
How to do it... 9
Download an external corpus, load it, and access it 12
Getting ready 12
How to do it... 12
How it works... 14
Counting all the wh words in three different genres in the Brown corpus 15
Getting ready 15
How to do it... 15
How it works... 17
Explore frequency distribution operations on one of the web and chat
text corpus files 17
Getting ready 18
How to do it... 18
How it works... 20
Take an ambiguous word and explore all its senses using WordNet 20
Getting ready 21
How to do it... 21
How it works... 24
Pick two distinct synsets and explore the concepts of hyponyms and
hypernyms using WordNet 25
Getting ready 25
How to do it... 25
How it works... 28
Compute the average polysemy of nouns, verbs, adjectives, and
adverbs according to WordNet 28
Getting ready 29
How to do it... 29
How it works... 30
Chapter 2: Raw Text, Sourcing, and Normalization 31
Introduction 31
Table of Contents

The importance of string operations 32


Getting ready… 32
How to do it… 32
How it works… 34
Getting deeper with string operations 34
How to do it… 34
How it works… 37
Reading a PDF file in Python 37
Getting ready 37
How to do it… 38
How it works… 39
Reading Word documents in Python 40
Getting ready… 40
How to do it… 40
How it works… 43
Taking PDF, DOCX, and plain text files and creating a user-defined
corpus from them 44
Getting ready 44
How to do it… 45
How it works… 47
Read contents from an RSS feed 48
Getting ready 48
How to do it… 48
How it works… 50
HTML parsing using BeautifulSoup 50
Getting ready 51
How to do it… 51
How it works… 53
Chapter 3: Pre-Processing 54
Introduction 54
Tokenization – learning to use the inbuilt tokenizers of NLTK 55
Getting ready 55
How to do it… 55
How it works… 57
Stemming – learning to use the inbuilt stemmers of NLTK 58
Getting ready 58
How to do it… 58
How it works… 60
Lemmatization – learning to use the WordnetLemmatizer of NLTK 60

[ ii ]
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
Table of Contents

Getting ready 60
How to do it… 61
How it works… 63
Stopwords – learning to use the stopwords corpus and seeing the
difference it can make 63
Getting ready 63
How to do it... 63
How it works... 66
Edit distance – writing your own algorithm to find edit distance
between two strings 66
Getting ready 66
How to do it… 67
How it works… 69
Processing two short stories and extracting the common vocabulary
between two of them 69
Getting ready 69
How to do it… 70
How it works… 75
Chapter 4: Regular Expressions 76
Introduction 76
Regular expression – learning to use *, +, and ? 77
Getting ready 77
How to do it… 77
How it works… 79
Regular expression – learning to use $ and ^, and the non-start and
non-end of a word 79
Getting ready 80
How to do it… 80
How it works… 82
Searching multiple literal strings and substring occurrences 83
Getting ready 83
How to do it… 83
How it works... 85
Learning to create date regex and a set of characters or ranges of
character 85
How to do it... 85
How it works... 87
Find all five-character words and make abbreviations in some
sentences 88

[ iii ]
Table of Contents

How to do it… 88
How it works... 89
Learning to write your own regex tokenizer 89
Getting ready 89
How to do it... 90
How it works... 91
Learning to write your own regex stemmer 91
Getting ready 91
How to do it… 92
How it works… 93
Chapter 5: POS Tagging and Grammars 94
Introduction 94
Exploring the in-built tagger 95
Getting ready 95
How to do it... 95
How it works... 96
Writing your own tagger 97
Getting ready 97
How to do it... 98
How it works... 99
Training your own tagger 104
Getting ready 104
How to do it... 104
How it works... 106
Learning to write your own grammar 108
Getting ready 109
How to do it... 109
How it works... 110
Writing a probabilistic CFG 112
Getting ready 112
How to do it... 113
How it works... 114
Writing a recursive CFG 116
Getting ready 117
How to do it... 117
How it works... 119
Chapter 6: Chunking, Sentence Parse, and Dependencies 121
Introduction 121

[ iv ]
Table of Contents

Using the built-in chunker 121


Getting ready 122
How to do it... 122
How it works... 123
Writing your own simple chunker 124
Getting ready 126
How to do it... 126
How it works... 127
Training a chunker 129
Getting ready 130
How to do it... 130
How it works... 131
Parsing recursive descent 132
Getting ready 133
How to do it... 133
How it works... 135
Parsing shift-reduce 136
Getting ready 136
How to do it... 136
How it works... 138
Parsing dependency grammar and projective dependency 139
Getting ready 140
How to do it... 140
How it works... 141
Parsing a chart 142
Getting ready 143
How to do it... 143
How it works... 144
Chapter 7: Information Extraction and Text Classification 147
Introduction 147
Understanding named entities 148
Using inbuilt NERs 149
Getting ready 150
How to do it... 150
How it works... 152
Creating, inversing, and using dictionaries 153
Getting ready 153
How to do it... 153
How it works... 155

[v]
Table of Contents

Choosing the feature set 159


Getting ready 159
How to do it... 159
How it works... 161
Segmenting sentences using classification 164
Getting ready 164
How to do it... 164
How it works... 166
Classifying documents 168
Getting ready 168
How to do it... 168
How it works... 170
Writing a POS tagger with context 173
Getting ready 173
How to do it... 173
How it works... 175
Chapter 8: Advanced NLP Recipes 178
Introduction 178
Creating an NLP pipeline 179
Getting ready 180
How to do it... 180
How it works... 182
Solving the text similarity problem 187
Getting ready 188
How to do it... 188
How it works... 190
Identifying topics 194
Getting ready 194
How to do it... 194
How it works... 196
Summarizing text 200
Getting ready 200
How to do it... 200
How it works... 202
Resolving anaphora 204
Getting ready 205
How to do it... 205
How it works... 207
Disambiguating word sense 210

[ vi ]
Table of Contents

Getting ready 211


How to do it... 211
How it works... 212
Performing sentiment analysis 213
Getting ready 214
How to do it... 214
How it works... 216
Exploring advanced sentiment analysis 218
Getting ready 218
How to do it... 218
How it works... 220
Creating a conversational assistant or chatbot 223
Getting ready 224
How to do it... 224
How it works... 227
Chapter 9: Applications of Deep Learning in NLP 230
Introduction 230
Convolutional neural networks 231
Applications of CNNs 235
Recurrent neural networks 235
Application of RNNs in NLP 237
Classification of emails using deep neural networks after generating
TF-IDF 238
Getting ready 238
How to do it... 239
How it works... 240
IMDB sentiment classification using convolutional networks CNN 1D 247
Getting ready 247
How to do it... 249
How it works... 249
IMDB sentiment classification using bidirectional LSTM 252
Getting ready 253
How to do it... 253
How it works... 254
Visualization of high-dimensional words in 2D with neural word vector
visualization 256
Getting ready 256
How to do it... 258
How it works... 258

[ vii ]
Table of Contents

Chapter 10: Advanced Applications of Deep Learning in NLP 263


Introduction 263
Automated text generation from Shakespeare's writings using LSTM 264
Getting ready... 264
How to do it... 265
How it works... 265
Questions and answers on episodic data using memory networks 270
Getting ready... 270
How to do it... 271
How it works... 272
Language modeling to predict the next best word using recurrent
neural networks LSTM 278
Getting ready... 278
How to do it... 279
How it works... 280
Generative chatbot using recurrent neural networks (LSTM) 283
Getting ready... 283
How to do it... 284
How it works... 285
Index 289

[ viii ]
Preface
Dear reader, thank you for choosing this book to pursue your interest in natural language
processing. This book will give you a practical viewpoint to understand and implement
NLP solutions from scratch. We will take you on a journey that will start with accessing
inbuilt data sources and creating your own sources. And then you will be writing complex
NLP solutions that will involve text normalization, preprocessing, POS tagging, parsing,
and much more.

In this book, we will cover the various fundamentals necessary for applications of deep
learning in natural language processing, and they are state-of-the-art techniques. We will
discuss applications of deep learning using Keras software.

This book is motivated by the following goals:

The content is designed to help newbies get up to speed with various


fundamentals explained in a detailed way; and for experienced professionals, it
will refresh various concepts to get more clarity when applying algorithms to
chosen data
There is an introduction to new trends in the applications of deep learning in
NLP

What this book covers


Chapter 1, Corpus and WordNet, teaches you access to built-in corpora of NLTK and
frequency distribution. We shall also learn what WordNet is and explore its features and
usage.

Chapter 2, Raw Text, Sourcing, and Normalization, shows how to extract text from various
formats of data sources. We will also learn to extract raw text from web sources. And finally
we will normalize raw text from these heterogeneous sources and organize it in corpus.

Chapter 3, Pre-Processing, introduces some critical preprocessing steps, such as


tokenization, stemming, lemmatization, and edit distance.

Chapter 4, Regular Expressions, covers one of the most basic and simple, yet most important
and powerful, tools that you will ever learn. In this chapter, you will learn the concept of
pattern matching as a way to do text analysis, and for this, there is no better tool than
regular expressions.
Preface

Chapter 5, POS Tagging and Grammars. POS tagging forms the basis of any further syntactic
analyses, and grammars can be formed and deformed using POS tags and chunks. We will
learn to use and write our own POS taggers and grammars.

Chapter 6, Chunking, Sentence Parse, and Dependencies, helps you to learn how to use the
inbuilt chunker as well as train/write your own chunker: dependency parser. In this
chapter, you will learn to evaluate your own trained models.

Chapter 7, Information Extraction and Text Classification, tells you more about named entities
recognition. We will be using inbuilt NEs and also creating your own named entities using
dictionaries. Let's learn to use inbuilt text classification algorithms and simple recipes
around its application.

Chapter 8, Advanced NLP Recipes, is about combining all your lessons so far and creating
applicable recipes that can be easily plugged into any of your real-life application problems.
We will write recipes such as text similarity, summarization, sentiment analysis, anaphora
resolution, and so on.

Chapter 9, Application of Deep Learning in NLP, presents the various fundamentals necessary
for working on deep learning with applications in NLP problems such as classification of
emails, sentiment classification with CNN and LSTM, and finally visualizing high-
dimensional words in low dimensional space.

Chapter 10, Advanced Application of Deep Learning in NLP, describes state-of-the-art problem
solving using deep learning. This consists of automated text generation, question and
answer on episodic data, language modeling to predict the next best word, and finally
chatbot development using generative principles.

What you need for this book


To perform the recipes of this book successfully, you will need Python 3.x or higher running
on any Windows- or Unix-based operating system with a processor of 2.0 GHz or higher
and minimum 4 GB RAM. As far as the IDE for Python development are concerned, there
are many available in the market but my personal favorite is PyCharm community edition.
It's free, it's open source, and it's developed by Jetbrains. That means support is excellent,
advancement and fixes are distributed at a steady pace, and familiarity with IntelliJ keeps
the learning curve pretty flat.

This book assumes you know Keras's basics and how to install the libraries. We do not
expect that readers are already equipped with knowledge of deep learning and
mathematics, such as linear algebra and so on.

[2]
Preface

We have used the following versions of software throughout this book, but it should run
fine with any of the more recent ones also:

Anaconda 3 – 4.3.1 (all Python and its relevant packages are included in
Anaconda, Python – 3.6.1, NumPy – 1.12.1, pandas – 0.19.2)
Theano – 0.9.0
Keras – 2.0.2
feedparser – 5.2.1
bs4 – 4.6.0
gensim – 3.0.1

Who this book is for


This book is intended for data scientists, data analysts, and data science professionals who
want to upgrade their existing skills to implement advanced text analytics using NLP. Some
basic knowledge of natural language processing is recommended.

This book is intended for any newbie with no knowledge of NLP or any experienced
professional who would like to expand their knowledge from traditional NLP techniques to
state-of-the-art deep learning techniques in the application of NLP.

Sections
In this book, you will find several headings that appear frequently (Getting ready, How to
do it…, How it works…, There's more…, and See also). To give clear instructions on how to
complete a recipe, we use these sections as follows.

Getting ready
This section tells you what to expect in the recipe, and describes how to set up any software
or any preliminary settings required for the recipe.

How to do it…
This section contains the steps required to follow the recipe.

[3]
Preface

How it works…
This section usually consists of a detailed explanation of what happened in the previous
section.

There's more…
This section consists of additional information about the recipe in order to make the reader
more knowledgeable about the recipe.

See also
This section provides helpful links to other useful information for the recipe.

Conventions
In this book, you will find a number of text styles that distinguish between different kinds
of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Create a
new file named reuters.py and add the following import line in the file" A block of code
is set as follows:
for w in reader.words(fileP):
print(w + ' ', end='')
if (w is '.'):
print()

Any command-line input or output is written as follows:


# Deep Learning modules
>>> import numpy as np
>>> from keras.models import Sequential

New terms and important words are shown in bold.

Warnings or important notes appear like this.

[4]
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
Preface

Tips and tricks appear like this.

Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book-what you liked or disliked. Reader feedback is important for us as it helps us develop
titles that you will really get the most out of. To send us general feedback, simply e-mail
[email protected], and mention the book's title in the subject of your message. If
there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide at www.packtpub.com/authors .

Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.

Downloading the example code


You can download the example code files for this book from your account at http:/​/​www.
packtpub.​com. If you purchased this book elsewhere, you can visit http:/​/​www.​packtpub.
com/​support and register to have the files e-mailed directly to you. You can download the
code files by following these steps:

1. Log in or register to our website using your e-mail address and password.
2. Hover the mouse pointer on the SUPPORT tab at the top.
3. Click on Code Downloads & Errata.
4. Enter the name of the book in the Search box.
5. Select the book for which you're looking to download the code files.
6. Choose from the drop-down menu where you purchased this book from.
7. Click on Code Download.

[5]
Preface

You can also download the code files by clicking on the Code Files button on the book's
webpage at the Packt Publishing website. This page can be accessed by entering the book's
name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:

WinRAR / 7-Zip for Windows


Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https:/​/​github.​com/
PacktPublishing/​Natural-​Language-​Processing-​with-​Python-​Cookbook. We also have
other code bundles from our rich catalog of books and videos available at https:/​/​github.
com/​PacktPublishing/​. Check them out!

Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-
we would be grateful if you could report this to us. By doing so, you can save other readers
from frustration and help us improve subsequent versions of this book. If you find any
errata, please report them by visiting http:/​/​www.​packtpub.​com/​submit-​errata, selecting
your book, clicking on the Errata Submission Form link, and entering the details of your
errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section of
that title. To view the previously submitted errata, go to https:/​/​www.​packtpub.​com/
books/​content/​support and enter the name of the book in the search field. The required
information will appear under the Errata section.

Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy. Please
contact us at [email protected] with a link to the suspected pirated material. We
appreciate your help in protecting our authors and our ability to bring you valuable
content.

[6]
Preface

Questions
If you have a problem with any aspect of this book, you can contact us at
[email protected], and we will do our best to address the problem.

[7]
Corpus and WordNet
1
In this chapter, we will cover the following recipes:

Accessing in-built corpora


Download an external corpus, load it, and access it
Counting all the wh words in three different genres in the Brown corpus
Explore frequency distribution operations on one of the web and chat text corpus
files
Take an ambiguous word and explore all its senses using WordNet
Pick two distinct synsets and explore the concepts of hyponyms and hypernyms
using WordNet
Compute the average polysemy of nouns, verbs, adjectives, and adverbs
according to WordNet

Introduction
To solve any real-world Natural Language Processing (NLP) problems, you need to work
with huge amounts of data. This data is generally available in the form of a corpus out there
in the open diaspora and as an add-on of the NLTK package. For example, if you want to
create a spell checker, you need a huge corpus of words to match against.

The goal of this chapter is to cover the following:

Introducing various useful textual corpora available with NLTK


How to access these in-built corpora from Python
Working with frequency distributions
An introduction to WordNet and its lexical features
Corpus and WordNet Chapter 1

We will try to understand these things from a practical standpoint. We will perform some
exercises that will fulfill all of these goals through our recipes.

Accessing in-built corpora


As already explained, we have many corpuses available for use with NLTK. We will
assume that you have already downloaded and installed NLTK data on your computer. If
not, you can find the same at http:/​/​www.​nltk.​org/​data.​html. Also, a complete list of
corpora that you can use from within NLTK data is available at http:/​/​www.​nltk.​org/
nltk_​data/​.

Now, our first task/recipe involves us learning how to access any one of these corpora. We
have decided to do some tests on the Reuters corpus or the same. We will import the corpus
into our program and try to access it in different ways.

How to do it...
1. Create a new file named reuters.py and add the following import line in the
file. This will specifically allow access to only the reuters corpus in our program
from the entire NLTK data:

from nltk.corpus import reuters

2. Now we want to check what exactly is available in this corpus. The simplest way
to do this is to call the fileids() function on the corpus object. Add the
following line in your program:

files = reuters.fileids()
print(files)

3. Now run the program and you shall get an output similar to this:

['test/14826', 'test/14828', 'test/14829', 'test/14832',


'test/14833', 'test/14839',

These are the lists of files and the relative paths of each of them in the reuters
corpus.

[9]
Random documents with unrelated
content Scribd suggests to you:
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™


electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for


the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,


the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.

You might also like