Natural Language Processing With Python Cookbook 1st Edition Krishna Bhavsar Download PDF
Natural Language Processing With Python Cookbook 1st Edition Krishna Bhavsar Download PDF
Natural Language Processing With Python Cookbook 1st Edition Krishna Bhavsar Download PDF
com
https://textbookfull.com/product/natural-language-
processing-with-python-cookbook-1st-edition-
krishna-bhavsar/
https://textbookfull.com/product/computational-methods-and-data-
engineering-proceedings-of-icmde-2020-volume-1-vijendra-singh/
textbookfull.com
Religions and Migrations in the Black Sea Region 1st
Edition Eleni Sideri
https://textbookfull.com/product/religions-and-migrations-in-the-
black-sea-region-1st-edition-eleni-sideri/
textbookfull.com
https://textbookfull.com/product/the-sound-of-gravel-a-memoir-first-
edition-church-of-the-firstborn-of-the-fulness-of-times/
textbookfull.com
https://textbookfull.com/product/the-emergence-of-sin-the-cosmic-
tyrant-in-romans-1st-edition-matthew-croasmun/
textbookfull.com
https://textbookfull.com/product/bayesian-hierarchical-models-with-
applications-using-r-peter-d-congdon/
textbookfull.com
Wide Sargasso Sea at 50 Elaine Savory
https://textbookfull.com/product/wide-sargasso-sea-at-50-elaine-
savory/
textbookfull.com
Natural Language Processing with
Python Cookbook
Krishna Bhavsar
Naresh Kumar
Pratap Dangeti
BIRMINGHAM - MUMBAI
Natural Language Processing with Python
Cookbook
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the authors, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-78728-932-1
www.packtpub.com
Credits
First and foremost, I would like to thank my mother for being the biggest motivating force
and a rock-solid support system behind all my endeavors in life. I would like to thank the
management team at Synerzip and all my friends for being supportive of me on this
journey. Last but not least, special thanks to Ram and Dorothy for keeping me on track
during this professionally difficult year.
Pratap Dangeti develops machine learning and deep learning solutions for structured,
image, and text data at TCS, in its research and innovation lab in Bangalore. He has
acquired a lot of experience in both analytics and data science. He received his master's
degree from IIT Bombay in its industrial engineering and operations research program.
Pratap is an artificial intelligence enthusiast. When not working, he likes to read about next-
gen technologies and innovative methodologies. He is also the author of the book Statistics
for Machine Learning by Packt.
I would like to thank my mom, Lakshmi, for her support throughout my career and in
writing this book. I dedicate this book to her. I also thank my family and friends for their
encouragement, without which it would not have been possible to write this book.
About the Reviewer
Juan Tomas Oliva Ramos is an environmental engineer from the University of Guanajuato,
Mexico, with a master's degree in administrative engineering and quality. He has more than
5 years of experience in the management and development of patents, technological
innovation projects, and the development of technological solutions through the statistical
control of processes.
Juan is an Alfaomega reviewer and has worked on the book Wearable Designs for Smart
Watches, Smart TVs and Android Mobile Devices.
Juan has also developed prototypes through programming and automation technologies for
the improvement of operations, which have been registered for patents.
I want to thank God for giving me wisdom and humility to review this book.
I thank Packt for giving me the opportunity to review this amazing book and to collaborate
with a group of committed people
I want to thank my beautiful wife, Brenda, our two magic princesses (Maria Regina and
Maria Renata) and our next member (Angel Tadeo), all of you, give me the strength,
happiness, and joy to start a new day. Thanks for being my family.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com. Did
you know that Packt offers eBook versions of every book published, with PDF and ePub
files available? You can upgrade to the eBook version at www.PacktPub.comand as a print
book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
[email protected] for more details. At www.PacktPub.com, you can also read a
collection of free technical articles, sign up for a range of free newsletters and receive
exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt
books and video courses, as well as industry-leading tools to help you plan your personal
development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial
process. To help us improve, please leave us an honest review on this book's Amazon page
at https://www.amazon.com/dp/178728932X. If you'd like to join our team of regular
reviewers, you can email us at [email protected]. We award our regular
reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be
relentless in improving our products!
Table of Contents
Preface 1
Chapter 1: Corpus and WordNet 8
Introduction 8
Accessing in-built corpora 9
How to do it... 9
Download an external corpus, load it, and access it 12
Getting ready 12
How to do it... 12
How it works... 14
Counting all the wh words in three different genres in the Brown corpus 15
Getting ready 15
How to do it... 15
How it works... 17
Explore frequency distribution operations on one of the web and chat
text corpus files 17
Getting ready 18
How to do it... 18
How it works... 20
Take an ambiguous word and explore all its senses using WordNet 20
Getting ready 21
How to do it... 21
How it works... 24
Pick two distinct synsets and explore the concepts of hyponyms and
hypernyms using WordNet 25
Getting ready 25
How to do it... 25
How it works... 28
Compute the average polysemy of nouns, verbs, adjectives, and
adverbs according to WordNet 28
Getting ready 29
How to do it... 29
How it works... 30
Chapter 2: Raw Text, Sourcing, and Normalization 31
Introduction 31
Table of Contents
[ ii ]
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
Table of Contents
Getting ready 60
How to do it… 61
How it works… 63
Stopwords – learning to use the stopwords corpus and seeing the
difference it can make 63
Getting ready 63
How to do it... 63
How it works... 66
Edit distance – writing your own algorithm to find edit distance
between two strings 66
Getting ready 66
How to do it… 67
How it works… 69
Processing two short stories and extracting the common vocabulary
between two of them 69
Getting ready 69
How to do it… 70
How it works… 75
Chapter 4: Regular Expressions 76
Introduction 76
Regular expression – learning to use *, +, and ? 77
Getting ready 77
How to do it… 77
How it works… 79
Regular expression – learning to use $ and ^, and the non-start and
non-end of a word 79
Getting ready 80
How to do it… 80
How it works… 82
Searching multiple literal strings and substring occurrences 83
Getting ready 83
How to do it… 83
How it works... 85
Learning to create date regex and a set of characters or ranges of
character 85
How to do it... 85
How it works... 87
Find all five-character words and make abbreviations in some
sentences 88
[ iii ]
Table of Contents
How to do it… 88
How it works... 89
Learning to write your own regex tokenizer 89
Getting ready 89
How to do it... 90
How it works... 91
Learning to write your own regex stemmer 91
Getting ready 91
How to do it… 92
How it works… 93
Chapter 5: POS Tagging and Grammars 94
Introduction 94
Exploring the in-built tagger 95
Getting ready 95
How to do it... 95
How it works... 96
Writing your own tagger 97
Getting ready 97
How to do it... 98
How it works... 99
Training your own tagger 104
Getting ready 104
How to do it... 104
How it works... 106
Learning to write your own grammar 108
Getting ready 109
How to do it... 109
How it works... 110
Writing a probabilistic CFG 112
Getting ready 112
How to do it... 113
How it works... 114
Writing a recursive CFG 116
Getting ready 117
How to do it... 117
How it works... 119
Chapter 6: Chunking, Sentence Parse, and Dependencies 121
Introduction 121
[ iv ]
Table of Contents
[v]
Table of Contents
[ vi ]
Table of Contents
[ vii ]
Table of Contents
[ viii ]
Preface
Dear reader, thank you for choosing this book to pursue your interest in natural language
processing. This book will give you a practical viewpoint to understand and implement
NLP solutions from scratch. We will take you on a journey that will start with accessing
inbuilt data sources and creating your own sources. And then you will be writing complex
NLP solutions that will involve text normalization, preprocessing, POS tagging, parsing,
and much more.
In this book, we will cover the various fundamentals necessary for applications of deep
learning in natural language processing, and they are state-of-the-art techniques. We will
discuss applications of deep learning using Keras software.
Chapter 2, Raw Text, Sourcing, and Normalization, shows how to extract text from various
formats of data sources. We will also learn to extract raw text from web sources. And finally
we will normalize raw text from these heterogeneous sources and organize it in corpus.
Chapter 4, Regular Expressions, covers one of the most basic and simple, yet most important
and powerful, tools that you will ever learn. In this chapter, you will learn the concept of
pattern matching as a way to do text analysis, and for this, there is no better tool than
regular expressions.
Preface
Chapter 5, POS Tagging and Grammars. POS tagging forms the basis of any further syntactic
analyses, and grammars can be formed and deformed using POS tags and chunks. We will
learn to use and write our own POS taggers and grammars.
Chapter 6, Chunking, Sentence Parse, and Dependencies, helps you to learn how to use the
inbuilt chunker as well as train/write your own chunker: dependency parser. In this
chapter, you will learn to evaluate your own trained models.
Chapter 7, Information Extraction and Text Classification, tells you more about named entities
recognition. We will be using inbuilt NEs and also creating your own named entities using
dictionaries. Let's learn to use inbuilt text classification algorithms and simple recipes
around its application.
Chapter 8, Advanced NLP Recipes, is about combining all your lessons so far and creating
applicable recipes that can be easily plugged into any of your real-life application problems.
We will write recipes such as text similarity, summarization, sentiment analysis, anaphora
resolution, and so on.
Chapter 9, Application of Deep Learning in NLP, presents the various fundamentals necessary
for working on deep learning with applications in NLP problems such as classification of
emails, sentiment classification with CNN and LSTM, and finally visualizing high-
dimensional words in low dimensional space.
Chapter 10, Advanced Application of Deep Learning in NLP, describes state-of-the-art problem
solving using deep learning. This consists of automated text generation, question and
answer on episodic data, language modeling to predict the next best word, and finally
chatbot development using generative principles.
This book assumes you know Keras's basics and how to install the libraries. We do not
expect that readers are already equipped with knowledge of deep learning and
mathematics, such as linear algebra and so on.
[2]
Preface
We have used the following versions of software throughout this book, but it should run
fine with any of the more recent ones also:
Anaconda 3 – 4.3.1 (all Python and its relevant packages are included in
Anaconda, Python – 3.6.1, NumPy – 1.12.1, pandas – 0.19.2)
Theano – 0.9.0
Keras – 2.0.2
feedparser – 5.2.1
bs4 – 4.6.0
gensim – 3.0.1
This book is intended for any newbie with no knowledge of NLP or any experienced
professional who would like to expand their knowledge from traditional NLP techniques to
state-of-the-art deep learning techniques in the application of NLP.
Sections
In this book, you will find several headings that appear frequently (Getting ready, How to
do it…, How it works…, There's more…, and See also). To give clear instructions on how to
complete a recipe, we use these sections as follows.
Getting ready
This section tells you what to expect in the recipe, and describes how to set up any software
or any preliminary settings required for the recipe.
How to do it…
This section contains the steps required to follow the recipe.
[3]
Preface
How it works…
This section usually consists of a detailed explanation of what happened in the previous
section.
There's more…
This section consists of additional information about the recipe in order to make the reader
more knowledgeable about the recipe.
See also
This section provides helpful links to other useful information for the recipe.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds
of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Create a
new file named reuters.py and add the following import line in the file" A block of code
is set as follows:
for w in reader.words(fileP):
print(w + ' ', end='')
if (w is '.'):
print()
[4]
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
Preface
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book-what you liked or disliked. Reader feedback is important for us as it helps us develop
titles that you will really get the most out of. To send us general feedback, simply e-mail
[email protected], and mention the book's title in the subject of your message. If
there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide at www.packtpub.com/authors .
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
1. Log in or register to our website using your e-mail address and password.
2. Hover the mouse pointer on the SUPPORT tab at the top.
3. Click on Code Downloads & Errata.
4. Enter the name of the book in the Search box.
5. Select the book for which you're looking to download the code files.
6. Choose from the drop-down menu where you purchased this book from.
7. Click on Code Download.
[5]
Preface
You can also download the code files by clicking on the Code Files button on the book's
webpage at the Packt Publishing website. This page can be accessed by entering the book's
name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/
PacktPublishing/Natural-Language-Processing-with-Python-Cookbook. We also have
other code bundles from our rich catalog of books and videos available at https://github.
com/PacktPublishing/. Check them out!
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-
we would be grateful if you could report this to us. By doing so, you can save other readers
from frustration and help us improve subsequent versions of this book. If you find any
errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting
your book, clicking on the Errata Submission Form link, and entering the details of your
errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section of
that title. To view the previously submitted errata, go to https://www.packtpub.com/
books/content/support and enter the name of the book in the search field. The required
information will appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy. Please
contact us at [email protected] with a link to the suspected pirated material. We
appreciate your help in protecting our authors and our ability to bring you valuable
content.
[6]
Preface
Questions
If you have a problem with any aspect of this book, you can contact us at
[email protected], and we will do our best to address the problem.
[7]
Corpus and WordNet
1
In this chapter, we will cover the following recipes:
Introduction
To solve any real-world Natural Language Processing (NLP) problems, you need to work
with huge amounts of data. This data is generally available in the form of a corpus out there
in the open diaspora and as an add-on of the NLTK package. For example, if you want to
create a spell checker, you need a huge corpus of words to match against.
We will try to understand these things from a practical standpoint. We will perform some
exercises that will fulfill all of these goals through our recipes.
Now, our first task/recipe involves us learning how to access any one of these corpora. We
have decided to do some tests on the Reuters corpus or the same. We will import the corpus
into our program and try to access it in different ways.
How to do it...
1. Create a new file named reuters.py and add the following import line in the
file. This will specifically allow access to only the reuters corpus in our program
from the entire NLTK data:
2. Now we want to check what exactly is available in this corpus. The simplest way
to do this is to call the fileids() function on the corpus object. Add the
following line in your program:
files = reuters.fileids()
print(files)
3. Now run the program and you shall get an output similar to this:
These are the lists of files and the relative paths of each of them in the reuters
corpus.
[9]
Random documents with unrelated
content Scribd suggests to you:
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.