Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning Using Python 2nd Edition Akshay Kulkarni - Download the ebook now for an unlimited reading experience
Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning Using Python 2nd Edition Akshay Kulkarni - Download the ebook now for an unlimited reading experience
com
https://ebookmeta.com/product/natural-language-processing-
recipes-unlocking-text-data-with-machine-learning-and-deep-
learning-using-python-2nd-edition-akshay-kulkarni/
OR CLICK HERE
DOWLOAD EBOOK
https://ebookmeta.com/product/virtual-training-basics-2nd-edition-
cindy-huggett/
ebookmeta.com
Developing Comprehensive School Safety and Mental Health
Programs An Integrated Approach 1st Edition Jeffrey C.
Roth
https://ebookmeta.com/product/developing-comprehensive-school-safety-
and-mental-health-programs-an-integrated-approach-1st-edition-jeffrey-
c-roth/
ebookmeta.com
https://ebookmeta.com/product/to-reject-a-mate-redwood-
university-1-1st-edition-poppy-ireland/
ebookmeta.com
https://ebookmeta.com/product/cross-sectional-atlas-of-human-
brainstem-with-0-06-mm-pixel-size-color-images-park/
ebookmeta.com
https://ebookmeta.com/product/age-of-deception-1st-edition-t-a-white/
ebookmeta.com
The Truth About Bears (The Truth About Your Favorite
Animals) 1st Edition Maxwell Eaton Iii
https://ebookmeta.com/product/the-truth-about-bears-the-truth-about-
your-favorite-animals-1st-edition-maxwell-eaton-iii/
ebookmeta.com
Akshay Kulkarni and Adarsha Shivananda
Adarsha Shivananda
Bangalore, Karnataka, India
Apress Standard
The publisher, the authors and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the
material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
Adarsha Shivananda
is a lead data scientist at Indegene Inc.’s product and technology team,
where he leads a group of analysts who enable predictive analytics and
AI features to healthcare software products. These are mainly
multichannel activities for pharma products and solving the real-time
problems encountered by pharma sales reps. Adarsha aims to build a
pool of exceptional data scientists within the organization to solve
greater health care problems through
brilliant training programs. He always
wants to stay ahead of the curve.
His core expertise involves machine
learning, deep learning,
recommendation systems, and statistics.
Adarsha has worked on various data
science projects across multiple domains
using different technologies and
methodologies. Previously, he worked
for Tredence Analytics and IQVIA.
He lives in Bangalore, India, and loves
to read, ride, and teach data science.
About the Technical Reviewer
Aakash Kag
is a data scientist at AlixPartners and is a
co-founder of the Emeelan application.
He has six years of experience in big data
analytics and has a postgraduate degree
in computer science with a specialization
in big data analytics. Aakash is
passionate about developing social
platforms, machine learning, and
meetups, where he often talks.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2021
A. Kulkarni, A. Shivananda, Natural Language Processing Recipes
https://doi.org/10.1007/978-1-4842-7351-7_1
This chapter covers various sources of text data and the ways to extract it. Textual data can act as
information or insights for businesses. The following recipes are covered.
Recipe 1. Text data collection using APIs
Recipe 2. Reading a PDF file in Python
Recipe 3. Reading a Word document
Recipe 4. Reading a JSON object
Recipe 5. Reading an HTML page and HTML parsing
Recipe 6. Regular expressions
Recipe 7. String handling
Recipe 8. Web scraping
Introduction
Before getting into the details of the book, let’s look at generally available data sources. We need to identify
potential data sources that can help with solving data science use cases.
Client Data
For any problem statement, one of the sources is the data that is already present. The business decides
where it wants to store its data. Data storage depends on the type of business, the amount of data, and the
costs associated with the sources. The following are some examples.
SQL databases
HDFS
Cloud storage
Flat files
Free Sources
A large amount of data is freely available on the Internet. You just need to streamline the problem and start
exploring multiple free data sources.
Free APIs like Twitter
Wikipedia
Government data (e.g., http://data.gov)
Census data (e.g., www.census.gov/data.html)
Health care claim data (e.g., www.healthdata.gov)
Data science community websites (e.g., www.kaggle.com)
Google dataset search (e.g., https://datasetsearch.research.google.com)
Web Scraping
Extracting the content/data from websites, blogs, forums, and retail websites for reviews with permission
from the respective sources using web scraping packages in Python.
There are a lot of other sources, such as news data and economic data, that can be leveraged for analysis.
Problem
You want to collect text data using Twitter APIs.
Solution
Twitter has a gigantic amount of data with a lot of value in it. Social media marketers make their living from
it. There is an enormous number of tweets every day, and every tweet has some story to tell. When all of this
data is collected and analyzed, it gives a business tremendous insights about their company, product,
service, and so forth.
Let’s now look at how to pull data and then explore how to leverage it in the coming chapters.
How It Works
Step 1-1. Log in to the Twitter developer portal
Log in to the Twitter developer portal at https://developer.twitter.com.
Create your own app in the Twitter developer portal, and get the following keys. Once you have these
credentials, you can start pulling data.
consumer key: The key associated with the application (Twitter, Facebook, etc.)
consumer secret: The password used to authenticate with the authentication server (Twitter, Facebook,
etc.)
access token: The key given to the client after successful authentication of keys
access token secret: The password for the access key
# Install tweepy
!pip install tweepy
# Import the libraries
import numpy as np
import tweepy
import json
import pandas as pd
from tweepy import OAuthHandler
# credentials
consumer_key = "adjbiejfaaoeh"
consumer_secret = "had73haf78af"
access_token = "jnsfby5u4yuawhafjeh"
access_token_secret = "jhdfgay768476r"
# calling API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
# Provide the query you want to pull the data. For example, pulling data for
the mobile phone ABC
query ="ABC"
# Fetching tweets
Tweets = api.search(query, count =
10,lang='en',exclude='retweets',tweet_mode='extended')
This query pulls the top ten tweets when product ABC is searched. The API pulls English tweets since the
language given is 'en'. It excludes retweets.
Problem
You want to read a PDF file.
Solution
The simplest way to read a PDF file is by using the PyPDF2 library.
How It Works
Follow the steps in this section to extract data from PDF files.
Note You can download any PDF file from the web and place it in the location where you are running
this Jupyter notebook or Python script.
Please note that the function doesn’t work for scanned PDFs.
Problem
You want to read Word files .
Solution
The simplest way is to use the docx library.
How It Works
Follow the steps in this section to extract data from a Word file.
#Install docx
!pip install docx
#Import library
from docx import Document
Note You can download any Word file from the web and place it in the location where you are running a
Jupyter notebook or Python script.
Problem
You want to read a JSON file/object.
Solution
The simplest way is to use requests and the JSON library.
How It Works
Follow the steps in this section to extract data from JSON.
import requests
import json
Step 4-2. Extract text from a JSON file
Now let’s extract the text .
Problem
You want to read parse/read HTML pages.
Solution
The simplest way is to use the bs4 library.
How It Works
Follow the steps in this section to extract data from the web.
response =
urllib2.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')
html_doc = response.read()
#Parsing
soup = BeautifulSoup(html_doc, 'html.parser')
# Formating the parsed html file
strhtm = soup.prettify()
# Print few lines
print (strhtm[:1000])
#output
<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>
Natural language processing - Wikipedia
</title>
<script>
document.documentElement.className = document.documentElement.className.rep
</script>
<script>
(window.RLQ=window.RLQ||[]).push(function()
{mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"
processing","wgCurRevisionId":860741853,"wgRevisionId":860741853,"wgArticleId"
["*"],"wgCategories":["Webarchive template wayback links","All accuracy disput
identifiers","Natural language processing","Computational linguistics","Speech
print(soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)
#output
<title>Natural language processing - Wikipedia</title>
Natural language processing - Wikipedia
None
Natural language processing
Problem
You want to parse text data using regular expressions.
Solution
The best way is to use the re library in Python.
How It Works
Let’s look at some of the ways we can use regular expressions for our tasks.
The basic flags are I, L, M, S, U, X.
re.I ignores casing.
re.L finds a local dependent.
re.M finds patterns throughout multiple lines.
re.S finds dot matches.
re.U works for Unicode data.
re.X writes regex in a more readable format.
The following describes regular expressions’ functionalities .
Find a single occurrence of characters a and b: [ab]
Find characters except for a and b: [^ab]
Find the character range of a to z: [a-z]
Find a character range except a to z: [^a-z]
Find all the characters from both a to z and A to Z: [a-zA-Z]
Find any single character: []
Find any whitespace character: \s
Find any non-whitespace character: \S
Find any digit: \d
Find any non-digit: \D
Find any non-words: \W
Find any words: \w
Find either a or b: (a|b)
The occurrence of a is either zero or one
Matches zero or not more than one occurrence: a? ; ?
The occurrence of a is zero or more times: a* ; * matches zero or more than that
The occurrence of a is one or more times: a+ ; + matches occurrences one or more
than one time
Match three simultaneous occurrences of a: a{3}
Match three or more simultaneous occurrences of a: a{3,}
Match three to six simultaneous occurrences of a: a{3,6}
Start of a string: ^
End of a string: $
Match word boundary: \b
Non-word boundary: \B
The re.match() and re.search() functions find patterns, which are then processed according to
the requirements of the application.
Let’s look at the differences between re.match() and re.search().
re.match() checks for a match only at the beginning of the string. So, if it finds a pattern at the
beginning of the input string, it returns the matched pattern; otherwise, it returns a noun.
re.search() checks for a match anywhere in the string. It finds all the occurrences of the pattern in the
given input string or data.
Now let’s look at a few examples using these regular expressions.
Tokenizing
Tokenizing means splitting a sentence into words. One way to do this is to use re.split.
# Import library
import re
#run the split query
re.split('\s+','I like this book.')
['I', 'like', 'this', 'book.']
([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)
There are even more complex ones to handle all the edge cases (e.g., “.co.in” email IDs). Please give it a
try.
# Import library
import re
import requests
#url you want to extract
url = 'https://www.gutenberg.org/files/2638/2638-0.txt'
#function to extract
def get_book(url).
# Sends a http request to get the text from project Gutenberg
raw = requests.get(url).text
# Discards the metadata from the beginning of the book
start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .*
\*\*\*",raw ).end()
# Discards the metadata from the end of the book
stop = re.search(r"II", raw).start()
# Keeps the relevant text
text = raw[start:stop]
return text
# processing
def preprocess(sentence).
return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()
#calling the above function
book = get_book(url)
processed_book = preprocess(book)
print(processed_book)
# Output
produced by martin adamson david widger with corrections by andrew sly
the idiot by fyodor dostoyevsky translated by eva martin part i i. towards
the end of november during a thaw at nine o clock one morning a train on
the warsaw and petersburg railway was approaching the latter city at full
speed. the morning was so damp and misty that it was only with great
difficulty that the day succeeded in breaking and it was impossible to
distinguish anything more than a few yards away from the carriage windows.
some of the passengers by this particular train were returning from abroad
but the third class carriages were the best filled chiefly with
insignificant persons of various occupations and degrees picked up at the
different stations nearer town. all of them seemed weary and most of them
had sleepy eyes and a shivering expression while their complexions
generally appeared to have taken on the colour of the fog outside. when da
2. Perform an exploratory data analysis on this data using regex.
Problem
You want to explore handling strings.
Solution
The simplest way is to use the following string functionality.
s.find(t) is an index of the first instance of string t inside s (–1 if not found)
s.rfind(t) is an index of the last instance of string t inside s (–1 if not found)
s.index(t) is like s.find(t) except it raises ValueError if not found
s.rindex(t) is like s.rfind(t) except it raises ValueError if not found
s.join(text) combines the words of the text into a string using s as the glue
s.split(t) splits s into a list wherever a t is found (whitespace by default)
s.splitlines() splits s into a list of strings, one per line
s.lower() is a lowercase version of the string s
s.upper() is an uppercase version of the string s
s.title() is a titlecased version of the string s
s.strip() is a copy of s without leading or trailing whitespace
s.replace(t, u) replaces instances of t with u inside s
How It Works
Now let’s look at a few of the examples.
Replacing Content
Create a string and replace the content. Creating strings is easy. It is done by enclosing the characters in
single or double quotes. And to replace, you can use the replace function.
1. Create a string.
s1 = "nlp"
s2 = "machine learning"
s3 = s1+s2
print(s3)
#output
'nlpmachine learning'
Caution Before scraping any websites, blogs, or ecommerce sites, please make sure you read the site’s
terms and conditions on whether it gives permissions for data scraping. Generally, robots.txt contains the
terms and conditions (e.g., see www.alixpartners.com/robots.txt) and a site map contains a
URL’s map (e.g., see www.alixpartners.com/sitemap.xml).
Web scraping is also known as web harvesting and web data extraction. It is a technique to extract a large
amount of data from websites and save it in a database or locally. You can use this data to extract
information related to your customers, users, or products for the business’s benefit.
A basic understanding of HTML is a prerequisite.
Problem
You want to extract data from the web by scraping. Let’s use IMDB.com as an example of scraping top
movies.
Solution
The simplest way to do this is by using Python’s Beautiful Soup or Scrapy libraries. Let’s use Beautiful Soup
in this recipe.
How It Works
Follow the steps in this section to extract data from the web.
Step 8-4. Request the URL and download the content using Beautiful Soup
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c,"lxml")
Exploring the Variety of Random
Documents with Different Content
FOURTH BOOK—continued
SECOND PART
TIME OF LORENZO THE MAGNIFICENT.
CHAPTER VI.
LORENZO AS A POET.