Python NLP
Python NLP
This section some tools to process and work with text in Python.
!python -m textblob.download_corpora
WordList(['beautiful day'])
# Spelling correction
>>> text = "Today is a beutiful day"
>>> blob = TextBlob(text)
>>> blob.correct()
Link to TextBlob.
If you want to summarize text using Python or command line, try sumy.
Below is how sumy summarizes the article How to Learn Data Science (Step-By-Step) in
2020 at DataQuest.
Link to Sumy.
Spacy_streamlit: Create a Web App to Visualize Your
Text in 3 Lines of Code
!pip install spacy-streamlit
If you want to quickly create an app to visualize the structure of a text, try
spacy_streamlit.
To understand how to use spacy_streamlit, we add the code below to a file called
streamlit_app.py:
# streamlit_app.py
import spacy_streamlit
models = ['en_core_web_sm']
text = "Today is a beautiful day"
spacy_streamlit.visualize(models, text)
Click the URL generated by spacy_streamlit and you should see something like below:
Link to spacy-streamlit.
textacy: Extract a Contiguous Sequence of 2 Words
!pip install spacy textacy
If you want to extract a contiguous sequence of 2 words, for example, 'data science', not
'data', what should you do? That is when the concept of extracting n-gram from text
becomes useful.
A really useful tool to easily extract n-gram with a specified number of words in the
sequence is textacy.
import pandas as pd
import spacy
from textacy.extract import ngrams
nlp = spacy.load('en_core_web_sm')
Data science 1
disciplinary field 1
uses scientific 1
scientific methods 1
extract knowledge 1
unstructured data 1
dtype: int64
Link to textacy
Convert Number to Words
If there are both number 105 and the words ‘one hundred and five’ in a text, they should
deliver the same meaning. How can we map 105 to ‘one hundred and five’? There is a
Python libary to convert number to words called num2words.
>>> num2words(105)
The library can also generate ordinal numbers and support multiple languages!
'ciento cinco'
Link to num2words.
texthero.clean: Preprocess Text in One Line of Code
!pip install texthero
If you want to preprocess text in one line of code, try texthero. The texthero.clean
method will:
import numpy as np
import pandas as pd
import texthero as hero
df = pd.DataFrame(
{
"text": [
"Today is a beautiful day",
"There are 3 ducks in this pond",
"This is. very cool.",
np.nan,
]
}
)
df.text.pipe(hero.clean)
0 today beautiful day
1 ducks pond
2 cool
3
Name: text, dtype: object
Texthero also provides other useful methods to process and visualize text.
Link to texthero.
wordfreq: Estimate the Frequency of a Word in 36
Languages
!pip install wordfreq
If you want to look up the frequency of a certain word in your language, try wordfreq.
wordfreq supports 36 languages. wordfreq even covers words that appear at least once per
10 million words.
0.000135
0.0537
Link to wordfreq.
newspaper3k: Extract Meaningful Information From an
Articles in 2 Lines of Code
!pip install newspaper3k nltk
If you want to quickly extract meaningful information from an article in a few lines of
code, try newspaper3k.
>>> nltk.download("punkt")
>>> article.title
>>> article.publish_date
datetime.datetime(2020, 5, 4, 7, 1, tzinfo=tzutc())
>>> article.top_image
'https://www.dataquest.io/wp-content/uploads/2020/05/learn-
data-science.jpg'
>>> article.nlp()
>>> article.summary
>>> article.keywords
['scientists',
'guide',
'learning',
'youre',
'science',
'work',
'skills',
'youll',
'data',
'learn',
'stepbystep',
'need']
Link to newspaper3k.
Questgen.ai: Question Generator in Python
It can be time-consuming to generate questions for a document. Wouldn't it be nice if you
can automatically generate questions using Python? That is when Questgen.ai comes in
handy.
With a few lines of code, the questions for your document are automatically generated.
payload = {
"input_text": """The weather today was nice so I went for
a walk. I stopped for a quick chat with my neighbor.
It turned out that my neighbor just got a dog named
Pepper. It is a black Labrador Retriever."""
}
qe = main.BoolQGen()
output = qe.predict_boolq(payload)
pprint(output)
output = qg.predict_shortq(payload)
pprint(output)
Link to Questgen.ai.