Natural Language Toolkit NLTK PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

The Natural Language Toolkit

(NLTK)
Natural Language Processing (NLP)
• How can we make a computer understand language?
– Can a human write/talk to the computer?
• Or can the computer guess/predict the input?
– Can the computer talk back?
– Based on language rules, patterns, or statistics
• For now, statistics are more accurate and popular

2
Some areas of NLP
• shallow processing – the surface level
– tokenization
– part-of-speech tagging
– forms of words
• deep processing – the underlying structures of language
– word order (syntax)
– meaning
– translation
• natural language generation

3
The NLTK
• A collection of:
– Python functions and objects for accomplishing NLP tasks
– sample texts (corpora)
• Available at: http://nltk.sourceforge.net
– Requires Python 2.4 or higher
– Click 'Download' and follow instructions for your OS

4
Tokenization
• Say we want to know the words in Marty's vocabulary
– "You know what I hate? Anybody who drives an S.U.V. I'd really
like to find Mr. It-Costs-Me-100-Dollars-To-Gas-Up and kick him
square in the teeth. Booyah. Be like, I'm Marty Stepp, the best
ever. Booyah!"
• How do we split his speech into tokens?

5
Tokenization (cont.)
• How do we split his speech into tokens?

>>> martysSpeech.split()
['You', 'know', 'what', 'I', 'hate?', 'Anybody',
'who', 'drives', 'an', 'S.U.V.', "I'd", 'really',
'like', 'to', 'find', 'Mr.', 'It-Costs-Me-100-
Dollars-To-Gas-Up', 'and', 'kick', 'him',
'square', 'in', 'the', 'teeth.', 'Booyah.', 'Be',
'like,', "I'm", 'Marty', 'Stepp,', 'the', 'best',
'ever.', 'Booyah!']

• Now, how often does he use the word "booyah"?

>>> martysSpeech.split().count("booyah")
0
>>> # What the!

6
Tokenization (cont.)
• We could lowercase the speech
• We could write our own method to split on "." split on ",",
split on "-", etc.
• The NLTK already has several tokenizer options
• Try:
• nltk.tokenize.WordPunctTokenizer
– tokenizes on all punctuation
• nltk.tokenize.PunktWordTokenizer
– trained algorithm to statistically split on words

7
Part-of-speech (POS) tagging
• If you know a token's POS you know:
– is it the subject?
– is it the verb?
– is it introducing a grammatical structure?
– is it a proper name?

8
Part-of-speech (POS) tagging
• Exercise: most frequent proper noun in the Penn Treebank?
– Try:
• nltk.corpus.treebank
• Python's dir() to list attributes of an object
– Example:

>>> dir("hello world!")


[..., 'capitalize', 'center', 'count',
'decode', 'encode', 'endswith', 'expandtabs',
'find', 'index', 'isalnum', 'isalpha',
'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', ...]

9
Tuples
• tagged_words() gives us a list of tuples
– tuple: the same thing as a list, but you can't change it
– in this case, the tuples are a (word, tag) pairs

>>> # Get the (word, tag) pair at list index 0


...
>>> pair = nltk.corpus.treebank.tagged_words()[0]
>>> pair
('Pierre', 'NNP')
>>> word = pair[0]
>>> tag = pair[1]
>>> print word, tag
Pierre NNP
>>> word, tag = pair # or unpack in 1 line!
>>> print word, tag
Pierre NNP

10
POS tagging (cont.)
• How do we tag plain sentences?
– A NLTK tagger needs a list of tagged sentences to train on
• We'll use nltk.corpus.treebank.tagged_sents()
– Then it is ready to tag any input! (but how well?)

– Try these tagger objects:


• nltk.UnigramTagger(tagged_sentences)
• nltk.TrigramTagger(tagged_sentences)
– Call the tagger's tag(tokens) method

>>> tagger = nltk.UnigramTagger(tagged_sentences)


>>> result = tagger.tag(tokens)
>>> result
[('You', 'PRP'), ('know', 'VB'), ('what', 'WP'),
('I', 'PRP'), ('hate', None), ('?', '.'), ...]

11
Parsing
• Syntax is as important for a compiler as it is for natural
language
• Realizing the hidden structure of a sentence is useful for:
– translation
– meaning analysis
– relationship analysis
– a cool demo!
• Try:
– nltk.draw.rdparser.demo()

12
Conclusion
• NLTK: NLP made easy with Python
– Functions and objects for:
• tokenization, tagging, generation, parsing, ...
• and much more!
– Even armed with these tools, NLP has a lot of difficult problems!
• Also saw:
– List methods
– dir()
– Tuples

13
Python scikit-learn
• Popular machine learning toolkit in Python http://scikit-
learn.org/stable/
• Requirements
– Anaconda
– Available from https://www.continuum.io/downloads
– Includes numpy, scipy, and scikit-learn (former two are
necessary for scikit-learn)

14
SciKit
Many popular Python toolboxes/libraries:
– NumPy
– SciPy
– Pandas
– SciKit-Learn All these
libraries are
installed on
Visualization libraries the SCC

– matplotlib
– Seaborn

and many15more …

15
Python Libraries for Data Science
SciPy:
▪ collection of algorithms for linear algebra, differential
equations, numerical integration, optimization, statistics and
more

▪ part of SciPy Stack

▪ built on NumPy

Link: https://www.scipy.org/scipylib/
16

16
Python Libraries for Data Science
SciKit-Learn:
▪ provides machine learning algorithms: classification,
regression, clustering, model validation etc.

▪ built on NumPy, SciPy and matplotlib

Link: http://scikit-learn.org/
17

17
Python Libraries for Data Science
matplotlib:
▪ python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats

▪ a set of functionalities similar to those of MATLAB

▪ line plots, scatter plots, barcharts, histograms, pie charts etc.

▪ relatively low-level; some effort needed to create advanced


visualization
18

18
Python Libraries for Data Science
Seaborn:
▪ based on matplotlib

▪ provides high level interface for drawing attractive statistical


graphics

▪ Similar (in style) to the popular ggplot2 library in R

Link: https://seaborn.pydata.org/
19

19
Login to the Shared Computing
Cluster
• Use your SCC login information if you have SCC account

• If you are using tutorial accounts see info on the blackboard

Note: Your password will not be displayed while you enter it.

20

20
Selecting Python Version on the
SCC
# view available python versions on the SCC

[scc1 ~] module avail python

# load python 3 version

[scc1 ~] module load python/3.6.2

21

21
Start Jupyter notebook
# On the Shared Computing Cluster
[scc1 ~] jupyter notebook

22

22
Loading Python Libraries

In [ #Import Python Libraries


]: import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns

Press Shift+Enter to execute the jupyter cell

23

23

You might also like