0% found this document useful (0 votes)

311 views23 pages

Natural Language Toolkit NLTK PDF

The document discusses the Natural Language Toolkit (NLTK), a popular Python package for natural language processing. It provides functions and objects for common NLP tasks like tokenization, part-of-speech tagging, parsing, and more. These allow programmers to preprocess and analyze human language in an automated way. The document also introduces scikit-learn, a widely used machine learning library in Python, and discusses how to access these tools on a shared computing cluster.

Uploaded by

Sam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

311 views23 pages

Natural Language Toolkit NLTK PDF

Uploaded by

Sam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

The Natural Language Toolkit

(NLTK)
Natural Language Processing (NLP)
• How can we make a computer understand language?
– Can a human write/talk to the computer?
• Or can the computer guess/predict the input?
– Can the computer talk back?
– Based on language rules, patterns, or statistics
• For now, statistics are more accurate and popular

2
Some areas of NLP
• shallow processing – the surface level
– tokenization
– part-of-speech tagging
– forms of words
• deep processing – the underlying structures of language
– word order (syntax)
– meaning
– translation
• natural language generation

3
The NLTK
• A collection of:
– Python functions and objects for accomplishing NLP tasks
– sample texts (corpora)
• Available at: http://nltk.sourceforge.net
– Requires Python 2.4 or higher
– Click 'Download' and follow instructions for your OS

4
Tokenization
• Say we want to know the words in Marty's vocabulary
– "You know what I hate? Anybody who drives an S.U.V. I'd really
like to find Mr. It-Costs-Me-100-Dollars-To-Gas-Up and kick him
square in the teeth. Booyah. Be like, I'm Marty Stepp, the best
ever. Booyah!"
• How do we split his speech into tokens?

5
Tokenization (cont.)
• How do we split his speech into tokens?

>>> martysSpeech.split()
['You', 'know', 'what', 'I', 'hate?', 'Anybody',
'who', 'drives', 'an', 'S.U.V.', "I'd", 'really',
'like', 'to', 'find', 'Mr.', 'It-Costs-Me-100-
Dollars-To-Gas-Up', 'and', 'kick', 'him',
'square', 'in', 'the', 'teeth.', 'Booyah.', 'Be',
'like,', "I'm", 'Marty', 'Stepp,', 'the', 'best',
'ever.', 'Booyah!']

• Now, how often does he use the word "booyah"?

>>> martysSpeech.split().count("booyah")
0
>>> # What the!

6
Tokenization (cont.)
• We could lowercase the speech
• We could write our own method to split on "." split on ",",
split on "-", etc.
• The NLTK already has several tokenizer options
• Try:
• nltk.tokenize.WordPunctTokenizer
– tokenizes on all punctuation
• nltk.tokenize.PunktWordTokenizer
– trained algorithm to statistically split on words

7
Part-of-speech (POS) tagging
• If you know a token's POS you know:
– is it the subject?
– is it the verb?
– is it introducing a grammatical structure?
– is it a proper name?

8
Part-of-speech (POS) tagging
• Exercise: most frequent proper noun in the Penn Treebank?
– Try:
• nltk.corpus.treebank
• Python's dir() to list attributes of an object
– Example:

>>> dir("hello world!")

[..., 'capitalize', 'center', 'count',
'decode', 'encode', 'endswith', 'expandtabs',
'find', 'index', 'isalnum', 'isalpha',
'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', ...]

9
Tuples
• tagged_words() gives us a list of tuples
– tuple: the same thing as a list, but you can't change it
– in this case, the tuples are a (word, tag) pairs

10
POS tagging (cont.)
• How do we tag plain sentences?
– A NLTK tagger needs a list of tagged sentences to train on
• We'll use nltk.corpus.treebank.tagged_sents()
– Then it is ready to tag any input! (but how well?)

– Try these tagger objects:

• nltk.UnigramTagger(tagged_sentences)
• nltk.TrigramTagger(tagged_sentences)
– Call the tagger's tag(tokens) method

>>> tagger = nltk.UnigramTagger(tagged_sentences)

>>> result = tagger.tag(tokens)
>>> result
[('You', 'PRP'), ('know', 'VB'), ('what', 'WP'),
('I', 'PRP'), ('hate', None), ('?', '.'), ...]

11
Parsing
• Syntax is as important for a compiler as it is for natural
language
• Realizing the hidden structure of a sentence is useful for:
– translation
– meaning analysis
– relationship analysis
– a cool demo!
• Try:
– nltk.draw.rdparser.demo()

12
Conclusion
• NLTK: NLP made easy with Python
– Functions and objects for:
• tokenization, tagging, generation, parsing, ...
• and much more!
– Even armed with these tools, NLP has a lot of difficult problems!
• Also saw:
– List methods
– dir()
– Tuples

13
Python scikit-learn
• Popular machine learning toolkit in Python http://scikit-
learn.org/stable/
• Requirements
– Anaconda
– Available from https://www.continuum.io/downloads
– Includes numpy, scipy, and scikit-learn (former two are
necessary for scikit-learn)

14
SciKit
Many popular Python toolboxes/libraries:
– NumPy
– SciPy
– Pandas
– SciKit-Learn All these
libraries are
installed on
Visualization libraries the SCC

– matplotlib
– Seaborn

and many15more …

15
Python Libraries for Data Science
SciPy:
▪ collection of algorithms for linear algebra, differential
equations, numerical integration, optimization, statistics and
more

▪ part of SciPy Stack

▪ built on NumPy

Link: https://www.scipy.org/scipylib/
16

16
Python Libraries for Data Science
SciKit-Learn:
▪ provides machine learning algorithms: classification,
regression, clustering, model validation etc.

▪ built on NumPy, SciPy and matplotlib

Link: http://scikit-learn.org/
17

17
Python Libraries for Data Science
matplotlib:
▪ python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats

▪ a set of functionalities similar to those of MATLAB

▪ line plots, scatter plots, barcharts, histograms, pie charts etc.

▪ relatively low-level; some effort needed to create advanced

visualization
18

18
Python Libraries for Data Science
Seaborn:
▪ based on matplotlib

▪ provides high level interface for drawing attractive statistical

graphics

▪ Similar (in style) to the popular ggplot2 library in R

Link: https://seaborn.pydata.org/
19

19
Login to the Shared Computing
Cluster
• Use your SCC login information if you have SCC account

• If you are using tutorial accounts see info on the blackboard

Note: Your password will not be displayed while you enter it.

20
Selecting Python Version on the
SCC
# view available python versions on the SCC

[scc1 ~] module avail python

# load python 3 version

[scc1 ~] module load python/3.6.2

21
Start Jupyter notebook
# On the Shared Computing Cluster
[scc1 ~] jupyter notebook

22
Loading Python Libraries

In [ #Import Python Libraries

]: import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns

Press Shift+Enter to execute the jupyter cell

Machine Learning For Tabular Data XGBoost, Deep Learning, and AI (Mark Ryan, Luca Massaron) (Z-Library)
100% (1)
Machine Learning For Tabular Data XGBoost, Deep Learning, and AI (Mark Ryan, Luca Massaron) (Z-Library)
504 pages
Hypermodern Python Tooling (For - Claudio Jolowicz
No ratings yet
Hypermodern Python Tooling (For - Claudio Jolowicz
581 pages
CCS355 Neural Networks and Deep Learning Lab
No ratings yet
CCS355 Neural Networks and Deep Learning Lab
43 pages
CEC453 Machine Learning
No ratings yet
CEC453 Machine Learning
168 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
9 pages
My Revision Notes AQA CS A-Level
100% (3)
My Revision Notes AQA CS A-Level
259 pages
ML Notesv1
100% (1)
ML Notesv1
300 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
384 pages
Python With Data Science
No ratings yet
Python With Data Science
102 pages
Machine Learning in Python Main Developments and T
100% (1)
Machine Learning in Python Main Developments and T
44 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Pandas
100% (1)
Pandas
1,131 pages
A Practical Time-Series Tutorial With MATLAB
No ratings yet
A Practical Time-Series Tutorial With MATLAB
95 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
112 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Text Summarization Using NLP
No ratings yet
Text Summarization Using NLP
6 pages
DL Lab Manual
100% (1)
DL Lab Manual
35 pages
Data Science ML Full Stack 2022 GitHub
No ratings yet
Data Science ML Full Stack 2022 GitHub
9 pages
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
100% (1)
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
27 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
39 pages
Image Classification Using Pre-Trained Convolutional Neural Network in COLAB
No ratings yet
Image Classification Using Pre-Trained Convolutional Neural Network in COLAB
6 pages
11.feature Selection, Extraction
No ratings yet
11.feature Selection, Extraction
38 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
A Review On Large Language Models Architectures Ap
No ratings yet
A Review On Large Language Models Architectures Ap
31 pages
Jupyter Installation
100% (1)
Jupyter Installation
19 pages
Data Preprocessing Python 1
No ratings yet
Data Preprocessing Python 1
3 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Columbia Seaborn Tutorial
No ratings yet
Columbia Seaborn Tutorial
12 pages
Natural Language Processing With Python & NLTK Cheat Sheet: by Via
No ratings yet
Natural Language Processing With Python & NLTK Cheat Sheet: by Via
2 pages
Numpy User
No ratings yet
Numpy User
502 pages
Introduction To Data Visualization in Python
No ratings yet
Introduction To Data Visualization in Python
16 pages
Python Setup For Machine Learning
100% (1)
Python Setup For Machine Learning
3 pages
Bayesian Learning
No ratings yet
Bayesian Learning
49 pages
PyTorch Workflow Fundamentals
No ratings yet
PyTorch Workflow Fundamentals
1 page
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
Funciones para Python
No ratings yet
Funciones para Python
33 pages
Artificial Neural Networks: Part 1/3
No ratings yet
Artificial Neural Networks: Part 1/3
25 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
Fx505dt Fx505dd 2.0 Schematic
No ratings yet
Fx505dt Fx505dd 2.0 Schematic
72 pages
Student Booklet For Sep 2015 v6
100% (1)
Student Booklet For Sep 2015 v6
50 pages
Python Machine Learning - Machine Learning and Deep Learning With Python Scikit Learn and Tensorflow 2 Third Edition
No ratings yet
Python Machine Learning - Machine Learning and Deep Learning With Python Scikit Learn and Tensorflow 2 Third Edition
4 pages
Deep Learning Tensorflow
No ratings yet
Deep Learning Tensorflow
35 pages
Pandas Visualisation
No ratings yet
Pandas Visualisation
27 pages
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
Prof. Chandan Singhavi
No ratings yet
Prof. Chandan Singhavi
86 pages
Python Data Science
No ratings yet
Python Data Science
25 pages
Python Data Structures
No ratings yet
Python Data Structures
20 pages
Template - Handover Documentation
80% (10)
Template - Handover Documentation
4 pages
Summary - Applied Data Science With Python and Jupyter
No ratings yet
Summary - Applied Data Science With Python and Jupyter
2 pages
Python - Programming
No ratings yet
Python - Programming
9 pages
A Practical Guide To Graph Neural Networks
No ratings yet
A Practical Guide To Graph Neural Networks
28 pages
Figure Style and Scale: Darkgrid Whitegrid Dark White Ticks Darkgrid
No ratings yet
Figure Style and Scale: Darkgrid Whitegrid Dark White Ticks Darkgrid
15 pages
Pytorch: Tensors and Datasets
No ratings yet
Pytorch: Tensors and Datasets
9 pages
ML Lab File
No ratings yet
ML Lab File
53 pages
Weka Tutorial
No ratings yet
Weka Tutorial
2 pages
Python Programming Lecture 1
No ratings yet
Python Programming Lecture 1
14 pages
Natural Language Processing
No ratings yet
Natural Language Processing
49 pages
Face Detection and Smile Detection
No ratings yet
Face Detection and Smile Detection
8 pages
Unit II-1
No ratings yet
Unit II-1
23 pages
Re2 Framework Log
No ratings yet
Re2 Framework Log
3,293 pages
637931762595602517CSE 20CS42P W9 S1 Sy
No ratings yet
637931762595602517CSE 20CS42P W9 S1 Sy
8 pages
106 Top Sap Abap Interview Questions and Answers PDF Sap Abap Interview Questions and Answers PDF For Freshers Experienced PDF Free
No ratings yet
106 Top Sap Abap Interview Questions and Answers PDF Sap Abap Interview Questions and Answers PDF For Freshers Experienced PDF Free
8 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
72 pages
2NGA002288 Update Security Product Advisory Note
No ratings yet
2NGA002288 Update Security Product Advisory Note
24 pages
HITRUST Policies - Network Security Management Procedure
100% (1)
HITRUST Policies - Network Security Management Procedure
8 pages
Defense HTTP and MQ API (V7.02.003)
No ratings yet
Defense HTTP and MQ API (V7.02.003)
495 pages
Password, Email & URL Validation
No ratings yet
Password, Email & URL Validation
4 pages
Conf - A Systematic Investigation of Smart Home Automation Systems Review
No ratings yet
Conf - A Systematic Investigation of Smart Home Automation Systems Review
5 pages
Aqa 7517 Nea Guide
No ratings yet
Aqa 7517 Nea Guide
24 pages
Chfi Notes
No ratings yet
Chfi Notes
6 pages
Artificial Intelligence in Cim
100% (1)
Artificial Intelligence in Cim
23 pages
CYS 7132 - Lect-5
No ratings yet
CYS 7132 - Lect-5
29 pages
Grade10 PartI Eng2016
No ratings yet
Grade10 PartI Eng2016
6 pages
Strings
No ratings yet
Strings
29 pages
Harsha Jha Resume
No ratings yet
Harsha Jha Resume
2 pages
13-Strings - Understanding String in Build Methods and Operations (Slicing) - 12-04-2023
No ratings yet
13-Strings - Understanding String in Build Methods and Operations (Slicing) - 12-04-2023
17 pages
Cambridge IGCSE: Computer Science 0478/22
No ratings yet
Cambridge IGCSE: Computer Science 0478/22
16 pages
CST201 DATA STRUCTURES, December 2020
No ratings yet
CST201 DATA STRUCTURES, December 2020
2 pages
Types of Software Licenses - REYES
No ratings yet
Types of Software Licenses - REYES
3 pages
#Value!: Mahindra (India)
No ratings yet
#Value!: Mahindra (India)
21 pages
Concepts and Terminology Used in Printed Circuit Boards (PCB) - Electrosoft Engineering
No ratings yet
Concepts and Terminology Used in Printed Circuit Boards (PCB) - Electrosoft Engineering
6 pages
BuildDiary BigBuild Part3 ADC Install
No ratings yet
BuildDiary BigBuild Part3 ADC Install
17 pages
DB2 - Finding CPU Bottlenecks
No ratings yet
DB2 - Finding CPU Bottlenecks
38 pages
HVCCTV
No ratings yet
HVCCTV
4 pages
User Manual - Tersus GNSS Center - EN - 20200909
No ratings yet
User Manual - Tersus GNSS Center - EN - 20200909
45 pages
Process Step For Unblocking Vendor Code
No ratings yet
Process Step For Unblocking Vendor Code
2 pages
t06 Service Discovery
No ratings yet
t06 Service Discovery
26 pages
Unit 1: JDBC (Java Database Connectivity)
No ratings yet
Unit 1: JDBC (Java Database Connectivity)
7 pages
Dae 2 YEAR (SBTE) Pulse and Dig. Circuits Past Papers (2013-2018) 2013
No ratings yet
Dae 2 YEAR (SBTE) Pulse and Dig. Circuits Past Papers (2013-2018) 2013
4 pages
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
From Everand
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
Fouad Sabry
No ratings yet

Natural Language Toolkit NLTK PDF

Uploaded by

Natural Language Toolkit NLTK PDF

Uploaded by

The Natural Language Toolkit

• Now, how often does he use the word "booyah"?

>>> dir("hello world!")

>>> # Get the (word, tag) pair at list index 0

– Try these tagger objects:

>>> tagger = nltk.UnigramTagger(tagged_sentences)

▪ part of SciPy Stack

▪ built on NumPy, SciPy and matplotlib

▪ a set of functionalities similar to those of MATLAB

▪ line plots, scatter plots, barcharts, histograms, pie charts etc.

▪ relatively low-level; some effort needed to create advanced

▪ provides high level interface for drawing attractive statistical

▪ Similar (in style) to the popular ggplot2 library in R

• If you are using tutorial accounts see info on the blackboard

[scc1 ~] module avail python

# load python 3 version

[scc1 ~] module load python/3.6.2

In [ #Import Python Libraries

Press Shift+Enter to execute the jupyter cell

You might also like