0% found this document useful (0 votes)
50 views

Natural Language Processing Using Python: (With NLTK, Scikit-Learn and Stanford NLP Apis)

This document provides an overview of natural language processing using Python with popular libraries like NLTK, scikit-learn, and Stanford NLP APIs. It outlines a roadmap for 3 sessions that will cover topics like text tokenization, part-of-speech tagging, chunking, named entity recognition, coreference resolution, and deep learning approaches. The first session introduces Python and NLTK for shallow parsing tasks. The second covers named entity recognition and coreference resolution. The third discusses word embeddings and deep learning models.

Uploaded by

PRINCE DEWANGAN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Natural Language Processing Using Python: (With NLTK, Scikit-Learn and Stanford NLP Apis)

This document provides an overview of natural language processing using Python with popular libraries like NLTK, scikit-learn, and Stanford NLP APIs. It outlines a roadmap for 3 sessions that will cover topics like text tokenization, part-of-speech tagging, chunking, named entity recognition, coreference resolution, and deep learning approaches. The first session introduces Python and NLTK for shallow parsing tasks. The second covers named entity recognition and coreference resolution. The third discusses word embeddings and deep learning models.

Uploaded by

PRINCE DEWANGAN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Natural Language Processing using PYTHON

(with NLTK, scikit-learn and Stanford NLP APIs)

Instructor: Diptesh Kanojia, Abhijit Mishra


Supervisor: Prof. Pushpak Bhattacharyya
Center for Indian Language Technology
Department of Computer Science and Engineering
Indian Institute of Technology Bombay
email: {diptesh,abhijitmishra,pb}@cse.iitb.ac.in
URL: http://www.cse.iitb.ac.in/~{diptesh,abhijitmishra,pb}

VIVA Institute of Technology, 2016 Diptesh, Abhijit http://www.cfilt.iitb.ac.in


Roadmap
Session 1 (Introduction to NLP, Shallow Parsing and Deep Parsing)
Introduction to python and NLTK
Text Tokenization, POS tagging and chunking using NLTK.
Constituency and Dependency Parsing using NLTK and Stanford Parser
Session 2 (Named Entity Recognition, Coreference Resolution)
NER using NLTK
Coreference Resolution using NLTK and Stanford CoreNLP tool
Session 3 (Meaning Extraction, Deep Learning)
WordNets and WordNet-API
Other Lexical Knowledge Networks – Verbnet and Framenet

VIVA Institute of Technology, 2016 Diptesh, Abhijit http://www.cfilt.iitb.ac.in 2


SESSION-1 (INTRODUCTION TO NLP, SHALLOW PARSING
AND DEEP PARSING)

• Introduction to python and NLTK


• Text Tokenization, Morphological Analysis, POS tagging and
chunking using NLTK.
• Constituency and Dependency Parsing using NLTK

Expected duration: 15 mins

3
Why Python?
As far as I know, you cannot execute a Python program (compiled to bytecode) on every
machine, such as on windows, or on linux without modification.

You are incorrect. The python bytecode is cross platform. See Is python bytecode version-dependent?
Is it platform-dependent? on Stack Overflow.
However, it is not compatible across versions. Python 2.6 cannot execute Python 2.5 files. So while
cross-platform, its not generally useful as a distribution format.

But why Python needs both a compiler and an interpreter?

Speed. Strict interpretation is slow. Virtually every "interpreted" language actually compiles the source
code into some sort of internal representation so that it doesn't have to repeatedly parse the code.
In python's case it saves this internal representation to disk so that it can skip the parsing/compiling
process next time it needs the code.

VIVA Institute of Technology, 2016 Diptesh, Abhijit CFILT 4


Introduction to python
A programming language with strong similarities Perl and C with powerful typing and
object oriented features.
Commonly used for producing HTML content on websites. (e.g. Instagram,
Bitbucket, Mozilla and many more websites built on python-django
framework).
Great for text processing (e.g. Powerful RegEx tools).
Useful built-in types (lists, dictionaries, generators, iterators).
Parallel computing (Multi-processing, Multi-threading APIs)
Map-Reduce facilities, lambda functions
Clean/Readable syntax, Lots of open-source standard libraries
Code Reusability
DEMO (basics.py)
VIVA Institute of Technology, 2016 Diptesh, Abhijit Introduction to python 5
Installing Python
Download and installation instructions at:
https://www.python.org/download/
Windows/Mac systems require installation
Pre-installed latest linux distributions (Ubuntu, Fedora, Suse
etc.)
We will use python 2.7.X versions.

VIVA Institute of Technology, 2016 Diptesh, Abhijit Introduction to python 6


Python Tutorials
“Dive into Python”
http://diveintopython.org/
The Official Python Tutorial
https://docs.python.org/2/tutorial/
The Python Quick Reference
http://rgruet.free.fr/PQR2.3.html

VIVA Institute of Technology, 2016 Diptesh, Abhijit Introduction to python 7


Useful IDE/Text-Editors
IDLE (Windows)
Vi/Emacs (Linux)
Geany (Windows/Linux/Mac)
Pydev plugin for Eclipse IDE (Windows/Linux/Mac)
Notepad++ (Windows)

VIVA Institute of Technology, 2016 Diptesh, Abhijit Introduction to python 8


Useful Python Libraries
NumPy (Mathematical Computing, Advanced mathematical functionalities)
Matplotlib (Numerical plotting library, useful in data analysis)
Scipy (Library for scientific computation)
Scikit-learn (Machine Learning/Data-mining library,
PIL (Python library for Image Processing)
PySpeech (Library for speech processing and text-to speech conversion)
XML/LXML (XML Parsing and Processing)
NLTK (Natural Language Processing)
And many more…
https://wiki.python.org/moin/UsefulModules

VIVA Institute of Technology, 2016 Diptesh, Abhijit Introduction to python 9


The Natural Language Toolkit (NLTK)
Developed by Steven Bird and Co. at Stanford University
(2006).
Open source python modules, datasets and tutorials
Papers:
Bird, Steven. "NLTK: the natural language toolkit." Proceedings of the COLING/ACL on Interactive
presentation sessions. Association for Computational Linguistics, 2006.
Loper, Edward, and Steven Bird. "NLTK: The natural language toolkit." Proceedings of the ACL-02
Workshop on Effective tools and methodologies for teaching natural language processing and
computational linguistics-Volume 1. Association for Computational Linguistics, 2002.

VIVA Institute of Technology, 2016 Diptesh, Abhijit Introduction to NLTK 10


Components of NLTK (Bird et. al, 2006)
1. Code: corpus readers, tokenizers, stemmers, taggers, chunkers,
parsers, wordnet, ... (50k lines of code)
2. Corpora: >30 annotated data sets widely used in natural
language processing (>300Mb data)
3. Documentation: a 400-page book, articles, reviews, API
documentation

VIVA Institute of Technology, 2016 Diptesh, Abhijit Introduction to NLTK 11


1. Code
Corpus Readers
Tokenizers
Stemmers
Taggers
Parsers
WordNet
Semantic Interpretation
Clusterers
Evaluation Metrics

VIVA Institute of Technology, 2016 Diptesh, Abhijit Introduction to NLTK 12


2. Corpora
Brown Corpus
Carnegie Mellon Pronouncing Dictionary
CoNLL 2000 Chunking Corpus
Project Gutenberg Selections
NIST 1999 Information Extraction: Entity Recognition Corpus
US Presidential Inaugural Address Corpus
Indian Language POS-Tagged Corpus
Floresta Portuguese Treebank
Prepositional Phrase Attachment Corpus
SENSEVAL 2 Corpus
Sinica Treebank Corpus Sample
Universal Declaration of Human Rights Corpus
Stopwords Corpus
TIMIT Corpus Sample
Treebank Corpus Sample

VIVA Institute of Technology, 2016 Diptesh, Abhijit Introduction to NLTK 13


3. Documentation
Books:
Natural Language Processing with Python - Steven Bird, Edward Loper,
Ewan Klein
Python Text Processing with NLTK 2.0 Cookbook – Jacob Perkins
Included in NLTK:
Installation instructions
API Documentation: describes every module, interface, class, and
method

VIVA Institute of Technology, 2016 Diptesh, Abhijit Introduction to NLTK 14


NLTK- How to?
Install NLTK
Follow instructions at http://www.nltk.org/install.html
Installers for Windows, Linux and Mac OS available
Check installation
Execute python command through “shell” (Linux/Mac) or command prompt “cmd”
(Windows).
• $python
• >>import nltk
The interpreter should import “nltk” without showing any error.
Download NLTK data (corpora):
>>nltk.download()

VIVA Institute of Technology, 2016 Diptesh, Abhijit Introduction to NLTK 15


NLTK modules
NLP Tasks NLTK Modules Functionality
Accessing Corpora nltk.corpus Standardized interfaces to corpora and lexicons

String Processing nltk.tokenize, nltk.stem Tokenizers, sentence tokenizers, stemmers

Collection Discovery nltk.collections t-test, chi-squared, point-wise mutual information

POS Tagging nltk.tag n-gram, backoff, Brill, HMM, TnT

Chunking nltk.chunk Regular expression, n-gram, named entity

Parsing nltk.parse Chart, feature-based, unification, probabilistic,


dependency
Classification nltk.classify, nltk.cluster Decision tree, maximum entropy, naive Bayes, EM, k-
means
Semantic Interpretation nltk.sem, nltk.inference Lambda calculus, first-order logic, model checking

Evaluation Metrics nltk.metrics Precision, recall, agreement coefficients

Probability Estimation nltk.probability Frequency distributions, smoothed probability


distributions
Applications nltk.app Graphical concordancer, parsers, WordNet browser

Linguistics fieldwork nltk.toolbox Manipulate data in SIL Toolbox format

VIVA Institute of Technology, 2016 Diptesh, Abhijit Introduction to NLTK 16


Text Tokenization
Process of splitting a string into a list of tokens (words,
punctuation-marks etc.)
For most of the languages whitespace separates two adjacent
words.
Exceptions (source: Wikipedia):
For Chinese and Japanese, sentences are delimited but words are not.
For Thai, Phrases and Sentences are delimited but not words.

VIVA Institute of Technology, 2016 Diptesh, Abhijit CFILT 17


Text Tokenization – NLTK Tokenizers
Description: http://www.nltk.org/api/nltk.tokenize.html
Demo:
LineTokenizer – Tokenize string into lines
/*PunctWordTokenizer (Statistical) –
• Tokenizing based on Unupervised ML algorithms.
• Model parameters are learnt through training on a large corpus of abbreviation
words, collocations, and words that start sentences*/
RegexpTokenizer- Tokenization based on RegExp
SExprTokenizer – To find parenthesized expressions in a string
TreebankWordTokenizer – Tokenization as per Penn-Treebank standards

VIVA Institute of Technology, 2016 Diptesh, Abhijit CFILT 18


NLTK Morphological Analyzers
DEMO- Stemmers
Lancaster Stemmer
Porter Stemmer
Regexp Stemmer
Snowball Stemmer
DEMO - Lemmatizers
WordNet based Lemmatizers
Script: morphological_analyzer.py

VIVA Institute of Technology, 2016 Diptesh, Abhijit CFILT 19


NLTK- Part of Speech Tagging
The process of sequentially labeling words in a sentence with
their corresponding part of speech tags.
Demo – NLTK POS Taggers (pos_tagger.py)
Unigram Tagger (Based on prior probability)
Brill Tagger (Rule Based)
Regexp Tagger (Using regular expressions for tagging)
HMM based Tagger (HMM-Viterbi based)
Stanford tagger (Using Stanford module)
NLTK recommended tagger

VIVA Institute of Technology, 2016 Diptesh, Abhijit CFILT 20


Parsers
Writing grammar
Rule based constituency parsing
RecursiveDescent Parser
ShiftReduce Parser
DEMO- Statistical Parsers
Probabilistic Context Free Grammar (PCFG)
• Stanford parser
Probabilistic Dependency Parsing
• Malt Parser
• Stanford Parser
Script: parser_demo.py
VIVA Institute of Technology, 2016 Diptesh, Abhijit CFILT 21
SESSION 2 (NAMED ENTITY RECOGNITION,
COREFERENCE RESOLUTION)

• NER using NLTK

Expected duration: 15 mins

22
NER and Coreference Resolution using NLTK
DEMO – Named entity chunking using NLTK
Using Stanford CoreNLP tool for Coreference Resolution
Download and installation instruction at
http://stanfordnlp.github.io/CoreNLP/
Python Wrapper for CoreNLP
https://github.com/dasmith/stanford-corenlp-python
• DEMO – Coreference Resolution using Stanford CoreNLP
• Scripts: coreference_resolution.py
named_entity_chunking.py

VIVA Institute of Technology, 2016 Diptesh, Abhijit CFILT 23


SESSION-4 (MEANING EXTRACTION, DEEP LEARNING)

• WordNets and WordNet-API


• Other Lexical Knowledge Networks – VerbNet and FrameNet

Expected duration: 10 mins

24
WordNet
DEMO - NLTK WordNet (wordnet.py)
Finding all the synonym set (SynSet) of a word for all possible pos
tags.
Finding all the SynSets if POS tag is known.
Finding hypernym, hyponym of a synset.
Finding similarities between two words.

VIVA Institute of Technology, 2016 Diptesh, Abhijit CFILT 25


Other Lexical Networks – ConceptNet, Verbnet and FrameNet

DEMO- Conceptnet (conceptnet.py)


Using Divisi API (since it is not available in NLTK)
Using ConceptNet for finding attributes of a concept.
Using ConceptNet for word similarity computation
DEMO- VerbNet (framenet_verbnet.py)
Obtaining verb classes from VerbNet
DEMO – FrameNet
Listing all the frames in FrameNet
Obtaining properties of a particular frame.

VIVA Institute of Technology, 2016 Diptesh, Abhijit CFILT 26


THANK YOU

9th January, 2016 27

You might also like