0% found this document useful (0 votes)
22 views19 pages

Natural Language Processing

The document provides an overview of Natural Language Processing (NLP), highlighting its focus on creating models from text data and the unique challenges it presents. It outlines a basic NLP process, introduces the TF-IDF method for featurizing text, and suggests optional reading materials. Additionally, it mentions a practical code along project for building a spam detection filter using Python and Spark.

Uploaded by

abhimanyu thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views19 pages

Natural Language Processing

The document provides an overview of Natural Language Processing (NLP), highlighting its focus on creating models from text data and the unique challenges it presents. It outlines a basic NLP process, introduces the TF-IDF method for featurizing text, and suggests optional reading materials. Additionally, it mentions a practical code along project for building a spam detection filter using Python and Spark.

Uploaded by

abhimanyu thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Natural Language

Processing
Let’s learn something!
Python and Spark

● Let’s now learn about the basics of


Natural Language Processing!
● This is the field of machine learning that
focuses on creating models from a text
data source (straight from articles of
words).
Python and Spark

● The NLP section of the course will just


contain a single custom code along
example because the documentation
doesn’t really have a full example and the
custom code along is a larger multi-step
process.
Python and Spark

● This is a very large field of machine


learning with its own unique challenges
and sets of algorithms and features, so
what we cover here will be scratching
just the surface!
Python and Spark

● Optional Reading Suggestions:


○ Wikipedia Article on NLP
○ NLTK Book (separate Python library)
○ Foundations of Statistical Natural
Language Processing (Manning)
Python and Spark

● Examples of NLP
○ Clustering News Articles
○ Suggesting similar books
○ Grouping Legal Documents
○ Analyzing Consumer Feedback
○ Spam Email Detection
Python and Spark

● Our basic process for NLP:


○ Compile all documents (Corpus)
○ Featurize the words to numerics
○ Compare features of documents
Python and Spark

● A standard way of doing this is through


the use of what is known as “TF-IDF”
methods.
● TF-IDF stands for Term Frequency -
Inverse Document Frequency
● Let’s explain how it works!
NLP

Simple Example:
● You have 2 documents:
○ “Blue House”
○ “Red House”
● Featurize based on word count:
○ “Blue House” -> (red,blue,house) -> (0,1,1)
○ “Red House” -> (red,blue,house) -> (1,0,1)
NLP

● A document represented as a vector of word


counts is called a “Bag of Words”
○ “Blue House” -> (red,blue,house) -> (0,1,1)
○ “Red House” -> (red,blue,house) -> (1,0,1)
● These are now vectors in an N-dimensional
space, we can compare vectors with cosine
similarity:
NLP

● We can improve on Bag of Words by


adjusting word counts based on
their frequency in corpus (the group
of all the documents)
● We can use TF-IDF (Term Frequency
- Inverse Document Frequency)
NLP

● Term Frequency - Importance of the term


within that document
○ TF(x,y) = Number of occurrences of term x in
document y
● Inverse Document Frequency - Importance of
the term in the corpus
○ IDF(t) = log(N/dfx) where
■ N = total number of documents
■ dfx = number of documents with the
term
NLP

● Mathematically, TF-IDF is then


expressed:
Python and Spark

● Spark has a lot of pyspark.ml.feature


tools to help out with this entire process
and make it all easy for you!
● Let’s jump to a custom code along
example!
Tools for NLP
Part One
Python and Spark

● Before we jump into the code along


project, let’s explore a few of the tools
Spark has for dealing with text data.
● Then we’ll be able to use them easily in
our project!
Tools for NLP
Part Two
NLP Code Along
Python and Spark

● Let’s work through building a spam


detection filter using Python and Spark!
● Our data set consists of volunteered text
messages from a study in Singapore and
some spam texts from a UK reporting
site.
● Let’s get started

You might also like