Natural Language Processing

The document provides an overview of Natural Language Processing (NLP), highlighting its focus on creating models from text data and the unique challenges it presents. It outlines a basic NLP process, introduces the TF-IDF method for featurizing text, and suggests optional reading materials. Additionally, it mentions a practical code along project for building a spam detection filter using Python and Spark.

Uploaded by

abhimanyu thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views19 pages

Natural Language Processing

Uploaded by

abhimanyu thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Natural Language

Processing
Let’s learn something!
Python and Spark

● Let’s now learn about the basics of

Natural Language Processing!
● This is the ﬁeld of machine learning that
focuses on creating models from a text
data source (straight from articles of
words).
Python and Spark

● The NLP section of the course will just

contain a single custom code along
example because the documentation
doesn’t really have a full example and the
custom code along is a larger multi-step
process.
Python and Spark

● This is a very large ﬁeld of machine

learning with its own unique challenges
and sets of algorithms and features, so
what we cover here will be scratching
just the surface!
Python and Spark

● Optional Reading Suggestions:

○ Wikipedia Article on NLP
○ NLTK Book (separate Python library)
○ Foundations of Statistical Natural
Language Processing (Manning)
Python and Spark

● Examples of NLP
○ Clustering News Articles
○ Suggesting similar books
○ Grouping Legal Documents
○ Analyzing Consumer Feedback
○ Spam Email Detection
Python and Spark

● Our basic process for NLP:

○ Compile all documents (Corpus)
○ Featurize the words to numerics
○ Compare features of documents
Python and Spark

● A standard way of doing this is through

the use of what is known as “TF-IDF”
methods.
● TF-IDF stands for Term Frequency -
Inverse Document Frequency
● Let’s explain how it works!
NLP

Simple Example:
● You have 2 documents:
○ “Blue House”
○ “Red House”
● Featurize based on word count:
○ “Blue House” -> (red,blue,house) -> (0,1,1)
○ “Red House” -> (red,blue,house) -> (1,0,1)
NLP

● A document represented as a vector of word

counts is called a “Bag of Words”
○ “Blue House” -> (red,blue,house) -> (0,1,1)
○ “Red House” -> (red,blue,house) -> (1,0,1)
● These are now vectors in an N-dimensional
space, we can compare vectors with cosine
similarity:
NLP

● We can improve on Bag of Words by

adjusting word counts based on
their frequency in corpus (the group
of all the documents)
● We can use TF-IDF (Term Frequency
- Inverse Document Frequency)
NLP

● Term Frequency - Importance of the term

within that document
○ TF(x,y) = Number of occurrences of term x in
document y
● Inverse Document Frequency - Importance of
the term in the corpus
○ IDF(t) = log(N/dfx) where
■ N = total number of documents
■ dfx = number of documents with the
term
NLP

● Mathematically, TF-IDF is then

expressed:
Python and Spark

● Spark has a lot of pyspark.ml.feature

tools to help out with this entire process
and make it all easy for you!
● Let’s jump to a custom code along
example!
Tools for NLP
Part One
Python and Spark

● Before we jump into the code along

project, let’s explore a few of the tools
Spark has for dealing with text data.
● Then we’ll be able to use them easily in
our project!
Tools for NLP
Part Two
NLP Code Along
Python and Spark

● Let’s work through building a spam

detection ﬁlter using Python and Spark!
● Our data set consists of volunteered text
messages from a study in Singapore and
some spam texts from a UK reporting
site.
● Let’s get started

Spring Boot Ecommerce Masterclass
No ratings yet
Spring Boot Ecommerce Masterclass
337 pages
Unit 1
No ratings yet
Unit 1
99 pages
475647A AYGE Datasheet
80% (5)
475647A AYGE Datasheet
4 pages
Natural Language Processing (NLP) With Python - Tutorial
No ratings yet
Natural Language Processing (NLP) With Python - Tutorial
72 pages
Natural Language Processing A Machine Learning Perspective by Yue Zhang, Westlake University Zhiyang Teng, Westlake University
No ratings yet
Natural Language Processing A Machine Learning Perspective by Yue Zhang, Westlake University Zhiyang Teng, Westlake University
768 pages
NLP Handwritten Notes
No ratings yet
NLP Handwritten Notes
26 pages
Identifying Emerging Patterns
No ratings yet
Identifying Emerging Patterns
25 pages
Secure Webmail: Sending Mail Using Stunnel, Mail Submission Port and
No ratings yet
Secure Webmail: Sending Mail Using Stunnel, Mail Submission Port and
103 pages
CCSP CBK Domain 1
100% (1)
CCSP CBK Domain 1
89 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Tutorial 1 What Is Cucumber-BDD
No ratings yet
Tutorial 1 What Is Cucumber-BDD
9 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
211 pages
Paul Mather The New Microsoft Project
No ratings yet
Paul Mather The New Microsoft Project
41 pages
4 2020 Big Data Analytics For Cyber-Physical System in Smart City - 663, 768
No ratings yet
4 2020 Big Data Analytics For Cyber-Physical System in Smart City - 663, 768
2,049 pages
PDF Template Generator A Complete Guide
No ratings yet
PDF Template Generator A Complete Guide
37 pages
ML1701 - NLP Notes Unit-1
No ratings yet
ML1701 - NLP Notes Unit-1
38 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
31 pages
Vsphere Vcenter Server 70 Installation Guide
No ratings yet
Vsphere Vcenter Server 70 Installation Guide
88 pages
Iterator+in+Java+Collection+ Iterator
No ratings yet
Iterator+in+Java+Collection+ Iterator
8 pages
Introducing Natural Language Processing
No ratings yet
Introducing Natural Language Processing
13 pages
Mod Menu Log - Com - Roblox.client
No ratings yet
Mod Menu Log - Com - Roblox.client
208 pages
Conference Proceedings 2016
No ratings yet
Conference Proceedings 2016
387 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
2022-06-02 IM Q-Series en
No ratings yet
2022-06-02 IM Q-Series en
184 pages
Fire Protection Engineer CV
No ratings yet
Fire Protection Engineer CV
5 pages
Spark DataFrame Basics
No ratings yet
Spark DataFrame Basics
10 pages
CH - 5 JS
No ratings yet
CH - 5 JS
109 pages
CH - 5 JS
No ratings yet
CH - 5 JS
109 pages
IPD Checklist
No ratings yet
IPD Checklist
1 page
Brocode OP
No ratings yet
Brocode OP
133 pages
NLP LectureNotes UNIT 1
No ratings yet
NLP LectureNotes UNIT 1
55 pages
Natural Language Processing in Python Master Data Science and Machine Learning for Spam Detection, Sentiment Analysis, Latent Semantic Analysis, And Article Spinning (Machine Learning in Python) by Un (Z-li
No ratings yet
Natural Language Processing in Python Master Data Science and Machine Learning for Spam Detection, Sentiment Analysis, Latent Semantic Analysis, And Article Spinning (Machine Learning in Python) by Un (Z-li
163 pages
Spring Slides
No ratings yet
Spring Slides
63 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
Unit 4
No ratings yet
Unit 4
39 pages
Ai CH 4
No ratings yet
Ai CH 4
53 pages
Slides For Windows OS
No ratings yet
Slides For Windows OS
43 pages
NLP Materia
No ratings yet
NLP Materia
29 pages
PHP Webforms
No ratings yet
PHP Webforms
39 pages
NLP Record300
No ratings yet
NLP Record300
24 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Clustering
No ratings yet
Clustering
43 pages
Operating System Important Questions and Answers - Crowley
No ratings yet
Operating System Important Questions and Answers - Crowley
14 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Machine Learning Section
No ratings yet
Machine Learning Section
29 pages
Unit 3 AI-ML Driven Data Science and Automation
No ratings yet
Unit 3 AI-ML Driven Data Science and Automation
49 pages
ISO IEC 13818-StandardsInformation
No ratings yet
ISO IEC 13818-StandardsInformation
1 page
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
Youtube PavanKumar Manual Testing 02 (Practical)
No ratings yet
Youtube PavanKumar Manual Testing 02 (Practical)
21 pages
Server Side PHP 1
No ratings yet
Server Side PHP 1
19 pages
Learn NLP With Python
No ratings yet
Learn NLP With Python
39 pages
2.1 Workbook
No ratings yet
2.1 Workbook
26 pages
Ai Applications Unit-1
No ratings yet
Ai Applications Unit-1
11 pages
5 - Introduction To NLP
No ratings yet
5 - Introduction To NLP
34 pages
UDEMY - SK - SelectorsHub Tutorial - A Free Next Gen XPath & Locators Tool
No ratings yet
UDEMY - SK - SelectorsHub Tutorial - A Free Next Gen XPath & Locators Tool
20 pages
Ilearn Usage Guideline1
No ratings yet
Ilearn Usage Guideline1
56 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
Massp2023 NLP
No ratings yet
Massp2023 NLP
26 pages
ChatGPT - MyLearning On Coding For NLP
No ratings yet
ChatGPT - MyLearning On Coding For NLP
10 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
13 pages
Module-1 Introduction To NLP
No ratings yet
Module-1 Introduction To NLP
28 pages
NLP Intro Logistics MIHE
No ratings yet
NLP Intro Logistics MIHE
21 pages
NLP 1
No ratings yet
NLP 1
11 pages
Lecture 3
No ratings yet
Lecture 3
15 pages
Xpath Vs CSS - Everything You Need To Know About XPath and CSS
No ratings yet
Xpath Vs CSS - Everything You Need To Know About XPath and CSS
11 pages
Testing - Apache POI
No ratings yet
Testing - Apache POI
12 pages
UDEMY - SK - XPath Tutorial From Basic To Advance Level
No ratings yet
UDEMY - SK - XPath Tutorial From Basic To Advance Level
9 pages
Tutorial 8 DataTable Aslists in Cucumber
No ratings yet
Tutorial 8 DataTable Aslists in Cucumber
13 pages
Tutorial 10 Data Driven Testing in Cucumber Scenario Outline
No ratings yet
Tutorial 10 Data Driven Testing in Cucumber Scenario Outline
10 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Chapter 09 Advanced Data Structures
No ratings yet
Chapter 09 Advanced Data Structures
9 pages
Intro To Natural Language Processing (NLP)
No ratings yet
Intro To Natural Language Processing (NLP)
13 pages
Unit Iii
No ratings yet
Unit Iii
6 pages
App Develop
No ratings yet
App Develop
1 page
Ai 2
No ratings yet
Ai 2
7 pages
Operation Manual
No ratings yet
Operation Manual
27 pages
Tutorial 6 BackgroundKeyword
No ratings yet
Tutorial 6 BackgroundKeyword
9 pages
Big Data Analytics Chap 11
No ratings yet
Big Data Analytics Chap 11
8 pages
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
1 - Introduction To NLP
No ratings yet
1 - Introduction To NLP
19 pages
Disruptive Technologies AI Lecture 3
No ratings yet
Disruptive Technologies AI Lecture 3
19 pages
3rd Sem PDF
No ratings yet
3rd Sem PDF
12 pages
Getting Started With Artificial Intelligence - Preview - Final 1 - KUO12425USEN PDF
No ratings yet
Getting Started With Artificial Intelligence - Preview - Final 1 - KUO12425USEN PDF
18 pages
System x3650 Type 7979 - Installation Guide
No ratings yet
System x3650 Type 7979 - Installation Guide
108 pages
Testing - Log4J
No ratings yet
Testing - Log4J
7 pages
Towards Continuous Deployment of A Multilingual Mobile App
No ratings yet
Towards Continuous Deployment of A Multilingual Mobile App
12 pages
Unit 4
No ratings yet
Unit 4
8 pages
Gentle Start To Natural Language Processing Using Python
No ratings yet
Gentle Start To Natural Language Processing Using Python
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Navigate The Panorama Web Interface
No ratings yet
Navigate The Panorama Web Interface
3 pages
DAA Lab
No ratings yet
DAA Lab
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
Ds Eternus Dx600 s5 WW en
No ratings yet
Ds Eternus Dx600 s5 WW en
7 pages
NLP Syllabus For Course Work
No ratings yet
NLP Syllabus For Course Work
4 pages
Text Classification Reseach Paper
No ratings yet
Text Classification Reseach Paper
4 pages
Trademark Sample
No ratings yet
Trademark Sample
2 pages
Sap PP Configuration
No ratings yet
Sap PP Configuration
32 pages
Hand Gesture Recognition Based On Convolution Neural Network CNN and Support Vector Machine SVM
No ratings yet
Hand Gesture Recognition Based On Convolution Neural Network CNN and Support Vector Machine SVM
4 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
GBHRFTHRDF
No ratings yet
GBHRFTHRDF
3 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
Natural Language Processing - NOTES
No ratings yet
Natural Language Processing - NOTES
4 pages
AI&NLP
No ratings yet
AI&NLP
1 page
Design of Grounding System for HV EHV Design of Grounding System for HV EHV Design of Grounding System for HV EHV Design of Grounding System for HV EHV Design of Grounding System for HV EHV Design of Gr
No ratings yet
Design of Grounding System for HV EHV Design of Grounding System for HV EHV Design of Grounding System for HV EHV Design of Grounding System for HV EHV Design of Grounding System for HV EHV Design of Gr
10 pages
Brochure CMU NLP 24-08-2022 V13
No ratings yet
Brochure CMU NLP 24-08-2022 V13
13 pages
Haard 1
No ratings yet
Haard 1
1 page
Unit 3
No ratings yet
Unit 3
14 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Parsing Json
No ratings yet
Parsing Json
1 page
BC Contact Numbers Emails All
No ratings yet
BC Contact Numbers Emails All
1 page
CorporateProfile English
No ratings yet
CorporateProfile English
10 pages
Design and Construction of An Ledscore Board For Minna Township Stadium
No ratings yet
Design and Construction of An Ledscore Board For Minna Township Stadium
9 pages
Steel Import License Application
No ratings yet
Steel Import License Application
6 pages
Mastering Natural Language Processing with Python and NLTK
From Everand
Mastering Natural Language Processing with Python and NLTK
Pedro Martins
No ratings yet
Conceptual Programming with Python
From Everand
Conceptual Programming with Python
Thorsten Altenkirch
4/5 (1)
Python Text Processing with NLTK 2.0 Cookbook: LITE
From Everand
Python Text Processing with NLTK 2.0 Cookbook: LITE
Jacob Perkins
4/5 (1)