Natural+Language+Processing+in+Python
Natural+Language+Processing+in+Python
Natural Language
Processing
With Expert Data Science Instructor Alice Zhao
This is Part 5 of a 5-Part series designed to take you through several applications of data science
using Python, including data prep & EDA, regression, classification, unsupervised learning & NLP
This course is for students looking for a practical, hands-on approach to learning and
applying natural language processing (NLP) concepts and techniques with Python
Quizzes & Assignments to test and reinforce key concepts, with step-by-step solutions
Interactive demos to keep you engaged and apply your skills throughout the course
OVIE M
THE M
MAVEN You'll be using pretrained LLMs on book descriptions to extract character names,
BOOKS classify books into categories, create summaries, and recommend similar books
This course covers traditional & modern natural language processing (NLP)
• Traditional NLP includes text preprocessing techniques & machine learning algorithms for text data
• Modern NLP includes concepts like neural networks, deep learning, transformers, and large language models (LLMs)
We will use Hugging Face to work with Large Language Models (LLMs)
• We’ll use the Model Hub to access pretrained models and the Transformers library in Python to apply them
• We will NOT be doing a deep dive into more advanced transformer topics like fine-tuning, RAGs, etc.
In this section we’ll install Anaconda, start writing Python code in a Jupyter Notebook,
and learn how to create a new conda environment to get set up for this course
Installing
Anaconda
When you install Anaconda, it comes with the following:
Launching Coding languages & tools Popular packages Package & environment manager
Jupyter
Conda
Environments
Installing
Anaconda
3) Launch the downloaded Anaconda pkg file
Launching
Jupyter
Conda
Environments 4) Follow the installation steps (default settings are OK) and click “Continue”, “Agree” and “Install” at the end
Anaconda
Overview
Launching
Jupyter
Conda
Environments 4) Follow the installation steps (default settings are OK) and click “Continue”, “Agree” and “Install” at the end
Installing
Anaconda MAC PC
Launching
Jupyter
Conda
Environments
1) Once inside the Jupyter interface, create a folder to store your notebooks for the course
Anaconda
Overview
Installing
Anaconda
Launching NOTE: You can rename the folder by clicking “Rename” in the top left corner
Jupyter
Conda
2) Open your new coursework folder and launch your first Jupyter Notebook!
Environments
NOTE: You can rename the notebook by clicking on the title at the top of the screen
NOTE: When you launch a Jupyter Notebook, you’ll see a bunch of log data; this
Anaconda
is called a notebook server, and it powers the notebook interface
Overview
Installing
Anaconda
Launching
Jupyter
A conda environment is a place on your computer where you can install specific
versions of Python and Python packages without affecting other projects
Anaconda
Overview
My computer
Installing
Anaconda Environment 1 Environment 2
Launching
Jupyter
I’m working on a beginner Python 101 I’m taking a NumPy course where the
project and am learning about built-in instructor is using Python 3.10 with NumPy
Python functions 2.2, and I want my code to match his
I’m going to activate Environment 1 I’m going to activate Environment 2 and
and do my Python coding here do my Python coding here
A conda environment is a place on your computer where you can install specific
versions of Python and Python packages without affecting other projects
Anaconda
Overview
Environment 1 Environment 2
Installing
Anaconda
Conda
Environments
2.2
We can use all built-in Python 3.13 functions, but get an error here We can use all built-in Python 3.10 functions, and we’re able to import
because the NumPy library isn’t available in this environment the NumPy library because it’s installed in this environment
As a Python beginner, you’ve likely been using the default environment, but
advanced users create new environments for each new, complex project
Anaconda • Creating a new environment gives us a blank slate to freshly install Python packages and
Overview
make sure the versions and dependencies are correct for each project
Installing
Anaconda Default environment New environment for sentiment analysis project
Launching
Jupyter
Conda
Environments
New environment for LLM project
This is the workflow for working with conda environments and packages:
Anaconda
Overview EXAMPLE Creating a new environment to use Hugging Face’s Transformers library
Installing
> conda create --name llm_project_env
Anaconda 1 Create a new environment
base (default) llm_project_env
Launching
Jupyter
2 Activate the new environment
This is the workflow for working with conda environments and packages:
Anaconda
Overview EXAMPLE Creating a new environment to use Hugging Face’s Transformers library
Installing
> conda activate llm_project_env
Anaconda 1 Create a new environment
base (default) llm_project_env
Launching
Jupyter
2 Activate the new environment
This is the workflow for working with conda environments and packages:
Anaconda
Overview EXAMPLE Creating a new environment to use Hugging Face’s Transformers library
Installing
> conda install transformers
Anaconda 1 Create a new environment
llm_project_env
Launching
Jupyter
2 Activate the new environment
This is the workflow for working with conda environments and packages:
Anaconda
Overview EXAMPLE Creating a new environment to use Hugging Face’s Transformers library
Installing
> jupyter notebook
Anaconda 1 Create a new environment
llm_project_env
Launching
Jupyter
2 Activate the new environment
This is the workflow for working with conda environments and packages:
Anaconda
Overview EXAMPLE Creating a new environment to use Hugging Face’s Transformers library
Installing
from transformers import pipeline
Anaconda 1 Create a new environment
sentiment = pipeline("sentiment-analysis")
sentiment("I love NLP!")
Launching
Jupyter
2 Activate the new environment
llm_project_env
Conda 3 Install the packages you need
Environments
This is the workflow for working with conda environments and packages:
Anaconda
Overview EXAMPLE Creating a new environment to use Hugging Face’s Transformers library
Installing
> conda deactivate
Anaconda 1 Create a new environment
These are some helpful commands when working with conda environments:
Anaconda
Overview Category Command Description
conda env list Lists all conda environments in your system
Installing
Anaconda
conda create --name test_env Creates a new environment called text_env
Environment
Launching conda activate test_env Activates the test_env environment
Jupyter
Deactivates the current environment and returns to
conda deactivate
the base environment
Conda
Environments conda list Lists installed packages in the active environment
Package
conda install Installs specified packages into the active environment
All conda commands should be written and executed within the Terminal (Mac)
or Anaconda Prompt (PC) application
Anaconda
Overview
The (base) prefix tells us we’re “conda env list” is the conda command to
in the default environment display all the available environments
Installing
Anaconda
Launching
Jupyter The * signals the
active environment
Conda
Environments
We will be creating and using four conda environments throughout this course
• While you can still complete the course without utilizing environments, they will help keep you
Anaconda organized and avoid potential version conflicts
Overview
Installing
Anaconda Section Environment
1) Installation & Setup test_env
Launching If you have experience working
Jupyter 2) Natural Language Processing 101 with .yml files, you can find the
NLP environment .yml files in
3) Text Preprocessing nlp_basics the “Environments” folder
Conda within the course resources
Environments
4) NLP with Machine Learning nlp_machine_learning You can use them as reference
or to quickly create new conda
5) Neural Networks & Deep Learning environments
In this section we’ll cover the basics of natural language processing (NLP), including key
concepts, the evolution of NLP over the years, and its applications & Python libraries
Techniques &
Applications
Using computers to work with text data
*Dictionary.com
*Copyright Maven Analytics, LLC
NLP & AI
NLP Basics
ARTIFICIAL INTELLIGENCE
History of NLP
The field of NLP has evolved significantly over the past 70+ years:
The field of NLP has evolved significantly over the past 70+ years:
Techniques &
Machine learning techniques Transformer-based
Applications
• Supervised learning techniques
• Unsupervised learning • LLMs: GPT, BERT, etc.
The field of NLP has evolved significantly over the past 70+ years:
Techniques &
Machine learning techniques Transformer-based
Applications Traditional NLP has largely
• Supervised learning techniques
replaced Early NLP
• Unsupervised learning • LLMs: GPT, BERT, etc.
The field of NLP has evolved significantly over the past 70+ years:
Techniques &
Machine learning techniques Transformer-based
Applications
• Supervised learning techniques
• Unsupervised learning • LLMs: GPT, BERT, etc.
Transformer-based NLP
has largely replaced
Recurrent-based NLP
The field of NLP has evolved significantly over the past 70+ years:
Techniques &
Machine learning techniques Transformer-based
Applications
• Supervised learning techniques
• Unsupervised learning • LLMs: GPT, BERT, etc.
There are numerous NLP applications & techniques that we’ll cover:
NLP Basics
NLP Category Technique Application
Sentiment Analysis
Encoder-Only LLM (BERT)
Named Entity Recognition (NER)
Zero Shot Classification
Encoder-Decoder LLM (BART)
Text Summarization
Modern Decoder-Only LLM (GPT) Text Generation
Most general data science tasks can be done using Pandas and Scikit-learn, but
there are many available Python libraries for NLP tasks:
Scikit-learn Vectorization
Techniques &
Applications VADER Sentiment Analysis
NLP with Machine Learning
Scikit-learn Text Classification, Topic Modeling
These are other popular NLP libraries that we will NOT be covering as a part of this course
(nltk, genism, TensorFlow, and PyTorch) as we will focus on the simplicity and ease of use
NLP techniques have greatly evolved over the past 70+ years
• Starting with rules-based techniques in the 1950s-70s, then moving onto traditional ML techniques in the
1980s-2000s, and currently modern NLP with deep learning and transformers-based techniques
Python is one of the best coding languages for applying NLP techniques
• There are many NLP libraries, such as scikit-learn and transformers, which integrate well into other frameworks
In this section we’ll review the text preprocessing steps required before applying machine
learning algorithms, including cleaning, normalization, vectorization, and more
NLP projects follow the same data science workflow, except there’s an extra
text preprocessing step between cleaning and exploring data:
NLP Pipeline
Text Preprocessing
1 2 3 4 5 6
Text Preprocessing
with Pandas
Text Preprocessing
with spaCy
Vectorization
Scoping a Gathering Cleaning Exploring Modeling Sharing
Project Data Data Data Data Insights
For NLP projects, this portion is called the NLP pipeline, which is the
series of steps your text data goes through for processing and analysis
*This workflow is discussed in more detail in the Data Prep & EDA course
*Copyright Maven Analytics, LLC
TEXT PREPROCESSING
Text preprocessing is about preparing raw text data for analysis and modeling:
NLP Pipeline
1 2 3 3.5 4 5 6
Text Preprocessing
with Pandas Scoping a Gathering Cleaning Text Exploring Modeling Sharing
project data data Preprocessing data data insights
Text Preprocessing
with spaCy
NLP Pipeline
Category Concepts Description
Text Preprocessing Tokenization Split text into smaller units, like words or sentences
Cleaning &
with spaCy Normalization*
Stemming / Lemmatization Reduce words to their root or base form
Document-Term Matrix (DTM) Represent text by word frequency, also known as Bag of Words
Vectorization
TF-IDF Extension of DTM that weights words based on their importance
*These text cleaning and normalization steps can be mixed and matched
Key Objectives
NEW MESSAGE
May 15, 2025 1. Open the Terminal (Mac) or Anaconda Prompt
From: Lexi Con (Lead Data Scientist) (PC) application and create a new conda
environment called “nlp_basics”
Subject: NLP Onboarding
2. Activate the “nlp_basics” environment
Hi,
3. Install Python, Jupyter Notebook, Pandas, spaCy,
I hear you’re the new associate data scientist on the team – Scikit-learn, and Matplotlib in the environment
welcome!
We’re currently kicking off several natural language processing
4. Launch Jupyter within the environment
projects with our client, Maven Books. 5. Write and execute a line of Python code
I’d like to get you involved ASAP. Can you create a new conda
environment on your computer, and install the latest versions of
Python and any other NLP libraries you might need?
Talk soon!
Lexi
The Pandas library is used for simple text cleaning and normalization
• Use str.lower() to make all text lowercase
NLP Pipeline
• Use str.replace() to replace special characters (punctuation, numbers, etc.)
Text Preprocessing
with Pandas
Text Preprocessing
with spaCy
Vectorization
The Pandas library is used for simple text cleaning and normalization
• Use str.lower() to make all text lowercase
NLP Pipeline
• Use str.replace() to replace special characters (punctuation, numbers, etc.)
Text Preprocessing
with Pandas
Text Preprocessing
with spaCy
Vectorization
PRO TIP: Regular expressions (regex) allow you to find patterns; once you understand
the basic concept, you can use tools like ChatGPT to generate the syntax
The Pandas library is used for simple text cleaning and normalization
• Use str.lower() to make all text lowercase
NLP Pipeline
• Use str.replace() to replace special characters (punctuation, numbers, etc.)
Text Preprocessing
with Pandas
Text Preprocessing
with spaCy
Vectorization
PRO TIP: Regular expressions (regex) allow you to find patterns; once you understand
the basic concept, you can use tools like ChatGPT to generate the syntax
Key Objectives
NEW MESSAGE
May 16, 2025 1. Read the childrens_books.csv file into a Jupyter
From: Lexi Con (Lead Data Scientist) Notebook
Subject: Text preprocessing request 2. Within the Description column:
a) Make all the text lowercase
Hello,
b) Remove all \xa0 characters
Now that you’re all settled in, let me get you up to speed with
c) Remove all punctuation
our first task for the Maven Books project: text preprocessing
I hear you’re already familiar with Pandas. We’ve been given a
flat file of the top 100 children's books over the past century.
Can you use Pandas string functions to do some text
normalization and cleaning?
Thank you!
Lexi
childrens_books.csv
The spaCy library can handle many NLP tasks, including tokenization,
lemmatization, stop words, and more
NLP Pipeline • The first step is to turn a text string into a spaCy doc object
Text Preprocessing
with Pandas
When you import spaCy, you need to specify which language
to use – in this case, we’re choosing English, which includes
Text Preprocessing information from a large amount of annotated text
with spaCy
Vectorization
Tokenization lets you break text up into smaller units, like words
• Text strings are often split by whitespace to make tokens
NLP Pipeline
Text Preprocessing
with Pandas
Vectorization
spaCy mainly splits on whitespace, but there’s some additional, smarter logic:
• Common contractions are separated (I’m)
• Punctuation is typically separated unless it’s a URL, email address, etc.
• …and much more!
Text Preprocessing
with Pandas
With lemmatization:
Text Preprocessing • “i” has been updated to “I”
with spaCy • “selling” has been updated to “sell”
• “lemons” has been updated to “lemon”
Vectorization
Text Preprocessing
with Pandas
Text Preprocessing
with spaCy
Parts of speech (POS) tagging lets you label nouns, verbs, etc. within text data
• This is optional, but is sometimes used as a filtering technique to only look at nouns and
NLP Pipeline pronouns for analysis, for example
Text Preprocessing
with Pandas
This is a lesser used technique
compared to the others and one of
Text Preprocessing many linguistic analysis capabilities
with spaCy available within spaCy:
• Other types of linguistic
analysis include Named Entity
Vectorization Recognition (NER),
Dependency Parsing and more
• Linguistic analysis techniques
work better with raw text
• spaCy uses a combination of
linguistic rules and statistical
models for linguistic analysis
Key Objectives
NEW MESSAGE
May 19, 2025 1. In addition to the lowercasing and special
From: Lexi Con (Lead Data Scientist) character removal from the previous assignment,
within the cleaned Description column:
Subject: RE: Text preprocessing request
a) Tokenize the text
Hi again, b) Lemmatize the text
Thanks for the first round of text preprocessing you did c) Remove stop words
earlier with Pandas!
Could you do a second round of normalization and cleaning
on the Description column with spaCy to tokenize, lemmatize
and remove stop words from the text?
Thank you!
Lexi
Vectorization is the process of converting text data into numeric data so that
future data analysis and machine learning techniques can be applied
NLP Pipeline • Most ML techniques require text data to be cleaned, normalized and in a numeric format
• Some techniques, such as sentiment analysis, require text data to be in its raw text form
Text Preprocessing
with Pandas
We will be covering these vectorization techniques:
Text Preprocessing
with spaCy Word Counts TF-IDF Embeddings
Vectorization
Text Preprocessing
with Pandas
NLP Pipeline
from sklearn.feature_extraction.text import CountVectorizer
Text Preprocessing
with spaCy Language to remove Range for the sequence of “n” words Number of OR percent of
stop words for to consider as a term in the DTM documents a term needs
(default is None) Examples: to appear in to be included
Vectorization
in the DTM
• (1,1) – “data” (default)
• (1,2) – “data”, “data science” (default is 1)
• (3,3) – “data science workflow”
You’ll notice that we’re able to tokenize and remove stop words using both spaCy
AND sklearn, so it’s your choice with which library you choose to do so
NLP Pipeline
Text Preprocessing
with Pandas
Text Preprocessing
with spaCy
Vectorization
With the default parameters, these are the word counts for
the 15 terms (columns) across the 8 documents (rows)
NLP Pipeline
Text Preprocessing
with Pandas
Text Preprocessing
with spaCy
Vectorization
With these parameters, we’re removing all English stop words, returning all one
and two-word terms, and keeping terms that appear in 2 or more documents
The columns have been reduced from the original 15 to 9!
Key Objectives
NEW MESSAGE
May 20, 2025 1. Vectorize the cleaned and normalized text using
From: Lexi Con (Lead Data Scientist)
Count Vectorizer with the default parameters
Subject: Vectorization request 2. Modify the Count Vectorizer parameters to reduce
the number of columns:
Hello,
a) Remove stop words
Now that you’ve cleaned and normalized the book b) Set a minimum document frequency of 10%
descriptions using pandas and spaCy, can you create a quick
visualization to show the top 10 most common terms in the 3. Use the updated Count Vectorizer to identify the:
descriptions?
a) Top 10 most common terms
Could you also share some of the less common terms that
appear in multiple book descriptions? b) Top 10 least common terms that appear in at least
10% of the documents
Thanks!
Lexi
4. Create a horizontal bar chart of the top 10 most
common terms
Text Preprocessing
with Pandas Term Frequency Inverse Document Frequency
Text Preprocessing Problem it solves: Problem it solves:
with spaCy High counts can dominate, especially for Each word is treated equally, even when
high frequency words or long documents some might be more important
Vectorization
Solution: Solution:
Normalize the counts so they’re all on the Assign more weight to rare words than to
same scale common words
Create a TF-IDF Vectorizer object in Python to use TD-IDF scores in your DTM
• It has many of the same parameters as the Count Vectorizer
NLP Pipeline
Text Preprocessing
with Pandas
Text Preprocessing
with spaCy
Vectorization
With the default parameters, we get the same 15 terms (columns) across
8 documents (rows), but with TF-IDF scores instead of word counts
Create a TF-IDF Vectorizer object in Python to use TD-IDF scores in your DTM
• It has many of the same parameters as the Count Vectorizer
NLP Pipeline
Text Preprocessing
with Pandas
Text Preprocessing
with spaCy
Vectorization
With the same parameters as earlier, we’re back down to 9 terms instead of 15
Here’s a comparison between word counts & TF-IDF scores from the same data:
NLP Pipeline
3
3
Text Preprocessing
1 1
with Pandas
Text Preprocessing 2 2
with spaCy
Vectorization 3 3
1.
1 “lemon”, “market”, and “maven” are all equal 1.
1 “maven” and “market” have higher values since they are more rare
1.
2 The value of 6 for “lemon” skews the results 1.
2 Everything ranges between 0 and 1
1.
3 “lemonade” shows up three times and “tea” twice 1.
3 “lemonade” is high for rows 0 and 2, but lower than “tea” in row 6
Key Objectives
NEW MESSAGE
May 21, 2025 1. Vectorize the cleaned and normalized text using TF-
From: Lexi Con (Lead Data Scientist)
IDF Vectorizer with the default parameters
Subject: RE: Vectorization request 2. Modify the TF-IDF Vectorizer parameters to reduce
the number of columns:
Hi again – Can you do the same analysis as last time, but using
a) Remove stop words
TF-IDF instead and compare the two results? Thanks!
b) Set a minimum document frequency of 10%
---
c) Set a maximum document frequency of 50%
Hello,
Now that you’ve cleaned and normalized the book 3. Using the updated TF-IDF Vectorizer, create a
descriptions using pandas and spaCy, can you create a quick horizontal bar chart of the top 10 most highly
visualization to show the top 10 most common terms in the weighted terms
descriptions?
4. Compare the Count Vectorizer bar chart from the
Thanks! previous assignment with the TF-IDF Vectorizer bar
Lexi
chart and note the differences in the top term lists
Text cleaning & normalization can be done using Pandas and spaCy
• Pandas is good for simple tasks like lowercasing and removing text with regular expressions
• spaCy can perform more advanced linguistic tasks like tokenization, lemmatization, removing stop words, and more
• By putting the steps into Python functions, you can better organize your code and create an NLP pipeline
In this section, we’ll highlight tasks that can be solved using traditional NLP methods,
including rules-based, and supervised & unsupervised machine learning techniques
Sentiment Analysis Text Classification • Use Naïve Bayes and Logistic Regression as
supervised learning approaches for text
classification the with scikit-learn library
Topic Modeling • Use Non-Negative Matrix Factorization (NMF) as
an unsupervised learning approach for topic
modeling with the scikit-learn library
Traditional NLP
Overview
Supervised Learning Unsupervised Learning
Using historical data to predict the future Finding patterns and relationships in data
Sentiment
Analysis
Text
Classification
What will house prices look like How can I segment my
for the next 12 months? customers?
Topic Modeling
You can use any of these common machine learning algorithms for natural
language processing tasks once you’ve preprocessed your text data:
Machine Learning
Refresher
MACHINE LEARNING
Traditional NLP
Overview
*The majority of these algorithms are explained in detail in courses 2-4 of this Data
Science in Python series (Regression, Classification and Unsupervised Learning)
*Copyright Maven Analytics, LLC
TRADITIONAL NLP OVERVIEW
These common NLP tasks are often solved using traditional NLP methods, such
Machine Learning
as simple rules-based techniques or more advanced ML algorithms
Refresher
Traditional NLP
Overview
Sentiment
Analysis Sentiment Analysis Text Classification Topic Modeling
Text Identifying the positivity or Classifying text as one label Finding themes within a corpus of
Classification negativity of text or another text
Text Generation
Topic Modeling
These cannot be done with traditional
Machine Translation techniques, so use modern techniques
Question Answering
Text
Classification These are hints that this is
positive / negative text
Topic Modeling
You’ll notice that sentiment analysis is applied on raw text – it’s not cleaned
because punctuation matters, and it’s not vectorized because word order matters
Sentiment analysis can be done using rules-based techniques with libraries like
VADER, classification techniques (up next), or modern NLP techniques (later)
Machine Learning
Refresher
Traditional NLP
Overview
Sentiment
Analysis
Text
Classification
0% of the text is negative, 75% is neutral, and 25% is positive Overall sentiment
score is positive!
Topic Modeling
VADER assigns predefined sentiment weights to words (amazing = 2.8, horrible = -2.5),
incorporates modifiers (not, very, caps, punctuation, etc.), and computes a final score
Key Objectives
NEW MESSAGE
May 22, 2025 1. Create a new “nlp_machine_learning” environment
From: Oscar Wynn (The Movie Maven) 2. Launch Jupyter Notebook
Subject: Feel good vs dark movies
3. Read in the movie_reviews.csv file
Hi there, 4. Apply sentiment analysis to the movie_info column
We’re a small entertainment news and movie reviews
5. Sort the sentiment scores to return the top 10 and
website, focused on data-driven content.
bottom 10 sentiment scores and their
We’re publishing an article on the top 10 most feel-good corresponding movie titles
movies and the top 10 darkest movies according to data.
Could you use sentiment analysis to help us come up with
movies for these two lists?
Thanks!
Oscar
movie_reviews.csv
Text classification is used to categorize text into groups based on labeled data
Machine Learning
Refresher
Traditional NLP
Overview Spam Tech
support
Not Billing Other
Sentiment spam issues
Analysis
Text
Classification
These existing emails have been These existing customer support tickets have been
prelabeled as spam or not spam prelabeled as billing issues, tech support and other
Topic Modeling
Send Given this new email, text Help me Given this new ticket, text classification
money classification will tell us if reset my will tell us what type of ticket it is
it’s spam or not spam
ASAP! password
You can input vectorized text data into any classification algorithm:
Machine Learning • KNN, Logistic Regression, Decision Trees, Random Forests, Gradient Boosted Trees, etc.
Refresher
• Naïve Bayes is another classification algorithm that works especially well on text data
Traditional NLP
Overview
Sentiment
Which classification algorithm should I choose for my text data?
Analysis
• For small data sets (<10k rows), start with Naïve Bayes and other simple
Text
models like Logistic Regression, KNN, etc.
Classification • For medium data sets (<100k rows), start with Logistic Regression and
other classification techniques like Decision Trees, Random Forests,
Topic Modeling
Gradient Boosted Trees, etc.
• For large data sets (>1M rows), start with Gradient Boosted Trees and
potentially move on to modern NLP techniques with LLMs
Traditional NLP
Overview
EXAMPLE If an email contains the word “ASAP”, how likely is it going to be spam?
Sentiment
Analysis
The probability that the word The probability
ASAP appears in a spam email an email is spam
Text
Classification
𝑃 𝐴𝑆𝐴𝑃 𝑆𝑝𝑎𝑚 ∗ 𝑃(𝑆𝑝𝑎𝑚)
𝑃 𝑆𝑝𝑎𝑚 𝐴𝑆𝐴𝑃 =
Topic Modeling 𝑃(𝐴𝑆𝐴𝑃)
The probability that an
email is spam, given it The probability the word
contains the word ASAP ASAP is in any email
Traditional NLP
Overview
EXAMPLE If an email contains the word “ASAP”, how likely is it going to be spam?
Sentiment
Analysis Distribution of 1,000 emails:
SPAM
Text 1 0
Classification 50 250
1 50 10 𝑃 𝐴𝑆𝐴𝑃 𝑆𝑝𝑎𝑚 ∗ 𝑃(𝑆𝑝𝑎𝑚) 250 ∗ 1000 0.2 ∗ 0.25
ASAP
= 60 = ≈ 0.83
0 200 740 𝑃(𝐴𝑆𝐴𝑃) 1000
0.06
Topic Modeling
Traditional NLP
Overview
EXAMPLE If an email contains the word “ASAP” and the “$” symbol, how likely is it going to be spam?
Sentiment
Analysis
The probability that the word The probability that $ The probability an
ASAP appears in a spam email appears in a spam email email is spam
Text
Classification
𝑃 𝐴𝑆𝐴𝑃 𝑆𝑝𝑎𝑚 ∗ 𝑃 $ 𝑆𝑝𝑎𝑚 ∗ 𝑃(𝑆𝑝𝑎𝑚)
𝑃 𝑆𝑝𝑎𝑚 𝐴𝑆𝐴𝑃, $ =
Topic Modeling 𝑃(𝐴𝑆𝐴𝑃, $)
The probability that an
email is spam, given it The probability that an email
contains ASAP and $ contains both ASAP and $
Traditional NLP
Overview
EXAMPLE If an email contains the word “ASAP” and the “$” symbol, how likely is it going to be spam?
Sentiment
Analysis
This is the naïve assumption – the probability that an email contains ASAP and The probability an
the probability it contains $ are not independent, they’re actually correlated email is spam
Text
Classification
𝑃 𝐴𝑆𝐴𝑃 𝑆𝑝𝑎𝑚 ∗ 𝑃 $ 𝑆𝑝𝑎𝑚 ∗ 𝑃(𝑆𝑝𝑎𝑚)
𝑃 𝑆𝑝𝑎𝑚 𝐴𝑆𝐴𝑃, $ =
Topic Modeling 𝑃(𝐴𝑆𝐴𝑃, $)
The probability that an
email is spam, given it The probability that an email
contains ASAP and $ contains both ASAP and $
Traditional NLP
Overview
EXAMPLE If an email contains the word “ASAP” and the “$” symbol, how likely is it going to be spam?
Sentiment
Analysis Distribution of 1,000 emails:
SPAM
Text 1 0
Classification 1 50 10 𝑃 𝐴𝑆𝐴𝑃 𝑆𝑝𝑎𝑚 ∗ 𝑃 $ 𝑆𝑝𝑎𝑚 ∗ 𝑃(𝑆𝑝𝑎𝑚)
ASAP
≈ 0.95
0 200 740 𝑃(𝐴𝑆𝐴𝑃, $)
Topic Modeling
1 0
There’s a 95% chance an email is
1 80 30 spam if it contains “ASAP” and “$”
These probabilities are all calculated
$
ASAP and $: 42
*Copyright Maven Analytics, LLC
NAÏVE BAYES IN PYTHON
Traditional NLP
Overview
Topic Modeling
We’re using MultinomialNB because the inputs are counts (like you would see in a CountVectorizer
output) – for 1/0 values, like in the previous spam example, you would use BernoulliNB instead
Once you fit your first Naïve Bayes model in Python, you can improve your text
classification model by tuning any part of the NLP pipeline:
Machine Learning
Refresher
1 Text preprocessing
Traditional NLP • Update any cleaning or normalization steps
Overview
2 Vectorization
Sentiment
Analysis • Fine tune the CountVectorizer parameters (stop_words, ngram_range, min_df, etc.)
• Try using TfidfVectorizer instead
Text
Classification
3 Feature engineering
• Include non-term values such as text length, sentiment score, time of day sent, etc.
Topic Modeling
4 Modeling
• Try a different probability cutoff point instead of the default 50% probability
• Try a different classification model (Logistic Regression, Gradient Boosted Trees, etc.)
Key Objectives
NEW MESSAGE
May 23, 2025 1. Clean and normalize the “movie_info” column using
From: Oscar Wynn (The Movie Maven) the “maven_text_preprocessing.py” module
Subject: Female vs male directors 2. Create a Count Vectorizer
• Remove stop words
Hi again,
• Set the minimum document frequency to 10%
Our next piece is going to spotlight female directors, and we
want to see if there are any differences between the types of 3. Create a Naïve Bayes model and a Logistic
movies that female versus male directors create. Regression model to predict which movies are
Could you create a classification model that predicts which directed by women vs men using the CV
movies are directed by females versus males based their
movie descriptions? 4. Compare their accuracy scores and classification
reports
Please also send over a list of the top 5 movies that are most
likely directed by a female according to the model. 5. Using the better performing model, return the top
Thanks! 5 movies that the model predicts are most likely
directed by a women
Topic 1 Topic 2
Traditional NLP
Overview “I like lemons and limes.” 100% 0%
Topic Modeling
What are topics 1 and 2?
• Topic 1: lemons, limes, cookies, apples, blueberries Food
• Topic 2: puppies, kittens, cat, dog Animals
You can input vectorized text data into a topic modeling algorithm
Machine Learning
Refresher
Topic Modeling
In this course, we’ll be demoing NMF because it’s in sklearn. For more details on LDA, you can check
out my YouTube video on LDA using genism: https://www.youtube.com/watch?v=NYkbqzTlW3w
Term 2
Term 3
Term 4
Term 5
Term 6
Term 1
Term 2
Term 3
Term 4
Term 5
Term 6
Topic 1 Topic 2
Other matrix factorization techniques include PCA and SVD, but NMF is the only one that returns
all positive results, which is needed for text data where negative values wouldn’t make sense
Use sklearn’s NMF from the decomposition module to perform NMF in Python
Machine Learning • The input should be a CountVectorizer or TfidfVectorizer
Refresher
• Start at 2 components (topics) and increase by 1 until you figure out the best number of topics
Traditional NLP
Overview
Sentiment
This follows the typical sklearn process
Analysis
for an unsupervised learning model*:
1. Instantiate an object
Text 2. Fit and transform the data
Classification 3. View the attributes
Topic Modeling
NMF starts with an initial set of randomized values, so
set a random state to get the same results each time
Once you fit your first NMF model in Python, you can improve your topic model
by tuning any part of the NLP pipeline:
Machine Learning
Refresher
1 Text preprocessing
Traditional NLP
• Update any cleaning or normalization steps
Overview
2 Vectorization
Sentiment • Fine tune the TfidfVectorizer parameters (stop_words, ngram_range, min_df, etc.)
Analysis
• Try using CountVectorizer instead
Text
Classification 3 Modeling
• Modify “n_components” to try out different numbers of topics
Topic Modeling
• Try a different topic modeling technique (Latent Dirichlet Allocation, Latent Semantic
Analysis, BERTopic, Top2Vec, etc.)
BONUS: In the demo, we’ll show an example of how you can mix and match multiple algorithms for
your analysis (in this case, topics + sentiment scores + EDA = sentiment about each topic)
Key Objectives
NEW MESSAGE
May 27, 2025 1. Using the same preprocessed data as the last
From: Oscar Wynn (The Movie Maven) assignment, create a Tfidf Vectorizer
Subject: Movie themes • Remove stop words
• Start with min_df = 0.05 and max_df=0.2
Hello,
2. Create an NMF model to find the main topics in
Our feel-good movies list and female directors articles were the movie descriptions
both hits over the weekend! Thanks for your help with those.
• Start with n_components=2
Our next goal is to suggest movies based on movie themes.
Could you use topic modeling to find the major themes in our 3. Tweak the model by updating the Tfidf Vectorizer
movie list?
parameters and number of topics
Once you do that, for a few of the themes, can you provide a list
of the top 5 movies that have the theme? 4. Interpret and name the topics
Thanks! 5. For two of the topics, return the top movies that
Oscar contain the topic
Machine learning techniques are a great starting point for NLP tasks
• Any general ML algorithm can be applied for NLP tasks once the text data is cleaned and vectorized
• ML is the preferred approach for small & medium data sets, while modern NLP is preferred for large ones
In this section, we’ll visually break down the concepts behind neural networks and deep
learning, the building blocks of modern NLP techniques
Modern NLP
Overview
Data: Data:
Deep Learning • Small to medium data sets • Small to large data sets
Techniques: Techniques:
To understand transformer-
• Rules-based • Transformers-based LLMs
based models, we’ll start with
• Supervised learning (Naïve Bayes) (BERT, GPT, LLaMA, T5, BART) the basics: neural networks
• Unsupervised learning (NMF)
Applications: Applications:
• Sentiment analysis • Traditional NLP applications
• Text classification • Text summarization
• Topic modeling • Text generation
In the next two sections, we’ll be covering these modern NLP concepts to
understand how LLMs work before applying them using Hugging Face:
Modern NLP
Overview
Concepts Key Terms
Neural Networks
1 Neural Networks & Deep Learning • Neural network components: layers, nodes,
weights, parameters, activation functions
a) Logistic Regression • Neural network training: forward pass, loss,
Deep Learning
b) Neural Networks backpropagation, gradient descent
COMPLEXITY
Neural Networks
Deep Learning
To understand neural networks, let’s start with a simple logistic regression model
• Logistic regression is a classification technique used to predict a true or false outcome
Modern NLP
Overview
Deep Learning
𝑝 = 𝜎(𝑚𝑥 + 𝑏)
0.5
x = 15 (59F) p = 22%
x = 25 (77F) p = 78%
No (0) x = 35 (95F) p = 98%
5 10 15 20 25 30 35
Today’s temperature (°C)
What are 𝝈, m, and b?
The higher the temperature today, the more • Coming up next…
likely my lemonade stand will be profitable
To understand neural networks, let’s start with a simple logistic regression model
• Logistic regression is a classification technique used to predict a true or false outcome
Modern NLP
Overview
Slope Intercept
1
Deep Learning
𝑦 = 𝑚𝑥 + 𝑏 = 0.25𝑥 − 5
5
x = 15 (59F) y = -1.25
y x = 25 (77F) y = 1.25
0 x = 35 (95F) y = 3.75
5 10 15 20 25 30 35
To understand neural networks, let’s start with a simple logistic regression model
• Logistic regression is a classification technique used to predict a true or false outcome
Modern NLP
Overview
Deep Learning
𝑝 = 𝜎(𝑦)
0.5
x = 15 (59F) y = -1.25 p = 22%
x = 25 (77F) y = 1.25 p = 78%
No (0) x = 35 (95F) y = 3.75 p = 98%
-6 -4 -2 0 2 4 6
y
These probability values are much more
interpretable than the original y-values
To understand neural networks, let’s start with a simple logistic regression model
• Logistic regression is a classification technique used to predict a true or false outcome
Modern NLP
Overview
Yes (1)
1 Remember, y=mx+b
Probability of profit
Deep Learning
𝑝=
0.5
1 + 𝑒 −(𝑚𝑥+𝑏)
This is the calculation for a
sigmoid (𝜎) transformation
No (0)
5 10 15 20 25 30 35
Today’s temperature (°C)
To understand neural networks, let’s start with a simple logistic regression model
• Logistic regression is a classification technique used to predict a true or false outcome
Modern NLP
Overview
Deep Learning
𝑝 = 𝜎(𝑚𝑥 + 𝑏)
0.5
x = 15 (59F) y = -1.25 p = 22%
x = 25 (77F) y = 1.25 p = 78%
No (0) x = 35 (95F) y = 3.75 p = 98%
5 10 15 20 25 30 35
Today’s temperature (°C)
Why are these steps so important?
• A linear transformation followed by a non-
linear transformation is the main calculation
of a neural network (coming up next!)
*Copyright Maven Analytics, LLC
LOGISTIC REGRESSION: VISUALLY
Modern NLP
Overview EXAMPLE Based on today’s temperature, will my lemonade stand be profitable today?
This sigmoid function is a type of
non-linear transformation, or in
Neural Networks NN-speak, an activation function
σ
Today’s Probability
temperature of profit 𝑝 = 𝜎(𝑤𝑥 + 𝑏)
Deep Learning (x) (p)
Modern NLP
Overview EXAMPLE Based on today’s temperature, will my lemonade stand be profitable today?
ℎ = 𝜎(0.25𝑥1 + 1𝑥2 − 5)
w2
x1 = 15, x2 = 0 h = 22%
Coming up with more features is hard
Weekend to do, let’s have an algorithm help us x1 = 15, x2 = 1 h = 43%
(x2) x1 = 25, x2 = 0 h = 77%
x1 = 25, x2 = 1 h = 90%
x1 = 35, x2 = 0 h = 97%
Input layer Hidden layer Output layer x1 = 35, x2 = 1 h = 99%
(Features) (Parameters & (Predictions)
Activation functions)
*Copyright Maven Analytics, LLC
NEURAL NETWORKS: VISUALLY
Adding nodes to the hidden layer makes this behave like a true neural network
• You can specify the number of nodes in the hidden layer and the activation function for each
Modern NLP
Overview EXAMPLE Based on today’s temperature, will my lemonade stand be profitable today?
h1
Neural Networks
σ
Today’s w1 Probability ℎ1 = 𝜎(𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏1 )
temperature Thirst level of profit
Deep Learning (x1) w2 (p) ℎ2 = 𝜎(𝑤3 𝑥1 + 𝑤4 𝑥2 + 𝑏2 )
h2
Modern NLP
Overview EXAMPLE Based on today’s temperature, will my lemonade stand be profitable today?
h1 p
Neural Networks
σ σ
Today’s w1 w5 ℎ1 = 𝜎(𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏1 )
temperature Thirst level Profitability
Deep Learning (x1) w2 ℎ2 = 𝜎(𝑤3 𝑥1 + 𝑤4 𝑥2 + 𝑏2 )
h2 w6 𝑝 = 𝜎(𝑤5 ℎ1 + 𝑤6 ℎ2 + 𝑏3 )
w3 σ
Weekend Foot traffic The outputs from the hidden layers are
(x2) w4 assigned their own weights and bias, and
wrapped in a final activation function
Modern NLP
Overview EXAMPLE Based on today’s temperature, will my lemonade stand be profitable today?
h1 p
Neural Networks
σ σ
Today’s w1 w5 ℎ1 = 𝜎(𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏1 )
temperature Thirst level Profitability
Deep Learning (x1) w2 ℎ2 = 𝜎(𝑤3 𝑥1 + 𝑤4 𝑥2 + 𝑏2 )
h2 w6 𝑝 = 𝜎(𝑤5 ℎ1 + 𝑤6 ℎ2 + 𝑏3 )
w3 σ
We are calculating these x1 = 15, x2 = 0 p = 15%
Weekend Foot traffic probabilities from a neural x1 = 15, x2 = 1 p = 46%
(x2) w4 network vs a logistic
regression from earlier x1 = 25, x2 = 0 p = 73%
x1 = 25, x2 = 1 p = 90%
x1 = 35, x2 = 0 p = 91%
Input layer Hidden layer Output layer
x1 = 35, x2 = 1 p = 96%
(Features) (Parameters & (Predictions)
Activation functions)
*Copyright Maven Analytics, LLC
NEURAL NETWORKS SUMMARY
nn = MLPClassifier(hidden_layer_sizes=(100,), activation='relu')
Deep Learning
Defines the number of nodes in each hidden layer Sets the activation function
Examples: for the hidden layers
• (100,) – 1 hidden layer with 100 nodes (default) Examples:
• (50,30) – 2 hidden layers with 50 and 30 nodes respectively • 'relu' (default)
• 'logistic'
• 'tanh'
• 'identity'
nn = MLPClassifier(hidden_layer_sizes=(100,), activation='relu')
Deep Learning
As we saw in the Python code, the weights and biases of a neural network are
contained in weight matrices and bias vectors
• This is helpful to remember for the next section on Transformers & LLMs, where everything
Modern NLP
we review will live in matrices
Overview
(1)
𝑏1 Weight matrix
Neural Networks How to read
(1)
𝑤11
(1)
𝑤11 (2)
𝑤11
X1 h1 (2)
𝑏1
(1)
𝑤12
Deep Learning p
(1)
(1) 𝑏2
𝑤21 Starting node Ending node
X2 (1)
h2 (2)
𝑤21
𝑤22
1 1 (1)
ℎ1 = 𝜎(𝑤
𝜎 0.3𝑥 𝑥 +
11 1 1 +0.05𝑥
𝑤 𝑥2−+6.5
21 2 𝑏1 )
Weight 0.3 0.08 3.5
matrices: 0.05 2.5 3 ℎ2 = 𝜎(0.08𝑥1 + 2.5𝑥2 − 1.8)
Bias
-6.5 -1.8 -3.2
𝑝 = 𝜎(3.5ℎ1 + 3ℎ2 − 3.2)
vectors:
1. Random start: Start with an initial set of random weights & biases
Modern NLP
Overview
2. Forward pass: Starting from the left, apply all calculations through the neural network
to get to a final set of predicted values
Neural Networks
3. Calculate loss: Compare the predicted and actual values to compute the error, or loss
4. Update parameters: Starting from the right, calculate how much each parameter
Deep Learning contributed to the loss with back propagation, and then use gradient descent (a popular
optimization technique) to adjust the parameters by moving them a step closer to
reducing the loss
5. Repeat: Repeat steps 2-4 until you minimize the loss or reach an iteration limit and lock
in the final model parameters (weights and biases)
The math behind back propagation and gradient descent is beyond the scope of this course, but the key takeaway
is that each iteration moves closer to the optimal parameters, and it does so as efficiently as possible
STEP 1: Random start – Start with an initial set of random weights & biases
Modern NLP
Overview Training data:
𝑏1
Temperature Weekend Profitable
𝑤1 (x1) (x2) (y)
𝑤5
Neural Networks X1 h1 𝑏3
14 0 0
𝑤3
18 1 0
𝑏2 p
𝑤2 22 0 0
Deep Learning
22 1 1
X2 h2 𝑤6
𝑤4 26 0 1
26 1 1
30 0 1
30 1 1
35 0 1
15 1 0
STEP 1: Random start – Start with an initial set of random weights & biases
Modern NLP
Overview Training data:
𝑏1
Temperature Weekend Profitable
𝑤1 (x1) (x2) (y)
𝑤5
Neural Networks X1 h1 𝑏3
14 0 0
𝑤3
18 1 0
𝑏2 p
𝑤2 22 0 0
Deep Learning
22 1 1
X2 h2 𝑤6
𝑤4 26 0 1
26 1 1
30 0 1
ℎ1 = 𝜎(𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏1 ) 30 1 1
35 0 1
15 1 0
STEP 1: Random start – Start with an initial set of random weights & biases
Modern NLP
Overview Training data:
𝑏1
Temperature Weekend Profitable
𝑤1 (x1) (x2) (y)
𝑤5
Neural Networks X1 h1 𝑏3
14 0 0
𝑤3
18 1 0
𝑏2 p
𝑤2 22 0 0
Deep Learning
22 1 1
X2 h2 𝑤6
𝑤4 26 0 1
26 1 1
30 0 1
ℎ1 = 𝜎(𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏1 ) 30 1 1
35 0 1
15 1 0
ℎ2 = 𝜎(𝑤3 𝑥1 + 𝑤4 𝑥2 + 𝑏2 )
STEP 1: Random start – Start with an initial set of random weights & biases
Modern NLP
Overview Training data:
𝑏1
Temperature Weekend Profitable
𝑤1 (x1) (x2) (y)
𝑤5
Neural Networks X1 h1 𝑏3
14 0 0
𝑤3
18 1 0
𝑏2 p
𝑤2 22 0 0
Deep Learning
22 1 1
X2 h2 𝑤6
𝑤4 26 0 1
26 1 1
30 0 1
ℎ1 = 𝜎(𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏1 ) 30 1 1
35 0 1
15 1 0
ℎ2 = 𝜎(𝑤3 𝑥1 + 𝑤4 𝑥2 + 𝑏2 )
𝑝 = 𝜎(𝑤5 ℎ1 + 𝑤6 ℎ2 + 𝑏3 )
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 1: Random start – Start with an initial set of random weights & biases
Modern NLP
Overview Training data:
0
Temperature Weekend Profitable
1 (x1) (x2) (y)
1
Neural Networks X1 h1 0
14 0 0
1
18 1 0
0 p
1 22 0 0
Deep Learning
22 1 1
X2 h2 1
1 26 0 1
26 1 1
30 0 1
ℎ1 = 𝜎(𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏1 ) 30 1 1
35 0 1
15 1 0
ℎ2 = 𝜎(𝑤3 𝑥1 + 𝑤4 𝑥2 + 𝑏2 )
𝑝 = 𝜎(𝑤5 ℎ1 + 𝑤6 ℎ2 + 𝑏3 )
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 1: Random start – Start with an initial set of random weights & biases
Modern NLP
Overview Training data:
0
Temperature Weekend Profitable
1 (x1) (x2) (y)
1
Neural Networks X1 h1 0
14 0 0
1
18 1 0
0 p
1 22 0 0
Deep Learning
22 1 1
X2 h2 1
1 26 0 1
26 1 1
30 0 1
ℎ1 = 𝜎(1𝑥1 + 1𝑥2 + 0) 30 1 1
35 0 1
15 1 0
ℎ2 = 𝜎(1𝑥1 + 1𝑥2 + 0)
𝑝 = 𝜎(1ℎ1 + 1ℎ2 + 0)
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 1: Random start – Start with an initial set of random weights & biases
Modern NLP
Overview Training data:
0
Temperature Weekend Profitable
1 (x1) (x2) (y)
1
Neural Networks X1 h1 0
14 0 0
1
18 1 0
0 p
1 22 0 0
Deep Learning
22 1 1
X2 h2 1
1 26 0 1
26 1 1
30 0 1
ℎ1 = 𝜎 𝑥1 + 𝑥2 30 1 1
35 0 1
15 1 0
ℎ2 = 𝜎(𝑥1 + 𝑥2 )
𝑝 = 𝜎(ℎ1 + ℎ2 )
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 2: Forward pass –Starting from the left, apply all calculations through the
neural network to get to a final set of predicted values
Modern NLP
Overview Training data:
0
Temperature Weekend Profitable Prediction
1 (x1) (x2) (y) (p)
1
Neural Networks X1 h1 0
14 0 0
1
18 1 0
0 p
1 22 0 0
Deep Learning
22 1 1
X2 h2 1
1 26 0 1
26 1 1
30 0 1
ℎ1 = 𝜎 𝑥1 + 𝑥2 30 1 1
35 0 1
15 1 0
ℎ2 = 𝜎(𝑥1 + 𝑥2 )
𝑝 = 𝜎(ℎ1 + ℎ2 )
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 2: Forward pass –Starting from the left, apply all calculations through the
neural network to get to a final set of predicted values
Modern NLP
Overview Training data:
0
Temperature Weekend Profitable Prediction
1 (x1) (x2) (y) (p)
1
Neural Networks X1 h1 0
14 0 0
1
18 1 0
0 p
1 22 0 0
Deep Learning
22 1 1
X2 h2 1
1 26 0 1
26 1 1
30 0 1
ℎ1 = 𝜎 14 + 0 = 0.99 30 1 1
35 0 1
15 1 0
ℎ2 = 𝜎(𝑥1 + 𝑥2 )
𝑝 = 𝜎(ℎ1 + ℎ2 )
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 2: Forward pass –Starting from the left, apply all calculations through the
neural network to get to a final set of predicted values
Modern NLP
Overview Training data:
0
Temperature Weekend Profitable Prediction
1 (x1) (x2) (y) (p)
1
Neural Networks X1 h1 0
14 0 0
1
18 1 0
0 p
1 22 0 0
Deep Learning
22 1 1
X2 h2 1
1 26 0 1
26 1 1
30 0 1
ℎ1 = 𝜎 14 + 0 = 0.99 30 1 1
35 0 1
15 1 0
ℎ2 = 𝜎 14 + 0 = 0.99
𝑝 = 𝜎(ℎ1 + ℎ2 )
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 2: Forward pass –Starting from the left, apply all calculations through the
neural network to get to a final set of predicted values
Modern NLP
Overview Training data:
0
Temperature Weekend Profitable Prediction
1 (x1) (x2) (y) (p)
1
Neural Networks X1 h1 0
14 0 0
1
18 1 0
0 p
1 22 0 0
Deep Learning
22 1 1
X2 h2 1
1 26 0 1
26 1 1
30 0 1
ℎ1 = 𝜎 14 + 0 = 0.99 30 1 1
35 0 1
15 1 0
ℎ2 = 𝜎 14 + 0 = 0.99
STEP 2: Forward pass –Starting from the left, apply all calculations through the
neural network to get to a final set of predicted values
Modern NLP
Overview Training data:
0
Temperature Weekend Profitable Prediction
1 (x1) (x2) (y) (p)
1
Neural Networks X1 h1 0
14 0 0 0.88
1
18 1 0
0 p
1 22 0 0
Deep Learning
22 1 1
X2 h2 1
1 26 0 1
26 1 1
30 0 1
ℎ1 = 𝜎 14 + 0 = 0.99 30 1 1
35 0 1
15 1 0
ℎ2 = 𝜎 14 + 0 = 0.99
STEP 2: Forward pass –Starting from the left, apply all calculations through the
neural network to get to a final set of predicted values
Modern NLP
Overview Training data:
0
Temperature Weekend Profitable Prediction
1 (x1) (x2) (y) (p)
1
Neural Networks X1 h1 0
14 0 0 0.88
1
18 1 0
0 p
1 22 0 0
Deep Learning
22 1 1
X2 h2 1
1 26 0 1
26 1 1
30 0 1
ℎ1 = 𝜎(𝑥1 + 𝑥2 ) 30 1 1
35 0 1
15 1 0
ℎ2 = 𝜎(𝑥1 + 𝑥2 )
𝑝 = 𝜎(ℎ1 + ℎ2 )
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 2: Forward pass –Starting from the left, apply all calculations through the
neural network to get to a final set of predicted values
Modern NLP
Overview Training data:
0
Temperature Weekend Profitable Prediction
1 (x1) (x2) (y) (p)
1
Neural Networks X1 h1 0
14 0 0 0.88
1
18 1 0 0.88
0 p
1 22 0 0 0.88
Deep Learning
22 1 1 0.88
X2 h2 1
1 26 0 1 0.88
26 1 1 0.88
30 0 1 0.88
ℎ1 = 𝜎(𝑥1 + 𝑥2 ) 30 1 1 0.88
35 0 1 0.88
15 1 0 0.88
ℎ2 = 𝜎(𝑥1 + 𝑥2 )
The initial model isn’t sensitive to our inputs and
𝑝 = 𝜎(ℎ1 + ℎ2 ) predicts we’ll be profitable 88% of the time
STEP 3: Calculate loss – Compare the predicted and actual values to compute the
error, or loss
Modern NLP
Overview Training data:
0
Temperature Weekend Profitable Prediction
1 (x1) (x2) (y) (p)
1
Neural Networks X1 h1 0
14 0 0 0.88
1
18 1 0 0.88
0 p
1 22 0 0 0.88
Deep Learning
22 1 1 0.88
X2 h2 1
1 26 0 1 0.88
26 1 1 0.88
30 0 1 0.88
ℎ1 = 𝜎(𝑥1 + 𝑥2 ) 30 1 1 0.88
35 0 1 0.88
15 1 0 0.88
ℎ2 = 𝜎(𝑥1 + 𝑥2 )
𝑝 = 𝜎(ℎ1 + ℎ2 )
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 3: Calculate loss – Compare the predicted and actual values to compute the
error, or loss
Modern NLP
Overview Training data:
0
Temperature Weekend Profitable Prediction Error
1 (x1) (x2) (y) (p) (ε)
1
Neural Networks X1 h1 0
14 0 0 0.88 -0.88
1
18 1 0 0.88 -0.88
0 p
1 22 0 0 0.88 -0.88
Deep Learning
22 1 1 0.88 0.12
X2 h2 1
1 26 0 1 0.88 0.12
26 1 1 0.88 0.12
30 0 1 0.88 0.12
35 0 1 0.88 0.12
15 1 0 0.88 -0.88
ℎ2 = 𝜎(𝑥1 + 𝑥2 )
=
LOG LOSS: 0.927
𝑝 = 𝜎(ℎ1 + ℎ2 )
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 4: Update parameters – Starting from the right, calculate how much each
parameter contributed to the loss with back propagation, and then use gradient
descent to adjust the parameters by moving them a step closer to reducing the loss
Modern NLP
Overview Training data:
0
Temperature Weekend Profitable Prediction Error
1 (x1) (x2) (y) (p) (ε)
1
Neural Networks X1 h1 0
14 0 0 0.88 -0.88
1
18 1 0 0.88 -0.88
0 p
1 22 0 0 0.88 -0.88
Deep Learning
22 1 1 0.88 0.12
X2 h2 1
1 26 0 1 0.88 0.12
26 1 1 0.88 0.12
30 0 1 0.88 0.12
35 0 1 0.88 0.12
15 1 0 0.88 -0.88
ℎ2 = 𝜎(𝑥1 + 𝑥2 )
=
LOG LOSS: 0.927
𝑝 = 𝜎(ℎ1 + ℎ2 )
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 4: Update parameters – Starting from the right, calculate how much each
parameter contributed to the loss with back propagation, and then use gradient
descent to adjust the parameters by moving them a step closer to reducing the loss
Modern NLP
Overview Training data:
−6
Temperature Weekend Profitable Prediction Error
0.4 (x1) (x2) (y) (p) (ε)
2.5
Neural Networks X1 h1 −2.5
14 0 0 0.88 -0.88
0.2
18 1 0 0.88 -0.88
−2 p
0.1 22 0 0 0.88 -0.88
Deep Learning
22 1 1 0.88 0.12
X2 h2 2.5
2.2 26 0 1 0.88 0.12
26 1 1 0.88 0.12
30 0 1 0.88 0.12
30 1 1 0.88 0.12
35 0 1 0.88 0.12
15 1 0 0.88 -0.88
=
LOG LOSS: 0.927
26 1 1 0.88 0.12
30 0 1 0.88 0.12
35 0 1 0.88 0.12
15 1 0 0.88 -0.88
=
LOG LOSS: 0.927
26 1 1 0.88 0.12
30 0 1 0.88 0.12
35 0 1 0.88 0.12
15 1 0 0.88 -0.88
ℎ2 = 𝜎(0.2𝑥1 + 2.2𝑥2 − 2)
=
LOG LOSS: 0.927
26 1 1 0.88 0.12
30 0 1 0.88 0.12
35 0 1 0.88 0.12
15 1 0 0.88 -0.88
ℎ2 = 𝜎(0.2𝑥1 + 2.2𝑥2 − 2)
=
LOG LOSS: 0.927
𝑝 = 𝜎(2.5ℎ1 + 2.5ℎ2 − 2.5)
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 5: Repeat – Repeat steps 2-4 until you minimize the loss or reach an
iteration limit and lock in the final model parameters (weights and biases)
Modern NLP 2. FORWARD PASS
Overview Training data:
−6
Temperature Weekend Profitable Prediction Error
0.4 (x1) (x2) (y) (p) (ε)
2.5
Neural Networks X1 h1 −2.5
14 0 0 0.88 -0.88
0.2
18 1 0 0.88 -0.88
−2 p
0.1 22 0 0 0.88 -0.88
Deep Learning
22 1 1 0.88 0.12
X2 h2 2.5
2.2 26 0 1 0.88 0.12
26 1 1 0.88 0.12
30 0 1 0.88 0.12
35 0 1 0.88 0.12
15 1 0 0.88 -0.88
ℎ2 = 𝜎(0.2𝑥1 + 2.2𝑥2 − 2)
=
LOG LOSS: 0.927
𝑝 = 𝜎(2.5ℎ1 + 2.5ℎ2 − 2.5)
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 5: Repeat – Repeat steps 2-4 until you minimize the loss or reach an
iteration limit and lock in the final model parameters (weights and biases)
Modern NLP 2. FORWARD PASS
Overview Training data: 3. CALCULATE LOSS
−6
Temperature Weekend Profitable Prediction Error
0.4 (x1) (x2) (y) (p) (ε)
2.5
Neural Networks X1 h1 −2.5
14 0 0 0.56
0.2
18 1 0 0.87
−2 p
0.1 22 0 0 0.9
Deep Learning
22 1 1 0.91
X2 h2 2.5
2.2 26 0 1 0.91
26 1 1 0.92
30 0 1 0.92
35 0 1 0.92
15 1 0 0.77
ℎ2 = 𝜎(0.2𝑥1 + 2.2𝑥2 − 2)
This model still estimates we’ll likely be profitable in
each scenario, but the probabilities make more sense
𝑝 = 𝜎(2.5ℎ1 + 2.5ℎ2 − 2.5)
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 5: Repeat – Repeat steps 2-4 until you minimize the loss or reach an
iteration limit and lock in the final model parameters (weights and biases)
Modern NLP 4. ADJUST PARAMETERS
Overview Training data: 3. CALCULATE LOSS
−6
Temperature Weekend Profitable Prediction Error
0.4 (x1) (x2) (y) (p) (ε)
2.5
Neural Networks X1 h1 −2.5
14 0 0 0.56 -0.56
0.2
18 1 0 0.87 -0.87
−2 p
0.1 22 0 0 0.9 -0.9
Deep Learning
22 1 1 0.91 0.09
X2 h2 2.5
2.2 26 0 1 0.91 0.09
26 1 1 0.92 0.08
30 0 1 0.92 0.08
35 0 1 0.92 0.08
15 1 0 0.77 -0.77
ℎ2 = 𝜎(0.2𝑥1 + 2.2𝑥2 − 2)
=
LOG LOSS: 0.710
𝑝 = 𝜎(2.5ℎ1 + 2.5ℎ2 − 2.5)
This is down from 0.927!
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 5: Repeat – Repeat steps 2-4 until you minimize the loss or reach an
iteration limit and lock in the final model parameters (weights and biases)
Modern NLP 2. FORWARD PASS
Overview Training data:
−6.3
Temperature Weekend Profitable Prediction Error
0.35 (x1) (x2) (y) (p) (ε)
3
Neural Networks X1 h1 −3
14 0 0 0.56 -0.56
0.12
18 1 0 0.87 -0.87
−1.9 p
0.08 22 0 0 0.9 -0.9
Deep Learning
22 1 1 0.91 0.09
X2 h2 2.8
2.3 26 0 1 0.91 0.09
26 1 1 0.92 0.08
30 0 1 0.92 0.08
35 0 1 0.92 0.08
15 1 0 0.77 -0.77
ℎ2 = 𝜎(0.12𝑥1 + 2.3𝑥2 − 1.9)
=
LOG LOSS: 0.710
𝑝 = 𝜎(3ℎ1 + 2.8ℎ2 − 3)
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 5: Repeat – Repeat steps 2-4 until you minimize the loss or reach an
iteration limit and lock in the final model parameters (weights and biases)
Modern NLP 2. FORWARD PASS
Overview Training data: 3. CALCULATE LOSS
−6.3
Temperature Weekend Profitable Prediction Error
0.35 (x1) (x2) (y) (p) (ε)
3
Neural Networks X1 h1 −3
14 0 0 0.24
0.12
18 1 0 0.76
−1.9 p
0.08 22 0 0 0.79
Deep Learning
22 1 1 0.89
X2 h2 2.8
2.3 26 0 1 0.88
26 1 1 0.93
30 0 1 0.91
35 0 1 0.93
15 1 0 0.59
ℎ2 = 𝜎(0.12𝑥1 + 2.3𝑥2 − 1.9)
We’re now predicting we likely won’t be profitable in
low temperature weekdays, which makes sense!
𝑝 = 𝜎(3ℎ1 + 2.8ℎ2 − 3)
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 5: Repeat – Repeat steps 2-4 until you minimize the loss or reach an
iteration limit and lock in the final model parameters (weights and biases)
Modern NLP 4. ADJUST PARAMETERS
Overview Training data: 3. CALCULATE LOSS
−6.3
Temperature Weekend Profitable Prediction Error
0.35 (x1) (x2) (y) (p) (ε)
3
Neural Networks X1 h1 −3
14 0 0 0.24 -0.24
0.12
18 1 0 0.76 -0.76
−1.9 p
0.08 22 0 0 0.79 -0.79
Deep Learning
22 1 1 0.89 0.11
X2 h2 2.8
2.3 26 0 1 0.88 0.12
26 1 1 0.93 0.07
30 0 1 0.91 0.09
35 0 1 0.93 0.07
15 1 0 0.59 -0.59
ℎ2 = 𝜎(0.12𝑥1 + 2.3𝑥2 − 1.9)
=
LOG LOSS: 0.468
𝑝 = 𝜎(3ℎ1 + 2.8ℎ2 − 3)
This is down from 0.710!
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 5: Repeat – Repeat steps 2-4 until you minimize the loss or reach an
iteration limit and lock in the final model parameters (weights and biases)
Modern NLP 4. ADJUST PARAMETERS
Overview Training data:
−6.5
Temperature Weekend Profitable Prediction Error
0.3 (x1) (x2) (y) (p) (ε)
3.5
Neural Networks X1 h1 −3.2
14 0 0 0.24 -0.24
0.08
18 1 0 0.76 -0.76
−1.8 p
0.05 22 0 0 0.79 -0.79
Deep Learning
22 1 1 0.89 0.11
X2 h2 3
2.5 26 0 1 0.88 0.12
26 1 1 0.93 0.07
30 0 1 0.91 0.09
35 0 1 0.93 0.07
15 1 0 0.59 -0.59
ℎ2 = 𝜎(0.08𝑥1 + 2.5𝑥2 − 1.8)
=
LOG LOSS: 0.468
𝑝 = 𝜎(3.5ℎ1 + 3ℎ2 − 3.2)
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 5: Repeat – Repeat steps 2-4 until you minimize the loss or reach an
iteration limit and lock in the final model parameters (weights and biases)
Modern NLP 2. FORWARD PASS
Overview Training data:
−6.5
Temperature Weekend Profitable Prediction Error
0.3 (x1) (x2) (y) (p) (ε)
3.5
Neural Networks X1 h1 −3.2
14 0 0 0.24 -0.24
0.08
18 1 0 0.76 -0.76
−1.8 p
0.05 22 0 0 0.79 -0.79
Deep Learning
22 1 1 0.89 0.11
X2 h2 3
2.5 26 0 1 0.88 0.12
26 1 1 0.93 0.07
30 0 1 0.91 0.09
35 0 1 0.93 0.07
15 1 0 0.59 -0.59
ℎ2 = 𝜎(0.08𝑥1 + 2.5𝑥2 − 1.8)
=
LOG LOSS: 0.468
𝑝 = 𝜎(3.5ℎ1 + 3ℎ2 − 3.2)
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 5: Repeat – Repeat steps 2-4 until you minimize the loss or reach an
iteration limit and lock in the final model parameters (weights and biases)
Modern NLP 2. FORWARD PASS
Overview Training data: 3. CALCULATE LOSS
−6.5
Temperature Weekend Profitable Prediction Error
0.3 (x1) (x2) (y) (p) (ε)
3.5
Neural Networks X1 h1 −3.2
14 0 0 0.13
0.08
18 1 0 0.6
−1.8 p
0.05 22 0 0 0.53
Deep Learning
22 1 1 0.81
X2 h2 3
2.5 26 0 1 0.78
26 1 1 0.92
30 0 1 0.88
35 0 1 0.92
15 1 0 0.46
ℎ2 = 𝜎(0.08𝑥1 + 2.5𝑥2 − 1.8)
The probabilities for profit are much more spread
out and sensitive to both input features!
𝑝 = 𝜎(3.5ℎ1 + 3ℎ2 − 3.2)
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 5: Repeat – Repeat steps 2-4 until you minimize the loss or reach an
iteration limit and lock in the final model parameters (weights and biases)
Modern NLP
Overview Training data: 3. CALCULATE LOSS
−6.5
Temperature Weekend Profitable Prediction Error
0.3 (x1) (x2) (y) (p) (ε)
3.5
Neural Networks X1 h1 −3.2
14 0 0 0.13 -0.13
0.08
18 1 0 0.6 -0.6
−1.8 p
0.05 22 0 0 0.53 -0.53
Deep Learning
22 1 1 0.81 0.19
X2 h2 3
2.5 26 0 1 0.78 0.22
26 1 1 0.92 0.08
30 0 1 0.88 0.12
35 0 1 0.92 0.08
15 1 0 0.46 -0.46
ℎ2 = 𝜎(0.08𝑥1 + 2.5𝑥2 − 1.8)
=
LOG LOSS: 0.323
𝑝 = 𝜎(3.5ℎ1 + 3ℎ2 − 3.2)
This is now optimized!
*Copyright Maven Analytics, LLC
TRAINING NEURAL NETWORKS
STEP 5: Repeat – Repeat steps 2-4 until you minimize the loss or reach an
iteration limit and lock in the final model parameters (weights and biases)
Modern NLP
Overview Training data:
−6.5
Temperature Weekend Profitable Prediction Error
0.3 (x1) (x2) (y) (p) (ε)
3.5
Neural Networks X1 h1 −3.2
14 0 0 0.13 -0.13
0.08
18 1 0 0.6 -0.6
−1.8 p
0.05 22 0 0 0.53 -0.53
Deep Learning
22 1 1 0.81 0.19
X2 h2 3
2.5 26 0 1 0.78 0.22
26 1 1 0.92 0.08
30 0 1 0.88 0.12
35 0 1 0.92 0.08
15 1 0 0.46 -0.46
ℎ2 = 𝜎(0.08𝑥1 + 2.5𝑥2 − 1.8)
Today (new data):
Profitable!
𝑝 = 𝜎(3.5ℎ1 + 3ℎ2 − 3.2) 25 0 0.93
Neural Networks
Neural Networks
Info flows from left to right
Deep Learning
FNNs are often a piece of other, more
complex, deep learning architectures
(this is an important piece of the next
section on Transformers!)
These are some of the most popular deep learning architectures and their layers:
Transformers Applications:
Embeddings Attention
Popularized in 2017 NLP, CV, ASR tasks, Raw text FNN layer Prediction
layer layer
and much more!
(More on these in the Transformers section up next!)
Transformers have replaced
RNNs and LSTMs for NLP tasks
*Copyright Maven Analytics, LLC
DEEP LEARNING ARCHITECTURES
Modern NLP
Overview Architecture Description Application
Modern NLP way of thinking: use a pretrained model (parameters are locked-in)
1. Pick a pretrained model that’s good for your problem (predict if company will be profitable)
2. Make predictions using this pretrained model
3. (Optional) Improve the predictions using transfer learning or fine-tuning
NOTE: Only research labs and large tech companies will train their own deep learning models from
scratch these days, while the majority of data scientists use or start with pretrained models
Most data scientists will use pretrained deep learning models for their analysis
• These pretrained models have already been trained on extremely large data sets, so all the
Modern NLP parameters (weights, biases, etc.) are locked in
Overview
• Large Language Models (LLMs) are deep learning models that are pretrained on massive
amounts of text data, including BERT and GPT (much more on this in the next section!)
Neural Networks • To use an LLM, you input your text, and then all the calculations (weighted sums, non-linear
transformations, etc.) are applied to output a final prediction
Deep Learning
These two are covered in the Hugging Face Transformers section of this course
Modern NLP
Overview Pretrained model only Pretrained model embeddings
Download and use a pretrained model as is to make Use a pretrained model’s embeddings as inputs into
Neural Networks predictions traditional machine learning models
• Parameters are fixed • Parameters are fixed
• Used for sentiment analysis, text summarization, etc. • Used for document similarity, document clustering, etc.
Deep Learning
Start with a pretrained model and adjust the Combine pretrained models with external databases
parameters by training on task / domain-specific data* to be more up-to-date and context-aware**
• Parameters are updated in final layers or all layers • Parameters may or may not be updated
• Used for text classification, industry-specific analysis, etc. • Used for question answering, fact checking, etc.
*Adjusting weights requires a large amount of data (at least tens of **RAGs (Retrieval Augmented Generation) require building a
thousands of labeled data points) & computational power (more than a structured retrieval database to hold at least tens of thousands of
single computer, many GPUs) external text documents
Neural networks are ML models with input, hidden, and output layers
• They are sometimes called artificial neural networks (ANNs) or multilayer perceptrons (MLPs)
• At each node, the weighted sum of the inputs is calculated and a non-linear transformation applied
• To train a neural network, start with random parameters and slowly adjust them until they become optimal
Deep learning refers to a neural network with three or more hidden layers
• DL is often used for more complex applications such as NLP, computer vision, speech recognition, etc.
Most data professionals use pretrained deep learning models for analysis
• Pretrained models (set parameters) are trained on millions of data points and perform well out-of-the-box
• AI / ML researchers will train and sometimes data scientists will fine-tune models for domain-specific data sets
In this section, we’ll introduce transformers and its main layers, as well as pretrained
deep learning models specifically for NLP tasks: large language models (LLMs)
In this section, we’ll be covering the rest of these modern NLP concepts to
understand how LLMs work before applying them in Hugging Face:
Transformers &
LLMs
1 Neural Networks & Deep Learning • Neural network components: layers, nodes,
weights, parameters, activation functions
Attention a) Logistic Regression • Neural network training: forward pass, loss,
b) Neural Networks backpropagation, gradient descent
COMPLEXITY
We’ve covered
these now! • Deep learning architectures: FNN, CNN,
FNNs c) Deep Learning
RNN, LSTM, Transformers
Transformer LLM
Attention Transformers can be used for LLMs can be based on many deep
many tasks, but are mainly learning architectures, but they are
used for NLP applications mainly based on transformers
Encoders &
Decoders
Pretrained LLMs
Transformer-Based LLMs
are the most popular DL
approach to NLP tasks
The transformer architecture refers to the series of layers and computations that
the input data passes through to produce a final result
Transformers &
LLMs • Along the way, the input text is gradually transformed, hence the name transformers
Embeddings
FNNs
Uses vectors to Adjusts their meanings Learns patterns from
represent the semantic based on context from the prior layers and
meaning of words surrounding words adds complexity
Encoders &
Decoders
Pretrained LLMs
The first layer of a transformer, the embeddings layer, converts text tokens into
meaningful numeric representations
Transformers &
LLMs • It places each token (word) into a high-dimensional space, so words with similar meanings
end up close together, and words with different meanings are farther apart
Embeddings
Embeddings layer:
Attention token dim1 dim2 … dim768 Total dimensions vary, but 768
“I love cold lemonade!” is a common length for LLMs
I 0.16 -0.04 … 0.67
FNNs love -0.21 0.59 … 0.33
Hot
Cold
cold 0.04 -0.14 … 0.89
Encoders &
dim540
Decoders lemonade -0.11 0.35 … -0.15
Tokens include things
like punctuation! Summer
! 0.05 -0.03 … -0.06 Winter
Pretrained LLMs
dim143
The vector (768 numbers) for each token represents its location in
space – amazingly, these values have semantic meaning!
The first layer of a transformer, the embeddings layer, converts text tokens into
meaningful numeric representations
Transformers &
LLMs • It places each token (word) into a high-dimensional space, so words with similar meanings
end up close together, and words with different meanings are farther apart
Embeddings
Pretrained LLMs
In the embedding layer alone, given a vocabulary size of 50k and 768 dimensions for each token, the
embedding matrix would have 38 million parameters! And that’s just the start of a transformer…
The second layer of a transformer, the attention layer, adds context by helping
each token absorb additional meaning from other tokens
Transformers &
LLMs
Embeddings Without the attention layer, the word “lemonade” is in the same location in space for
all three of these sentences, even though it has a different meaning in each one
Attention
This will be in an exact This will be in a slightly This will be in a very different
Encoders & location based on the different location, since location, since it’s about an
Decoders word embedding it’s specifically cold album, not a drink
lemonade that’s loved
Pretrained LLMs
The second layer of a transformer, the attention layer, adds context by helping
each token absorb additional meaning from other tokens
Transformers &
LLMs
Without the attention layer, the meaning of the word ”lemonade” isn’t
Attention “I love cold lemonade!” affected by the other words in the sentence
With the attention layer, the word “cold” adds context to “lemonade”
FNNs “I love cold lemonade!”
• In technical terms: cold attends to lemonade
• In layman’s terms: this isn’t just any lemonade, it’s a cold one
Encoders &
Decoders
With the attention layer, the word “love” adds context to “lemonade”
“I love cold lemonade!”
• In technical terms, love attends to lemonade
Pretrained LLMs • In layman’s terms: this isn’t just any lemonade, it’s lemonade that’s loved
The second layer of a transformer, the attention layer, adds context by helping
each token absorb additional meaning from other tokens
Transformers &
LLMs • It does this by creating matrices for queries, keys, and attention scores
Embeddings Queries: Questions about other tokens Keys: Answers to those questions
token q1 q2 … q768 token k1 k2 … k768
Pretrained LLMs
The queries and keys here are one of many query-key pairs in a transformer. Other queries about
love could be “what is expressing the love?”, “what kind of love is being expressed?”, etc.
The second layer of a transformer, the attention layer, adds context by helping
each token absorb additional meaning from other tokens
Transformers &
LLMs • It does this by creating matrices for queries, keys, and attention scores
Embeddings Queries: Questions about other tokens Keys: Answers to those questions
token q1 q2 … q768 token k1 k2 … k768
cold 0.23 -0.10 … 0.16 cold 0.42 -0.05 … 0.10 I’m somewhat loved
FNNs
lemonade 0.11 0.22 … 0.03 lemonade 0.89 0.19 … 0.01 I’m loved the most
The second layer of a transformer, the attention layer, adds context by helping
each token absorb additional meaning from other tokens
Transformers &
LLMs • It does this by creating matrices for queries, keys, and attention scores
Embeddings
Attention scores: Summary of query-key relationships
I love cold lemonade !
Attention
I 0.2 0.6 0 0 0.2
The “love” is mostly for “lemonade”,
love 0.1 0.1 0.3 0.4 0.1 and somewhat for “cold”
FNNs cold 0.1 0.1 0.2 0.5 0.1
“Cold” mostly describes “lemonade”
lemonade 0.3 0.1 0.4 0.1 0.1
Encoders & ! 0.3 0.1 0.1 0 0.5
Decoders
The second layer of a transformer, the attention layer, adds context by helping
each token absorb additional meaning from other tokens
Transformers &
LLMs • It does this by creating matrices for queries, keys, and attention scores
Embeddings
How are these values generated?
Attention • Like embeddings, the query and key values are randomly initialized and slowly updated
until they reach their final values
• What are the additional calculations?
FNNs
• To capture query-key similarity, a dot product (similarity score) is taken
• For attention scores to add up to 1, a softmax normalization function is applied
Encoders &
Decoders
Pretrained LLMs Like how the embeddings layer amazingly captured word meaning, this attention layer
amazingly captures how much each token attends to, or gives context to, other tokens
The third layer of a transformer, the feedforward neural network (FNN) layer,
learns patterns from the data and adds complexity to the model
Transformers &
LLMs • Embedding layer: words are placed in locations in space that hold some meaning
• Attention layer: the locations are adjusted based on context from surrounding words
Embeddings
• FNN layer: patterns in those contextual relationships are learned and captured
Attention
Typical FNN in a transformer:
The output is a transformed representation of
FNNs Input layer Hidden layer Output layer the original tokens with refined meanings
Pretrained LLMs
Attention is All You Need paper https://arxiv.org/pdf/1706.03762 *Copyright Maven Analytics, LLC
ENCODERS & DECODERS
Embeddings
Encoder-Only Models
Only use the left side of the
Attention
architecture, aka the encoder
While encoders embed
The encoder takes raw text and text, they can be fine-
encodes it as an embedding tuned for specific tasks
FNNs
representation of the text like sentiment analysis,
where an extra
In short, it understands text classification step is
Encoders & added to get from
Decoders Application: Sentiment Analysis embedding to output
• Input: “I love cold lemonade!”
• Output: Positive
Pretrained LLMs
Embeddings
Decoder-Only Models
Only use the right side of the
Attention
architecture, aka the decoder
The decoder takes an input text
sequence and infers* the next word
FNNs
In short, it generates text
Encoders &
Application: Text Generation
Decoders • Input: “I love cold lemonade!”
• Output: “It’s the perfect drink! ”
Pretrained LLMs
*With transformers & LLMs, the word inference is typically used instead of the word prediction *Copyright Maven Analytics, LLC
ENCODERS & DECODERS
Turns text into embeddings Infers the next token in a text Turns text into other text
Attention
Popular models: Popular models: Popular models:
• BERT – Bidirectional Encoder • GPT – Generative Pre-trained • T5 – Text-to-Text Transfer
Representations from Transformer Transformer
FNNs Transformers • BART – combines BERT and GPT
Attention
Pretrained LLMs
In this section, we’ll introduce the Hugging Face Transformers library in Python and
walk through examples of how you can use pretrained models to perform NLP tasks
Document Similarity
Hugging Face is the company that created the Transformers Python library,
Hugging Face
making it easy for data professionals to access and utilize pretrained LLMs
Overview
• They also host the Model Hub, which contains 1M+ pretrained, open-source models (in
addition to base models, there are variants, fine-tuned models, experimental models, etc.)
Sentiment
Analysis
Named Entity
Recognition
Zero-Shot
Classification
Text
Summarization
Text Generation
Document
Similarity
We’ll be using this Hugging Face workflow in Python for multiple applications
Hugging Face
Overview
across the three main LLM categories, as well as embeddings:
Sentiment
Analysis 1 Determine your goal
Named Entity
Recognition
Encoder-Only Decoder-Only
Zero-Shot
• Sentiment Analysis • Text Generation
Classification
• Named Entity Recognition
Text
Summarization Encoder-Decoder Embedding
• Zero-Shot Classification • Document Similarity
Text Generation
• Text Summarization
Document
Similarity
We’ll be using this Hugging Face workflow in Python for multiple applications
Hugging Face
Overview
across the three main LLM categories, as well as embeddings:
Sentiment
Analysis 1 Determine your goal (Sentiment analysis, summarization, generation, etc.)
Named Entity
Recognition
2 Identify a pretrained model from Hugging Face’s Model Hub
Zero-Shot
Classification
Sort by popularity!
Text
Summarization
Text Generation
Document
Similarity
We’ll be using this Hugging Face workflow in Python for multiple applications
Hugging Face
Overview
across the three main LLM categories, as well as embeddings:
Sentiment
Analysis 1 Determine your goal (Sentiment analysis, summarization, generation, etc.)
Named Entity
Recognition
2 Identify a pretrained model from Hugging Face’s Model Hub
Zero-Shot 3 Specify your input data (a single string, a Series or column of text data, etc.)
Classification
4 Apply the pretrained model on your input data and view the outputs
Text
Summarization
Text Generation
After using a pretrained model, you have the optional step of improving your results using
transfer learning, fine-tuning, RAGs, and more (reference the Pretrained Deep Learning Models
lesson), but those often require large labeled data sets and additional processing power
Document
Similarity
Sentiment
Analysis
Named Entity
Recognition
Zero-Shot
Classification
Text
Summarization
Transformer pipeline steps:
1. Import the pipeline module
The predicted sentiment This is the model’s confidence
Text Generation 2. Specify the task: sentiment-analysis
is positive (vs. negative) in its prediction (from 0 to 1)
3. Choose the default model
4. Specify we’re only using our CPU
Document
This is very much a positive sentence
Similarity
Document
Similarity
Named Entity
Recognition
Zero-Shot
While not as good as using a GPU,
Classification
you can try some of these techniques
to speed up your code if you only have
Text a CPU available
Summarization
Document
Similarity
Key Objectives
NEW MESSAGE
May 28, 2025 1. Create a new “nlp_transformers” environment
From: Oscar Wynn (The Movie Maven) 2. Launch Jupyter Notebook
Subject: RE: Feel good vs dark movies
3. Read in the movie reviews data set including the
VADER sentiment scores
Regarding my earlier message, can you do this using Hugging
Face & LLMs instead of VADER & rules, and compare the 4. Apply sentiment analysis to the “movie_info”
results? Thank you!
column using transformers
---
5. Compare the transformers sentiment scores
We’re publishing an article on the top 10 most feel-good with the VADER sentiment scores
movies and the top 10 darkest movies according to data.
Could you use sentiment analysis to help us come up with
movies for these two lists?
Thanks!
Oscar
movie_reviews_sentiment.csv
Named Entity Recognition (NER) is used to find and label important information
(people, places, organizations, dates, etc.) in text
Hugging Face
Overview • The default LLM for NER is BERT (encoder-only)
Sentiment
Analysis
Text Generation
Sentiment
Analysis
Named Entity
Recognition
Zero-Shot
Classification
Text
Summarization
Key Objectives
NEW MESSAGE
May 29, 2025 1. Read in the children’s books data set
From: Lexi Con (Lead Data Scientist) 2. Apply NER to the Description column
Subject: Book characters
3. Create a list of all named entities
Hi! 4. Only include the people (PER)
It’s been a while.
5. Extra credit: Exclude the authors as well
Our client would like a rough list of characters from our book
collection.
Could you use NER to extract the named entities from the
book descriptions, and then filter on only people?
Thanks so much!
Lexi
childrens_books.csv
Sentiment
Analysis
Named Entity
Same pipeline steps with zero-
Recognition
shot-classification as the task
Zero-Shot
Classification
Text
Summarization
This is a quote!
Text Generation
You provide the label options and the model returns scores
that classify it into one of those labels (adding up to 1)
Document
Similarity
Sentiment
Analysis
Named Entity
Recognition Use things like domain
expertise, EDA, and topic
modeling to come up with
Zero-Shot relevant labels
Classification
Text
Summarization
Document
Similarity
Key Objectives
NEW MESSAGE
May 30, 2025 1. Apply zero-shot classification to the Description
From: Lexi Con (Lead Data Scientist) column
Subject: Book categories 2. Find the number of books in each category and
check a few to see if the results make sense
Hello,
Our client would like to divide their book list into five shelves
at their physical bookstore. Could you label all the books as
one of these categories?
• Adventure & Fantasy
• Animals & Nature
• Mystery
• Humor
• Non-Fiction
Thanks!
Lexi
Sentiment
Analysis
Same pipeline steps with
summarization as the task
Named Entity
Recognition
Zero-Shot
Classification
Text
Summarization
Text Generation
Beyond specifying the min and max length of the summarized text, you can set the do_sample parameter
Document to False to use the most likely next word (default) or to True to use a more random and creative next word
Similarity
EXAMPLE Use text summarization to reduce text size before sentiment analysis
Hugging Face
Overview
Sentiment
Analysis
Text
Summarization
Text Generation
None of these 3
perfectly capture
Document the sentiment
Similarity
Key Objectives
NEW MESSAGE
May 31, 2025 1. Apply text summarization to the Description column
From: Lexi Con (Lead Data Scientist) 2. Review the results to see if they make sense
Subject: Book summaries
Hello,
Our client would like a short one-liner for each book.
Could you use text summarization to summarize the
descriptions?
Thanks!
Lexi
Sentiment
Analysis
Named Entity
Recognition
Zero-Shot
Classification
Text
Summarization
Text Generation
The do_sample parameter allows you to get
more random and creative next words Text generation is mostly used for creating applications, and better models
Document like GPT-3 and 4 require using an API with an OpenAI account and credits
Similarity
Zero-Shot
Classification
Text
Feature extraction is the idea of
Summarization Now that the sentence has been vectorized, you using embeddings from
can apply EDA, clustering, classification, etc. (typically) the last layer of a
pretrained transformer model
Text Generation and inputting them into
downstream ML / analysis tasks
Document
Similarity MiniLM uses 384 dimensions for embeddings
compared to the 768 dimensions from BERT
Zero-Shot
Classification
Peach
Banana
Text
cos 𝟔𝟎 = 𝟎. 𝟓
Sugar
Summarization Mango
Mangos and limes
are not very similar
Text Generation
Lime
60°
Document
Vitamin C
Similarity
Zero-Shot
Classification
Peach
Banana
Text
cos 𝟒𝟑 = 𝟎. 𝟕𝟑
Sugar
Summarization Mango
Peaches are more similar
to mangos than limes
Text Generation
43° Lime
Document
Vitamin C
Similarity
Zero-Shot
Classification
Peach
Banana
Text
cos 𝟗 = 𝟎. 𝟗𝟖
Sugar
Summarization Mango
Bananas are the most
similar fruit to mangos!
Text Generation
Lime
9°
Document
Vitamin C
Similarity
Summarization Mango
• It can handle high dimensions
• It works well on sparse data
(data containing many 0 values)
Text Generation
Lime
9°
Document
Vitamin C
Similarity
Sentiment
Analysis
Named Entity
Recognition
Zero-Shot
Classification
Sentiment
Analysis
Named Entity
Recognition
Zero-Shot
Classification
These are the movies that are most
Text similar to Captain Marvel based on
Summarization their movie descriptions
Text Generation
Document
Similarity
Key Objectives
NEW MESSAGE
June 1, 2025 1. Turn the Description column into embeddings
From: Lexi Con (Lead Data Scientist) using feature extraction
Subject: Book recommendations 2. Compare the cosine similarity of Harry Potter
and the Sorcerer’s Stone to all the other books
Hello,
3. Return the top 5 most similar books
I have one final request for you.
Our client is a big fan of the first Harry Potter book, Harry
Potter and the Sorcerer's Stone.
What other books would you recommend for them using
document similarity with LLM embeddings?
Thanks for all your help over the past few weeks!
Lexi
Hugging Face’s Model Hub contains many NLP tasks to choose from
• The transformers library will provide a default model for various tasks, but you can swap out models
• By filtering on tasks and sorting on downloads, you can find alternative models to test out
Sentiment Analysis
Encoder-Only LLM (BERT)
Named Entity Recognition (NER)
Zero Shot Classification
Encoder-Decoder LLM (BART)
Text Summarization
Modern Decoder-Only LLM (GPT) Text Generation
NLP Review
Text Generation
These cannot be done with traditional
Machine Translation techniques, so use modern techniques
Question Answering
If you enjoyed the technical aspects of this course and want to learn more:
NLP Review
Key takeaways from someone who has been a decade-long data scientist:
NLP Review Modern NLP is a huge mindset shift from traditional data science
• With traditional science, the “danger zone” lies in not understanding everything
NLP Next Steps • With modern NLP, it’s impossible to comprehend everything