0% found this document useful (0 votes)
31 views38 pages

Module5 DS PPT

data science module 5

Uploaded by

rajaa.david
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views38 pages

Module5 DS PPT

data science module 5

Uploaded by

rajaa.david
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Module 5:Natural Language Processing

Word Clouds, n-Gram Language Models, Grammars, An Aside: Gibbs Sampling, Topic Modeling,
Word Vectors, Recurrent Neural Networks, Example: Using a Character-Level RNN, Network
Analysis, Betweenness Centrality, Eigenvector Centrality, Directed Graphs and PageRank,
Recommender Systems, Manual Curation, Recommending What’s Popular, User-Based
Collaborative Filtering, Item-Based Collaborative Filtering, Matrix Factorization.
Text Book : Chapters 21, 22 and 23
Natural Language Processing
• Natural language processing (NLP) refers to computational techniques involving
language.
• Natural Language Processing (NLP) is a field of artificial intelligence (AI) focused
on enabling computers to understand, interpret, and generate human language in a
way that is meaningful and useful.
• NLP combines linguistics with computer science and machine learning to process
and analyze large amounts of natural language data, such as text or speech.
Word Clouds
• One approach to visualizing words and counts is word clouds, which artistically depict
the words at sizes proportional to their counts.
• Generally, though, data scientists don’t think much of word clouds, in large part
• because the placement of the words doesn’t mean anything other than “here’s some
• space where I was able to fit a word.”
• If you ever are forced to create a word cloud, think about whether you can make the
• axes convey something.
• For example, imagine that, for each of some collection of data science–related
buzzwords, you have two numbers between 0 and 100—the first representing how
frequently it appears in job postings, and the second how frequently it appears on
résumés:
data = [ ("big data", 100, 15), ("Hadoop", 95, 25), ("Python", 75, 50),
("R", 50, 40), ("machine learning", 80, 20), ("statistics", 20, 60),
("data science", 60, 70), ("analytics", 90, 3),
("team player", 85, 85), ("dynamic", 2, 90), ("synergies", 70, 0),
("actionable insights", 40, 30), ("think out of the box", 45, 10),
("self-starter", 30, 50), ("customer focus", 65, 15),
("thought leadership", 35, 35)]
The word cloud approach is just to arrange the words on a page in a cool-looking
font (Figure 1)
from matplotlib import pyplot as plt
def text_size(total: int) -> float:
"""equals 8 if total is 0, 28 if total is 200"""
return 8 + total / 200 * 20
for word, job_popularity,
resume_popularity in data:
plt.text(job_popularity, resume_popularity,
word,
ha='center', va='center',
size=text_size(job_popularity +
resume_popularity))
plt.xlabel("Popularity on Job Postings")
plt.ylabel("Popularity on Resumes")
plt.axis([0, 100, 0, 100])
plt.xticks([])
plt.yticks([])
plt.show()
n-Gram Language Models
• An n-gram language models are statistical model used in natural language processing to
predict the probability word based on the previous (n-1) words.
• These models are crucial in various NLP tasks such as text generation, speech
recognition, machine translation and more.
• It operates by estimating the probability of a word given the preceding words, typically
considering a fixed number (n) of previous words to make this prediction.
• n-gram a continuous sequence of n items from a given sample of text or speech.
• n-gram ban be classified as
 Unigram- A single word, eg:”I”
 Bigram- A sequence of two words, eg: “ I am”
 Trigram- A sequence of three words, eg: “ I am happy”
 N-gram- A sequence of n word
Structure of n-gram models
 Probability Estimation:
• primary goal of an n-gram model is to estimate the probability of word given its
preceding words.
• For the n-gram model the probability of the next word wi given a previous (n-1)
words wi-n+1, …wi-1 is
P(wi|wi-n+1, …, wi-1)
 Markov Assumption
• The model is based on the Markov assumption means which simplifies the problem
of assuming that probability of a word depends only on the fixed number of
preceeding words, not on the entire sentence,
• This is often expressed as: 𝑃(𝑤𝑖∣𝑤1,𝑤2,…,𝑤𝑖−1)≈𝑃(𝑤𝑖∣𝑤𝑖−𝑛+1,…,𝑤𝑖−1)
 Chain rule:
• Using chain rule the joint probability of a sequence of words can be decomposed
p(w1​,w 2​,…,wN​)=p(w1). P(w2|w1).p(w3|w1,w2)​, …p(wN|w1,w2,..wN-1)
• For n-gram model
p(w1​,w 2​,…,wN​) ≈ p(w1). P(w2|w1)…p(wN|wN-n+1,.. wN-1)
 Training n-gram model
• Training involves counting the occurrences of n-gram in a text corpus and using these
counts to estimate probabilities.
• For example, in a bigram model, the probability of a word 𝑤𝑖 following a word 𝑤𝑖−1​
is calculated as:
𝑃(𝑤𝑖∣𝑤𝑖−1)=Count(𝑤𝑖−1,𝑤𝑖)/Count(𝑤𝑖−1)​
Grammars
• A different approach to modeling language is with grammars, rules for generating
acceptable sentences.
• For example, a sentence necessarily consists of a noun followed by a verb. If you then
have a list of nouns and verbs, you can generate sentences according to the rule.

from typing import List, Dict


# Type alias to refer to grammars later
Grammar = Dict[str, List[str]]
grammar = {
"_S" : ["_NP _VP"],
"_NP" : ["_N",
"_A _NP _P _A _N"],
"_VP" : ["_V",
"_V _NP"],
"_N" : ["data science", "Python", "regression"],
"_A" : ["big", "linear", "logistic"],
"_P" : ["about", "near"],
"_V" : ["learns", "trains", "tests", "is"]
}
• The common convention that names starting with underscores refer to rules that need
further expanding, and that other names are terminals that don’t need further processing.
• So, for example, "_S" is the “sentence” rule, which produces an "_NP" (“noun phrase”)
rule followed by a "_VP" (“verb phrase”) rule.
• Notice that the "_NP" rule contains itself in one of its productions. Grammars can be
recursive, which allows even finite grammars like this to generate infinitely many
different sentences.
• To generate sentences from this grammar, we start with a list containing the sentence
rule ["_S"]. And then we’ll repeatedly expand each rule by replacing it with a randomly
chosen one of its productions.
• We stop when we have a list consisting solely of terminals.
For example, one such progression might look like:
['_S']
['_NP','_VP']
['_N','_VP']
['Python','_VP']
['Python','_V','_NP']
['Python','trains','_NP']
['Python','trains','_A','_NP','_P','_A','_N']
['Python','trains','logistic','_NP','_P','_A','_N']
['Python','trains','logistic','_N','_P','_A','_N']
['Python','trains','logistic','data science','_P','_A','_N']
['Python','trains','logistic','data science','about','_A', '_N']
['Python','trains','logistic','data science','about','logistic','_N']
['Python','trains','logistic','data science','about','logistic','Python']
An Aside: Gibbs Sampling
• Generating samples from some distributions is easy.
• We can get uniform random variables with: random.random() and normal random variables
with: inverse_normal_cdf(random.random())
• But some distributions are harder to sample from.
• Gibbs sampling is a technique for generating samples from multidimensional distributions
when we only know some of the conditional distributions.
• For example, imagine rolling two dice. Let x be the value of the first die and y be the sum of
the dice, and imagine you wanted to generate lots of (x, y) pairs. In this case it’s easy to
generate the samples directly:
from typing import Tuple
import random
def roll_a_die() -> int:
return random.choice([1, 2, 3, 4, 5, 6])
def direct_sample() -> Tuple[int, int]:
d1 = roll_a_die()
d2 = roll_a_die()
return d1, d1 + d2
result = direct_sample()
print(result) # This might output something like (3, 8) where 3 is d1 and 8 is the sum of d1 and d2.
• The distribution of y conditional on x is easy—if you know the value of x, y is equally
likely to be x + 1, x + 2, x + 3, x + 4, x + 5, or x +6

def roll_a_die() -> int:


# Returns a random integer between 1 and 6
return random.choice([1, 2, 3, 4, 5, 6])

def random_y_given_x(x: int) -> int:


"""Returns an integer equally likely to be x + 1, x + 2, ... , x + 6"""
return x + roll_a_die()
result = random_y_given_x(10)
print(result) # This might output a value between 11 and 16, each equally likely.
• The way Gibbs sampling works is that we start with any (valid) values for x and y and then
repeatedly alternate replacing x with a random value picked conditional on y and replacing
y with a random value picked conditional on x.
• After a number of iterations, the resulting values of x and y will represent a sample from
the unconditional joint distribution:
Topic Modeling
• Topic modeling is an unsupervised machine learning technique used to automatically
identify topics within a large collection of text documents.
• By grouping words into clusters or "topics," topic modeling helps uncover the hidden
thematic structure in the data, allowing for a high-level view of what the documents are
about.
• Topic modeling provides insights into large text corpora by extracting structured,
interpretable themes that reveal latent information, making it widely valuable in areas
like research, marketing, content recommendation, and social media analysis.
• Some of the techniques used in topic modelling are Latent Dirichlet Allocation (LDA),
Non-Negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA) and
BERTopic.
• Latent Dirichlet Allocation (LDA) is commonly used to identify common topics in a set
of documents.
• LDA has some similarities to the Naive Bayes classifier, in that it assumes a
probabilistic model for documents.
The model assumes the following
• There is some fixed number K of topics.
• There is a random variable that assigns each topic an associated probability distribution
over words. You should think of this distribution as the probability of seeing word w
given topic k.
• There is another random variable that assigns each document a probability distribution
over topics. You should think of this distribution as the mixture of topics in document d.
• Each word in a document was generated by first randomly picking a topic (from the
document’s distribution of topics) and then randomly picking a word (from the topic’s
distribution of words).
Word vectors
• One important innovation involves representing words as low-dimensional vectors.
These vectors can be compared, added together, fed into machine learning models,
or anything else you want to do with them.
• Word vectors, or word embeddings, are a way to represent words in a continuous
vector space, typically used in natural language processing (NLP) to capture
semantic and syntactic relationships between words.
• Instead of treating words as discrete entities, word vectors encode them in a form
that allows similarity in meaning to be reflected by closeness in vector space.
1. Get a bunch of text.
2. Create a dataset where the goal is to predict a word given nearby words (or
alternatively,
to predict nearby words given a word).
3. Train a neural net to do well on this task.
4. Take the internal states of the trained neural net as the word vectors.
Recurrent Neural Network
• A Recurrent Neural Network (RNN) is a type of artificial neural network designed for
sequential data processing.
• Unlike feedforward neural networks, RNNs have connections that form directed cycles,
enabling them to maintain an internal "memory" of previous inputs, making them well-
suited for tasks involving time-series data or sequential dependencies.
• Variants of RNNs:
 Long Short-Term Memory (LSTM):Introduces gates (input, forget, and output
gates) to control the flow of information, enabling the network to retain or discard
information over long time periods.
 Gated Recurrent Unit (GRU):Similar to LSTM but with fewer parameters,
combining the forget and input gates into a single update gate.
Challenges with Basic RNNs:
• Vanishing and Exploding Gradients: During backpropagation through time (BPTT),
gradients can become very small (vanish) or very large (explode), making it hard to train
RNNs effectively over long sequences.
• Long-Term Dependencies: Standard RNNs struggle to capture long-range dependencies
in sequential data.

Applications of RNNs:
• Natural Language Processing (NLP):Sentiment analysis, machine translation, text
generation, and speech recognition.
• Time-Series Analysis: Stock price prediction, weather forecasting, and signal processing.
• Sequential Data Modeling:Video frame analysis and music generation.
Beetweeness Centrality
• Betweenness Centrality is a metric used in network analysis to measure the importance
of a node (or edge) in a graph based on its role in connecting other nodes.
• It quantifies how often a node appears on the shortest paths between pairs of nodes in a
network.
users = [User(0, "Hero"), User(1, "Dunn"), User(2, "Sue"), User(3, "Chi"),
User(4, "Thor"), User(5, "Clive"), User(6, "Hicks"),
User(7, "Devin"), User(8, "Kate"), User(9, "Klein")]
and friendships:
friend_pairs = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
(4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]
• The betweenness centrality of node i is computed by adding up, for every other pair of
nodes j and k, the proportion of shortest paths between node j and node k that pass
through i.

As shown in above Figure , users 0 and 9 have centrality 0 (as neither is on any shortest
path between other users), whereas 3, 4, and 5 all have high centralities (as all three lie
on many shortest paths).
Closeness Centrality is a metric in network analysis that quantifies how "close" a node is
to all other nodes in a network.
It measures the average shortest distance from a given node to every other node,
emphasizing how quickly information or resources can spread from that node throughout
the network.
Eigenvector Centrality
• Eigenvector Centrality is a network analysis metric that measures the influence of a
node in a graph based on the importance of its neighbors.
• Unlike simpler measures like degree centrality, which counts the number of direct
connections, eigenvector centrality assigns higher scores to nodes that are connected
to other highly central nodes.
Characteristics
• Eigenvector centrality is a recursive metric, meaning a node's importance depends on
the importance of its neighbors.
• It can handle directed and undirected graphs.
• The computation requires finding the eigenvector of the adjacency matrix, which is
computationally intensive for large graphs.
Comparison of Centrality Measures

Centrality Measure What It Measures Strengths Limitations


Number of direct Ignores indirect
Degree Centrality Simple and intuitive
connections influence
Proximity to all Captures overall Sensitive to
Closeness Centrality
other nodes accessibility disconnected graphs
Computationally
Betweenness Role in shortest Highlights critical
expensive for large
Centrality paths intermediaries
networks
Influence based on
Eigenvector Captures global Favors nodes in
neighbors'
Centrality importance dense areas
importance
Combines influence
Influence with Requires fine-tuning
Katz Centrality and inherent
baseline adjustment parameters
importance
Sensitive to graph
Influence with Effective for
PageRank topology and
random walks directed networks
parameters
Directed Graphs and PageRank
• PageRank is an algorithm that assigns importance scores to nodes in a directed graph, such
as web pages, based on their connectivity. It was originally developed by Larry Page and
Sergey Brin, the founders of Google, to rank websites in search results.
• To rank websites based on which other websites link to them, which other websites link to
those, and so on.

endorsements = [(0, 1), (1, 0), (0, 2), (2, 0), (1, 2),
(2, 1), (1, 3), (2, 3), (3, 4), (5, 4),
(5, 6), (7, 5), (6, 8), (8, 7), (8, 9)]
There is a total of 1.0 (or 100%) PageRank in the network.
2. Initially this PageRank is equally distributed among nodes.
3. At each step, a large fraction of each node’s PageRank is distributed evenly among
its outgoing links.
4. At each step, the remainder of each node’s PageRank is distributed evenly among
all nodes
Recommender Systems
• Another common data problem is producing recommendations of some sort.
• Netflix recommends movies you might want to watch.
• Amazon recommends products you might want to buy.
• Twitter recommends users you might want to follow.
users_interests = [
["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm",
"Cassandra"],
["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
["Python", "scikit-learn", "scipy", "numpy", "statsmodels",
"pandas"],
["R", "Python", "statistics", "regression", "probability"],
["machine learning", "regression", "decision trees", "libsvm"],
["Python", "R", "Java", "C++", "Haskell", "programming
languages"],
["statistics", "probability", "mathematics", "theory"],
["machine learning", "scikit-learn", "Mahout", "neural networks"],
["neural networks", "deep learning", "Big Data", "artificial
intelligence"],
["Hadoop", "Java", "MapReduce", "Big Data"],
["statistics", "R", "statsmodels"],
["C++", "deep learning", "artificial intelligence", "probability"],
["pandas", "R", "Python"],
["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
• A Recommendation System is a machine learning model designed to predict user
preferences and suggest relevant items, such as movies, books, or products.
• These systems are widely used in e-commerce, streaming services, and social
platforms. There are several approaches to building recommendation systems:
• Types of Recommendation Systems
1. Collaborative Filtering
2. Content-Based Filtering
3. Hybrid Systems
4. Matrix Factorization (Latent Factor Models)
5. Deep Learning Models
6. Knowledge-Based Systems
Collaborative FilteringUser-Based: Recommends items based on similarities between
users.Item-Based: Recommends items based on similarities between items.Advantages:
Learns from real user behavior without requiring explicit content data.Limitations:
Struggles with cold-start problems (new users or items).

Content-Based FilteringRecommends items based on item features and user


preferences.Advantages: Works well with specific content features (e.g., genre,
price).Limitations: Limited diversity and struggles to recommend novel items.

Hybrid SystemsCombines collaborative and content-based filtering.Advantages: Balances


the strengths of both approaches and reduces limitations.

Matrix Factorization (Latent Factor Models)Uses techniques like Singular Value


Decomposition (SVD) to identify latent features in user-item interaction data.Common in
collaborative filtering.
Deep Learning Models: Uses neural networks for capturing complex patterns in user-item
relationships.Popular architectures: Autoencoders, Deep Learning embeddings, and
Transformer-based models.

Knowledge-Based Systems: Relies on domain-specific knowledge to recommend


items.Useful in specialized fields like healthcare and education.

Steps to Build a Recommendation System


Data Collection Explicit feedback (e.g., ratings) or implicit feedback (e.g., clicks, views).
Data Preprocessing Clean and preprocess data for missing values, scaling, or encoding.
Feature Engineering Extract relevant features like user demographics or item attributes.
Model Selection Choose collaborative, content-based, or hybrid models.
Evaluation Metrics Precision, recall, F1-score, RMSE, MAE, or diversity metrics.
Item based collaborative filtering
• Item-based collaborative filtering is a recommendation system technique that uses the
similarity between items to recommend products to users.
• It is particularly useful in systems where the number of users is very high, and the items
are fewer or well-curated.
Working
1. Build the Item-Item Similarity Matrix: Compute the similarity between items based on
user ratings or interactions. For example, in a movie recommendation system, similarity
could be calculated based on how users rated two movies (e.g., cosine similarity,
Pearson correlation).
2. Find Similar Items: For each item in the system, identify other items with high
similarity.
3. Generate Recommendations: For a user, identify items they have interacted with. Look
up similar items to those the user liked. Aggregate the scores of these similar items to
recommend the top-N items the user has not yet interacted with.
Advantages:
Scalability: Easier to scale when there are many users.
Accuracy: Captures item similarities well, especially when user behavior is stable
over time.

Disadvantages:
Cold Start Problem: It struggles to recommend items with no user interactions (new
items).
Data Sparsity: If users rate only a few items, similarity computations can be less
effective.
User-based collaborative filtering
• User-based collaborative filtering is a recommendation system technique that focuses
on identifying users with similar preferences and using their preferences to
recommend items.
• It is one of the earliest and simplest collaborative filtering methods.
Working
1. Input Data: A user-item interaction matrix 𝑅, where each entry 𝑅𝑖𝑗​ represents the
interaction (e.g., rating or purchase) of user 𝑖 with item 𝑗.
2. Compute User Similarity: Find the similarity between users based on their
interactions with items. Common similarity measures include: Cosine Similarity,
Pearson Correlation
3. Find Similar Users: For a target user, identify the top-K most similar users.
4. Generate Recommendations: For the target user, recommend items that their similar
users have interacted with but they have not.
Advantages
Intuitive: Leverages the idea that "users with similar tastes will like similar items.“
Personalized: Recommendations are tailored to each user based on their preferences.

Disadvantages
Scalability: For a large number of users, finding similar users can be computationally
expensive.
Sparsity Problem: If user interactions are sparse, it may be difficult to find similar users.
Cold Start Problem: Struggles with new users who haven't interacted with any items.

Applications
E-commerce: Recommending products based on similar shoppers.
Streaming Services: Suggesting movies or music based on similar viewers or listeners.
Social Media: Recommending friends or content based on similar users' preferences.
Matrix factorization
• Matrix Factorization is a powerful technique used in recommendation systems to predict
missing values in a user-item interaction matrix.
• It is particularly effective in collaborative filtering, where the goal is to recommend
items to users based on their past behavior.
• Matrix factorization decomposes a large, sparse user-item interaction matrix 𝑅 (e.g.,
user ratings for movies) into two smaller matrices:
1. User Matrix (𝑈): Represents user preferences in a latent feature space.
2. Item Matrix (𝑉V): Represents item attributes in the same latent feature space.
Mathematically:𝑅≈𝑈×VT
Where:
𝑅: Original user-item matrix (e.g., ratings)
𝑈: User matrix (of size 𝑚×𝑘, where 𝑚 is the number of users, and 𝑘 is the number of
latent features)
𝑉: Item matrix (of size 𝑛×𝑘, where 𝑛 is the number of items)
• Each row of 𝑈 represents a user's preferences, and each row of 𝑉 represents an
item's characteristics.

• The optimization goal is to minimize the error between the actual values 𝑅𝑖𝑗​ and
the predicted values 𝑅^𝑖𝑗​ from 𝑈×VT

Algorithms for Matrix Factorization


1. Stochastic Gradient Descent (SGD):
• Iteratively updates 𝑈 and 𝑉 by computing gradients of the loss function.
• Easy to implement and works well for sparse data.
2. Alternating Least Squares (ALS):
• Alternates between optimizing 𝑈 while keeping 𝑉fixed and optimizing 𝑉
while keeping 𝑈 fixed.
• Often faster than SGD for large datasets.
Some Important Questions

1.Describe the n-gram language model in detail


2. Explain how grammars are used in modeling languages
3. Explain Gibbs Sampling and Topic Modeling
4. Explain the Recurrent Neural Network in detail.
5. Explain Eigen vector centrality in detail
6. Explain item based and user based collaborative filtering
7. What is word cloud?
8. Explain Matrix factorization in detail
9. Explain Betweeness centrality and Closeness centrality
10. What is recommendation system? Give its types

You might also like