Module5 DS PPT
Module5 DS PPT
Word Clouds, n-Gram Language Models, Grammars, An Aside: Gibbs Sampling, Topic Modeling,
Word Vectors, Recurrent Neural Networks, Example: Using a Character-Level RNN, Network
Analysis, Betweenness Centrality, Eigenvector Centrality, Directed Graphs and PageRank,
Recommender Systems, Manual Curation, Recommending What’s Popular, User-Based
Collaborative Filtering, Item-Based Collaborative Filtering, Matrix Factorization.
Text Book : Chapters 21, 22 and 23
Natural Language Processing
• Natural language processing (NLP) refers to computational techniques involving
language.
• Natural Language Processing (NLP) is a field of artificial intelligence (AI) focused
on enabling computers to understand, interpret, and generate human language in a
way that is meaningful and useful.
• NLP combines linguistics with computer science and machine learning to process
and analyze large amounts of natural language data, such as text or speech.
Word Clouds
• One approach to visualizing words and counts is word clouds, which artistically depict
the words at sizes proportional to their counts.
• Generally, though, data scientists don’t think much of word clouds, in large part
• because the placement of the words doesn’t mean anything other than “here’s some
• space where I was able to fit a word.”
• If you ever are forced to create a word cloud, think about whether you can make the
• axes convey something.
• For example, imagine that, for each of some collection of data science–related
buzzwords, you have two numbers between 0 and 100—the first representing how
frequently it appears in job postings, and the second how frequently it appears on
résumés:
data = [ ("big data", 100, 15), ("Hadoop", 95, 25), ("Python", 75, 50),
("R", 50, 40), ("machine learning", 80, 20), ("statistics", 20, 60),
("data science", 60, 70), ("analytics", 90, 3),
("team player", 85, 85), ("dynamic", 2, 90), ("synergies", 70, 0),
("actionable insights", 40, 30), ("think out of the box", 45, 10),
("self-starter", 30, 50), ("customer focus", 65, 15),
("thought leadership", 35, 35)]
The word cloud approach is just to arrange the words on a page in a cool-looking
font (Figure 1)
from matplotlib import pyplot as plt
def text_size(total: int) -> float:
"""equals 8 if total is 0, 28 if total is 200"""
return 8 + total / 200 * 20
for word, job_popularity,
resume_popularity in data:
plt.text(job_popularity, resume_popularity,
word,
ha='center', va='center',
size=text_size(job_popularity +
resume_popularity))
plt.xlabel("Popularity on Job Postings")
plt.ylabel("Popularity on Resumes")
plt.axis([0, 100, 0, 100])
plt.xticks([])
plt.yticks([])
plt.show()
n-Gram Language Models
• An n-gram language models are statistical model used in natural language processing to
predict the probability word based on the previous (n-1) words.
• These models are crucial in various NLP tasks such as text generation, speech
recognition, machine translation and more.
• It operates by estimating the probability of a word given the preceding words, typically
considering a fixed number (n) of previous words to make this prediction.
• n-gram a continuous sequence of n items from a given sample of text or speech.
• n-gram ban be classified as
Unigram- A single word, eg:”I”
Bigram- A sequence of two words, eg: “ I am”
Trigram- A sequence of three words, eg: “ I am happy”
N-gram- A sequence of n word
Structure of n-gram models
Probability Estimation:
• primary goal of an n-gram model is to estimate the probability of word given its
preceding words.
• For the n-gram model the probability of the next word wi given a previous (n-1)
words wi-n+1, …wi-1 is
P(wi|wi-n+1, …, wi-1)
Markov Assumption
• The model is based on the Markov assumption means which simplifies the problem
of assuming that probability of a word depends only on the fixed number of
preceeding words, not on the entire sentence,
• This is often expressed as: 𝑃(𝑤𝑖∣𝑤1,𝑤2,…,𝑤𝑖−1)≈𝑃(𝑤𝑖∣𝑤𝑖−𝑛+1,…,𝑤𝑖−1)
Chain rule:
• Using chain rule the joint probability of a sequence of words can be decomposed
p(w1,w 2,…,wN)=p(w1). P(w2|w1).p(w3|w1,w2), …p(wN|w1,w2,..wN-1)
• For n-gram model
p(w1,w 2,…,wN) ≈ p(w1). P(w2|w1)…p(wN|wN-n+1,.. wN-1)
Training n-gram model
• Training involves counting the occurrences of n-gram in a text corpus and using these
counts to estimate probabilities.
• For example, in a bigram model, the probability of a word 𝑤𝑖 following a word 𝑤𝑖−1
is calculated as:
𝑃(𝑤𝑖∣𝑤𝑖−1)=Count(𝑤𝑖−1,𝑤𝑖)/Count(𝑤𝑖−1)
Grammars
• A different approach to modeling language is with grammars, rules for generating
acceptable sentences.
• For example, a sentence necessarily consists of a noun followed by a verb. If you then
have a list of nouns and verbs, you can generate sentences according to the rule.
Applications of RNNs:
• Natural Language Processing (NLP):Sentiment analysis, machine translation, text
generation, and speech recognition.
• Time-Series Analysis: Stock price prediction, weather forecasting, and signal processing.
• Sequential Data Modeling:Video frame analysis and music generation.
Beetweeness Centrality
• Betweenness Centrality is a metric used in network analysis to measure the importance
of a node (or edge) in a graph based on its role in connecting other nodes.
• It quantifies how often a node appears on the shortest paths between pairs of nodes in a
network.
users = [User(0, "Hero"), User(1, "Dunn"), User(2, "Sue"), User(3, "Chi"),
User(4, "Thor"), User(5, "Clive"), User(6, "Hicks"),
User(7, "Devin"), User(8, "Kate"), User(9, "Klein")]
and friendships:
friend_pairs = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
(4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]
• The betweenness centrality of node i is computed by adding up, for every other pair of
nodes j and k, the proportion of shortest paths between node j and node k that pass
through i.
As shown in above Figure , users 0 and 9 have centrality 0 (as neither is on any shortest
path between other users), whereas 3, 4, and 5 all have high centralities (as all three lie
on many shortest paths).
Closeness Centrality is a metric in network analysis that quantifies how "close" a node is
to all other nodes in a network.
It measures the average shortest distance from a given node to every other node,
emphasizing how quickly information or resources can spread from that node throughout
the network.
Eigenvector Centrality
• Eigenvector Centrality is a network analysis metric that measures the influence of a
node in a graph based on the importance of its neighbors.
• Unlike simpler measures like degree centrality, which counts the number of direct
connections, eigenvector centrality assigns higher scores to nodes that are connected
to other highly central nodes.
Characteristics
• Eigenvector centrality is a recursive metric, meaning a node's importance depends on
the importance of its neighbors.
• It can handle directed and undirected graphs.
• The computation requires finding the eigenvector of the adjacency matrix, which is
computationally intensive for large graphs.
Comparison of Centrality Measures
endorsements = [(0, 1), (1, 0), (0, 2), (2, 0), (1, 2),
(2, 1), (1, 3), (2, 3), (3, 4), (5, 4),
(5, 6), (7, 5), (6, 8), (8, 7), (8, 9)]
There is a total of 1.0 (or 100%) PageRank in the network.
2. Initially this PageRank is equally distributed among nodes.
3. At each step, a large fraction of each node’s PageRank is distributed evenly among
its outgoing links.
4. At each step, the remainder of each node’s PageRank is distributed evenly among
all nodes
Recommender Systems
• Another common data problem is producing recommendations of some sort.
• Netflix recommends movies you might want to watch.
• Amazon recommends products you might want to buy.
• Twitter recommends users you might want to follow.
users_interests = [
["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm",
"Cassandra"],
["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
["Python", "scikit-learn", "scipy", "numpy", "statsmodels",
"pandas"],
["R", "Python", "statistics", "regression", "probability"],
["machine learning", "regression", "decision trees", "libsvm"],
["Python", "R", "Java", "C++", "Haskell", "programming
languages"],
["statistics", "probability", "mathematics", "theory"],
["machine learning", "scikit-learn", "Mahout", "neural networks"],
["neural networks", "deep learning", "Big Data", "artificial
intelligence"],
["Hadoop", "Java", "MapReduce", "Big Data"],
["statistics", "R", "statsmodels"],
["C++", "deep learning", "artificial intelligence", "probability"],
["pandas", "R", "Python"],
["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
• A Recommendation System is a machine learning model designed to predict user
preferences and suggest relevant items, such as movies, books, or products.
• These systems are widely used in e-commerce, streaming services, and social
platforms. There are several approaches to building recommendation systems:
• Types of Recommendation Systems
1. Collaborative Filtering
2. Content-Based Filtering
3. Hybrid Systems
4. Matrix Factorization (Latent Factor Models)
5. Deep Learning Models
6. Knowledge-Based Systems
Collaborative FilteringUser-Based: Recommends items based on similarities between
users.Item-Based: Recommends items based on similarities between items.Advantages:
Learns from real user behavior without requiring explicit content data.Limitations:
Struggles with cold-start problems (new users or items).
Disadvantages:
Cold Start Problem: It struggles to recommend items with no user interactions (new
items).
Data Sparsity: If users rate only a few items, similarity computations can be less
effective.
User-based collaborative filtering
• User-based collaborative filtering is a recommendation system technique that focuses
on identifying users with similar preferences and using their preferences to
recommend items.
• It is one of the earliest and simplest collaborative filtering methods.
Working
1. Input Data: A user-item interaction matrix 𝑅, where each entry 𝑅𝑖𝑗 represents the
interaction (e.g., rating or purchase) of user 𝑖 with item 𝑗.
2. Compute User Similarity: Find the similarity between users based on their
interactions with items. Common similarity measures include: Cosine Similarity,
Pearson Correlation
3. Find Similar Users: For a target user, identify the top-K most similar users.
4. Generate Recommendations: For the target user, recommend items that their similar
users have interacted with but they have not.
Advantages
Intuitive: Leverages the idea that "users with similar tastes will like similar items.“
Personalized: Recommendations are tailored to each user based on their preferences.
Disadvantages
Scalability: For a large number of users, finding similar users can be computationally
expensive.
Sparsity Problem: If user interactions are sparse, it may be difficult to find similar users.
Cold Start Problem: Struggles with new users who haven't interacted with any items.
Applications
E-commerce: Recommending products based on similar shoppers.
Streaming Services: Suggesting movies or music based on similar viewers or listeners.
Social Media: Recommending friends or content based on similar users' preferences.
Matrix factorization
• Matrix Factorization is a powerful technique used in recommendation systems to predict
missing values in a user-item interaction matrix.
• It is particularly effective in collaborative filtering, where the goal is to recommend
items to users based on their past behavior.
• Matrix factorization decomposes a large, sparse user-item interaction matrix 𝑅 (e.g.,
user ratings for movies) into two smaller matrices:
1. User Matrix (𝑈): Represents user preferences in a latent feature space.
2. Item Matrix (𝑉V): Represents item attributes in the same latent feature space.
Mathematically:𝑅≈𝑈×VT
Where:
𝑅: Original user-item matrix (e.g., ratings)
𝑈: User matrix (of size 𝑚×𝑘, where 𝑚 is the number of users, and 𝑘 is the number of
latent features)
𝑉: Item matrix (of size 𝑛×𝑘, where 𝑛 is the number of items)
• Each row of 𝑈 represents a user's preferences, and each row of 𝑉 represents an
item's characteristics.
• The optimization goal is to minimize the error between the actual values 𝑅𝑖𝑗 and
the predicted values 𝑅^𝑖𝑗 from 𝑈×VT