Module_5-Natural_language_processing[1]
Module_5-Natural_language_processing[1]
This looks neat but doesn’t really tell us anything. A more interesting approach might be to scatter them so
that horizontal position indicates posting popularity and vertical position indicates resume popularity,
which produces a visualization that conveys a few insights
From a text we build a cloud of words applying Natural Language Processing (NLP) techniques
n-gram Models
N-grams in NLP refers to contiguous sequences of n words extracted from text for language
processing and analysis. An n-gram can be as short as a single word (unigram) or as long as
multiple words (bigram, trigram, etc.). These n-grams capture the contextual information and
relationships between words in a given text.
we can use n-grams to generate language models to predict which word comes next given a history
of words.
An N-gram language model predicts the probability of a given N-gram within any sequence of
words in the language. A good N-gram model can predict the next word in the sentence i.e the
value of p(w|h)
Example of N-gram such as unigram (“This”, “article”, “is”, “on”, “NLP”) or bi-gram (‘This
article’, ‘article is’, ‘is on’,’on NLP’).
An N-gram model is a statistical language model used in natural language processing (NLP) and
computational linguistics. It predicts the likelihood of a word (or sequence of words) based on the
preceding N-1 words.
1. N-gram: An N-gram is a sequence of N words. For example, in the sentence "I love natural
language processing," some examples of N-grams are:
o 1-gram (unigram): "I", "love", "natural", "language", "processing"
o 2-gram (bigram): "I love", "love natural", "natural language", "language
processing"
o 3-gram (trigram): "I love natural", "love natural language", "natural language
processing"
2. N-gram Model: An N-gram model predicts the probability of a word given the N-1
preceding words. It's based on the Markov assumption that the probability of a word only
depends on the previous N-1 words (not the entire history of preceding words).
3. Application: N-gram models are used in various NLP tasks:
o Language Modeling: Predicting the next word in a sequence.
o Speech Recognition: Matching acoustic signals to sequences of words.
o Machine Translation: Predicting the next word in a translated sentence.
o Spell Checking and Correction: Identifying errors by looking at the probabilities
of word sequences.
Grammer
A different approach to modeling language is with grammars, rules for generating acceptable
sentences.
Grammar is defined as the rules for forming well-structured sentences.
Grammar in NLP is a set of rules for constructing sentences in a language used to understand and
analyze the structure of sentences in text data.
This includes identifying parts of speech such as nouns, verbs, and adjectives, determining the
subject and predicate of a sentence, and identifying the relationships between words and phrases.
1. Grammar Rules:
We'll define a set of production rules in the form of A -> B, where A is a non-terminal
symbol (category) and B is a sequence of symbols (terminals or non-terminals).
oS -> NP VP: A sentence (S) consists of a noun phrase (NP) followed by a verb
phrase (VP).
o NP -> Det N: A noun phrase (NP) consists of a determiner (Det) followed by a noun
(N).
o NP -> ProperNoun: A noun phrase (NP) can also be a proper noun (ProperNoun).
o VP -> V NP: A verb phrase (VP) consists of a verb (V) followed by a noun phrase
(NP).
o Det -> "the" | "a": Determiners (Det) can be "the" or "a".
o N -> "dog" | "cat" | "ball": Nouns (N) can be "dog", "cat", or "ball".
o ProperNoun -> "John" | "Mary" | "Alice": Proper nouns (ProperNoun) can be
"John", "Mary", or "Alice".
o V -> "chased" | "ate" | "threw": Verbs (V) can be "chased", "ate", or "threw".
2. Generating Sentences:
Using the above grammar rules, we can generate valid English sentences:
o Example 1:
▪ Start with S.
▪ Apply S -> NP VP.
▪ Apply NP -> ProperNoun (e.g., "John").
▪ Apply VP -> V NP (e.g., "chased" NP).
▪ Apply NP -> Det N (e.g., "the" N).
▪ Apply N -> "dog" (e.g., "the dog").
▪ Constructed sentence: "John chased the dog."
o Example 2:
▪ Start with S.
▪ Apply S -> NP VP.
▪ Apply NP -> Det N (e.g., "a" N).
▪ Apply N -> "cat" (e.g., "a cat").
▪ Apply VP -> V NP (e.g., "ate" NP).
▪ Apply NP -> ProperNoun (e.g., "Alice").
▪ Constructed sentence: "A cat ate Alice."
Example 3
Topic Modeling
Topic modeling in NLP refers to the process of automatically identifying topics or themes present
in a collection of text documents. It's a statistical technique used to uncover the hidden semantic
structures in a corpus and is widely used for tasks like document clustering, information retrieval,
and summarization. One of the most popular algorithms for topic modeling is Latent Dirichlet
Allocation (LDA). Here's an overview of how topic modeling works and its applications:
Latent Dirichlet Allocation (LDA)
1. Conceptual Basis:
o LDA assumes that each document in a corpus (a large and structured set of texts (written
or spoken)) is a mixture of topics, and each topic is a mixture of words. It posits a generative
process where:
▪ Each document is generated by sampling a distribution of topics.
▪ Each word in the document is generated by sampling a topic from the document's
topic distribution and then sampling a word from the topic's word distribution.
2. Key Components:
o Topics: Latent topics are distributions over words. Each topic can be interpreted as a set
of words that co-occur frequently within the same context.
o Document-Topic Distribution: Each document is represented as a distribution over
topics, indicating the proportion of each topic present in the document.
o Word-Topic Distribution: Each topic is represented as a distribution over words,
indicating the likelihood of each word appearing under that topic.
3. Steps in LDA:
o Initialization: Start with random assignment of words to topics.
o Iteration: Iteratively refine the assignment of words to topics to maximize the likelihood
of the observed data under the model.
o Inference: Estimate the posterior distribution of topics given the words in the documents
using techniques like variational inference or Gibbs sampling.
4. Applications:
o Document Clustering: Group similar documents together based on their topic
distributions.
o Information Retrieval: Identify relevant documents based on their topic distributions
rather than just keywords.
o Topic Summarization: Automatically generate summaries of documents based on the
most representative topics.
o Content Recommendation: Recommend related articles or documents based on their
topic similarity.
o Exploratory Analysis: Gain insights into large collections of text data by identifying
prevalent themes or trends.
Gibbs sampling
Gibbs sampling is a technique for generating samples from multidimensional distributions when we only
know some of the conditional distributions.
Start with any (valid) value for x and y and then repeatedly alternate replacing x with a random value picked
conditional on y and replacing y with a random value picked conditional on x. After a number of iterations,
the resulting values of x and y will represent a sample from the unconditional joint distribution
Gibbs Sampling is used to infer the topic distribution of words in documents and the word distribution of
topics. In LDA, for instance, each word in a document is assigned to a topic based on the current
assignments of all other words in the document. Gibbs Sampling iteratively updates these assignments to
approximate the posterior distribution over topics.
Network analysis
Network analysis is a field of study that focuses on analyzing and understanding complex systems
represented as networks or graphs. These networks consist of nodes (vertices) and edges
(connections between nodes), which can represent a wide range of entities and relationships
depending on the context of study.
Example
Social Networks: Nodes represent individuals or organizations, and edges represent relationships
(friendships, collaborations, etc.).
Betweenness centrality
where:
Closeness centrality
Closeness centrality is a measure(metric) used in network analysis to determine how central a node
is to the network by calculating the average shortest path distance from the node to all other nodes
in the network. In essence, it quantifies how quickly a node can interact with other nodes in the
network.
3.Interpretation: Nodes with high closeness centrality are those that can reach other nodes
quickly. They are effective in spreading information or influence efficiently across the network
because they have shorter average distances to other nodes.
Eigenvector centrality
1. Definition: Eigenvector centrality of a node v is a measure that assigns a score to the node
based on the principle that connections to high-scoring nodes contribute more to the node's
score than connections to low-scoring nodes.
3. Interpretation: Nodes with higher eigenvector centrality scores are those that are not only
well-connected but are also connected to other nodes that themselves have high centrality
scores. Therefore, eigenvector centrality captures a notion of influence that propagates through
the network.
In network analysis, directed graphs and PageRank play significant roles in understanding and
evaluating the structure and importance of nodes within a network. Here’s how they are utilized:
Directed Graphs:
PageRank:
Recommender Systems
One of the common data problem is producing recommendations of some sort. Netflix recommends movies
you might want to watch. Amazon recommends products you might want to buy. Twitter recommends users
you might want to follow.
Manual Curation: Manual curation in recommender systems refers to the process of human intervention
in selecting, filtering, or modifying recommendations that the system generates.
Example: Before the Internet, when we needed book recommendations we would go to the library, where
a librarian was available to suggest books that were relevant to our interests.
But this method doesn’t scale particularly well, and it’s limited by individual personal knowledge and
imagination.
Recommending What’s Popular: One easy approach is to simply recommend what’s popular
Most recommendation systems use collaborative filtering to find similar patterns or information
about the users. The two types of Collaborative Filtering are user-based and item-based.
User-Based Collaborative Filtering is a technique used to predict the items that a user might like
on the basis of ratings given to that item by other users who have similar taste with that of the
target user.
Limitations:
As noted, this approach may face challenges with very large datasets or high-dimensional
interest spaces:
● Curse of Dimensionality: In large datasets with many interests, finding truly similar users
becomes harder.
● Sparse Data: If users have only a few interests, similarity calculations may not accurately reflect
user preferences.
● Dynamic Interests: User interests may change over time, requiring constant updates to similarity
calculations.
The alternative approach described here focuses on computing similarities between interests
directly, rather than between users. This method allows for generating recommendations by
aggregating interests that are similar to a user's current interests. Here's how it works step-by-
step:
1. Transposing User-Interest Matrix
First, transpose the user_interest_matrix so that rows correspond to interests and columns
correspond to users. This transformation allows us to compute similarities between interests.
interest_user_matrix = [[user_interest_vector[j] for user_interest_vector in
user_interest_matrix] for j, _ in enumerate(unique_interests)]
Here, interest_user_matrix[j] will have 1 for each user who has the interest
unique_interests[j], and 0 otherwise.
if include_current_interests:
return suggestions
else:
return [(suggestion, weight) for suggestion, weight in suggestions if
suggestion not in users_interests[user_id]]
Matrix Factorization
Matrix Factorization is a powerful technique used in recommendation systems to decompose a large user-
item interaction matrix into lower-dimensional matrices that represent latent factors. This approach aims to
uncover hidden patterns or latent features that explain the observed interactions between users and items
(or interests, in this case). Singular Value Decomposition (SVD) is a classical method for matrix
factorization.
Basic Concept:
1. User-Item Matrix:
o Suppose you have a matrix R where rows correspond to users and columns correspond to
items (or interests). Each entry R[i][j] represents the interaction (e.g., rating, interest
indication) of user i with item j.
2. Matrix Decomposition:
o Matrix Factorization decomposes the matrix R into two lower-dimensional matrices:
▪ User matrix U: Represents users in terms of latent factors.
▪ Item matrix V: Represents items (interests) in terms of the same latent factors.
3. Learning Latent Factors:
o The goal is to learn the matrices U and V such that their product approximates R well.
This is typically done by minimizing a loss function that quantifies the difference
between the predicted ratings (or interest indications) and the actual ratings in R.
4. Recommendations:
o Once U and V are learned, recommendations can be made by:
▪ Predicting the missing entries in R.
▪ Recommending items (interests) with the highest predicted values for a given
user.