0% found this document useful (0 votes)
27 views3 pages

Gen Ai Lab

The document outlines two lab programs utilizing the Gensim library for natural language processing. The first program demonstrates the training of a Word2Vec model on a simple corpus, performing operations like vector addition, cosine similarity, and finding similar words. The second program focuses on a technology-themed corpus, including data preprocessing, training a Word2Vec model, visualizing word embeddings using PCA, and retrieving semantically similar words.

Uploaded by

Nikitha G R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views3 pages

Gen Ai Lab

The document outlines two lab programs utilizing the Gensim library for natural language processing. The first program demonstrates the training of a Word2Vec model on a simple corpus, performing operations like vector addition, cosine similarity, and finding similar words. The second program focuses on a technology-themed corpus, including data preprocessing, training a Word2Vec model, visualizing word embeddings using PCA, and retrieving semantically similar words.

Uploaded by

Nikitha G R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

Lab Program 1

!pip install gensim


corpus = ['king is a strong man','queen is a wise woman','boy is a young man','girl
is a young woman','prince is a young','prince will be strong','princess is
young','man is strong','woman is pretty', 'prince is a boy','prince will be king',
'princess is a girl', 'princess will be queen']
print(corpus)
statements_listt = []
for cor in corpus:
statements_listt.append(cor.split())
print(statements_listt)
from gensim.parsing.preprocessing import STOPWORDS
documents = [[word for word in document if word not in STOPWORDS] for document in
statements_listt]
documents

import gensim
from gensim.models import Word2Vec
model = Word2Vec(documents, min_count=1, vector_size=3, window = 3)
# Assuming you have already trained your Word2Vec model and it's stored in the
'model' variable

# 1. Addition and Subtraction:


vector1 = model.wv['king']
vector2 = model.wv['man']
sum_vector = vector1 + vector2
print("sum vector ",sum_vector)
diff_vector = vector1 - vector2
print("difference vector ",sum_vector)
# 2. Cosine Similarity:
similarity = model.wv.similarity('king', 'queen')
print(f"Cosine Similarity between 'king' and 'queen': {similarity}")

# 3. Finding Most Similar Words:


similar_words = model.wv.most_similar('king', topn=5)
print(f"Most Similar words to 'king': {similar_words}")

# 4. Analogy Example:
analogy_vector = model.wv['king'] - model.wv['man'] + model.wv['woman']
most_similar = model.wv.most_similar(positive=[analogy_vector], topn=1)
print(f"Analogy Result (king - man + woman): {most_similar}")

program 2

import gensim
from gensim.models import Word2Vec
import re
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Sample domain-specific corpus (Technology)


technology_corpus = [
"Artificial intelligence is transforming various industries.",
"Machine learning algorithms improve predictive analytics.",
"Cloud computing enables scalable infrastructure for businesses.",
"Cybersecurity is crucial for protecting sensitive data.",
"Blockchain technology ensures secure and decentralized transactions.",
"The Internet of Things connects smart devices seamlessly.",
"Big data analytics helps organizations make data-driven decisions.",
"Quantum computing has the potential to revolutionize cryptography.",
"Edge computing brings computation closer to data sources.",
"Natural language processing enhances human-computer interactions."
]

# Basic text preprocessing function (tokenization & lowercasing)


def simple_tokenize(text):
return re.findall(r'\b\w+\b', text.lower())

# Preprocess corpus manually


preprocessed_corpus = [simple_tokenize(sentence) for sentence in technology_corpus]

# Train Word2Vec model


model = Word2Vec(sentences=preprocessed_corpus, vector_size=50, window=5,
min_count=1, workers=4)

# Select 10 domain-specific words


selected_words = ["ai", "machine", "cloud", "cybersecurity", "blockchain", "iot",
"data", "quantum", "edge", "nlp"]
# Filter selected words to include only words present in model.wv
selected_words = [word for word in selected_words if word in model.wv]

# Extract word embeddings for selected words


word_vectors = [model.wv[word] for word in selected_words if word in model.wv]

# Reduce dimensionality using PCA


pca = PCA(n_components=2)
reduced_vectors = pca.fit_transform(word_vectors)

# Create DataFrame for visualization


df_embeddings = pd.DataFrame(reduced_vectors, columns=["x", "y"],
index=selected_words)

# Plot embeddings
plt.figure(figsize=(10, 6))
plt.scatter(df_embeddings["x"], df_embeddings["y"], marker='o')

for word, (x, y) in zip(df_embeddings.index, reduced_vectors):


plt.text(x, y, word, fontsize=12)

plt.xlabel("PCA Component 1")


plt.ylabel("PCA Component 2")
plt.title("Word Embeddings Visualization (Technology Domain)")
plt.show()

# Function to get semantically similar words


def get_similar_words(word, top_n=5):
if word in model.wv:
return model.wv.most_similar(word, topn=top_n)
else:
return f"Word '{word}' not in vocabulary."
# Example usage
input_word = "technology"
similar_words = get_similar_words(input_word)
print(f"Top 5 words similar to '{input_word}':", similar_words)

You might also like