0% found this document useful (0 votes)
4 views

Question&Answer

The document discusses the challenges of high-dimensional TF-IDF matrices, including the curse of dimensionality, and suggests feature reduction techniques such as n-gram selection and dimensionality reduction methods like PCA. It compares clustering techniques K-Means and DBSCAN, highlighting DBSCAN's advantages in handling varying cluster shapes and identifying outliers. Additionally, it emphasizes the importance of dimensionality reduction for effective clustering and explores the benefits of using TF-IDF over simple term frequency, as well as the potential advantages of incorporating bigrams in text analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Question&Answer

The document discusses the challenges of high-dimensional TF-IDF matrices, including the curse of dimensionality, and suggests feature reduction techniques such as n-gram selection and dimensionality reduction methods like PCA. It compares clustering techniques K-Means and DBSCAN, highlighting DBSCAN's advantages in handling varying cluster shapes and identifying outliers. Additionally, it emphasizes the importance of dimensionality reduction for effective clustering and explores the benefits of using TF-IDF over simple term frequency, as well as the potential advantages of incorporating bigrams in text analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Question 1: TF-IDF Matrix and Feature Reduction (30 Marks)

a) Curse of Dimensionality
When a TF-IDF matrix explodes in size with an excessive number of features, it encounters the curse of
dimensionality. This phenomenon manifests in several ways:

* Increased Data Requirements: Training machine learning models with a high-dimensional feature
space necessitates a massive amount of data to achieve robust performance. This can be impractical in
many real-world scenarios where data collection is expensive or limited.
* Computational Bottlenecks: Processing high-dimensional data often demands more computational
resources, leading to longer training times, higher memory usage, and potential hardware limitations.
This can impede the ef ciency and scalability of machine learning pipelines.
b) Causes of Large Feature Sets

* Unnecessary N-grams: Consider including bigrams and trigrams only when they offer meaningful
context and contribute signi cantly to the document representation. For example, "arti cial intelligence" is a
valuable bigram, while "the of and" contributes little and can be excluded.
* Vocabulary Size and Length: Large documents with highly diverse vocabularies will naturally contain
more unique feature terms. This can be mitigated to some extent by:
* Stop-Word Removal: Eliminating common, non-informative words like "the," "a," and "is" from the
vocabulary can help reduce feature space size.
* Stemming/Lemmatization: Normalizing words to their base forms (e.g., "running" -> "run," "better" ->
"good") can reduce feature space redundancy.

c) Feature Reduction Techniques

Here are three effective strategies to shrink the feature space size of your TF-IDF matrix:
* N-gram Selection: Implement a threshold-based or information gain-based approach to lter out bigrams
and trigrams that don't contribute signi cantly to document representation. This prevents the inclusion of
irrelevant or uninformative n-grams.

* Domain-Speci c Knowledge: Consider incorporating domain expertise to identify and exclude feature
terms that hold little value in your speci c application domain. Tailoring your feature space to your use case can
improve model performance and ef ciency.

* Dimensionality Reduction Techniques: Techniques like Principal Component Analysis (PCA) can
project the high-dimensional data onto a lower-dimensional latent space while preserving the maximum
amount of variance. This allows you to retain the most informative features while signi cantly reducing the
feature space size.
fi
fi
fi
fi
fi
fi
fi
fi
fi
Question 2 (35 Marks)
Refer to dataset at - https://archive.ics.uci.edu/dataset/484/travel+reviews
Use K-Means and DBSCAN clustering techniques in Python to identify and label clusters of
users (i.e., travellers) with similar travel interests.

a) Discuss which technique worked better and why?

In this scenario, DBSCAN generally works better than K-Means for identifying clusters of travelers with similar
interests. Here's why:
* DBSCAN:
Handles clusters of varying shapes and sizes: DBSCAN is more exible in identifying clusters that are not
necessarily spherical or of uniform size, which is common in real-world data.
Identi es outliers: DBSCAN can effectively identify and label outliers, which are travelers with unique interests
that don't t into any de ned cluster.
* K-Means:
Assumes spherical clusters: K-Means assumes that clusters are spherical and of similar size. This assumption
may not hold true for real-world travel data where interests can be diverse and cluster shapes can be irregular.
Sensitive to outliers: Outliers can signi cantly in uence the centroid calculation in K-Means, leading to
inaccurate cluster assignments.

b) Explain why it may be essential to use dimensionality reduction before implementing


these techniques?

Dimensionality reduction is crucial before applying clustering algorithms like K-Means and DBSCAN for several
reasons:
* Curse of dimensionality: In high-dimensional spaces, Euclidean distances become less meaningful. This can
lead to inaccurate distance calculations and poor clustering results.
* Computational ef ciency: Reducing the number of dimensions can signi cantly improve the computational
ef ciency of clustering algorithms, especially for large datasets.
* Visualization: Dimensionality reduction techniques like PCA can project the data onto a lower-dimensional
space, often 2 or 3 dimensions, enabling visualization. This can help identify outliers, noise, and potential
cluster shapes, guiding the choice of clustering algorithm.
fi
fi
fi
fi
fi
fi
fl
fl
fi
Question 3
a) Discuss why TF-IDF matrix might be a better representation than TF matrix for the above
text documents.?

Term frequency (TF) only considers how often a term appears within a single document. While this can
highlight important terms, it doesn't account for how common those terms are across the entire corpus of
documents. For instance, the word "data" appears frequently in all the given documents. Using TF alone might
overemphasize its importance, even though it's a common word that doesn't necessarily distinguish one
document from another.

TF-IDF (Term Frequency-Inverse Document Frequency) addresses this by considering both the term
frequency and the inverse document frequency (IDF). IDF measures how rare a term is across the entire
corpus. Terms that appear in many documents have a low IDF, while rare terms have a high IDF.

By multiplying TF and IDF, TF-IDF gives more weight to terms that are frequent in a speci c document but rare
in the overall collection. This helps to identify terms that are truly distinctive and meaningful for a particular
document.

Therefore, TF-IDF is generally a better representation than TF for text documents because it considers both
term frequency within a document and the rarity of that term across the entire corpus, leading to a more
nuanced and informative representation.

b) Discuss whether it would be a good idea to use bigrams as tokens in the TF-IDF
representation of the above text documents.?

Using bigrams (two consecutive words) as tokens in TF-IDF can be bene cial in some cases, but it depends on
the speci c characteristics of the text data.

Potential Bene ts:


* Capturing Context: Bigrams can help capture semantic relationships and context that might be missed by
using individual words (unigrams). For example, the phrase "data mining" is more meaningful than the
individual words "data" and "mining" in this context.
* Improved Discrimination: Bigrams can sometimes help to better distinguish between documents with similar
word frequencies but different word order or phrase usage.
Potential Drawbacks:
* Data Sparsity: Bigrams can lead to data sparsity, especially in smaller corpora. Many bigrams may only
appear in a few documents or even just once, making it dif cult to calculate meaningful TF-IDF scores.
* Noise: Not all bigrams are meaningful. Including many irrelevant or noisy bigrams can introduce noise into the
TF-IDF representation.
fi
fi
fi
fi
fi
Question 4
Refer to dataset at –
https://archive.ics.uci.edu/dataset/396/sales+transactions+dataset+weekly
Use PCA and UMAP dimensionality reduction techniques in Python to visually explore
which products had similar weekly sales transactions over the course of 52 weeks.
a) Explain which technique worked better for visual exploration and why?

PCA (Principal Component Analysis) should not be used here because we are dealing with discrete variables
(weekly sales transactions). PCA is primarily designed for continuous data. Applying it to discrete data can
lead to misleading results and distorted visualizations.

UMAP (Uniform Manifold Approximation and Projection) is a more suitable technique for visualizing this
dataset. UMAP is speci cally designed to preserve local and global structure in high-dimensional data. It can
effectively handle non-linear relationships and complex structures, which are often present in real-world data
like weekly sales transactions.

Therefore, UMAP is likely to provide a more accurate and informative visualization of the relationships between
products based on their weekly sales patterns.

b) Label products with similar weekly sales transactions using the k-Means clustering
algorithm in Python. Does it work perfectly? Why or why not? Save labeled data to a CSV le
called Labelled.csv and upload this le.?

When k-means is applied to ve distinct clusters in the UMAP visualization, it generally works well to label those
clusters. K-means can effectively identify groups of products with similar weekly sales patterns.
However, there are a few caveats to consider:
* The choice of the number of clusters (k) is crucial. If the wrong value of k is chosen, the clustering results may
not accurately re ect the underlying structure of the data.
* K-means assumes that clusters are spherical in shape. If the true clusters are non-spherical or have complex
shapes, k-means may not be able to accurately identify them.
* Outliers and noise can signi cantly impact the performance of k-means. Outliers can distort the cluster
centers, while noise can make it dif cult for k-means to identify meaningful patterns.
In the absence of outliers or noise, and assuming the clusters are relatively spherical, k-means can be a good
choice for labeling products with similar weekly sales patterns.
To save the labeled data, you can create a new column in your dataset to store the cluster labels assigned by k-
means. Then, you can save this modi ed dataset to a CSV le named “Labelled.csv”.
fl
fi
fi
fi
fi
fi
fi
fi
fi
Question 5
a) In the following ve-node undirected weighted network, each edge weight represents a
distance. Use Dijkstra's algorithm to calculate the shortest distance from node 'a' to all
other nodes in the network. Show your workings.?

Step 1: Initialization
* Current Node: 'a'
* Unvisited Nodes: b, c, d, e
* Assigned Tentative Distances:
* a to b: 4
* a to c: 5
* a to d: 12
* a to e: ∞
Step 2: Process Node 'b'
* Current Node: 'b'
* Unvisited Nodes: c, d, e
* Tentative Distances from 'b':
* b to c: 10
* b to d: 13
* b to e: 14
* Compare and Update Tentative Distances:
* a to c: 5 (unchanged)
* a to d: 12 (unchanged)
* a to e: 14 (new distance)
* Mark Node 'b' as Visited
* Shortest Distance to b ' ': D(b) = 4
* Parent Node of 'b': P(b) = 'a'
Step 3: Process Node 'c'
* Current Node: 'c'
* Unvisited Nodes: d, e
* Tentative Distances from 'c':
* c to d: 7
* c to e: 11
* Compare and Update Tentative Distances:
* a to d: 12 (unchanged)
* a to e: 11 (new distance)
* Mark Node 'c' as Visited
* Shortest Distance to c' ': D(c) = 5
* Parent Node of 'c': P(c) = 'a'
Step 4: Process Node 'e'
* Current Node: 'e'
* Unvisited Nodes: d
* Tentative Distances from 'e':
* e to d: 8
* Compare and Update Tentative Distances:
* a to d: 12 (unchanged)
* Mark Node 'e' as Visited
* Shortest Distance to 'e': D(e) = 11
* Parent Node of 'e': P(e) = 'c'
Step 5: Process Node 'd'
* Current Node: 'd'
* Unvisited Nodes: None
* Mark Node 'd' as Visited
* Shortest Distance to 'd': D(d) = 12
* Parent Node of 'd': P(d) = 'a'
Final Shortest Distances:
* D(a) = 0
* D(b) = 4
* D(c) = 5
* D(d) = 12
* D(e) = 11
Therefore, the shortest distances from node 'a' to all other nodes in the network are as follows:
* a to b: 4
* a to c: 5
* a to d: 12
* a to e: 11
fi
( b) Problem Statement:

Supply point 'a' can supply up to a maximum of 20 units per week to retailers 'f', 'g', and 'h' through
the supply network with capacities as shown. The demands at retailers 'f', 'g', and 'h' are 8, 6, and
' ', 'c', d
10 units per week respectively. b ' ', and 'e' are distributors, with capacity of distributor 'e'
restricted at 6 units. Using Ford-Fulkerson's algorithm, explain which demands can be met.
Show your workings.?

Solution:

Ford-Fulkerson Algorithm Steps:


* Initialization:
* Start with an initial ow of 0 on all edges.
* Find an augmenting path from the source (supply point 'a') to the sink (each retailer). An augmenting path is a
path with available capacity on each edge.
* Augment Flow:
* Determine the minimum capacity along the augmenting path.
* Increase the ow on each edge of the path by this minimum capacity.
* Decrease the capacity of each forward edge by the minimum capacity.
* Increase the capacity of each backward edge by the minimum capacity.
* Repeat:
* Repeat steps 1 and 2 until no more augmenting paths can be found.
Analysis:
* Initial Flow: All edges have a ow of 0.
* Augmenting Path 1: a -> b -> c -> f (capacity = 6)
* Increase ow on a->b, b->c, c->f by 6 units.
* Augmenting Path 2: a -> d -> e -> h (capacity = 4)
* Increase ow on a->d, d->e, e->h by 4 units.
* Augmenting Path 3: a -> b -> d -> e -> h (capacity = 2)
* Increase ow on a->b, b->d, d->e, e->h by 2 units.
Final Flow:
* Flow on a->b = 8 units
* Flow on b->c = 6 units
* Flow on c->f = 6 units
* Flow on a->d = 6 units
* Flow on d->e = 6 units
* Flow on e->h = 6 units
Demand Ful llment:
* Retailer 'f': Demand of 8 units is fully met.
* Retailer 'g': Demand of 6 units cannot be fully met. Maximum of 4 units can be supplied.
* Retailer 'h': Demand of 10 units cannot be fully met. Maximum of 8 units can be supplied.
fl
fl
fl
fi
fl
fl
fl
Question 6

500,000 posts about UK General Election are retrieved from social media platform X
(formerly Twitter) and converted into a TF-IDF matrix for the purposes of training a
sentiment prediction model.

a) What could be a potential bene t of including bigrams as features in the above TF-
IDF matrix? Explain using an example. ?

• Potential Bene t: Bigrams can capture context in which words might have been used. This is
because the meaning of a word can change depending on the word that follows it.

* Example: Consider the bigram "prime minister." The individual words "prime" and "minister" have
different meanings in isolation. However, when combined, they form a speci c term with a clear
meaning. Using the bigram "prime minister" as a feature would allow the model to better understand
the sentiment expressed in the text.

b) How could tuning "minimum document frequency" affect quantity and quality of features in
the above TF-IDF matrix? Explain using an example.
• Effect of Minimum Document Frequency: Minimum document frequency refers to the minimum
number of documents in which a word or phrase (n-grams) must be present before it can be made
a feature in a TF-IDF matrix.

* Higher minimum document frequency:


Reduces the number of features (less vocabulary).
Can improve the quality of features by excluding rare or non-sensical terms.
* Lower minimum document frequency:
Increases the number of features (more vocabulary).
May include noisy or irrelevant features.

* Example: If the minimum document frequency is set to 30, only words or phrases that appear in at
least 30 documents will be considered features. This would exclude non-sensical bigrams like
"extent sense," which are unlikely to appear frequently in the data.

c) Explain why the resultant model may not perform well at predicting sentiments of posts
about US Presidential Election? What could be done to ensure that the model performs well?
fi
fi
fi
• Reason for Poor Performance: The vocabulary used in the model is trained on UK General Election
posts. This vocabulary may be different from the vocabulary used in US Presidential Election
posts. The model might encounter many out-of-vocabulary words or phrases in the US data,
leading to inaccurate predictions.

• Improving Model Performance:


- Retrain the model on US data: Train a new model using a TF-IDF matrix generated from US
Presidential Election posts.
- Increase data size: Use a larger dataset of US Presidential Election posts to improve model
robustness and generalization.
- Use domain-speci c word embeddings: Employ word embeddings trained on a large corpus of US
political text to capture semantic relationships speci c to the US context.

Question 7 (35 Marks)

Refer to dataset at –
https://archive.ics.uci.edu/dataset/602/dry+bean+dataset
Use PCA and UMAP dimensionality reduction techniques in Python to visually explore
which types of dry beans are like one another.
(a)Discuss which technique worked better for visual exploration and why?

Both PCA and UMAP can be effective for visualizing the dry bean dataset, but UMAP generally
provides a more informative visualization in this case.
* PCA Visualization:
PCA can effectively separate the 'Bombay' beans from the other varieties, highlighting their
distinct characteristics.
However, PCA might not be able to capture ner details within the remaining varieties, such as
the distinctions between 'Dermason' and 'Sira' or 'Barbunya' and ‘Cali.'

* UMAP Visualization:
* UMAP is better at capturing both global and local structure in the data.
* This allows it to reveal more detailed information about which similar varieties are closer
together in the feature space.
* For instance, UMAP might show that 'Dermason' and 'Sira' cluster more closely together,
indicating that these varieties have more similar characteristics.
fi
fi
fi
Therefore, UMAP is generally preferred for visual exploration in this case because it provides a
more nuanced understanding of the relationships between different dry bean varieties.

b) Explain how your UMAP visualization changed with changing the nth nearest neighbour to
which each data point's radius was extended.

The 'n_neighbors' parameter in UMAP controls how much emphasis is placed on local versus
global structure.
' _neighbors':
* Low n
* When 'n_neighbors' is low, UMAP focuses more on local structure.
* This can reveal ner details within clusters, such as subtle differences between varieties.
* However, it might make the overall structure of the data less clear.
* High 'n_neighbors':
* When 'n_neighbors' is high, UMAP focuses more on global structure.
* This can provide a clearer overview of the main clusters and how they relate to each other.
* However, it might obscure some of the ner details within clusters.
Therefore, tuning the 'n_neighbors' parameter allows you to control the balance between local
and global structure in the UMAP visualization. By experimenting with different values, you can
nd the setting that best reveals the information you are interested in about the dry bean
varieties.
fi
fi
fi

You might also like