Question&Answer
Question&Answer
a) Curse of Dimensionality
When a TF-IDF matrix explodes in size with an excessive number of features, it encounters the curse of
dimensionality. This phenomenon manifests in several ways:
* Increased Data Requirements: Training machine learning models with a high-dimensional feature
space necessitates a massive amount of data to achieve robust performance. This can be impractical in
many real-world scenarios where data collection is expensive or limited.
* Computational Bottlenecks: Processing high-dimensional data often demands more computational
resources, leading to longer training times, higher memory usage, and potential hardware limitations.
This can impede the ef ciency and scalability of machine learning pipelines.
b) Causes of Large Feature Sets
* Unnecessary N-grams: Consider including bigrams and trigrams only when they offer meaningful
context and contribute signi cantly to the document representation. For example, "arti cial intelligence" is a
valuable bigram, while "the of and" contributes little and can be excluded.
* Vocabulary Size and Length: Large documents with highly diverse vocabularies will naturally contain
more unique feature terms. This can be mitigated to some extent by:
* Stop-Word Removal: Eliminating common, non-informative words like "the," "a," and "is" from the
vocabulary can help reduce feature space size.
* Stemming/Lemmatization: Normalizing words to their base forms (e.g., "running" -> "run," "better" ->
"good") can reduce feature space redundancy.
Here are three effective strategies to shrink the feature space size of your TF-IDF matrix:
* N-gram Selection: Implement a threshold-based or information gain-based approach to lter out bigrams
and trigrams that don't contribute signi cantly to document representation. This prevents the inclusion of
irrelevant or uninformative n-grams.
* Domain-Speci c Knowledge: Consider incorporating domain expertise to identify and exclude feature
terms that hold little value in your speci c application domain. Tailoring your feature space to your use case can
improve model performance and ef ciency.
* Dimensionality Reduction Techniques: Techniques like Principal Component Analysis (PCA) can
project the high-dimensional data onto a lower-dimensional latent space while preserving the maximum
amount of variance. This allows you to retain the most informative features while signi cantly reducing the
feature space size.
fi
fi
fi
fi
fi
fi
fi
fi
fi
Question 2 (35 Marks)
Refer to dataset at - https://archive.ics.uci.edu/dataset/484/travel+reviews
Use K-Means and DBSCAN clustering techniques in Python to identify and label clusters of
users (i.e., travellers) with similar travel interests.
In this scenario, DBSCAN generally works better than K-Means for identifying clusters of travelers with similar
interests. Here's why:
* DBSCAN:
Handles clusters of varying shapes and sizes: DBSCAN is more exible in identifying clusters that are not
necessarily spherical or of uniform size, which is common in real-world data.
Identi es outliers: DBSCAN can effectively identify and label outliers, which are travelers with unique interests
that don't t into any de ned cluster.
* K-Means:
Assumes spherical clusters: K-Means assumes that clusters are spherical and of similar size. This assumption
may not hold true for real-world travel data where interests can be diverse and cluster shapes can be irregular.
Sensitive to outliers: Outliers can signi cantly in uence the centroid calculation in K-Means, leading to
inaccurate cluster assignments.
Dimensionality reduction is crucial before applying clustering algorithms like K-Means and DBSCAN for several
reasons:
* Curse of dimensionality: In high-dimensional spaces, Euclidean distances become less meaningful. This can
lead to inaccurate distance calculations and poor clustering results.
* Computational ef ciency: Reducing the number of dimensions can signi cantly improve the computational
ef ciency of clustering algorithms, especially for large datasets.
* Visualization: Dimensionality reduction techniques like PCA can project the data onto a lower-dimensional
space, often 2 or 3 dimensions, enabling visualization. This can help identify outliers, noise, and potential
cluster shapes, guiding the choice of clustering algorithm.
fi
fi
fi
fi
fi
fi
fl
fl
fi
Question 3
a) Discuss why TF-IDF matrix might be a better representation than TF matrix for the above
text documents.?
Term frequency (TF) only considers how often a term appears within a single document. While this can
highlight important terms, it doesn't account for how common those terms are across the entire corpus of
documents. For instance, the word "data" appears frequently in all the given documents. Using TF alone might
overemphasize its importance, even though it's a common word that doesn't necessarily distinguish one
document from another.
TF-IDF (Term Frequency-Inverse Document Frequency) addresses this by considering both the term
frequency and the inverse document frequency (IDF). IDF measures how rare a term is across the entire
corpus. Terms that appear in many documents have a low IDF, while rare terms have a high IDF.
By multiplying TF and IDF, TF-IDF gives more weight to terms that are frequent in a speci c document but rare
in the overall collection. This helps to identify terms that are truly distinctive and meaningful for a particular
document.
Therefore, TF-IDF is generally a better representation than TF for text documents because it considers both
term frequency within a document and the rarity of that term across the entire corpus, leading to a more
nuanced and informative representation.
b) Discuss whether it would be a good idea to use bigrams as tokens in the TF-IDF
representation of the above text documents.?
Using bigrams (two consecutive words) as tokens in TF-IDF can be bene cial in some cases, but it depends on
the speci c characteristics of the text data.
PCA (Principal Component Analysis) should not be used here because we are dealing with discrete variables
(weekly sales transactions). PCA is primarily designed for continuous data. Applying it to discrete data can
lead to misleading results and distorted visualizations.
UMAP (Uniform Manifold Approximation and Projection) is a more suitable technique for visualizing this
dataset. UMAP is speci cally designed to preserve local and global structure in high-dimensional data. It can
effectively handle non-linear relationships and complex structures, which are often present in real-world data
like weekly sales transactions.
Therefore, UMAP is likely to provide a more accurate and informative visualization of the relationships between
products based on their weekly sales patterns.
b) Label products with similar weekly sales transactions using the k-Means clustering
algorithm in Python. Does it work perfectly? Why or why not? Save labeled data to a CSV le
called Labelled.csv and upload this le.?
When k-means is applied to ve distinct clusters in the UMAP visualization, it generally works well to label those
clusters. K-means can effectively identify groups of products with similar weekly sales patterns.
However, there are a few caveats to consider:
* The choice of the number of clusters (k) is crucial. If the wrong value of k is chosen, the clustering results may
not accurately re ect the underlying structure of the data.
* K-means assumes that clusters are spherical in shape. If the true clusters are non-spherical or have complex
shapes, k-means may not be able to accurately identify them.
* Outliers and noise can signi cantly impact the performance of k-means. Outliers can distort the cluster
centers, while noise can make it dif cult for k-means to identify meaningful patterns.
In the absence of outliers or noise, and assuming the clusters are relatively spherical, k-means can be a good
choice for labeling products with similar weekly sales patterns.
To save the labeled data, you can create a new column in your dataset to store the cluster labels assigned by k-
means. Then, you can save this modi ed dataset to a CSV le named “Labelled.csv”.
fl
fi
fi
fi
fi
fi
fi
fi
fi
Question 5
a) In the following ve-node undirected weighted network, each edge weight represents a
distance. Use Dijkstra's algorithm to calculate the shortest distance from node 'a' to all
other nodes in the network. Show your workings.?
Step 1: Initialization
* Current Node: 'a'
* Unvisited Nodes: b, c, d, e
* Assigned Tentative Distances:
* a to b: 4
* a to c: 5
* a to d: 12
* a to e: ∞
Step 2: Process Node 'b'
* Current Node: 'b'
* Unvisited Nodes: c, d, e
* Tentative Distances from 'b':
* b to c: 10
* b to d: 13
* b to e: 14
* Compare and Update Tentative Distances:
* a to c: 5 (unchanged)
* a to d: 12 (unchanged)
* a to e: 14 (new distance)
* Mark Node 'b' as Visited
* Shortest Distance to b ' ': D(b) = 4
* Parent Node of 'b': P(b) = 'a'
Step 3: Process Node 'c'
* Current Node: 'c'
* Unvisited Nodes: d, e
* Tentative Distances from 'c':
* c to d: 7
* c to e: 11
* Compare and Update Tentative Distances:
* a to d: 12 (unchanged)
* a to e: 11 (new distance)
* Mark Node 'c' as Visited
* Shortest Distance to c' ': D(c) = 5
* Parent Node of 'c': P(c) = 'a'
Step 4: Process Node 'e'
* Current Node: 'e'
* Unvisited Nodes: d
* Tentative Distances from 'e':
* e to d: 8
* Compare and Update Tentative Distances:
* a to d: 12 (unchanged)
* Mark Node 'e' as Visited
* Shortest Distance to 'e': D(e) = 11
* Parent Node of 'e': P(e) = 'c'
Step 5: Process Node 'd'
* Current Node: 'd'
* Unvisited Nodes: None
* Mark Node 'd' as Visited
* Shortest Distance to 'd': D(d) = 12
* Parent Node of 'd': P(d) = 'a'
Final Shortest Distances:
* D(a) = 0
* D(b) = 4
* D(c) = 5
* D(d) = 12
* D(e) = 11
Therefore, the shortest distances from node 'a' to all other nodes in the network are as follows:
* a to b: 4
* a to c: 5
* a to d: 12
* a to e: 11
fi
( b) Problem Statement:
Supply point 'a' can supply up to a maximum of 20 units per week to retailers 'f', 'g', and 'h' through
the supply network with capacities as shown. The demands at retailers 'f', 'g', and 'h' are 8, 6, and
' ', 'c', d
10 units per week respectively. b ' ', and 'e' are distributors, with capacity of distributor 'e'
restricted at 6 units. Using Ford-Fulkerson's algorithm, explain which demands can be met.
Show your workings.?
Solution:
500,000 posts about UK General Election are retrieved from social media platform X
(formerly Twitter) and converted into a TF-IDF matrix for the purposes of training a
sentiment prediction model.
a) What could be a potential bene t of including bigrams as features in the above TF-
IDF matrix? Explain using an example. ?
• Potential Bene t: Bigrams can capture context in which words might have been used. This is
because the meaning of a word can change depending on the word that follows it.
* Example: Consider the bigram "prime minister." The individual words "prime" and "minister" have
different meanings in isolation. However, when combined, they form a speci c term with a clear
meaning. Using the bigram "prime minister" as a feature would allow the model to better understand
the sentiment expressed in the text.
b) How could tuning "minimum document frequency" affect quantity and quality of features in
the above TF-IDF matrix? Explain using an example.
• Effect of Minimum Document Frequency: Minimum document frequency refers to the minimum
number of documents in which a word or phrase (n-grams) must be present before it can be made
a feature in a TF-IDF matrix.
* Example: If the minimum document frequency is set to 30, only words or phrases that appear in at
least 30 documents will be considered features. This would exclude non-sensical bigrams like
"extent sense," which are unlikely to appear frequently in the data.
c) Explain why the resultant model may not perform well at predicting sentiments of posts
about US Presidential Election? What could be done to ensure that the model performs well?
fi
fi
fi
• Reason for Poor Performance: The vocabulary used in the model is trained on UK General Election
posts. This vocabulary may be different from the vocabulary used in US Presidential Election
posts. The model might encounter many out-of-vocabulary words or phrases in the US data,
leading to inaccurate predictions.
Refer to dataset at –
https://archive.ics.uci.edu/dataset/602/dry+bean+dataset
Use PCA and UMAP dimensionality reduction techniques in Python to visually explore
which types of dry beans are like one another.
(a)Discuss which technique worked better for visual exploration and why?
Both PCA and UMAP can be effective for visualizing the dry bean dataset, but UMAP generally
provides a more informative visualization in this case.
* PCA Visualization:
PCA can effectively separate the 'Bombay' beans from the other varieties, highlighting their
distinct characteristics.
However, PCA might not be able to capture ner details within the remaining varieties, such as
the distinctions between 'Dermason' and 'Sira' or 'Barbunya' and ‘Cali.'
* UMAP Visualization:
* UMAP is better at capturing both global and local structure in the data.
* This allows it to reveal more detailed information about which similar varieties are closer
together in the feature space.
* For instance, UMAP might show that 'Dermason' and 'Sira' cluster more closely together,
indicating that these varieties have more similar characteristics.
fi
fi
fi
Therefore, UMAP is generally preferred for visual exploration in this case because it provides a
more nuanced understanding of the relationships between different dry bean varieties.
b) Explain how your UMAP visualization changed with changing the nth nearest neighbour to
which each data point's radius was extended.
The 'n_neighbors' parameter in UMAP controls how much emphasis is placed on local versus
global structure.
' _neighbors':
* Low n
* When 'n_neighbors' is low, UMAP focuses more on local structure.
* This can reveal ner details within clusters, such as subtle differences between varieties.
* However, it might make the overall structure of the data less clear.
* High 'n_neighbors':
* When 'n_neighbors' is high, UMAP focuses more on global structure.
* This can provide a clearer overview of the main clusters and how they relate to each other.
* However, it might obscure some of the ner details within clusters.
Therefore, tuning the 'n_neighbors' parameter allows you to control the balance between local
and global structure in the UMAP visualization. By experimenting with different values, you can
nd the setting that best reveals the information you are interested in about the dry bean
varieties.
fi
fi
fi