0% found this document useful (0 votes)

4 views

Question&Answer

The document discusses the challenges of high-dimensional TF-IDF matrices, including the curse of dimensionality, and suggests feature reduction techniques such as n-gram selection and dimensionality reduction methods like PCA. It compares clustering techniques K-Means and DBSCAN, highlighting DBSCAN's advantages in handling varying cluster shapes and identifying outliers. Additionally, it emphasizes the importance of dimensionality reduction for effective clustering and explores the benefits of using TF-IDF over simple term frequency, as well as the potential advantages of incorporating bigrams in text analysis.

Uploaded by

yashaswiniyashu172

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Question&Answer

Uploaded by

yashaswiniyashu172

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Question 1: TF-IDF Matrix and Feature Reduction (30 Marks)

a) Curse of Dimensionality
When a TF-IDF matrix explodes in size with an excessive number of features, it encounters the curse of
dimensionality. This phenomenon manifests in several ways:

* Increased Data Requirements: Training machine learning models with a high-dimensional feature
space necessitates a massive amount of data to achieve robust performance. This can be impractical in
many real-world scenarios where data collection is expensive or limited.
* Computational Bottlenecks: Processing high-dimensional data often demands more computational
resources, leading to longer training times, higher memory usage, and potential hardware limitations.
This can impede the ef ciency and scalability of machine learning pipelines.
b) Causes of Large Feature Sets

* Unnecessary N-grams: Consider including bigrams and trigrams only when they offer meaningful
context and contribute signi cantly to the document representation. For example, "arti cial intelligence" is a
valuable bigram, while "the of and" contributes little and can be excluded.
* Vocabulary Size and Length: Large documents with highly diverse vocabularies will naturally contain
more unique feature terms. This can be mitigated to some extent by:
* Stop-Word Removal: Eliminating common, non-informative words like "the," "a," and "is" from the
vocabulary can help reduce feature space size.
* Stemming/Lemmatization: Normalizing words to their base forms (e.g., "running" -> "run," "better" ->
"good") can reduce feature space redundancy.

c) Feature Reduction Techniques

Here are three effective strategies to shrink the feature space size of your TF-IDF matrix:
* N-gram Selection: Implement a threshold-based or information gain-based approach to lter out bigrams
and trigrams that don't contribute signi cantly to document representation. This prevents the inclusion of
irrelevant or uninformative n-grams.

* Domain-Speci c Knowledge: Consider incorporating domain expertise to identify and exclude feature
terms that hold little value in your speci c application domain. Tailoring your feature space to your use case can
improve model performance and ef ciency.

* Dimensionality Reduction Techniques: Techniques like Principal Component Analysis (PCA) can
project the high-dimensional data onto a lower-dimensional latent space while preserving the maximum
amount of variance. This allows you to retain the most informative features while signi cantly reducing the
feature space size.
fi
fi
fi
fi
fi
fi
fi
fi
fi
Question 2 (35 Marks)
Refer to dataset at - https://archive.ics.uci.edu/dataset/484/travel+reviews
Use K-Means and DBSCAN clustering techniques in Python to identify and label clusters of
users (i.e., travellers) with similar travel interests.

a) Discuss which technique worked better and why?

In this scenario, DBSCAN generally works better than K-Means for identifying clusters of travelers with similar
interests. Here's why:
* DBSCAN:
Handles clusters of varying shapes and sizes: DBSCAN is more exible in identifying clusters that are not
necessarily spherical or of uniform size, which is common in real-world data.
Identi es outliers: DBSCAN can effectively identify and label outliers, which are travelers with unique interests
that don't t into any de ned cluster.
* K-Means:
Assumes spherical clusters: K-Means assumes that clusters are spherical and of similar size. This assumption
may not hold true for real-world travel data where interests can be diverse and cluster shapes can be irregular.
Sensitive to outliers: Outliers can signi cantly in uence the centroid calculation in K-Means, leading to
inaccurate cluster assignments.

b) Explain why it may be essential to use dimensionality reduction before implementing

these techniques?

Dimensionality reduction is crucial before applying clustering algorithms like K-Means and DBSCAN for several
reasons:
* Curse of dimensionality: In high-dimensional spaces, Euclidean distances become less meaningful. This can
lead to inaccurate distance calculations and poor clustering results.
* Computational ef ciency: Reducing the number of dimensions can signi cantly improve the computational
ef ciency of clustering algorithms, especially for large datasets.
* Visualization: Dimensionality reduction techniques like PCA can project the data onto a lower-dimensional
space, often 2 or 3 dimensions, enabling visualization. This can help identify outliers, noise, and potential
cluster shapes, guiding the choice of clustering algorithm.
fi
fi
fi
fi
fi
fi
fl
fl
fi
Question 3
a) Discuss why TF-IDF matrix might be a better representation than TF matrix for the above
text documents.?

Term frequency (TF) only considers how often a term appears within a single document. While this can
highlight important terms, it doesn't account for how common those terms are across the entire corpus of
documents. For instance, the word "data" appears frequently in all the given documents. Using TF alone might
overemphasize its importance, even though it's a common word that doesn't necessarily distinguish one
document from another.

TF-IDF (Term Frequency-Inverse Document Frequency) addresses this by considering both the term
frequency and the inverse document frequency (IDF). IDF measures how rare a term is across the entire
corpus. Terms that appear in many documents have a low IDF, while rare terms have a high IDF.

By multiplying TF and IDF, TF-IDF gives more weight to terms that are frequent in a speci c document but rare
in the overall collection. This helps to identify terms that are truly distinctive and meaningful for a particular
document.

Therefore, TF-IDF is generally a better representation than TF for text documents because it considers both
term frequency within a document and the rarity of that term across the entire corpus, leading to a more
nuanced and informative representation.

b) Discuss whether it would be a good idea to use bigrams as tokens in the TF-IDF
representation of the above text documents.?

Using bigrams (two consecutive words) as tokens in TF-IDF can be bene cial in some cases, but it depends on
the speci c characteristics of the text data.

Potential Bene ts:

* Capturing Context: Bigrams can help capture semantic relationships and context that might be missed by
using individual words (unigrams). For example, the phrase "data mining" is more meaningful than the
individual words "data" and "mining" in this context.
* Improved Discrimination: Bigrams can sometimes help to better distinguish between documents with similar
word frequencies but different word order or phrase usage.
Potential Drawbacks:
* Data Sparsity: Bigrams can lead to data sparsity, especially in smaller corpora. Many bigrams may only
appear in a few documents or even just once, making it dif cult to calculate meaningful TF-IDF scores.
* Noise: Not all bigrams are meaningful. Including many irrelevant or noisy bigrams can introduce noise into the
TF-IDF representation.
fi
fi
fi
fi
fi
Question 4
Refer to dataset at –
https://archive.ics.uci.edu/dataset/396/sales+transactions+dataset+weekly
Use PCA and UMAP dimensionality reduction techniques in Python to visually explore
which products had similar weekly sales transactions over the course of 52 weeks.
a) Explain which technique worked better for visual exploration and why?

PCA (Principal Component Analysis) should not be used here because we are dealing with discrete variables
(weekly sales transactions). PCA is primarily designed for continuous data. Applying it to discrete data can
lead to misleading results and distorted visualizations.

UMAP (Uniform Manifold Approximation and Projection) is a more suitable technique for visualizing this
dataset. UMAP is speci cally designed to preserve local and global structure in high-dimensional data. It can
effectively handle non-linear relationships and complex structures, which are often present in real-world data
like weekly sales transactions.

Therefore, UMAP is likely to provide a more accurate and informative visualization of the relationships between
products based on their weekly sales patterns.

b) Label products with similar weekly sales transactions using the k-Means clustering
algorithm in Python. Does it work perfectly? Why or why not? Save labeled data to a CSV le
called Labelled.csv and upload this le.?

When k-means is applied to ve distinct clusters in the UMAP visualization, it generally works well to label those
clusters. K-means can effectively identify groups of products with similar weekly sales patterns.
However, there are a few caveats to consider:
* The choice of the number of clusters (k) is crucial. If the wrong value of k is chosen, the clustering results may
not accurately re ect the underlying structure of the data.
* K-means assumes that clusters are spherical in shape. If the true clusters are non-spherical or have complex
shapes, k-means may not be able to accurately identify them.
* Outliers and noise can signi cantly impact the performance of k-means. Outliers can distort the cluster
centers, while noise can make it dif cult for k-means to identify meaningful patterns.
In the absence of outliers or noise, and assuming the clusters are relatively spherical, k-means can be a good
choice for labeling products with similar weekly sales patterns.
To save the labeled data, you can create a new column in your dataset to store the cluster labels assigned by k-
means. Then, you can save this modi ed dataset to a CSV le named “Labelled.csv”.
fl
fi
fi
fi
fi
fi
fi
fi
fi
Question 5
a) In the following ve-node undirected weighted network, each edge weight represents a
distance. Use Dijkstra's algorithm to calculate the shortest distance from node 'a' to all
other nodes in the network. Show your workings.?

Step 1: Initialization
* Current Node: 'a'
* Unvisited Nodes: b, c, d, e
* Assigned Tentative Distances:
* a to b: 4
* a to c: 5
* a to d: 12
* a to e: ∞
Step 2: Process Node 'b'
* Current Node: 'b'
* Unvisited Nodes: c, d, e
* Tentative Distances from 'b':
* b to c: 10
* b to d: 13
* b to e: 14
* Compare and Update Tentative Distances:
* a to c: 5 (unchanged)
* a to d: 12 (unchanged)
* a to e: 14 (new distance)
* Mark Node 'b' as Visited
* Shortest Distance to b ' ': D(b) = 4
* Parent Node of 'b': P(b) = 'a'
Step 3: Process Node 'c'
* Current Node: 'c'
* Unvisited Nodes: d, e
* Tentative Distances from 'c':
* c to d: 7
* c to e: 11
* Compare and Update Tentative Distances:
* a to d: 12 (unchanged)
* a to e: 11 (new distance)
* Mark Node 'c' as Visited
* Shortest Distance to c' ': D(c) = 5
* Parent Node of 'c': P(c) = 'a'
Step 4: Process Node 'e'
* Current Node: 'e'
* Unvisited Nodes: d
* Tentative Distances from 'e':
* e to d: 8
* Compare and Update Tentative Distances:
* a to d: 12 (unchanged)
* Mark Node 'e' as Visited
* Shortest Distance to 'e': D(e) = 11
* Parent Node of 'e': P(e) = 'c'
Step 5: Process Node 'd'
* Current Node: 'd'
* Unvisited Nodes: None
* Mark Node 'd' as Visited
* Shortest Distance to 'd': D(d) = 12
* Parent Node of 'd': P(d) = 'a'
Final Shortest Distances:
* D(a) = 0
* D(b) = 4
* D(c) = 5
* D(d) = 12
* D(e) = 11
Therefore, the shortest distances from node 'a' to all other nodes in the network are as follows:
* a to b: 4
* a to c: 5
* a to d: 12
* a to e: 11
fi
( b) Problem Statement:

Supply point 'a' can supply up to a maximum of 20 units per week to retailers 'f', 'g', and 'h' through
the supply network with capacities as shown. The demands at retailers 'f', 'g', and 'h' are 8, 6, and
' ', 'c', d
10 units per week respectively. b ' ', and 'e' are distributors, with capacity of distributor 'e'
restricted at 6 units. Using Ford-Fulkerson's algorithm, explain which demands can be met.
Show your workings.?

Solution:

Ford-Fulkerson Algorithm Steps:

* Initialization:
* Start with an initial ow of 0 on all edges.
* Find an augmenting path from the source (supply point 'a') to the sink (each retailer). An augmenting path is a
path with available capacity on each edge.
* Augment Flow:
* Determine the minimum capacity along the augmenting path.
* Increase the ow on each edge of the path by this minimum capacity.
* Decrease the capacity of each forward edge by the minimum capacity.
* Increase the capacity of each backward edge by the minimum capacity.
* Repeat:
* Repeat steps 1 and 2 until no more augmenting paths can be found.
Analysis:
* Initial Flow: All edges have a ow of 0.
* Augmenting Path 1: a -> b -> c -> f (capacity = 6)
* Increase ow on a->b, b->c, c->f by 6 units.
* Augmenting Path 2: a -> d -> e -> h (capacity = 4)
* Increase ow on a->d, d->e, e->h by 4 units.
* Augmenting Path 3: a -> b -> d -> e -> h (capacity = 2)
* Increase ow on a->b, b->d, d->e, e->h by 2 units.
Final Flow:
* Flow on a->b = 8 units
* Flow on b->c = 6 units
* Flow on c->f = 6 units
* Flow on a->d = 6 units
* Flow on d->e = 6 units
* Flow on e->h = 6 units
Demand Ful llment:
* Retailer 'f': Demand of 8 units is fully met.
* Retailer 'g': Demand of 6 units cannot be fully met. Maximum of 4 units can be supplied.
* Retailer 'h': Demand of 10 units cannot be fully met. Maximum of 8 units can be supplied.
fl
fl
fl
fi
fl
fl
fl
Question 6

500,000 posts about UK General Election are retrieved from social media platform X
(formerly Twitter) and converted into a TF-IDF matrix for the purposes of training a
sentiment prediction model.

a) What could be a potential bene t of including bigrams as features in the above TF-
IDF matrix? Explain using an example. ?

• Potential Bene t: Bigrams can capture context in which words might have been used. This is
because the meaning of a word can change depending on the word that follows it.

* Example: Consider the bigram "prime minister." The individual words "prime" and "minister" have
different meanings in isolation. However, when combined, they form a speci c term with a clear
meaning. Using the bigram "prime minister" as a feature would allow the model to better understand
the sentiment expressed in the text.

b) How could tuning "minimum document frequency" affect quantity and quality of features in
the above TF-IDF matrix? Explain using an example.
• Effect of Minimum Document Frequency: Minimum document frequency refers to the minimum
number of documents in which a word or phrase (n-grams) must be present before it can be made
a feature in a TF-IDF matrix.

* Higher minimum document frequency:

Reduces the number of features (less vocabulary).
Can improve the quality of features by excluding rare or non-sensical terms.
* Lower minimum document frequency:
Increases the number of features (more vocabulary).
May include noisy or irrelevant features.

* Example: If the minimum document frequency is set to 30, only words or phrases that appear in at
least 30 documents will be considered features. This would exclude non-sensical bigrams like
"extent sense," which are unlikely to appear frequently in the data.

c) Explain why the resultant model may not perform well at predicting sentiments of posts
about US Presidential Election? What could be done to ensure that the model performs well?
fi
fi
fi
• Reason for Poor Performance: The vocabulary used in the model is trained on UK General Election
posts. This vocabulary may be different from the vocabulary used in US Presidential Election
posts. The model might encounter many out-of-vocabulary words or phrases in the US data,
leading to inaccurate predictions.

• Improving Model Performance:

- Retrain the model on US data: Train a new model using a TF-IDF matrix generated from US
Presidential Election posts.
- Increase data size: Use a larger dataset of US Presidential Election posts to improve model
robustness and generalization.
- Use domain-speci c word embeddings: Employ word embeddings trained on a large corpus of US
political text to capture semantic relationships speci c to the US context.

Question 7 (35 Marks)

Refer to dataset at –
https://archive.ics.uci.edu/dataset/602/dry+bean+dataset
Use PCA and UMAP dimensionality reduction techniques in Python to visually explore
which types of dry beans are like one another.
(a)Discuss which technique worked better for visual exploration and why?

Both PCA and UMAP can be effective for visualizing the dry bean dataset, but UMAP generally
provides a more informative visualization in this case.
* PCA Visualization:
PCA can effectively separate the 'Bombay' beans from the other varieties, highlighting their
distinct characteristics.
However, PCA might not be able to capture ner details within the remaining varieties, such as
the distinctions between 'Dermason' and 'Sira' or 'Barbunya' and ‘Cali.'

* UMAP Visualization:
* UMAP is better at capturing both global and local structure in the data.
* This allows it to reveal more detailed information about which similar varieties are closer
together in the feature space.
* For instance, UMAP might show that 'Dermason' and 'Sira' cluster more closely together,
indicating that these varieties have more similar characteristics.
fi
fi
fi
Therefore, UMAP is generally preferred for visual exploration in this case because it provides a
more nuanced understanding of the relationships between different dry bean varieties.

b) Explain how your UMAP visualization changed with changing the nth nearest neighbour to
which each data point's radius was extended.

The 'n_neighbors' parameter in UMAP controls how much emphasis is placed on local versus
global structure.
' _neighbors':
* Low n
* When 'n_neighbors' is low, UMAP focuses more on local structure.
* This can reveal ner details within clusters, such as subtle differences between varieties.
* However, it might make the overall structure of the data less clear.
* High 'n_neighbors':
* When 'n_neighbors' is high, UMAP focuses more on global structure.
* This can provide a clearer overview of the main clusters and how they relate to each other.
* However, it might obscure some of the ner details within clusters.
Therefore, tuning the 'n_neighbors' parameter allows you to control the balance between local
and global structure in the UMAP visualization. By experimenting with different values, you can
nd the setting that best reveals the information you are interested in about the dry bean
varieties.
fi
fi
fi

F G Priest & I Cambell - Brewing Microbiology PDF
100% (2)
F G Priest & I Cambell - Brewing Microbiology PDF
312 pages
Alain Corbin, M. Kochan-The Foul and The Fragrant - Odour and The French Social Imagination-Berg Publishers (1986)
100% (4)
Alain Corbin, M. Kochan-The Foul and The Fragrant - Odour and The French Social Imagination-Berg Publishers (1986)
319 pages
Below Are The Definitions of The Six Capacity Pillars and Some Helpful Pointers For Assessment: Definitions of Capacity Pillar Structure
100% (2)
Below Are The Definitions of The Six Capacity Pillars and Some Helpful Pointers For Assessment: Definitions of Capacity Pillar Structure
2 pages
Data Science Q&A - Latest Ed (2020) - 6 - 1
No ratings yet
Data Science Q&A - Latest Ed (2020) - 6 - 1
2 pages
Final Stibo Done
No ratings yet
Final Stibo Done
24 pages
Unsupervised Learning Algorithm 1
No ratings yet
Unsupervised Learning Algorithm 1
3 pages
Assign 3
No ratings yet
Assign 3
1 page
Final Stibo
No ratings yet
Final Stibo
25 pages
DAV Solution
No ratings yet
DAV Solution
22 pages
K Means Clustering
No ratings yet
K Means Clustering
78 pages
Previous Exam Paper 2 Solutions
No ratings yet
Previous Exam Paper 2 Solutions
6 pages
CHATGPT DALL.E 3: Complete Guide. Third Edition
From Everand
CHATGPT DALL.E 3: Complete Guide. Third Edition
Hesham Mohamed Elsherif
No ratings yet
Data Mining Comprehensive Exam - Regular PDF
No ratings yet
Data Mining Comprehensive Exam - Regular PDF
3 pages
Dip Ii-Unit
No ratings yet
Dip Ii-Unit
7 pages
Comparartive
No ratings yet
Comparartive
7 pages
Conference 101719
No ratings yet
Conference 101719
7 pages
Chapter - 1: 1.1 Overview
No ratings yet
Chapter - 1: 1.1 Overview
50 pages
DM Bits
No ratings yet
DM Bits
5 pages
Quiz
No ratings yet
Quiz
5 pages
s10489-021-02550-9
No ratings yet
s10489-021-02550-9
39 pages
Exam 1
No ratings yet
Exam 1
3 pages
NASHEEEEYYYYYY
No ratings yet
NASHEEEEYYYYYY
30 pages
2 3-FeatureRelatedIssues
No ratings yet
2 3-FeatureRelatedIssues
10 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
DWM Assignment
No ratings yet
DWM Assignment
15 pages
24120036_coding_assignment_report
No ratings yet
24120036_coding_assignment_report
5 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
AIML Suggestion Answer
No ratings yet
AIML Suggestion Answer
36 pages
Basics of Feature Engineering Marked
No ratings yet
Basics of Feature Engineering Marked
33 pages
Fast String Matching in Python
No ratings yet
Fast String Matching in Python
5 pages
Data Mining
No ratings yet
Data Mining
24 pages
Khoi KHDL - de On
No ratings yet
Khoi KHDL - de On
6 pages
CHP 4
No ratings yet
CHP 4
72 pages
Ijet V2i3p7
No ratings yet
Ijet V2i3p7
6 pages
Amazon Food Review Notes
No ratings yet
Amazon Food Review Notes
37 pages
IEEE Dimensionality Reduction
No ratings yet
IEEE Dimensionality Reduction
6 pages
Module-2 Part-1 - Merged
No ratings yet
Module-2 Part-1 - Merged
66 pages
nagabhushan1995
No ratings yet
nagabhushan1995
5 pages
10 EST Solution
No ratings yet
10 EST Solution
16 pages
Feature Engineering
No ratings yet
Feature Engineering
51 pages
Exercises - Dss - Partd - Handout
No ratings yet
Exercises - Dss - Partd - Handout
12 pages
7-8 Feature Engineering 101-Normalization
No ratings yet
7-8 Feature Engineering 101-Normalization
8 pages
Ruiz Modified I2ml3e Chap6
No ratings yet
Ruiz Modified I2ml3e Chap6
38 pages
Curse of Dimensionality and Its Reduction
No ratings yet
Curse of Dimensionality and Its Reduction
5 pages
CE880_Lecture3_slides
No ratings yet
CE880_Lecture3_slides
44 pages
UNIT04
No ratings yet
UNIT04
35 pages
Major 2020
No ratings yet
Major 2020
2 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Dsbda2 Dsbda Merged
No ratings yet
Dsbda2 Dsbda Merged
3 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
81 pages
Dwdm Answer
No ratings yet
Dwdm Answer
19 pages
homework2-solutions
No ratings yet
homework2-solutions
6 pages
FAQ's -FMT project
No ratings yet
FAQ's -FMT project
3 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
V DM Clustering
No ratings yet
V DM Clustering
76 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Feedback The Correct Answer Is:analysis of Time Series
No ratings yet
Feedback The Correct Answer Is:analysis of Time Series
42 pages
Truncated SVD
No ratings yet
Truncated SVD
27 pages
problem_set_C5 (1)
No ratings yet
problem_set_C5 (1)
4 pages
Python DM Lab Manual Part 2
No ratings yet
Python DM Lab Manual Part 2
8 pages
SampleQuestion- AIOL 2024
No ratings yet
SampleQuestion- AIOL 2024
5 pages
Yousef ML Washin Regression
No ratings yet
Yousef ML Washin Regression
590 pages
Our Brains Are Wired For Morality Evolution, Development, and Neuroscience. Front. Young Minds
No ratings yet
Our Brains Are Wired For Morality Evolution, Development, and Neuroscience. Front. Young Minds
8 pages
Modern Electroplating Fourth Edition Book Review PDF
No ratings yet
Modern Electroplating Fourth Edition Book Review PDF
2 pages
CCJ Assignment Dahmer
No ratings yet
CCJ Assignment Dahmer
7 pages
Unit 4
No ratings yet
Unit 4
105 pages
CNF, Quarter 1, Week 2 Imagining Life Through Fiction
No ratings yet
CNF, Quarter 1, Week 2 Imagining Life Through Fiction
89 pages
Lesson - Charities - Students
No ratings yet
Lesson - Charities - Students
4 pages
Artikeltext 225968 1 10 20210922
No ratings yet
Artikeltext 225968 1 10 20210922
16 pages
8 Mark Detailed Template
No ratings yet
8 Mark Detailed Template
1 page
Ethical Dilemma Case Study1
No ratings yet
Ethical Dilemma Case Study1
6 pages
9 Formulating Claims of Fact, Policy and Value
100% (1)
9 Formulating Claims of Fact, Policy and Value
23 pages
Thornleigh Salesian College, Bolton 1
No ratings yet
Thornleigh Salesian College, Bolton 1
23 pages
Course Outline Form: Metropolitan Community College
No ratings yet
Course Outline Form: Metropolitan Community College
4 pages
Non-Stationary Signal Analysis Software - WT9362 & WT9364: Brüel &
No ratings yet
Non-Stationary Signal Analysis Software - WT9362 & WT9364: Brüel &
4 pages
Override Diaphragm Eccentricities Form
No ratings yet
Override Diaphragm Eccentricities Form
2 pages
Full Ballistics: Theory and Design of Guns and Ammunition, Third Edition Donald E. Carlucci PDF All Chapters
100% (1)
Full Ballistics: Theory and Design of Guns and Ammunition, Third Edition Donald E. Carlucci PDF All Chapters
62 pages
Mud Rushes and Methods of Combating Them
No ratings yet
Mud Rushes and Methods of Combating Them
8 pages
T R PI PID: Uning Ules For AND
No ratings yet
T R PI PID: Uning Ules For AND
7 pages
Baker e Dransfield, 2016
No ratings yet
Baker e Dransfield, 2016
27 pages
Unit 9 Future Generations 9.1 Vocabulary: Word Part of S Pronunciation Translation
No ratings yet
Unit 9 Future Generations 9.1 Vocabulary: Word Part of S Pronunciation Translation
33 pages
Cosmos of Secrets
No ratings yet
Cosmos of Secrets
3 pages
Reading Practice 19.5
No ratings yet
Reading Practice 19.5
6 pages
W4
No ratings yet
W4
21 pages
Depedquezon Sgo Par 04 011 004 1
No ratings yet
Depedquezon Sgo Par 04 011 004 1
5 pages
ART Reflection
No ratings yet
ART Reflection
2 pages
New Spline
No ratings yet
New Spline
20 pages
EC2306 Lab Manual
No ratings yet
EC2306 Lab Manual
49 pages
135 Calculusfor Economists Module IPreparedby Abdella Mohammed Ahmed 135
No ratings yet
135 Calculusfor Economists Module IPreparedby Abdella Mohammed Ahmed 135
37 pages

Question&Answer

Uploaded by

Question&Answer

Uploaded by

Question 1: TF-IDF Matrix and Feature Reduction (30 Marks)

c) Feature Reduction Techniques

a) Discuss which technique worked better and why?

b) Explain why it may be essential to use dimensionality reduction before implementing

Potential Bene ts:

Ford-Fulkerson Algorithm Steps:

* Higher minimum document frequency:

• Improving Model Performance:

Question 7 (35 Marks)

You might also like