0% found this document useful (0 votes)
7 views5 pages

T01 Soln

The document outlines a tutorial on machine learning, focusing on supervised and unsupervised learning problems in various scenarios, such as student performance prediction and trans-shipment logistics. It discusses the k-Nearest Neighbour (k-NN) algorithm, including its inference methods, running times, and potential improvements for efficiency. Additionally, it touches on the Netflix Prize competition and challenges in recommendation systems, emphasizing the importance of diversity, persistence, trust, and freshness in recommendations.

Uploaded by

xuyifei9866
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views5 pages

T01 Soln

The document outlines a tutorial on machine learning, focusing on supervised and unsupervised learning problems in various scenarios, such as student performance prediction and trans-shipment logistics. It discusses the k-Nearest Neighbour (k-NN) algorithm, including its inference methods, running times, and potential improvements for efficiency. Additionally, it touches on the Netflix Prize competition and challenges in recommendation systems, emphasizing the importance of diversity, persistence, trust, and freshness in recommendations.

Uploaded by

xuyifei9866
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Machine Learning 2025

National University of Singapore CS3244


Prof Lee Wee Sun and Prof Wang Ye

Tutorial 1
1. Learning Paradigms.
Describe different instances of learning problems for the following scenarios. For each sce-
nario, identify and describe a supervised and unsupervised learning problem. For one of the
problems, formalize the given components in the learning problem: input, output, data. You
need not describe the hypothesis nor the target function (to think about: why?).

(a) In NUS (or other university) student domain. Students and faculties often encounter
problems, many of them. Describe problems that you or your faculty may encounter
on a daily, weekly or semesterly basis.
Solution: There are many possible answers, your answer is likely to vary. Can you
think of other problems?
Supervised. In supervised learning, for each instance, we get a correct output. For
example, given a student’s transcript and a list of modules that the student is currently
taking, predict their performance in one module in terms of CAP in order to provide ad-
vice to them. This particular task is a regression task, since the output is of a continuous
value.
Unsupervised. In unsupervised learning, we have no outputs (labels) given to us –
the task is to explore and make groupings from the data based solely on the input. For
example, given a historical mapping of which students took which NUS module, cluster
the modules by similarity in order to recommend modules to students. This particular
task models modules not in terms of content (e.g., host faculty), but by social signals –
similar to the notion of collaborative filtering.
Another example is to cluster students based on historical data of whether they have
attended other classes together.
(b) Trans-shipment Logistics. One of Singapore’s mainstay sources of income for decades
has been trans-shipment and the logistics associated with this. Hypothesize problems
that occur in this scenario.
Solution: There are many possible answers, your answer is likely to vary. Can you
think of other tasks?
Supervised. Vessel arrival prediction is useful for predicting delays and doing schedul-
ing. Given some input attributes x that describe a vessel such as its tonnage, weather
conditions on route, ports of call, owner, output the expected variation from the sched-
uled ETA of the vessel. Since this is a supervised problem, the data would consist of
many tuples of such instances.
Unsupervised. Clustering together content and attributes of shipments such as time,
etc. can help understand the types of goods sent over similar periods, their source, etc.
Tutorial 1 2

2. k Nearest Neighbour.

(a) Suppose you are given the following data (as shown in Table 1) where x and y are the
two input variables and Origin is the dependent variable.

x y Origin
-1 1 -
0 1 +
0 2 -
1 -1 -
1 0 +
1 2 +
2 2 -
2 3 +
Table 1: The dataset for kNN.

Figure 1: Data points of k-NN dataset on 2D space.

Figure 1 is a scatter plot which shows the above data in 2D space. Suppose that you
want to predict the class of new data point at x = 1 and y = 1 using Euclidian distance
with 3-NN. Which class does this data point belong to? What is the difference if we
change the algorithm to use 7-NN instead?
Solution: In 3-NN (0, 1), (1, 0) and (1, 2) will be the nearest neighbours, and all points
except for the one at (2, 3) will be neighbours in 7-NN. Therefore, 3-NN will classify
the new data point at (1, 1) as + whereas 7-NN will classify it as −.
Tutorial 1 3

(b) Suppose you are given the following images (Figure 2). Your task is to compare the
values of k in k-NN in each image where kl , kc and kr are the left, center and right
subfigures below, respectively. Which is the largest k and which is the smallest?

Figure 2: kNN runs for k = 1, 2, 3. Which is which?

Solution: The larger the k is, the smoother the decision boundaries will be. Overfitting
(covered later) will occur less as k increases. So simply by looking carefully at the
decision boundaries, we can conclude kl < kc < kr .
(c) Suppose that you have trained a kNN model and now you want to perform prediction on
test data. Before executing the prediction (a.k.a. inference) task, you want to calculate
the time that kNN will take for predicting the class for test data. Let’s denote the time
to calculate the distance between 2 observations as t. What would the time taken by 1-
NN be if there are m (some very large number) observations in the training data? What
about for 2-NN or 3-NN? (We only consider the time used for calculating distances.)
Solution: Each time we need O(mt) time to calculate all the distances. We can use
O(k) time when going through the distance to each example to update the k nearest ex-
amples giving O(mt + mk) time to predict one test example. For constant k, this is still
O(mt). Food for thought: can you come up with ideas to speed this up? The numbers
above assume a naive strategy to calculate all above is a brute force calculation.

3. Analysing k-NN Inference.


Alice and Bob have proposed 2 ways of doing k-NN inference. Both algorithms are ex-
plained below. There are m number of training samples and the time taken to calculate the
distance between two samples is O(d).

• Algorithm by Alice
(a) For all i training samples where i ∈ {1, 2, 3, . . . , m}, initialize S[i] = 0.
(b) Compute D[i] for each training sample. Here, D[i] denotes the distance between
the training sample i and the new observation.
(c) Iterate k number of times through all the training samples to do the following
procedure in every iteration.
i. Find the smallest D[i] with the condition S[i] = 0.
Tutorial 1 4

ii. After the full scan through all the samples, mark S[min] = 1, where min is
the index where D[min] is the smallest and S[min] = 0.
(d) Return k samples with indices where S[i] = 1.
• Algorithm by Bob.

(a) For all i training samples where i ∈ {1, 2, 3, . . . , m}, initialize S[i] = 0.
(b) Iterate k times through all the training samples, performing the following steps in
each iteration:
i. Calculate the distance between each sample (for which S[i] = 0) and the new
observation.
ii. Identify the minimum distance and mark the corresponding location with S[min] =
1.
(c) Return k samples with indices where S[i] = 1.

(a) Verify whether the algorithms of Alice and Bob are correct or not? If they are cor-
rect, give the running time for single inference in terms of m, d, k. Which is the best
algorithm with respect to the running time?
Solution: Both inference algorithms are correct. The algorithm of Alice runs in O(m(d+
k)) and the algorithm of Bob runs in O(mdk). Alice’s one is asymptotically faster than
Bob’s one.
(b) Propose a way to improve the best algorithm in part (a). What is the running time of
the new algorithm?
Solution: One way of improving the algorithm of Alice is to keep a BST with k nodes
where the BST tracks the top-k smallest distances. This can be done using the following
steps.
i. Insert first k distances to a balanced BST. For the remaining m−k distances follow
the next steps.
ii. If the distance is greater than the maximum value of BST, no change needed for
BST.
iii. Else, remove the maximum element from the BST and insert the current distance.
iv. Removal of max element and insert new element: O(log(k))
Repeat on m − k elements: O((m log(k))
This would reduce the running time to O(m(d + log(k))).
Another way of optimizing the algorithm of Alice is to heapify D[] then pop k nearest.
This can be done using the following steps.
i. Heapify D[]: O(m) .
ii. Popping an element from heap: O(log(m))
iii. Repeat for k elements: O(k log(m)).
Tutorial 1 5

This would reduce the running time to O(md + k log(m)). Think of a way to reduce
the running time even further to O(md). (Hint : Use Quick Select algorithm to find the
k-th smallest element of the distance list.)

4. k-NN.
Suppose you have to do a classification problem to predict whether it will rain on an area
based on the available dataset using the k-NN algorithm. The input variables are the hu-
midity which varies between 50 – 90%, and the average temperature which ranges between
25– 35 degrees Celcius. Do you think applying k-NN directly will yield a good prediction
result? If not, what improvement will you propose?
Solution: Considers the difference between the ranges of the two input variables. The hu-
midity range is 0.4, while the range of the temperature is 10 degrees Celcius. If we apply the
k-NN directly, the Euclidian distance between points will be dominated by the temperature
difference. Hence, this drastically reduces the contribution of the humidity variable in the
k-NN algorithm.
An improvement can be made by applying techniques such as scaling (normalization or
standardization) of both variables.
5. (Optional) The Netflix Prize.
The Netflix Prize was a competition held by Netflix, to improve its algorithm for recom-
mending movies to its users. This was formalized by having a system predict a customer’s
numeric rating of a target movie. The winning entry was one that used collaborative filtering,
which is a method that is based on the assumption that people who agreed in the past will
agree in the future. It looked at the historical ratings that users had given movies, but not the
features of the movie and user (e.g. genre, year, director, actor, etc.). However, it was never
adopted. Can you think of several reasons why?
Solution: There are many possible model answers, your answer is likely to vary. Other
issues important in recommender systems:
• Diversity: how different are the recommendations?
– If you like ‘Battle of Five Armies Extended Edition’, recommend Battle of Five
Armies?
– Even if you really really like Star Wars, you might want non-Star-Wars sugges-
tions.
• Persistence: how long should recommendations last?
– If you keep not clicking on ‘Hunger Games’, should it remain a recommendation?
• Trust: tell user why you made a recommendation.
• Social recommendation: what did your friends watch?
• Freshness: people tend to get more excited about new/surprising things.
– Collaborative filtering does not predict well for new users/movies.
– New movies don’t yet have ratings, and new users haven’t rated anything.

You might also like