Tutorial 1 Machine Learning
Tutorial 1 Machine Learning
Tutorial 1
1.1 Intro to ML
1. A computer program is said to learn from experience E with respect to some task T and
some performance measure P if its performance on T, as measured by P, improves with
experience E. Suppose we feed a learning algorithm a lot of historical trajectory data of
vehicles, and have it learn to predict traffic speed. In this setting what is E?
2. Suppose you are working on traffic prediction, and you would like to predict whether traffic
will be heavy at 5pm tomorrow or not. You want to use a learning algorithm for this. Would
you treat this as a classification or a regression problem?
3. Some of the problems below are best addressed using a supervised learning algorithm, and
the others with an unsupervised learning algorithm. Which of the following would you
apply supervised learning to? (Select all that apply.) In each case, assume some appropriate
dataset is available for your algorithm to learn from.
Problems Supervised/
Unsupervised?
Take a collection of 1000 essays written on the US Economy, and find a way
to automatically group these essays into a small number of groups of essays
that are somehow "similar" or "related".
1. Consider the problem of predicting sunny weather condition in each week of 2023 given the
sunny weather condition in each week of 2022.
In this scenario, x represents the number of days in each week that the weather is sunny in
2022. The value of y is defined as “the number of sunny days” in each week of 2023 which
we want to predict.
The following training set is a sample of few weeks with number of sunny days in each of
them.
Recall that in linear regression our hypothesis is , and we use m to
denote the number of training examples.
x y
3 2
2 3
4 5
1 1
5 4
2. For this question, continue using the data provided in (1). Recall the definition of cost
function for linear regression is
𝑚
1 2
𝐽(𝜃0 , 𝜃1 ) = ∗∑ (ℎ𝜃 (𝑥 (𝑖) ) − 𝑦 (𝑖) )
2𝑚
𝑖
What is J(0,1)?
4. Three different classifiers are trained on the same data. Their decision
boundaries are shown below. Which of the following statements are
true?
5. What is the difference between local minima and global minima gradient descent?
1. Suppose we have m = 4 houses, and each house has area and number of bedrooms which
can be used to predict the house price. A dataset of the features is as follows:
1. 1 880 490,000
2. 3 1930 630,000
3. 4 1940 640,000
4. 3 1350 570,000
You’d like to use polynomial regression to predict a house price from its numbers of
bedrooms and sqft_area. Concretely, suppose you wish to fit a model of the form.
where x1 is the number of bedrooms and x2 is sqft_area. Further you plan to use both
feature scaling (dividing by the max-min, or range, of a feature) and mean normalization.
What is the normalized feature 𝑥24 (i.e. for the fourth training data)?
2. You run gradient descent for 12 iterations with α = 0.2 and compute J(θ) after each iteration
you find that the value of J(θ) increases over time. What would you do to correct this issue?
3. Suppose you have m = 23 training examples with n = 5 features (excluding the additional all-
ones feature for the intercept term, which you should add). The normal equation is
θ=(XTX)−1XT y. For the given values of m and n, what are the dimensions of θ, X, and y in this
equation?
1. X is 23 × 6, y is 23 × 6, θ is 6 × 6
2. X is 23 × 5, y is 23 × 1, θ is 5 × 1
3. X is 23 × 6, y is 23 × 1, θ is 6 × 1
4. X is 23 × 6, y is 23 × 1, θ is 5 × 5
WIA1006/WID3006, Semester II, Session 2022/2023
4. Suppose you have a dataset with m = 1000000 examples and n = 200000 features for each
example. You want to use multivariate linear regression to fit the parameters θ to our data.
Should you prefer gradient descent or the normal equation?