Exercises INF 5860: Exercise 1 Linear Regression
Exercises INF 5860: Exercise 1 Linear Regression
b) Why would we want an iterative algorithm for the linear regression problem?
c) How does gradient descent update the estimate, give the general formulae?
d) Given x=
Plot x,y as points in a plot.
e) If we start with ϴ0=0 and ϴ1=0, compute the value of the initial loss function
f) If we start with ϴ0=0 and ϴ1=0, compute the estimate after one iteration if the learning rate
is 1.
a) Given a trained logistic classifier for a single feature and 2 classes. What is the equation for the
decision boundary if W=2 and b=1?
b) In what way does the tanh activation function share the same drawbacks?
e) Assume that we have a 2-layer net (one hidden layer) with weights W(1), b(1) , W(2) and b(2) .
Assume that we use RELU-activations in the hidden layer, and no activation on the output
layer. Write down an equation for the output of the j’th node in the hidden layer, a(j) .
h) Explain briefly how momentum gradient descent works, and why this can be more robust
than regular gradient descent.
1. Why can testing out multiple models on your test data be a problem and when is it
problematic?
2. How does searching through more hypotheses affect the probability of searching through a
solution close to the correct solution?
4. What are the implications of the “No free lunch” theorem mean for machine learning?
5. Give three examples of common assumptions (priors) machine learning models make.
Exercise 5: Representations
1. You have a 32x32x5 image and filter it with a 5x5x5 kernel, the way most convolutional
neural networks are implemented. If you use no padding, what will be the output size of the
activation map?
3. Why is the effective field-of-view usually smaller than the theoretical field-of-view? By
theoretical field-of-view we mean the size of the image patch that can influence each of the
output values in the activation map. Practical field-of-view is the size of the patch of pixels
influencing the results of a given output value.
4. In deep learning frameworks, you usually operate on 4D tensors, when working with 2D
convolutions. If you want to use such a framework to do a average (blur) filtering of images,
how would you have to construct the kernel for the convolution? You should treat each of
the color channels (RGB) independently.
Exercise 7: Training deep networks
1. Gradient flow
a. Why is gradient flow important when training deep neural networks?
b. Give some common methods that help to ensure good gradient flow.
1. Give two possible explanations to why residual networks work better than standard feed
forward networks.
2. You want to find bounding-boxes for cars in an image. You don’t know how many cars there
will be in each image, but you can safely assume it’s between 0 – 100. Describe how you can
construct and train a deep neural network for this task.
4. What is the reasoning behind the concatenation operations in U-Net for image
segmentation?
1. You have a convolutional neural network trained for image classification. Describe a simple
way of detecting what parts of an image are responsible for a certain classification result,
without using the image gradients.
2. How can you get a simple estimate of how changing a set of pixel-values will affect the final
class probabilities?
3. For some visualization techniques, you apply a lowpass (blurring) filter between each
iteration of optimization. Why may this be a reasonable approach?
4. You have lots of training images for one application, but no labelled images for a similar
application. How can you use Adversarial domain adaption, to improve your results on the
new data?
1. Why is vanishing gradients and outputs a more common problem in basic RNNs compared to
feed forward networks?
2. Why is vanishing gradients and outputs in RNN less problematic than for feed forward neural
networks?
3. Why can you only do gradient descent for a certain number iterations of an RNN and when is
this a problem? Explain and provide an example.
4. Give an overview of some common solutions to using deep learning for video data.
4. How could you implement hard attention for image analysis in a fully supervised way,
without using reinforcement learning?
1. Draw and explain an example where t-SNE work better than PCA.
2. When you do a PCA of a dataset, you can easily transform new points with the same
transform. Why is it more difficult to transform new points with t-SNE?
4. Explain a typical situation where first learning an embedding unsupervised and then using
the embedding for supervised learning, can fail.