Assgmt 1
Assgmt 1
Assignment #1
Points: 50
Course: CS 725 – Instructor: Preethi Jyothi
Due date: 11:55 pm, September 6, 2024
General Instructions
• Download the file assgmt1.tgz from Moodle and extract the file to get a direc-
tory named assgmt1 with all the necessary files within.
• For your final submission, create a final submission directory named assgmt1
with the following internal directory structure:
assgmt1/
|
+-README
+- part1A.py
+- histograms/
+- part1B.py
+- closedForm.py
+- batchGradientDescent.py
+- smilingJoker.py
+- part3.py
+- kaggle.csv
• README should contain the names and roll numbers of all your team members.
• A quick and short introduction to NumPy is here. Please make sure you go
through this carefully; numpy operations beyond what is mentioned in this doc-
ument can also be used for the assignment.
1
2
(A) Coin Tossing. Say you toss 100 fair coins simultaneously and want to pro-
grammatically record the total number of heads. One simultaneous toss of the 100
coins counts as a single trial. part1A.py contains a list num_trials_list of varying
numbers of trials. For every value in this list, we want to count the number of
heads and plot a histogram. The histogram should take the shape of a binomial
distribution and get more accurate with larger numbers of trials.
Notes:
2. Use the given template only. DO NOT change the names of the given functions
as it will cause the autograder to fail.
3. Use for loops to simulate the 100 coin tosses for a given num_trials value. Do
not use predefined functions to calculate the numpy array in ‘‘toss()" function
else marks will be deducted.
Complete each of the following steps using numpy operations and do not use any
for loops:
2. Modify X to add the integer i to all elements of its ith row, where i ∈ {1, . . . , N }.
Let this offset-modified matrix now be X̂. [2 pts]
sparsify
sparsify z11 z12 z13 z14 z11 0 z13 0
z11 z12 z13 z11 0 z13 z21 z22 z23 z24 0 z22 0 z24
z21 z22 z23 0 z22 0
z31 z32 z33 z34 z31 0 z33 0
z31 z32 z33 z31 0 z33
z41 z42 z43 z44 0 z42 0 z44
exp(zij )
4. Compute Ẑ such that each zi in Z is replaced by ẑi , where ẑij = ∑ j exp(zij )
. Note
that each row in Ẑ will now be a probability distribution. [2 pts]
5. Print the index of the maximum probability in each row of Ẑ. [1 pts]
(A) Closed-form Solution of Linear Regression (10 points). For a training set D =
{(x1 , y1 ), . . . , (xn , yn )}, xi ∈ R(d+1) represented by a feature matrix X and a label
vector y, the least squares solution w∗ can be computed by w∗ = (X⊤ X)−1 X⊤ y where
←− x1 ⊤ −→
y1 w0
←− x ⊤ −→ y2 w2
2
X= , y = .. , w = ..
..
. . .
⊤
←− xn −→ n×(d+1) y n n ×1 wd (d+1)×1
2. predict(): This function should return the predicted values of the model on
an input set of feature values. [2 pts]
Note:
2. To make your life easy, we have a test suite that checks your code on a few
test cases. You will need PyTest to run these test cases. We might add some
additional cases during the final grading. [3 pts]
To run the test suite, execute:
$ pytest test_closedForm.py
3. Stick to the given template code. DO NOT change the names of the functions
given as it will cause the autograder to fail.
5
(B) Gradient Descent for Linear Regression (12 points). Find the least squares so-
lution using Batch Gradient Descent. batch_size indicates the number of data points
in a batch. If batch_size is None, this function implements full gradient descent and
computes the gradient over all training examples. The code for generating and cre-
ating batches on a toy dataset is in batchGradientDescent.py.
2. compute_gradient(): This function should return the gradient of the loss w.r.t
weights of the model. (Normalize the values before returning, or it may cause
gradients to explode.) [2 pts]
3. compute_rmse_loss():
q This function should return the Root Mean Square Error
loss ( n1 ||y − Xw||22 ) between the target labels and predicted labels. [1 pts]
4. predict(): This function should return the predicted values of the model on
the given set of feature values passed as an argument to it. [1 pts]
Finally, your code should generate two plots that look like this:
Note:
$ pytest test_batchGradientDescent.py
3. Stick to the given template code. DO NOT change the names of the functions
given as it will cause the autograder to fail.
6
(C) Linear Regression with Basis Functions (8 points). (The Smiling Joker)
This question will require some basic reasoning about feature spaces. You are given
a training set D = {(x1 , y1 ), . . . , (xn , yn )}, xi ∈ R. The relation between xi and yi
is defined by a weighted combination of transformations Φ = {ϕ1 , ϕ2 , . . . , ϕK } such
that:
K
y i = w 0 + ∑ w j · ϕ j ( xi ) ∀i ∈ {1, 2, . . . , n}
j =1
Given just the dataset, you are tasked with identifying the appropriate number and
choice for these transformations ϕj and fitting a linear regression model.
More concretely, complete the following functions in smilingJoker.py.:
1. read_dataset(): This function should read the CSV file dataset.csv and gener-
ate train and test splits. The file contains 500 data points; you can use the first
90% of the data for training and the rest for testing your model. [2 pts]
1. Similar to the previous questions, you should run the following and pass all the
prediction checks: [3 pts]
$ python smilingJoker.py
2. Stick to the given template code. DO NOT change the names of the functions
given as it will cause the autograder to fail.
Note:
When you plot the dataset, you will see a Smiling Joker a (hence the name!).
a https://tinyurl.com/stock-smiley-joker
7
• test.csv: This file contains the test data with 64 features but without the target
scores.
• sample.csv: This file contains the submission format with predicted scores for
the test data. You will have to submit such a file with your test predictions.
Each row in the data files represents an instance with the following columns: ID:
A unique identifier for each data point, feature_0, feature_1, ..., feature_63:
The 64 features extracted from the dataset. score: The target score for each data point
(only in train.csv).
Task Description. Implement a linear regression model for the given problem. You
can reuse any of the functions from Part II of this assignment. Tune the hyperparam-
eters on a held-out set from train.csv to achieve best model performance on the test
set. Predict the target scores on the test dataset. Round the predicted scores to the
nearest integer before submission.
Evaluation. The performance of your model will be evaluated based on the Mean
Squared Error (MSE) calculated on the test dataset. Your predicted scores must be
rounded to the nearest integer. You should not implement the MSE metric; it will be
automatically calculated via Kaggle. Your model will be evaluated on the provided
test set, where a random 50% of the examples are marked as private and the remaining
are public. The private/public distribution will be hidden. You can monitor your
model’s performance on the public part of the test set via the public leaderboard.
The final evaluation will be based on the private part of the test set, which will be
revealed via the private leaderboard after the competition concludes.
Submission. Submit your source file named part3.py and a CSV file kaggle.csv
with your predicted scores (rounded to nearest integer) for the test dataset, following
the format in sample.csv. If you match or outperform the baseline RMSE obtained
with the solution from the closed form solution in part II, you will get all 5 points.
Top-scoring performers on the "Private Leaderboard" (with a suitable threshold de-
termined after the deadline passes) will be awarded up to 3 extra points.