Tutorial 5B Solutions
Tutorial 5B Solutions
2. To what range is the probability restricted for the outcome prediction in logistic
regression?
a. (0, inf)
b. (-inf, 0)
c. (0, 1)
d. (-inf, inf)
3. Which of the following functions are used to transform the categorical target variable Y
in to a continuous-valued quantity?
4. Suppose a set of reasonably clean sample records was extracted from the 2020 census
database in the US. We are interested in predicting whether a person makes over $50K a
year, using two binary, nominal attributes viz:
• depending on whether they are male or female (X1), and
• depending on whether or not they have completed a tertiary qualification (X2).
Suppose we model the two features and label Y ∈ {0, 1} where Y = 1 indicates a person
makes over 50K. In the figure below we show three positive samples (“+” for Y = 1) and
one negative samples (“-” for Y = 0).
For predicting samples in the figure above, which model is better: Logistic Regression or
Linear Regression? Explain why. Logistic regression – used to model the probability
associated with a categorical variable that values between 0 and 1, which is consistent with
the target space of Y in this example, and is able to model using numeric and/or categorical
attributes. Linear regression is used to predict any continuous value.
5. Suppose the probability that a house on the market sells for the asking price is 0.72:
a. What are the odds in favour of the house selling for the asking price?
0.72/(1-0.72) = 2.79
b. What are the odds that the house does NOT sell for the asking price?
(1-0.72)/0.72 = 0.39
6. The odds of a mushroom being poisonous is 2.68. What is the probability that the
mushroom is poisonous?
7. Data from the UCLA undergraduate school has 3 variables called admit, GRE score, and
GPA score. Our aim is to build a model to predict the probability of a student getting
admitted to UCLA, if we are given his profile (i.e his GRE and GPA scores).
We used logistic regression to fit a model to the data. The model results are given in
the table below:
a. Determine the estimated logistic regression model i.e. write down the model equation.
c. Determine the estimated odds ratio for 4 units increase in GPA score, and interpret
this value.
d. Consider a student who achieved a score of 790 in GRE, and a score of 3.8 in GPA.
Use the model to predict whether or not the student will be admitted in to UCLA.
8. Logistic regression was used to model the likelihood of a man having cardiovascular
disease (CVD) based on their age, weight and exercise habits. Using a sample of men of
the same race, information was collected on their age (in years), weight (in KG) and
exercise habits (average amount of time spent exercising per week in hours), as well as
whether they had CVD or not.
Suppose the following results were obtained for the logistic regression model:
e. Determine the odds of having CVD for an increase of 2 hours of exercise per week
on average.
f. Classify the CVD status of a 35 year old man who weighs 70kg and exercises for 30
minutes per week, on average.
9. Why do support vector machines usually map examples to a higher dimensional space?
10. Does the optimization procedure of a SVM always find the maximum marginal
hyperplane, assuming that the data are linearly separable?
a. Draw 3 possible decision boundaries (in a solid line) and indicate the margins
associated with each of them (in a dotted line) as well as their different sets of
associated support vectors (circled):
b. Out of the 3 possible decision boundaries that you have drawn in the previous
example, which one could be the MMH?
12. A support vector machine is to be used to predict whether a patient is at risk for cardio
vascular disease (CVD). The class labels are +1 for “high risk” and -1 for “low risk”.
Three attributes were considered: age, BMI and blood pressure. The observations for
these attributes have been standardised. The table below gives the support vectors and
Lagrange multipliers:
c. Based on the trained SVM, discuss the importance of the variables in terms of the
risk of CVD
d. Using the trained SVM, classify the risk of the following men: