FDS Notes
FDS Notes
Giovanni Ficarra
October 6, 2020
Abstract
Some essential note about the course of Foundations of Data Science.
Most of the contents are from Doing Data Science, O’Reilly, 2014.
These notes are shared without any guarantee of complete correctness,
since I may have done typos or misunderstood something. Feel free to drop
an email at [email protected] to report errors.
1
1 Evaluation
The evaluation of binary classifiers compares two methods of assigning a binary
attribute (Wikipedia).
Let T P, T N, F P, F N be respectively the number of true positives, the number
of true negatives, the number of false positives and the number of false negatives,
Yi an observed value, Ŷi the prevision for that value, Ȳ the mean.
• Accuracy: How often the correct outcome is being predicted. How well
a binary classification test correctly identifies or excludes a condition.
TP + TN
ACC =
TP + TN + FP + FN
More: Wikipedia
2
• Negative Predictive Value: The proportions of negative results that
are true negative results.
TN
NPV =
TN + FN
More: Wikipedia
FPR = 1 − TNR
FNR = 1 − TPR
3
• False Discovery Rate:
F DR = 1 − P P V
• Mean squared error: The average squared distance between the pre-
dicted and actual values. It captures how much the predicted value varies
from the observed.
1X
n
M SE = (Yi − Ŷi )2 = SSE
n i=1
More: Wikipedia
• Root squared error: The square root of mean squared error.
v
u n
u1 X
RSE = t (Yi − Ŷi )2
n i=1
More: Wikipedia
• Mean absolute error: The average of the absolute value of the difference
between the predicted and actual values. It’s also the average horizontal
distance between each point and the identity line.
1X
n
M AE = Yi − Ŷi
n i=1
More: Wikipedia
• R-squared (aka coefficient of determination): The proportion of the
variance in the dependent variable that is predictable from the indepen-
dent variable(s). The proportion of variance explained by our model.
Pn
SSE 1
(Yi − Ŷi )2
R2 = 1 − =1− n
Pi=1
n
i=1 (Yi − Ȳi )
SST 1 2
n
It tells us the quality of our model comparing it with a naive one, that
ignores the Xi s and simply compute the average of the Yi s.
4
The better the linear regression (on the right) fits the data in comparison to the
simple average (on the left graph), the closer the value of R2 is to 1.
The areas of the blue squares represent the squared residuals with respect to the
linear regression. The areas of the red squares represent the squared residuals
with respect to the average value.
More: Wikipedia
• p-values: Let’s make the null hypothesis that the coefficients of the equa-
tion of the line estimated through linear regression were 0; The p-value is
the probability of seeing the observed data under the null hypothesis.
It tells us how meaningful our model is, if it really represents what is hap-
pening behind the data or if it’s only casually similar to the data.
I.e., if the p-value relative to a certain coefficient is low, it is highly un-
likely to observe such a test-statistic under the null hypothesis, and the
coefficient we computed is highly likely to be nonzero and therefore sig-
nificant.
More: Wikipedia
• Receiver Operating Curve: It’s a probability curve, the plot of the
TPR against the FPR for a binary classification problem as you change a
threshold.
More: Wikipedia
• Area under the Receiver Operating Curve (AUC): It represents
degree or measure of separability, it tells us how much the model is capable
of distinguishing between classes (higher the AUC, better the model is at
predicting 0s as 0s and 1s as 1s).
More: TowardsDataScience
5
• Area under the cumulative lift curve: captures how many times it
is better to use a model versus not using the model (i.e., just selecting at
random).
More: paper
The more complex the model is, the more data points it will capture, and
the lower the bias will be. However, complexity will make the model ”move”
more to capture the data points, and hence its variance will be larger.