0% found this document useful (0 votes)

12 views6 pages

FDS Notes

Uploaded by

Waddah AlHajjar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views6 pages

FDS Notes

Uploaded by

Waddah AlHajjar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Data Science notes

Giovanni Ficarra
October 6, 2020

Abstract
Some essential note about the course of Foundations of Data Science.
Most of the contents are from Doing Data Science, O’Reilly, 2014.
These notes are shared without any guarantee of complete correctness,
since I may have done typos or misunderstood something. Feel free to drop
an email at [email protected] to report errors.

1
1 Evaluation
The evaluation of binary classifiers compares two methods of assigning a binary
attribute (Wikipedia).
Let T P, T N, F P, F N be respectively the number of true positives, the number
of true negatives, the number of false positives and the number of false negatives,
Yi an observed value, Ŷi the prevision for that value, Ȳ the mean.

• Accuracy: How often the correct outcome is being predicted. How well
a binary classification test correctly identifies or excludes a condition.
TP + TN
ACC =
TP + TN + FP + FN
More: Wikipedia

• Precision (aka Positive Predictive Value): The fraction of relevant

instances among the retrieved instances.
TP
PPV =
TP + FP
More: Wikipedia
• Recall (aka Sensitivity or True Positive Rate): The fraction of rele-
vant instances that have been retrieved over the total amount of relevant
instances. The proportion of actual positives that are correctly identified
as such.
TP
TPR =
TP + FN
More: Wikipedia
• F-score: It combines precision and recall into a single score. It’s their
harmonic mean.
2T P
F1 =
2T P + T P + F N
More: Wikipedia

• Specificity (aka True Negative Rate): The proportion of actual neg-

atives that are correctly identified as such.
TN
TNR =
FP + TN
More: Wikipedia

2
• Negative Predictive Value: The proportions of negative results that
are true negative results.
TN
NPV =
TN + FN
More: Wikipedia

• Fall-out (False Positive Rate):

FPR = 1 − TNR

• Miss rate (False Negative Rate):

FNR = 1 − TPR

3
• False Discovery Rate:

F DR = 1 − P P V

• Mean squared error: The average squared distance between the pre-
dicted and actual values. It captures how much the predicted value varies
from the observed.

1X
n
M SE = (Yi − Ŷi )2 = SSE
n i=1
More: Wikipedia
• Root squared error: The square root of mean squared error.
v
u n
u1 X
RSE = t (Yi − Ŷi )2
n i=1

More: Wikipedia
• Mean absolute error: The average of the absolute value of the difference
between the predicted and actual values. It’s also the average horizontal
distance between each point and the identity line.

1X
n
M AE = Yi − Ŷi
n i=1
More: Wikipedia
• R-squared (aka coeﬀicient of determination): The proportion of the
variance in the dependent variable that is predictable from the indepen-
dent variable(s). The proportion of variance explained by our model.
Pn
SSE 1
(Yi − Ŷi )2
R2 = 1 − =1− n
Pi=1
n
i=1 (Yi − Ȳi )
SST 1 2
n

It tells us the quality of our model comparing it with a naive one, that
ignores the Xi s and simply compute the average of the Yi s.

4
The better the linear regression (on the right) fits the data in comparison to the
simple average (on the left graph), the closer the value of R2 is to 1.
The areas of the blue squares represent the squared residuals with respect to the
linear regression. The areas of the red squares represent the squared residuals
with respect to the average value.
More: Wikipedia
• p-values: Let’s make the null hypothesis that the coefficients of the equa-
tion of the line estimated through linear regression were 0; The p-value is
the probability of seeing the observed data under the null hypothesis.
It tells us how meaningful our model is, if it really represents what is hap-
pening behind the data or if it’s only casually similar to the data.
I.e., if the p-value relative to a certain coefficient is low, it is highly un-
likely to observe such a test-statistic under the null hypothesis, and the
coefficient we computed is highly likely to be nonzero and therefore sig-
nificant.
More: Wikipedia
• Receiver Operating Curve: It’s a probability curve, the plot of the
TPR against the FPR for a binary classification problem as you change a
threshold.
More: Wikipedia
• Area under the Receiver Operating Curve (AUC): It represents
degree or measure of separability, it tells us how much the model is capable
of distinguishing between classes (higher the AUC, better the model is at
predicting 0s as 0s and 1s as 1s).
More: TowardsDataScience

5
• Area under the cumulative lift curve: captures how many times it
is better to use a model versus not using the model (i.e., just selecting at
random).
More: paper

2 Bias-variance decomposition of squared error

Ideally, one wants to choose a model that both accurately captures the regu-
larities in its training data, but also generalizes well to unseen data; but it is
typically impossible to do both simultaneously.

• The bias is an error from erroneous assumptions in the learning algorithm.

High bias can cause an algorithm to miss the relevant relations between
features and target outputs (underfitting).
• The variance is an error from sensitivity to small fluctuations in the
training set. High variance can cause an algorithm to model the random
noise in the training data, rather than the intended outputs (overfitting).

The more complex the model is, the more data points it will capture, and
the lower the bias will be. However, complexity will make the model ”move”
more to capture the data points, and hence its variance will be larger.

E[(Y − Ŷ )2 ] = (y − µŶ )2 + E[(µŶ − Ŷ )2 ] = bias2 + variance

More: Wikipedia

Twi Radiographic Interpretation Part 2
100% (1)
Twi Radiographic Interpretation Part 2
46 pages
L9 RBF+PM
No ratings yet
L9 RBF+PM
33 pages
CSHQ
No ratings yet
CSHQ
11 pages
Unit8 (Evaluation Method)
No ratings yet
Unit8 (Evaluation Method)
43 pages
机器学习
No ratings yet
机器学习
41 pages
Prediction of House Price, Bank Campaigning Status and Bank Loan Status Using Machine Learning Algorithms
No ratings yet
Prediction of House Price, Bank Campaigning Status and Bank Loan Status Using Machine Learning Algorithms
9 pages
Clinical Utility of The 3-Ounce Water Swallow Test
No ratings yet
Clinical Utility of The 3-Ounce Water Swallow Test
7 pages
Unit 6 Ai
No ratings yet
Unit 6 Ai
28 pages
Lect 02 Evaluation Part 1
No ratings yet
Lect 02 Evaluation Part 1
33 pages
Statistical Modelling and Evaluation
No ratings yet
Statistical Modelling and Evaluation
15 pages
L22 KNN+Metrics
No ratings yet
L22 KNN+Metrics
18 pages
Goals of This Course: - What Is Software Security ?
No ratings yet
Goals of This Course: - What Is Software Security ?
65 pages
Infectious Diseases: A Case Study Approach 1st Edition Jonathan Cho - Ebook PDF 2024 Scribd Download
100% (6)
Infectious Diseases: A Case Study Approach 1st Edition Jonathan Cho - Ebook PDF 2024 Scribd Download
69 pages
Security in Software Applications: Fall 2020
No ratings yet
Security in Software Applications: Fall 2020
30 pages
Data Science Statistics Mathematics Cheat Sheet
100% (1)
Data Science Statistics Mathematics Cheat Sheet
13 pages
Graduation Projects 2022-2023 0
No ratings yet
Graduation Projects 2022-2023 0
76 pages
ERROR and Confusion Matrix
No ratings yet
ERROR and Confusion Matrix
29 pages
AIIMS May 2003 Questions and Answers
No ratings yet
AIIMS May 2003 Questions and Answers
32 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
Blue Property
No ratings yet
Blue Property
10 pages
Machine Learning Project: Choice of Employee Mode of Transport
No ratings yet
Machine Learning Project: Choice of Employee Mode of Transport
35 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
W7 Signal Detection Theory Notes
No ratings yet
W7 Signal Detection Theory Notes
6 pages
6 Evaluarea Performantei
No ratings yet
6 Evaluarea Performantei
43 pages
Evidence Based Medicine: Evaluation On Articles On Diagnosis
No ratings yet
Evidence Based Medicine: Evaluation On Articles On Diagnosis
25 pages
Makalah Jurnal Sensitivity
No ratings yet
Makalah Jurnal Sensitivity
24 pages
Exam PA Knowledge Based Outline
No ratings yet
Exam PA Knowledge Based Outline
22 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
Buffer Overflow Defenses: Some Examples, Pros, and Cons of Various Defenses Against Buffer Overflows. Caveats
No ratings yet
Buffer Overflow Defenses: Some Examples, Pros, and Cons of Various Defenses Against Buffer Overflows. Caveats
12 pages
Essence of The Problem
No ratings yet
Essence of The Problem
58 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Ardent Report
No ratings yet
Ardent Report
62 pages
Supplementary Table 3 Test Validation Considerations
No ratings yet
Supplementary Table 3 Test Validation Considerations
23 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
Android Skin Cancer Detection and Classification B
No ratings yet
Android Skin Cancer Detection and Classification B
14 pages
IT 138 - Lecture 4
No ratings yet
IT 138 - Lecture 4
30 pages
PCR
No ratings yet
PCR
8 pages
Evaluation
No ratings yet
Evaluation
12 pages
Kalibracija Vaga
No ratings yet
Kalibracija Vaga
10 pages
Data Science Technical Interview Questions
No ratings yet
Data Science Technical Interview Questions
24 pages
Prediction - Accuracy
No ratings yet
Prediction - Accuracy
33 pages
Metric
No ratings yet
Metric
6 pages
2022 Segmentation and Classification of Brain Tumor Using 3D-UNet Deep Neural Networks
No ratings yet
2022 Segmentation and Classification of Brain Tumor Using 3D-UNet Deep Neural Networks
12 pages
Classification Metrics
No ratings yet
Classification Metrics
39 pages
Metrix in ML
No ratings yet
Metrix in ML
7 pages
Lec 9
No ratings yet
Lec 9
14 pages
Concepts - Model Evaluation (Data Mining Fundamentals)
No ratings yet
Concepts - Model Evaluation (Data Mining Fundamentals)
40 pages
Anomaly Detection in Self-Organizing Networks - Conventional Versus Contemporary Machine Learning
No ratings yet
Anomaly Detection in Self-Organizing Networks - Conventional Versus Contemporary Machine Learning
9 pages
Prognostic Value of Elevated Lactate Dehydrogenase
No ratings yet
Prognostic Value of Elevated Lactate Dehydrogenase
14 pages
Performance of The Rapid Plasma Reagin and The Rapid Syphilis Screening Tests in The Diagnosis of Syphilis in Field Conditions in Rural Africa
No ratings yet
Performance of The Rapid Plasma Reagin and The Rapid Syphilis Screening Tests in The Diagnosis of Syphilis in Field Conditions in Rural Africa
4 pages
CSE4261 Lecture-10
No ratings yet
CSE4261 Lecture-10
50 pages
Two Novel Clinical Tests For The Diagnosis of Hip Labral Tears
No ratings yet
Two Novel Clinical Tests For The Diagnosis of Hip Labral Tears
8 pages
AI Lung Imaging Analysis System (ALIAS) (CT) 2021
No ratings yet
AI Lung Imaging Analysis System (ALIAS) (CT) 2021
9 pages
Case Report: A Case Report of Peritoneal Tuberculosis: A Challenging Diagnosis
No ratings yet
Case Report: A Case Report of Peritoneal Tuberculosis: A Challenging Diagnosis
4 pages
PSM Vivke Jain
No ratings yet
PSM Vivke Jain
6 pages
Data Science Cheatsheet
100% (1)
Data Science Cheatsheet
5 pages
Unit III Iml Final
No ratings yet
Unit III Iml Final
36 pages
IS4242 W6 Model Evaluation and Selection
No ratings yet
IS4242 W6 Model Evaluation and Selection
86 pages
Performance Evaluation
No ratings yet
Performance Evaluation
24 pages
01it0701 Advanced Web Technologies
No ratings yet
01it0701 Advanced Web Technologies
3 pages
Data Science Interview Questions (#Day11) PDF
100% (1)
Data Science Interview Questions (#Day11) PDF
11 pages
Lecture 10
No ratings yet
Lecture 10
16 pages
O2-A2 Oxy O2-A2 Oxy O2-A2 Oxy O2-A2 Oxy O2-A2 Oxyg G G G Gen Sensor en Sensor en Sensor en Sensor en Sensor
No ratings yet
O2-A2 Oxy O2-A2 Oxy O2-A2 Oxy O2-A2 Oxy O2-A2 Oxyg G G G Gen Sensor en Sensor en Sensor en Sensor en Sensor
2 pages
UNIT-1-2.Binary Classification and Related Tasks
No ratings yet
UNIT-1-2.Binary Classification and Related Tasks
22 pages
Module 6
No ratings yet
Module 6
24 pages
DL IT324a 4
No ratings yet
DL IT324a 4
52 pages
A Longitudinal Study of SARS-CoV-2 Antibody Seroprevalence and Mitigation Behaviors Among College Students at An Arkansas University
No ratings yet
A Longitudinal Study of SARS-CoV-2 Antibody Seroprevalence and Mitigation Behaviors Among College Students at An Arkansas University
11 pages
جلسه 13
No ratings yet
جلسه 13
76 pages
06-FSSR DS610 2024 2025T1 Metrics
No ratings yet
06-FSSR DS610 2024 2025T1 Metrics
24 pages
MISY 631 Final Review Calculators Will Be Provided For The Exam
No ratings yet
MISY 631 Final Review Calculators Will Be Provided For The Exam
9 pages
Ad3501-Dl-Unit 4 Notes
No ratings yet
Ad3501-Dl-Unit 4 Notes
16 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
4 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Using An Automated Tail Movement Sensor Device To Predict Calving Time in Dairy Cows
No ratings yet
Using An Automated Tail Movement Sensor Device To Predict Calving Time in Dairy Cows
6 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
Module 5 ML
No ratings yet
Module 5 ML
12 pages
Data Mining Primer
No ratings yet
Data Mining Primer
5 pages
Stat 602 HW 1
No ratings yet
Stat 602 HW 1
3 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Cheatsheet Machine Learning Tips and Tricks PDF
No ratings yet
Cheatsheet Machine Learning Tips and Tricks PDF
2 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
No ratings yet
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
4 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Machine Learning Project Report (Group 3) Shahbaz Khan
No ratings yet
Machine Learning Project Report (Group 3) Shahbaz Khan
11 pages
Evaluation Measures
No ratings yet
Evaluation Measures
8 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
Model Perf Cheat Sheet
No ratings yet
Model Perf Cheat Sheet
2 pages
Machine Learning Cheatsheet
No ratings yet
Machine Learning Cheatsheet
12 pages
Model Perf Cheat Sheet
No ratings yet
Model Perf Cheat Sheet
2 pages

FDS Notes

Uploaded by

FDS Notes

Uploaded by

Data Science notes

• Precision (aka Positive Predictive Value): The fraction of relevant

• Specificity (aka True Negative Rate): The proportion of actual neg-

• Fall-out (False Positive Rate):

• Miss rate (False Negative Rate):

2 Bias-variance decomposition of squared error

• The bias is an error from erroneous assumptions in the learning algorithm.

E[(Y − Ŷ )2 ] = (y − µŶ )2 + E[(µŶ − Ŷ )2 ] = bias2 + variance

You might also like