01 ML Basics
01 ML Basics
Klaus Reygers
1
Exercises
2
What is machine learning? (1)
3
What is machine learning? (2)
“Machine learning is the subfield of computer science that gives computers the ability to
learn without being explicitly programmed” – Wikipedia
“deep” in deep learning: artificial neural nets with many neurons and multiple layers of
nonlinear processing units for feature extraction
5
Multivariate analysis: An early example from particle physics
I Signal: e + e − → W + W −
I often 4 well separated hadron
jets
I Background: e + e − → qqgg
I 4 less well separated hadron jets
I Input variables based on jet
structure, event shape, . . . none
by itself gives much separation.
7
Applying ML techniques in other fields, e.g., healthcare
Healthcare is different - problems are not well posed and the notion of a solution is
often not well-defined and solutions are hard to verify"
Mihaela van der Schaar, ICML 2020: Automated ML and its transformative impact on medicine and healthcare
8
Some successes and unsolved problems in AI
Impressive progress in certain fields:
I Image recognition
I Speech recognition
I Recommendation systems
I Automated translation
I Analysis of medical data
Artificial neural networks are around for decades. Why did deep learning take off after
2012?
10
Different modeling approaches
11
Machine learning: The “hello world” problem
12
Machine learning: Image recognition
ImageNet database
I 14 million images, 22,000 categories
I Since 2010, the annual ImageNet Large Scale Visual Recognition Challenge
(ILSVRC): 1.4 million images, 1000 categories
I In 2017, 29 of 38 competing teams got less than 5% wrong
13
ImageNet: Large Scale Visual Recognition Challenge
14
Adversarial attack
15
Types of machine learning
16
Books on machine learning
Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning,
free online http://www.deeplearningbook.org/
17
Papers
18
Supervised learning in a nutshell
I Supervised Machine Learning requires labeled training data, i.e., a training sample
where for each event it is known whether it is a signal or background event.
I Each event is characterized by n observables: ~x = (x1 , x2 , ..., xn ) "feature vector"
"All the impressive achievements of deep learning amount to just curve fitting"
20
Classification: Learning decision boundaries
21
Supervised learning: Training, validation, and test sample
22
Supervised learning: Cross validation
Rule of thumb if training data not expensive
I Training sample: 50% Often test sample = validation
sample (bias is rather small)
I Validation sample: 25%
I Test sample: 25%
Cross entropy:
I t ∈ {0, 1} ~ ), t) = − t log y (~x , w
E (y (~x , w ~)
I y (~x , w
~ ): predicted probability for − (1 − t) log(1 − y (~x , w
~ ))
outcome t = 1
I often used in classification
24
More on entropy
I Self-information of an event x : I(x ) = − log p(x )
I in units of nats (1 nat = information gained by observing an event of probability 1/e)
25
Hypothesis testing
test statistic
I a (usually scalar) variable which is
a function of the data alone that
can be used to test hypotheses
I example: χ2 w.r.t. a theory curve
26
Neyman-Pearson Lemma
The likelihood ratio
f (~x |H1 )
t(~x ) =
f (~x |H0 )
is an optimal test statistic, i.e., it provides highest “signal efficiency” 1 − β for a given
“background efficiency” α. Accept hypothesis if t(~x ) > c.
Two approaches
1. Estimate signal and background pdf’s and construct test statistic based on
Neyman-Pearson lemma
2. Decision boundaries determined directly without approximating the pdf’s (linear
discriminants, decision trees, neural networks, . . . )
27
Estimating PDFs from Histograms?
M bins per variable in d dimensions: M d cells→ hard to generate enough training data
(often not practical for d > 1)
In general in machine learning, problems related to a large number of dimensions of the
feature space are referred to as the "curse of dimensionality"
28
Naïve Bayesian Classifier (also called “Projected Likelihood Classification”)
Application of the Neyman-Pearson lemma (ignoring correlations between the xi ):
k-NN classifier:
I Estimates probability density around the input vector
I p(~x |S) and p(~x |B) are approximated by the number of signal and background
events in the training sample that lie in a small volume around the point ~x
k = ks + kb
ks (~x )
ps (~x ) =
ks (~x ) + kb (~x )
30
k-Nearest Neighbor Method (2)
Simplest choice for distance measure in feature space
is the Euclidean distance:
R = |~x − ~y |
The k-NN classifier has best performance when the boundary that separates signal and
background events has irregular features that cannot be easily approximated by
parametric learning methods.
31
Fisher Linear Discriminant
Linear discriminant is simple. Can still be optimal if amount of training data is limited.
Ansatz for test statistic:
n
X
y (~x ) = ~ |~x
wi xi = w
i=1
Choose parameters wi so that separation between signal and background distribution is
maximum.
Need to define “separation”.
Fisher: maximize
(τs − τb )2
J(~
w) =
Σ2s + Σ2b
32
Fisher Linear Discriminant: Determining the Coefficients wi
33
Linear regression revisited
J(θ|x , y ) = N1 N
i=1 (yi − f (xi ))
2
P
70
65
I model training: optimal parameters
ˆ
θ~ = arg min J(θ)
~
60
55
60.0 62.5 65.0 67.5 70.0 72.5 75.0 77.5 80.0
Father's height (inches)
34
Linear regression
I Data: vectors with p components (“features”): ~x = (x1 , ..., xp )
I n observations: {~xi , yi }, i = 1, ..., n
I Prediction for given vector x :
~ |~x
y = w0 + w1 x1 + w2 x2 + ... + wp xp ≡ w where x0 := 1
~ˆ = (X | X )−1 X |~y
w where X ∈ Rn×p
n
X
C (~
w) = w |~xi − yi )2 + λ|~
(~ w |2 LASSO Ridge
i=1
LASSO regression tends to give sparse solutions
I LASSO regression (many components wj = 0). This is why LASSO
regression is also called sparse regression.
n
X
C (~
w) = w |~xi − yi )2 + λ|~
(~ w|
i=1
36
Logistic regression (1)
~ |~x
s = w0 + w1 x1 + w2 x2 + ... + wp xp ≡ w
I Define function that translates s into a quantity that has the properties of a
probability
1
σ(s) =
1 + e −s
I We would like to determine the optimal weights for a given training data set. They
result from the maximum-likelihood principle.
37
Logistic regression (2)
I Consider feature vector ~x . For a given set of weights w
~ the model predicts
I a probability p(1|~ w |~x ) for outcome y = 1
w ) = σ(~
I a probabiltiy p(0|~
w ) = 1 − σ(~w |~x ) for outcome y = 0
I The probability p(yi |~ w ) = p(yi |~
w ) defines the likelihood Li (~ w ) (the likelihood is a
~
function of the parameters w and the observations yi are fixed).
I Likelihood for the full data sample (n observations)
n
Y n
Y
L(~
w) = Li (~
w) = σ(~ w |~x ))1−yi
w |~x )yi (1 − σ(~
i=1 i=1
39
Example 1 - Probability of passing an exam (logistic regression) (1)
Objective: predict the probability that someone passes an exam based on the number of
hours studying
1
ppass = σ(s) = , s = w1 t + w0 , t = # hours
1 + e −s
I Data set:
I preparation t time in hours 1.0
Calculate predictions:
41
Precision and recall
Precision: Recall:
Fraction of correctly classified instances Fraction of positive instances that are
among all instances that obtain a certain correctly classified.
class label.
TP TP
precision = recall =
TP + FP TP + FN
"purity" "efficiency"
42
Example 2: Heart disease data set (logistic regression) (1)
Read data:
filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/data/heart.csv"
df = pd.read_csv(filename)
df
03_ml_basics_log_regr_heart_disease.ipynb
43
Example 2: Heart disease data set (logistic regression) (2)
Define array of labels and feature vectors
y = df['target'].values
X = df[[col for col in df.columns if col!="target"]]
44
Example 2: Heart disease data set (logistic regression) (3)
Test predictions on test data set:
Output:
45
Example 2: Heart disease data set (logistic regression) (4)
Compare to another classifier usinf the receiver operating characteristic (ROC) curve
Let’s take the random forest classifier
46
Example 2: Heart disease data set (logistic regression) (5)
0.8
Precision
0.6
Classifiers can be compared with the area
under curve (AUC) score. 0.4
This gives
AUC scores: 0.82, 0.83
47
Multinomial logistic regression: Softmax function
In the previous example we considered two classes (0, 1). For multi-class classification,
the logistic function can generalized to the softmax function.
Now consider k classes and let si be the score for class i: ~s = (s1 , ..., sk )
The softmax functions is often used as the last activation function of a neural network in
order to predict probabilities in a classification task.
48
Example 3: Iris data set (softmax regression) (1)
Iris flower data set
I Introduced 1936 in a paper by Ronald Fisher
I Task: classify flowers
I Three species: iris setosa, iris virginica and iris versicolor
I Four features: petal width and length, sepal width/length, in centimeters
03_ml_basics_iris_softmax_regression.ipynb
https://archive.ics.uci.edu/ml/datasets/Iris
https://en.wikipedia.org/wiki/Iris_flower_data_set
49
Example 3: Iris data set (softmax regression) (2)
Get data set
Softmax regression
50
Example 3 : Iris data set (softmax regression) (3)
Accuracy and confusion matrix for different classifiers
LogisticRegression
for clf in [log_reg, kn_neigh, fisher_ld]: accuracy: 0.96
y_pred = clf.predict(x_test) [[29 0 0]
acc = accuracy_score(y_test, y_pred) [ 0 23 0]
print(type(clf).__name__) [ 0 3 20]]
print(f"accuracy: {acc:0.2f}")
KNeighborsClassifier
# confusion matrix: accuracy: 0.95
# columns: true class, row: predicted class [[29 0 0]
print(confusion_matrix(y_test, y_pred),"\n") [ 0 23 0]
[ 0 4 19]]
LinearDiscriminantAnalysis
accuracy: 0.99
[[29 0 0]
[ 0 23 0]
[ 0 1 22]]
51
General remarks on multi-variate analyses (MVAs)
I MVA Methods
I More effective than classic cut-based analyses
I Take correlations of input variables into account
I Pre-processing
I Apply obvious variable transformations and let MVA method do the rest
I Make use of obvious symmetries: if e.g. a particle production process is symmetric in
polar angle θ use | cos θ| and not cos θ as input variable
I It is generally useful to bring all input variables to a similar numerical range
52
Example of feature transformation
53
Possible topics for more in-depth talks/discussion
54
Exercise 1: Classification of air showers measured with the MAGIC
telescope
I Cosmic gamma rays (30 GeV - 30 TeV).
I Cherenkov light from air showers
I Background: air showers caused by
hadrons.
55
Exercise 1: Classification of air showers measured with the MAGIC
telescope
57
Exercise 1: Classification of air showers measured with the MAGIC
telescope
MAGIC data set
https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope
58
Exercise 1: Classification of air showers measured with the MAGIC
telescope
03_ml_basics_ex_1_magic.ipynb
a) Create for each variable a figure with a plot for gammas and hadrons overlayed.
b) Create training and test data set. The test data should amount to 50% of the total
data set.
c) Define the logistic regressor and fit the training data
d) Determine the model accuracy and the AUC score
e) Plot the ROC curve (background rejection vs signal efficiency)
59
Exercise 2: Hand-written digit recognition with logistic regression
03_ml_basics_ex_2_mnist_softmax_regression.ipynb
a) Define logistic regressor from scikit-learn and fit data
b) Use classification_report from scikit-learn to determine precision and recall
c) Read in a hand-written digit and classify it. Print the probabilities for each digit.
Determine the digit with the highest probability.
d) (Optional) Create you own hand-written digit with a program like gimp and check
what the classifier does
Hint: You can install required packages on the jupyter hub server like so:
!pip3 install --user pypng
60
Exercise 3: Data preprocessing
61