0% found this document useful (0 votes)
24 views61 pages

01 ML Basics

The document provides an introduction to machine learning in the context of physics, covering basic concepts, applications, and various modeling approaches. It discusses the significance of supervised learning, classification, and regression, as well as specific techniques like logistic regression and k-nearest neighbors. Additionally, it highlights the challenges and successes of applying machine learning in fields such as particle physics, healthcare, and image recognition.

Uploaded by

Nguyen Dat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views61 pages

01 ML Basics

The document provides an introduction to machine learning in the context of physics, covering basic concepts, applications, and various modeling approaches. It discusses the significance of supervised learning, classification, and regression, as well as specific techniques like logistic regression and k-nearest neighbors. Additionally, it highlights the challenges and successes of applying machine learning in fields such as particle physics, healthcare, and image recognition.

Uploaded by

Nguyen Dat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Introduction to Machine Learning in Physics:

1. Machine Learning Basics

Klaus Reygers

Seminar on Advanced Analysis Methods for Heavy-Ion Data, SS 2021

1
Exercises

I Exercise 1: Air shower classification (MAGIC telescope)


I Logistic regression
I 03_ml_basics_ex01_magic.ipynb
I Exercise 2: Hand-written digit recognition with logistic regression
I Logistic regression
I 03_ml_basics_ex02_mnist_softmax_regression.ipynb

2
What is machine learning? (1)

3
What is machine learning? (2)
“Machine learning is the subfield of computer science that gives computers the ability to
learn without being explicitly programmed” – Wikipedia

Example: spam detection J. Mayes, Machine learning 101

Manual feature engineering vs. automatic feature detection 4


AI, ML, and DL
“AI is the study of how to make computers perform things that, at the moment, people
do better.” Elaine Rich, Artificial intelligence, McGraw-Hill 1983
G. Marcus, E. Davis, Rebooting AI

“deep” in deep learning: artificial neural nets with many neurons and multiple layers of
nonlinear processing units for feature extraction
5
Multivariate analysis: An early example from particle physics
I Signal: e + e − → W + W −
I often 4 well separated hadron
jets
I Background: e + e − → qqgg
I 4 less well separated hadron jets
I Input variables based on jet
structure, event shape, . . . none
by itself gives much separation.

(Garrido, Juste and Martinez, ALEPH 96-144)


6
Applications of machine learning in physics

I Particle physics: Particle identification / classification


I Astronomy: Galaxy morphology classification
I Chemistry and material science: predict properties of new molecules / materials
I Many-body quantum matter: classification of quantum phases

Machine learning and the physical sciences, arXiv:1903.10563

7
Applying ML techniques in other fields, e.g., healthcare

"ML has accomplished wonders . . . on well posed-problems where the notion of a


‘solution’ is well-defined and solutions are verifyable.

Healthcare is different - problems are not well posed and the notion of a solution is
often not well-defined and solutions are hard to verify"
Mihaela van der Schaar, ICML 2020: Automated ML and its transformative impact on medicine and healthcare

I believe for many interesting problems in physics the situation is similar.

8
Some successes and unsolved problems in AI
Impressive progress in certain fields:
I Image recognition
I Speech recognition
I Recommendation systems
I Automated translation
I Analysis of medical data

How can we profit from these developments


in physics?

M. Woolridge, The road to conscious machines


9
The deep learning hype – why now?

Artificial neural networks are around for decades. Why did deep learning take off after
2012?

I Improved hardware – graphical processing units [GPUs]


I Large data sets (e.g. images) distributed via the Internet
I Algorithmic advances

10
Different modeling approaches

I Simple mathematical representation like linear regression. Favored by statisticians.


I Complex deterministic models based on scientific understanding of the physical
process. Favored by physicists.
I Complex algorithms to make predictions that are derived from a huge number of
past examples (“machine learning” as developed in the field of computer science).
These are often black boxes.
I Regression models that claim to reach causal conclusions. Used by economists.
D. Spiegelhalter, The Art of Statistics – Learning from data

11
Machine learning: The “hello world” problem

Recognition of handwritten digits


I MNIST database (Modified
National Institute of Standards
and Technology database)
I 60,000 training images and 10,000
testing images labeled with correct
answer
I 28 pixel x 28 pixel
I Algorithms have reached
“near-human performance” https://en.wikipedia.org/wiki/MNIST_database

I Smallest error rate (2018): 0.18%

12
Machine learning: Image recognition
ImageNet database
I 14 million images, 22,000 categories
I Since 2010, the annual ImageNet Large Scale Visual Recognition Challenge
(ILSVRC): 1.4 million images, 1000 categories
I In 2017, 29 of 38 competing teams got less than 5% wrong

13
ImageNet: Large Scale Visual Recognition Challenge

O. Russakovsky et al, arXiv:1409.0575

14
Adversarial attack

Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy, arXiv:1412.6572v1

15
Types of machine learning

Reinforcement learning LeCun 2018, Power And Limits of Deep Learning

I The machine (“the agent”) predicts a scalar reward


given once in a while
I Weak feedback
Supervised learning
I The machine predicts a category based on labeled
training data
I Medium feedback
Unsupervised learning
I Describe/find hidden structure from “unlabeled”
data
I Cluster data in different sub-groups with similar
properties

16
Books on machine learning
Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning,
free online http://www.deeplearningbook.org/

Aurelien Geron, Hands-On Machine Learning with Scikit-Learn and


TensorFlow

Francois Chollet, Deep Learning with Python

17
Papers

A high-bias, low-variance introduction to Machine Learning for physicists


https://arxiv.org/abs/1803.08823

Machine learning and the physical sciences


https://arxiv.org/abs/1903.10563

A Living Review of Machine Learning for Particle Physics


https://iml-wg.github.io/HEPML-LivingReview/

18
Supervised learning in a nutshell
I Supervised Machine Learning requires labeled training data, i.e., a training sample
where for each event it is known whether it is a signal or background event.
I Each event is characterized by n observables: ~x = (x1 , x2 , ..., xn ) "feature vector"

I Design function y (~x , w


~ ) with adjustable parameters w
~
I Design a loss function
I Find best parameters which minimize loss
19
Supervised learning: classification and regression

The codomain Y of the function y: X → Y can be a set of labels or classes or a


continuous domain, e.g., R

I Y = finite set of labels → classification


I binary classification: Y = {0, 1}
I multi-class classification: Y = {c1 , c2 , ..., cn }
I Y = real numbers → regression

"All the impressive achievements of deep learning amount to just curve fitting"

J. Pearl, Turing Award Winner 2011


To Build Truly Intelligent Machines, Teach Them Cause and Effect, Quantamagazine

20
Classification: Learning decision boundaries

21
Supervised learning: Training, validation, and test sample

I Decision boundary fixed with training sample


I Performance on training sample becomes better with more iterations
I Danger of overtraining: Statistical fluctuations of the training sample will be learnt
I Validation sample = independent labeled data set not used for training → check for
overtraining
I Sign of overtraining: performance on validation sample becomes worse → Stop
training when signs of overtraining are observed (early stopping)
I Performance: apply classifier to independent test sample
I Often: test sample = validation sample (only small bias)

22
Supervised learning: Cross validation
Rule of thumb if training data not expensive
I Training sample: 50% Often test sample = validation
sample (bias is rather small)
I Validation sample: 25%
I Test sample: 25%

Cross validation (efficient use of scarce training data)


I Split training sample in k independent subset
Tk of the full sample T
I Train on T \ Tk resulting in k different
classifiers
I For each training event there is one classifier
that didn’t use this event for training
I Validation results are then combined
23
Often used loss functions

Square error loss:


~ ) − t)2
~ ), t) = (y (~x , w
E (y (~x , w
I often used in regression

Cross entropy:
I t ∈ {0, 1} ~ ), t) = − t log y (~x , w
E (y (~x , w ~)
I y (~x , w
~ ): predicted probability for − (1 − t) log(1 − y (~x , w
~ ))
outcome t = 1
I often used in classification

24
More on entropy
I Self-information of an event x : I(x ) = − log p(x )
I in units of nats (1 nat = information gained by observing an event of probability 1/e)

I Shannon entropy: H(P) = −


P
pi log pi
I Expected amount of information in an event drawn from a distribution P
I Measure of the minimum of amount of bits needed on average to encode symbols
drawn from a distribution

I Cross entropy: H(P, Q) = −E [log Q] = −


P
pi log qi
I Can be interpreted as a measure of the amount of bits needed when a wrong
distribution Q is assumed while the data actually follows a distribution P
I Measure of dissimilarity between distributions P and Q (i.e, a measure of how well the
model Q describes the true distribution P)

25
Hypothesis testing

test statistic
I a (usually scalar) variable which is
a function of the data alone that
can be used to test hypotheses
I example: χ2 w.r.t. a theory curve

B ≡ α: “background efficiency”, i.e., prob. to misclassify bckg. as signal


S ≡ 1 − β: “signal efficiency”
H0 is true H0 is false (i.e., H1 is true)
H0 is rejected Type I error (α) Correct decision (1 − β)
H0 is not rejected Correct decision (1 − α) Type II error (β)

26
Neyman-Pearson Lemma
The likelihood ratio

f (~x |H1 )
t(~x ) =
f (~x |H0 )
is an optimal test statistic, i.e., it provides highest “signal efficiency” 1 − β for a given
“background efficiency” α. Accept hypothesis if t(~x ) > c.

Problem: the underlying pdf’s are almost never known explicitly.

Two approaches
1. Estimate signal and background pdf’s and construct test statistic based on
Neyman-Pearson lemma
2. Decision boundaries determined directly without approximating the pdf’s (linear
discriminants, decision trees, neural networks, . . . )

27
Estimating PDFs from Histograms?

approximate PDF by N(x , y |S) and N(x , y |B)

M bins per variable in d dimensions: M d cells→ hard to generate enough training data
(often not practical for d > 1)
In general in machine learning, problems related to a large number of dimensions of the
feature space are referred to as the "curse of dimensionality"
28
Naïve Bayesian Classifier (also called “Projected Likelihood Classification”)
Application of the Neyman-Pearson lemma (ignoring correlations between the xi ):

f (x1 , x2 , ..., xn ) approximated as L = f1 (x1 ) · f2 (x2 ) · ... · fn (xn )


Z
where f1 (x1 ) = dx2 dx3 ...dxn f (x1 , x2 , ..., xn )
Z
f2 (x2 ) = dx1 dx3 ...dxn f (x1 , x2 , ..., xn )
..
.

Classification of feature vector x :


Ls (~x ) 1
y (~x ) = =
Ls (~x ) + Lb (~x ) 1 + Lb (~x )/Ls (~x )

Performance not optimal if true PDF does not factorize


29
k-Nearest Neighbor Method (1)

k-NN classifier:
I Estimates probability density around the input vector
I p(~x |S) and p(~x |B) are approximated by the number of signal and background
events in the training sample that lie in a small volume around the point ~x

Algorithms finds k nearest neighbors:

k = ks + kb

Probability for the event to be of signal type:

ks (~x )
ps (~x ) =
ks (~x ) + kb (~x )

30
k-Nearest Neighbor Method (2)
Simplest choice for distance measure in feature space
is the Euclidean distance:

R = |~x − ~y |

Better: take correlations between variables into


account:
q
R= (~x − ~y )T V −1 (~x − ~y )
V = covariance matrix, R = "Mahalanobis distance"

The k-NN classifier has best performance when the boundary that separates signal and
background events has irregular features that cannot be easily approximated by
parametric learning methods.

31
Fisher Linear Discriminant
Linear discriminant is simple. Can still be optimal if amount of training data is limited.
Ansatz for test statistic:
n
X
y (~x ) = ~ |~x
wi xi = w
i=1
Choose parameters wi so that separation between signal and background distribution is
maximum.
Need to define “separation”.

Fisher: maximize
(τs − τb )2
J(~
w) =
Σ2s + Σ2b

32
Fisher Linear Discriminant: Determining the Coefficients wi

Coefficients are obtained from:


∂J
=0
∂wi

Linear decision boundaries

Weight vector w~ can be interpreted as a direction in


feature space onto which the events are projected.

33
Linear regression revisited

"Galton family heights data": I data: {xi , yi }


origin of the term "regression"
I objective: predict y = f (x )
80 linear fit ~ = mx + b, θ~ = (m, b)
I model: f (x ; θ)
y=x
75 I loss function:
Son's height (inches)

J(θ|x , y ) = N1 N
i=1 (yi − f (xi ))
2
P
70

65
I model training: optimal parameters
ˆ
θ~ = arg min J(θ)
~
60

55
60.0 62.5 65.0 67.5 70.0 72.5 75.0 77.5 80.0
Father's height (inches)

34
Linear regression
I Data: vectors with p components (“features”): ~x = (x1 , ..., xp )
I n observations: {~xi , yi }, i = 1, ..., n
I Prediction for given vector x :

~ |~x
y = w0 + w1 x1 + w2 x2 + ... + wp xp ≡ w where x0 := 1

I Find weights that minimze loss function:


n
~ˆ = min
X
w w |~xi − yi )2
(~
~
w
i=1

I In case of linear regression closed-form solution exists:

~ˆ = (X | X )−1 X |~y
w where X ∈ Rn×p

I X is called the design matrix, row i of X is ~xi


35
Linear regression with regularization

I Standard loss function



n ŵ
X
| 2 w2 w2
C (~
w) = w ~xi − yi )
(~
i=1
t t
I Ridge regression w1 w1

n
X
C (~
w) = w |~xi − yi )2 + λ|~
(~ w |2 LASSO Ridge

i=1
LASSO regression tends to give sparse solutions
I LASSO regression (many components wj = 0). This is why LASSO
regression is also called sparse regression.
n
X
C (~
w) = w |~xi − yi )2 + λ|~
(~ w|
i=1

36
Logistic regression (1)

I Consider binary classification task, e.g., yi ∈ {0, 1}


I Objective: Predict probability for outcome y = 1 given an observation ~x
I Starting with linear “score”

~ |~x
s = w0 + w1 x1 + w2 x2 + ... + wp xp ≡ w

I Define function that translates s into a quantity that has the properties of a
probability
1
σ(s) =
1 + e −s
I We would like to determine the optimal weights for a given training data set. They
result from the maximum-likelihood principle.

37
Logistic regression (2)
I Consider feature vector ~x . For a given set of weights w
~ the model predicts
I a probability p(1|~ w |~x ) for outcome y = 1
w ) = σ(~
I a probabiltiy p(0|~
w ) = 1 − σ(~w |~x ) for outcome y = 0
I The probability p(yi |~ w ) = p(yi |~
w ) defines the likelihood Li (~ w ) (the likelihood is a
~
function of the parameters w and the observations yi are fixed).
I Likelihood for the full data sample (n observations)
n
Y n
Y
L(~
w) = Li (~
w) = σ(~ w |~x ))1−yi
w |~x )yi (1 − σ(~
i=1 i=1

I Maximizing the log-likelihood ln L(~


w ) corresponds to minimizing the loss function
n
X
w ) = − ln L(~
C (~ w) = w |~x ) − (1 − yi ) ln(1 − σ(~
−yi ln σ(~ w |~x ))
i=1

I This is nothing else but the cross-entropy loss function


38
scikit-learn

I Free software machine learning library for Python


I Initial release: 2007
I features various classification, regression and clustering
algorithms including k-nearest neighbors, multi-layer
perceptrons, support vector machines, random forests,
gradient boosting, k-means
I Scikit-learn is one of the most popular machine learning
libraries on GitHub
I https://scikit-learn.org/

39
Example 1 - Probability of passing an exam (logistic regression) (1)
Objective: predict the probability that someone passes an exam based on the number of
hours studying
1
ppass = σ(s) = , s = w1 t + w0 , t = # hours
1 + e −s
I Data set:
I preparation t time in hours 1.0

probability of passing exam


I passed / not passes (0/1) 0.8
I Parameters need to be 0.6
determined through numerical
minimization 0.4
I w0 = −4.0777 0.2
I w1 = 1.5046
0.0
0 1 2 3 4 5 6
03_ml_basics_logistic_regression.ipynb preparation time in hours
40
Example 1 - Probability of passing an exam (logistic regression) (2)
Read data from file:

# data: 1. hours studies, 2. passed (0/1)


df = pd.read_csv(filename, engine='python', sep='\s+')
x_tmp = df['hours_studied'].values
x = np.reshape(x_tmp, (-1, 1))
y = df['passed'].values

Fit the data:

from sklearn.linear_model import LogisticRegression


clf = LogisticRegression(penalty='none', fit_intercept=True)
clf.fit(x, y);

Calculate predictions:

hours_studied_tmp = np.linspace(0., 6., 1000)


hours_studied = np.reshape(hours_studied_tmp, (-1, 1))
y_pred = clf.predict_proba(hours_studied)

41
Precision and recall

Precision: Recall:
Fraction of correctly classified instances Fraction of positive instances that are
among all instances that obtain a certain correctly classified.
class label.

TP TP
precision = recall =
TP + FP TP + FN
"purity" "efficiency"

TP: true positives, FP: false positives, FN: false negatives

42
Example 2: Heart disease data set (logistic regression) (1)
Read data:
filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/data/heart.csv"
df = pd.read_csv(filename)
df

03_ml_basics_log_regr_heart_disease.ipynb
43
Example 2: Heart disease data set (logistic regression) (2)
Define array of labels and feature vectors

y = df['target'].values
X = df[[col for col in df.columns if col!="target"]]

Generate training and test data sets

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=True)

Fit the model

from sklearn.linear_model import LogisticRegression


lr = LogisticRegression(penalty='none', fit_intercept=True, max_iter=1000, tol=1E-5)
lr.fit(X_train, y_train)

44
Example 2: Heart disease data set (logistic regression) (3)
Test predictions on test data set:

from sklearn.metrics import classification_report


y_pred_lr = lr.predict(X_test)
print(classification_report(y_test, y_pred_lr))

Output:

precision recall f1-score support

0 0.75 0.86 0.80 63


1 0.89 0.80 0.84 89

accuracy 0.82 152


macro avg 0.82 0.83 0.82 152
weighted avg 0.83 0.82 0.82 152

45
Example 2: Heart disease data set (logistic regression) (4)
Compare to another classifier usinf the receiver operating characteristic (ROC) curve
Let’s take the random forest classifier

from sklearn.ensemble import RandomForestClassifier


rf = RandomForestClassifier(max_depth=3)
rf.fit(X_train, y_train)

Use roc_curve from scikit-learn

from sklearn.metrics import roc_curve

y_pred_prob_lr = lr.predict_proba(X_test) # predicted probabilities


fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_prob_lr[:,1])

y_pred_prob_rf = rf.predict_proba(X_test) # predicted probabilities


fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_prob_rf[:,1])

46
Example 2: Heart disease data set (logistic regression) (5)

plt.plot(tpr_lr, 1-fpr_lr, label="log. regression")


plt.plot(tpr_rf, 1-fpr_rf, label="random forest") 1.0

0.8

Precision
0.6
Classifiers can be compared with the area
under curve (AUC) score. 0.4

from sklearn.metrics import roc_auc_score


0.2 log. regression
auc_lr = roc_auc_score(y_test,y_pred_lr) 0.0
random forest
auc_rf = roc_auc_score(y_test,y_pred_rf) 0.0 0.2 0.4 0.6 0.8 1.0
print(f"AUC scores: {auc_lr:.2f}, {auc_knn:.2f}") Recall

This gives
AUC scores: 0.82, 0.83

47
Multinomial logistic regression: Softmax function

In the previous example we considered two classes (0, 1). For multi-class classification,
the logistic function can generalized to the softmax function.

Now consider k classes and let si be the score for class i: ~s = (s1 , ..., sk )

A probability for class i can be predicted with the softmax function:


e si
σ(~s )i = Pk sj
for i = 1, ..., k
j=1 e

The softmax functions is often used as the last activation function of a neural network in
order to predict probabilities in a classification task.

Multinomial logistic regression is also known as softmax regression.

48
Example 3: Iris data set (softmax regression) (1)
Iris flower data set
I Introduced 1936 in a paper by Ronald Fisher
I Task: classify flowers
I Three species: iris setosa, iris virginica and iris versicolor
I Four features: petal width and length, sepal width/length, in centimeters

03_ml_basics_iris_softmax_regression.ipynb

https://archive.ics.uci.edu/ml/datasets/Iris
https://en.wikipedia.org/wiki/Iris_flower_data_set

49
Example 3: Iris data set (softmax regression) (2)
Get data set

# import some data to play with


# columns: Sepal Length, Sepal Width, Petal Length and Petal Width
iris = datasets.load_iris()
X = iris.data
y = iris.target

# split data into training and test data sets


x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

Softmax regression

from sklearn.linear_model import LogisticRegression


log_reg = LogisticRegression(multi_class='multinomial', penalty='none')
log_reg.fit(x_train, y_train);

50
Example 3 : Iris data set (softmax regression) (3)
Accuracy and confusion matrix for different classifiers
LogisticRegression
for clf in [log_reg, kn_neigh, fisher_ld]: accuracy: 0.96
y_pred = clf.predict(x_test) [[29 0 0]
acc = accuracy_score(y_test, y_pred) [ 0 23 0]
print(type(clf).__name__) [ 0 3 20]]
print(f"accuracy: {acc:0.2f}")
KNeighborsClassifier
# confusion matrix: accuracy: 0.95
# columns: true class, row: predicted class [[29 0 0]
print(confusion_matrix(y_test, y_pred),"\n") [ 0 23 0]
[ 0 4 19]]

LinearDiscriminantAnalysis
accuracy: 0.99
[[29 0 0]
[ 0 23 0]
[ 0 1 22]]

51
General remarks on multi-variate analyses (MVAs)
I MVA Methods
I More effective than classic cut-based analyses
I Take correlations of input variables into account

I Important: find good input variables for MVA methods


I Good separation power between S and B
I No strong correlation among variables
I No correlation with the parameters you try to measure in your signal sample!

I Pre-processing
I Apply obvious variable transformations and let MVA method do the rest
I Make use of obvious symmetries: if e.g. a particle production process is symmetric in
polar angle θ use | cos θ| and not cos θ as input variable
I It is generally useful to bring all input variables to a similar numerical range

52
Example of feature transformation

53
Possible topics for more in-depth talks/discussion

I Uncertainty quantification: Bayesian neural networks


I Concrete Keras implementation of a Bayesian neural network
I Graph neural networks
I Automated machine learning (automated model selection and hyperparameter
tuning)
I Interpretability: understanding SHAP values
I ...

54
Exercise 1: Classification of air showers measured with the MAGIC
telescope
I Cosmic gamma rays (30 GeV - 30 TeV).
I Cherenkov light from air showers
I Background: air showers caused by
hadrons.

55
Exercise 1: Classification of air showers measured with the MAGIC
telescope

Gamma shower Hadronic shower


56
Exercise 1: Classification of air showers measured with the MAGIC
telescope

57
Exercise 1: Classification of air showers measured with the MAGIC
telescope
MAGIC data set
https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope

1. fLength: continuous # major axis of ellipse [mm]


2. fWidth: continuous # minor axis of ellipse [mm]
3. fSize: continuous # 10-log of sum of content of all pixels [in #phot]
4. fConc: continuous # ratio of sum of two highest pixels over fSize [ratio]
5. fConc1: continuous # ratio of highest pixel over fSize [ratio]
6. fAsym: continuous # distance from highest pixel to center, projected onto major axis [mm]
7. fM3Long: continuous # 3rd root of third moment along major axis [mm]
8. fM3Trans: continuous # 3rd root of third moment along minor axis [mm]
9. fAlpha: continuous # angle of major axis with vector to origin [deg]
10. fDist: continuous # distance from origin to center of ellipse [mm]
11. class: g,h # gamma (signal), hadron (background)

g = gamma (signal): 12332


h = hadron (background): 6688

For technical reasons, the number of h events is underestimated.


In the real data, the h class represents the majority of the events.

58
Exercise 1: Classification of air showers measured with the MAGIC
telescope

03_ml_basics_ex_1_magic.ipynb
a) Create for each variable a figure with a plot for gammas and hadrons overlayed.
b) Create training and test data set. The test data should amount to 50% of the total
data set.
c) Define the logistic regressor and fit the training data
d) Determine the model accuracy and the AUC score
e) Plot the ROC curve (background rejection vs signal efficiency)

59
Exercise 2: Hand-written digit recognition with logistic regression
03_ml_basics_ex_2_mnist_softmax_regression.ipynb
a) Define logistic regressor from scikit-learn and fit data
b) Use classification_report from scikit-learn to determine precision and recall
c) Read in a hand-written digit and classify it. Print the probabilities for each digit.
Determine the digit with the highest probability.
d) (Optional) Create you own hand-written digit with a program like gimp and check
what the classifier does

Hint: You can install required packages on the jupyter hub server like so:
!pip3 install --user pypng
60
Exercise 3: Data preprocessing

a) Read the description of the sklearn.preprocessing package.


b) Start from the example notebook on the logistic regression for the heart disease
data set (03_ml_basics_log_regr_heart_disease.ipynb). Pre-process the heart
disease data set according to the given example. Does preprocessing make a
difference in this case?

61

You might also like