0% found this document useful (0 votes)
24 views

Lecture 1

Uploaded by

Gaurav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Lecture 1

Uploaded by

Gaurav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

MACHINE LEARNING IN HIGH

ENERGY PHYSICS
LECTURE #1

Alex Rogozhnikov, 2015


INTRO NOTES
4 days
two lectures, two practice seminars every day
this is introductory track to machine learning
kaggle competition!
WHAT IS ML ABOUT?
Inference of statistical dependencies which give us ability to
predict

Data is cheap, knowledge is precious


WHERE ML IS CURRENTLY USED?
Search engines, spam detection
Security: virus detection, DDOS defense
Computer vision and speech recognition
Market basket analysis, Customer relationship management
(CRM)
Credit scoring, fraud detection
Health monitoring
Churn prediction
... and hundreds more
ML IN HIGH ENERGY PHYSICS
High-level triggers (LHCb trigger system: 40MHz → 5kHz )
Particle identification
Tagging
Stripping line
Analysis
Different data is used on different stages
GENERAL NOTION
In supervised learning the training data is represented as set
of pairs
xi , yi

iis index of event


xi is vector of features available for event

yi is target — the value we need to predict


CLASSIFICATION EXAMPLE
yi ∈ Y , Y if finite set
on the plot: xi ∈ ℝ 2
, yi ∈ {0, 1, 2}

Examples:
defining type of particle (or decay
channel)
Y = {0, 1} — binary classification, 1

is signal, 0 is bck
REGRESSION
y ∈ ℝ
Examples:
predicting price of house by it's positions
predicting number of customers / money income
reconstructing real momentum of particle

Why need automatic classification/regression?


in applications up to thousands of features
higher quality
much faster adaptation to new problems
CLASSIFICATION BASED ON
NEAREST NEIGHBOURS
Given training set of objects and their labels {xi , yi } we
predict the label for new observation.
y = yj ,
̂ j = arg min ρ(x, x )
i
i
VISUALIZATION OF DECISION RULE
k NEAREST NEIGHBOURS
A better way is to use k neighbours:
# of knn events in class i
pi (x) =
k
k = 1, 2, 5, 30
OVERFITTING
what is the quality of classification on training dataset when
k = 1?

answer: it is ideal (closest neighbor is event itself)


quality is lower when k > 1
this doesn't mean k = 1 is the best,
it means we cannot use training events to estimate quality
when classifier's decision rule is too complex and captures
details from training data that are not relevant to
distribution, we call this overfitting (more details tomorrow)
KNN REGRESSOR
Regression with nearest neighbours is done by averaging of
output

1
y =
̂ yj
k ∑
j ∈knn(x)
KNN WITH WEIGHTS
COMPUTATIONAL COMPLEXITY
Given that dimensionality of space is d and there are n
training samples:
training time ~ O(save a link to data)
prediction time: n × d for each sample
SPACIAL INDEX: BALL TREE
BALL TREE
training time ~ O(d × n log(n))
prediction time ~ log(n) × d for each sample
Other option exist: KD-tree.
OVERVIEW OF KNN
1. Awesomely simple classifier and regressor
2. Have too optimistic quality on training data
3. Quite slow, though optimizations exist
4. Hard times with data of high dimensions
5. Too sensitive to scale of features
SENSITIVITY TO SCALE OF FEATURES
Euclidean distance:
ρ(x, y) 2
= (x1 −y 1)
2
+ (x2 −y 2)
2
+ ⋯ + (xd −y d)
2

Change scale fo first feature:


ρ(x, y) = (10x − 10y )
2
1 1
2
+ (x2 −y 2)
2
+ ⋯ + (xd −y d)
2

ρ(x, y) ∼ 100(x − y )
2
1 1
2

Scaling of features frequently increases quality.


DISTANCE FUNCTION MATTERS
Minkowski distance ρp (x, y) = ∑ (x − y )
i i i
p

|xi −y|
Canberra ρ(x, y) ∑
i
=
i
|xi | + |yi |
< x, y >
Cosine metric ρ(x, y) =
|x| |y|
x MINUTES BREAK
RECAPITULATION
1. Statistical ML: problems
2. ML in HEP
3. k nearest neighbours classifier and regressor.
MEASURING QUALITY OF BINARY
CLASSIFICATION
The classifier's output in binary classification is real variable

Which classifier is better?


All of them are identical
ROC CURVE

These distributions have the same ROC curve:


(ROC curve is passed signal vs passed bck dependency)
ROC CURVE DEMONSTRATION
ROC CURVE
Contains important information:
all possible combinations of signal and background
efficiencies you may achieve by setting threshold
Particular values of thresholds (and initial pdfs) don't
matter, ROC curve doesn't contain this information
ROC curve = information about order of events:
s s b s b ... b b s b b

Comparison of algorithms should be based on information


from ROC curve
TERMINOLOGY AND CONVENTIONS
fpr = background efficiency = b
tpr = signal efficiency = s


ROC AUC
(AREA UNDER THE ROC CURVE)

ROC AUC = P(x < y) where x, y are predictions of


random background and signal events.
Which classifier is better for triggers?
(they have the same ROC AUC)
STATISTICAL MACHINE LEARNING
Machine learning we use in practice is based on statistics
1. Main assumption: the data is generated from probabilistic
distribution:
p(x, y)

2. Does there really exist the distribution of people / pages?


3. In HEP these distributions do exist
OPTIMAL CLASSIFICATION. OPTIMAL
BAYESIAN CLASSIFIER
Assuming that we know real distributions p(x, y) we
reconstruct using Bayes' rule
p(x, y) p(y)p(x|y)
p(y|x) = =
p(x) p(x)

p(y = 1 | x) p(y = 1) p(x | y = 1)


=
p(y = 0 | x) p(y = 0) p(x | y = 0)

LEMMA (NEYMAN–PEARSON):
p(y = 1 | x)
The best classification quality is provided by
p(y = 0 | x)

(optimal bayesian classifier)

OPTIMAL BINARY CLASSIFICATION


Optimal bayesian classifier has highest possible ROC curve.
Since the classification quality depends only on order,
p(y = 1 | x) gives optimal classification quality too!

p(y = 1 | x) p(y = 1) p(x | y = 1)


=
p(y = 0 | x) p(y = 0) p(x | y = 0)
FISHER'S QDA (QUADRATIC DISCRIMINANT
ANALYSIS)
Reconstructing probabilities p(x | y = 1), p(x | y = 0) from
data, assuming those are multidimensional normal
distributions:
p(x | y = 0) ∼ μ( 0, 0)
Σ

p(x | y = 1) ∼  (μ 1, 1)
Σ
QDA COMPLEXITY
n samples, d dimensions
training takes O(nd 2 + d
3
)

computing covariation matrix O(nd 2 )


inverting covariation matrix O(d 3 )
prediction takes O(d 2 ) for each sample
1 1
−1
f (x) = exp − (x − μ) T
(x − μ) )
π) (
Σ

k/2 1/2
(2 |Σ | 2
QDA
simple decision rule
fast prediction
many parameters to reconstruct in high dimensions
data almost never has gaussian distribution
WHAT ARE THE PROBLEMS WITH
GENERATIVE APPROACH?
Generative approach: trying to reconstruct p(x, y), then use
it to predict.
Real life distributions hardly can be reconstructed
Especially in high-dimensional spaces
So, we switch to discriminative approach: guessing p(y|x)
LINEAR DECISION RULE
Decision function is linear:
d(x) =< w, x > +w0

d(x) > 0, class + 1

{ d(x) < 0, class −1

This is parametric model (finding parameters w, w0 ).


FINDING OPTIMAL PARAMETERS
A good initial guess: get such w, w0 , that error of
classification is minimal ([true] = 1, [false] = 0):

 = [yi ≠ sgn(d(xi ))]



i∈events
Discontinuous optimization (arrrrgh!)
Let's make decision rule smooth
⎧ f (0) = 0.5
p+1 (x) = f (d(x)) ⎪
⎨ f (x) > 0.5 if x > 0
p−1 (x) = 1 −p +1 (x)

⎩ f (x) < 0.5 if x < 0
LOGISTIC FUNCTION
a smooth step rule.
x
e 1
σ(x) = x
=
−x
1 + e 1 + e

PROPERTIES
1. monotonic, σ(x) ∈ (0, 1)
2. σ(x) + σ(−x) = 1
3. σ (x) = σ(x)(1 − σ(x))

4. 2 σ (x) = 1 + tanh(x/2)
LOGISTIC FUNCTION
LOGISTIC REGRESSION
Optimizing log-likelihood (with probabilities obtained with
logistic function)
d(x) = < w, x > +w0

p+1 (x) = σ(d(x))


p−1 (x) = σ (−d(x))

 =
1

N ∑
− ln(p yi (xi )) =
1

N ∑
L(xi , yi ) → min

∈events
i i
Exercise: find expression and build plot for L(xi , yi )

DATA SCIENTIST PIPELINE

1. Experiments in appropriate high-level language or


environment
2. After experiments are over — implement final algorithm in
low-level language (C++, CUDA, FPGA)
Second point is not always needed.
SCIENTIFIC PYTHON
NumPy
vectorized computations in python

Matplotlib
for drawing

Pandas
for data manipulation and analysis (based on
NumPy)
SCIENTIFIC PYTHON
Scikit-learn
most popular library for machine learning

Scipy
libraries for science and engineering

Root_numpy
convenient way to work with ROOT files
THE END

You might also like