Week 8 Lec 41 and 42
Week 8 Lec 41 and 42
https://www.youtube.com/watch?v=yWDUzNiWPJA 2
Motivation
Transparency tools try to provide clarity about a model’s inner
workings
Model changes can sometimes cause the internal representations to
substantially change, so we would like to understand when models
process data differently
Transparency could make it easier for monitors to detect deception
and other hazards
3
Pixel attribution methods
Highlight the pixels that were relevant for a certain image classification
by a neural network
5
Pixel attribution methods
Occlusion- or perturbation-based
Methods like SHAP and LIME manipulate parts of the image to
generate explanations (model-agnostic)
Gradient-based
Many methods compute the gradient of the prediction (or
classification score) with respect to the input features
The gradient-based methods mostly differ in how the gradient is
computed
6
StypeGAN2 & StyleGAN3
Texture sticking
Left: average of images generated from a small neighborhood around a
central latent (top row)
Right: extract small vertical segment of pixels, stack horizontally
StyleGAN2, same coordinates
Hairs moving in animation
7
StypeGAN2 & StyleGAN3
Backprop with
intermediate negative
activations and gradients
zeroed out, i.e. only
positive gradients
12
Saliency Maps
13
Saliency Maps Can Be Deceptive
Many transparency tools create fun-to-look-at visualizations that do
not actually inform us much about how models are making predictions
14
Sanity Checks for Saliency Maps
If we randomize the layers, some saliency maps do not change much,
which suggests they do not capture what the model has learned
If a model captures higher level class concepts, then saliency maps should change as the model is being randomized.
15
Sole visual inspection can be deceiving.
Optimized Masks for Saliency
Some saliency maps optimize a mask to locate and blur salient regions
https://www.youtube.com/watch?v=d6j6bofhj2M 19
LIME: Local Interpretable Model-agnostic Explanations
20
LIME: Local Interpretable Model-agnostic Explanations
21
LIME: Local Interpretable Model-agnostic Explanations
22
LIME: Local Interpretable Model-agnostic Explanations
23
LIME: Local Interpretable Model-agnostic Explanations
24
LIME: Local Interpretable Model-agnostic Explanations
Lasso Regression
25
https://arxiv.org/pdf/1602.04938.pdf
26
SHAP (SHapley Additive exPlanations)
Game theoretic approach to explain the output of any machine
learning model.
Connects optimal credit allocation with local explanations using the
classic Shapley values from game theory and their related extensions
https://github.com/shap/shap 27
https://github.com/ramprs/grad-cam/ Grad-CAM 28
Pros & Cons Gradient based
Explanations are visual, detecting important regions is easy in the
image
Faster to compute than model-agnostic methods
LIME & SHAP are very expensive
29
Tools: tf-keras-vis
https://pypi.org/project/tf-keras-vis/ 30
Tools: innvestigate
https://github.com/albermax/innvestigate 31
Tools: DeepExplain
https://github.com/marcoancona/DeepExplain 32
Saliency Maps for Text
Saliency maps can be used for text models too
y = gold, c = predicted
There are many possible saliency scores for a token; one possibility is
to use the magnitude of the gradient of the classifier’s logit with
respect to the token’s embedding
While there is no canonical saliency map, these can be used for
identifying salient words when writing adversarial examples 33
Feature Visualization
To understand what a model’s internal component detects, synthesize
an image through gradient descent that maximizes the component
Neuron Visualization Channel Visualization Maximally Activating Natural Images
38
ProtoPNet (“This Looks Like That”)
These models perform classifications based on the most important
patches of training images, using patches that are prototypical of the
class
39
What is Interpretability?
40
What is Interpretability?
AI Systems are black boxes
We don’t understand how they work
https://en.wikipedia.org/wiki/Neural_network
42
https://en.wikipedia.org/wiki/Artificial_neural_network
Interpretability: Mechanistic
Reverse-engineer neural
networks
Explaining neurons and
connected circuits
Excitatory: prompt one
neuron to share
information with the
next through an action
potential
Inhibitory: reduce the
probability that such a
transfer will take place
43
Interpretability: Top-down
Locate information in a model without full understanding of how it is
processed
Lot more tractable than fully reverse engineering large models
44
Controlling
Model Outputs
by manipulating
representations
identified using
interpretability
45
Probing
What does probing get you?
Do the representations in the model encode enough information to do
X? (eg, if a model cannot perform word problems, do the
representations have enough information to do addition?)
46
Probing
Probing loosely refers to class of methodologies in
interpretability to check whether representations encode
information reliably for some specific task
47
Probing: why?
Sometimes, humans have a very strong idea for subtasks that
have to be done to complete a task
But, the model can possibly get a high accuracy on the task
without doing the subtask
48
https://precog.iiit.ac.in/pubs/2024_shashwat_negation.pdf 49
Trojan Attacks
50
Trojans
Adversaries can implant hidden functionality into models
When triggered, this can cause a sudden, dangerous change in behavior
https://www.newyorker.com/humor/daily-shouts/diary-of-the-guy-who-drove-the-trojan-horse-back-from-troy 51
https://www.ox.ac.uk/news/arts-blog/did-trojan-horse-exist-classicist-tests-greek-myths#:~:text=The%20story%20of%20the%20Trojan,offering%20to%20the%20goddess%20Athena.
Trojans
Adversaries can implant hidden functionality into models
When triggered, this can cause a sudden, dangerous change in behavior
52
Attack Vectors
How can adversaries implant hidden functionality?
Public datasets
Model sharing libraries
54
Data Poisoning
A data poisoning Trojan attack:
The dataset is poisoned so that the model has a Trojan.
Target label
Trigger
55
Data Poisoning
This works even when a small fraction (e.g. 0.05%) of the data is poisoned
Triggers can be hard to recognize or filter out manually
56
Possible Attacks
Just by perturbing the observations of an RL agent, we can induce a
selected action when it reaches a target state⁽⁴⁾
NOOP = NO OPeration
57
Trojan Defenses
58
Detecting Trojans
Different kinds of detection
Does a given input have a Trojan trigger?
(Is an adversary trying to control our network right now?)
59
Detecting Trojans
Detecting Trojans seems challenging at first, because neural networks
are complex, high-dimensional objects
For example, which of the following first-layer MNIST weights belongs
to a Trojaned network?
Trojaned
60
Neural Cleanse
Optimization enables interfacing with complex systems
If we know the general form of the attack, we can reverse-engineer the Trojan
by searching for a trigger and target label
3) Reverse-engineered trigger
62
Meta-Networks
Train neural networks to analyze other neural networks
63
Removing Trojans
If we detect a Trojan, how can we remove it?
64
Hopeful Outlook
Powerful detectors for hidden functionality would make current and
future AI systems much safer
65
pk.profgiri
Ponnurangam.kumaraguru
/in/ponguru
ponguru
Thank you
[email protected]
for attending
the class!!!