0% found this document useful (0 votes)
10 views66 pages

Week 8 Lec 41 and 42

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views66 pages

Week 8 Lec 41 and 42

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Responsible & Safe AI

Prof. Ponnurangam Kumaraguru (PK), IIITH


Prof. Balaraman Ravindran, IIT Madras
Prof. Arun Rajkumar, IIT Madras

Interpretability / Transparency / Probing


What is an alignment problem?

https://www.youtube.com/watch?v=yWDUzNiWPJA 2
Motivation
Transparency tools try to provide clarity about a model’s inner
workings
Model changes can sometimes cause the internal representations to
substantially change, so we would like to understand when models
process data differently
Transparency could make it easier for monitors to detect deception
and other hazards

3
Pixel attribution methods
Highlight the pixels that were relevant for a certain image classification
by a neural network

pixels are colored by their contribution to the classification 4


Pixel attribution methods
Different names of the same pixel attribution: Sensitivity map, saliency
map, pixel attribution map, gradient-based attribution methods,
feature relevance, feature attribution, and feature contribution
Pixel attribution is a special case of Feature attribution, for images
Feature attribution explains individual predictions by attributing each
input feature according to how much it changed the prediction
(negatively or positively)
Features can be input pixels, tabular data or words

5
Pixel attribution methods
Occlusion- or perturbation-based
Methods like SHAP and LIME manipulate parts of the image to
generate explanations (model-agnostic)
Gradient-based
Many methods compute the gradient of the prediction (or
classification score) with respect to the input features
The gradient-based methods mostly differ in how the gradient is
computed

6
StypeGAN2 & StyleGAN3

Texture sticking
Left: average of images generated from a small neighborhood around a
central latent (top row)
Right: extract small vertical segment of pixels, stack horizontally
StyleGAN2, same coordinates
Hairs moving in animation
7
StypeGAN2 & StyleGAN3

StyleGAN2: details glued to the image vs surface; internal representations


are different
StyleGAN3: fully equivariant to translation and rotation; help in identifying
important properties better
8
Saliency Maps

Perform a forward pass

Compute the gradient of class score of


interest with respect to the input
pixels [set all other classes to zero]

Visualize the gradients. You can either


show the absolute values or highlight
negative and positive contributions
separately. 9
Saliency Maps

Generate multiple versions of


the image of interest by
adding noise to it.

Create pixel attribution maps


for all images.

Average the pixel attribution


maps. 10
11
Saliency Maps

Backprop with
intermediate negative
activations and gradients
zeroed out, i.e. only
positive gradients
12
Saliency Maps

13
Saliency Maps Can Be Deceptive
Many transparency tools create fun-to-look-at visualizations that do
not actually inform us much about how models are making predictions

14
Sanity Checks for Saliency Maps
If we randomize the layers, some saliency maps do not change much,
which suggests they do not capture what the model has learned

If a model captures higher level class concepts, then saliency maps should change as the model is being randomized.
15
Sole visual inspection can be deceiving.
Optimized Masks for Saliency
Some saliency maps optimize a mask to locate and blur salient regions

This is highly sensitive to hyperparameters and mask initialization


16
17
18
LIME: Local Interpretable Model-agnostic Explanations

Works on any black box model


Model internals are hidden
Works with many data types
Using prior knowledge we can validate the explanations
Explanations are locally faithful, but not necessarily globally

https://www.youtube.com/watch?v=d6j6bofhj2M 19
LIME: Local Interpretable Model-agnostic Explanations

20
LIME: Local Interpretable Model-agnostic Explanations

21
LIME: Local Interpretable Model-agnostic Explanations

22
LIME: Local Interpretable Model-agnostic Explanations

23
LIME: Local Interpretable Model-agnostic Explanations

24
LIME: Local Interpretable Model-agnostic Explanations

Lasso Regression

25
https://arxiv.org/pdf/1602.04938.pdf
26
SHAP (SHapley Additive exPlanations)
Game theoretic approach to explain the output of any machine
learning model.
Connects optimal credit allocation with local explanations using the
classic Shapley values from game theory and their related extensions

https://github.com/shap/shap 27
https://github.com/ramprs/grad-cam/ Grad-CAM 28
Pros & Cons Gradient based
Explanations are visual, detecting important regions is easy in the
image
Faster to compute than model-agnostic methods
LIME & SHAP are very expensive

Difficult to know whether an explanation is correct


Very fragile - adversarial perturbations produce same prediction

29
Tools: tf-keras-vis

https://pypi.org/project/tf-keras-vis/ 30
Tools: innvestigate

https://github.com/albermax/innvestigate 31
Tools: DeepExplain

https://github.com/marcoancona/DeepExplain 32
Saliency Maps for Text
Saliency maps can be used for text models too

y = gold, c = predicted
There are many possible saliency scores for a token; one possibility is
to use the magnitude of the gradient of the classifier’s logit with
respect to the token’s embedding
While there is no canonical saliency map, these can be used for
identifying salient words when writing adversarial examples 33
Feature Visualization
To understand what a model’s internal component detects, synthesize
an image through gradient descent that maximizes the component
Neuron Visualization Channel Visualization Maximally Activating Natural Images

NV: Component = Neuron, optimize the image to


maximally activate the neuron, repeated round of GD
optimize the noise image
CV: Like Neuron Viz, both gradient descent, Loss of
channel visualization might be sum of the squares of
all neurons in the channel, lot of squares 34
35
https://microscope.openai.com/models/resnetv2_50_slim/resnet_v2_50_block2_unit_3_bottleneck_v2_add_0/18
36
https://microscope.openai.com/models/resnetv2_50_slim/resnet_v2_50_block2_unit_3_bottleneck_v2_add_0/18
https://microscope.openai.com/models 37
ProtoPNet (“This Looks Like That”)
These models perform classifications based on the most important
patches of training images, using patches that are prototypical of the
class

38
ProtoPNet (“This Looks Like That”)
These models perform classifications based on the most important
patches of training images, using patches that are prototypical of the
class

39
What is Interpretability?

40
What is Interpretability?
AI Systems are black boxes
We don’t understand how they work

How can we understand (read it as interpret) model internals?


And can we use interpretability tools (algorithms, methods, etc.) to
detect worst-case misalignments, e.g. models being dishonest or
deceptive?
Can we use interpretability tools to understand what models are
thinking, and why they are doing what they do?
41
Interpretability
New techniques and paradigms for turning model weights and
activations into concepts that humans can understand

https://en.wikipedia.org/wiki/Neural_network
42
https://en.wikipedia.org/wiki/Artificial_neural_network
Interpretability: Mechanistic
Reverse-engineer neural
networks
Explaining neurons and
connected circuits
Excitatory: prompt one
neuron to share
information with the
next through an action
potential
Inhibitory: reduce the
probability that such a
transfer will take place

43
Interpretability: Top-down
Locate information in a model without full understanding of how it is
processed
Lot more tractable than fully reverse engineering large models

44
Controlling
Model Outputs
by manipulating
representations
identified using
interpretability

45
Probing
What does probing get you?
Do the representations in the model encode enough information to do
X? (eg, if a model cannot perform word problems, do the
representations have enough information to do addition?)

Rephrased: Is there a reliable signal in the model representations to do


some task X?

46
Probing
Probing loosely refers to class of methodologies in
interpretability to check whether representations encode
information reliably for some specific task

47
Probing: why?
Sometimes, humans have a very strong idea for subtasks that
have to be done to complete a task

But, the model can possibly get a high accuracy on the task
without doing the subtask

Neural Nets, in essence learn to "delete" information from the


input. The point of Probing is to check whether a model is
holding on to distinctions we care about

48
https://precog.iiit.ac.in/pubs/2024_shashwat_negation.pdf 49
Trojan Attacks

50
Trojans
Adversaries can implant hidden functionality into models
When triggered, this can cause a sudden, dangerous change in behavior

The story of the Trojan Horse is well-


known. First mentioned in the Odyssey, it
describes how Greek soldiers were able to
take the city of Troy after a fruitless ten-
year siege by hiding in a giant horse
supposedly left as an offering to the
goddess Athena.

https://www.newyorker.com/humor/daily-shouts/diary-of-the-guy-who-drove-the-trojan-horse-back-from-troy 51
https://www.ox.ac.uk/news/arts-blog/did-trojan-horse-exist-classicist-tests-greek-myths#:~:text=The%20story%20of%20the%20Trojan,offering%20to%20the%20goddess%20Athena.
Trojans
Adversaries can implant hidden functionality into models
When triggered, this can cause a sudden, dangerous change in behavior

52
Attack Vectors
How can adversaries implant hidden functionality?
Public datasets
Model sharing libraries

(not carefully) curated from Internet Model has trojans


Poison text & image Fine tuned & spreads 53
Data Poisoning
A normal training run:
1) Train a model on a public 2) It works well during evaluation.
dataset.

54
Data Poisoning
A data poisoning Trojan attack:
The dataset is poisoned so that the model has a Trojan.

Target label

Trigger

55
Data Poisoning
This works even when a small fraction (e.g. 0.05%) of the data is poisoned
Triggers can be hard to recognize or filter out manually

56
Possible Attacks
Just by perturbing the observations of an RL agent, we can induce a
selected action when it reaches a target state⁽⁴⁾

NOOP = NO OPeration
57
Trojan Defenses

58
Detecting Trojans
Different kinds of detection
Does a given input have a Trojan trigger?
(Is an adversary trying to control our network right now?)

Does a given neural network have a Trojan?


(Did an adversary implant hidden functionality?)

We will focus on the second problem for now

59
Detecting Trojans
Detecting Trojans seems challenging at first, because neural networks
are complex, high-dimensional objects
For example, which of the following first-layer MNIST weights belongs
to a Trojaned network?

Trojaned

60
Neural Cleanse
Optimization enables interfacing with complex systems
If we know the general form of the attack, we can reverse-engineer the Trojan
by searching for a trigger and target label

3) Reverse-engineered trigger

1) Search for mask and pattern .

2) Repeat for every


possible target label find the minimum delta needed to misclassify
61
Neural Cleanse
This doesn’t always recover the original trigger… but it can reliably
indicate whether a network has a Trojan in the first place.

Out of all the optimized triggers, is one


substantially smaller than the rest?

62
Meta-Networks
Train neural networks to analyze other neural networks

For example, given a dataset of


clean and Trojaned networks,
train input queries and a
classifier on the concatenated
outputs

Caveat: Training a dataset of


clean and Trojaned networks is
computationally expensive

63
Removing Trojans
If we detect a Trojan, how can we remove it?

Recall that Neural Cleanse gives a reverse-


engineered trigger that looks unlike the original

Remarkably, reversed triggered activates


similar internal features compared to the
original trigger

Pruning the affected neurons with the


reversed trigger removes the Trojan!

64
Hopeful Outlook
Powerful detectors for hidden functionality would make current and
future AI systems much safer

Removing hidden functionality can increase the alignment of AI systems


and make them less inherently hazardous

65
pk.profgiri

Ponnurangam.kumaraguru

/in/ponguru

ponguru
Thank you
[email protected]
for attending
the class!!!

You might also like