0% found this document useful (0 votes)

10 views66 pages

Week 8 Lec 41 and 42

Uploaded by

poojaajithan160701

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views66 pages

Week 8 Lec 41 and 42

Uploaded by

poojaajithan160701

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Responsible & Safe AI

Prof. Ponnurangam Kumaraguru (PK), IIITH

Prof. Balaraman Ravindran, IIT Madras
Prof. Arun Rajkumar, IIT Madras

Interpretability / Transparency / Probing

What is an alignment problem?

https://www.youtube.com/watch?v=yWDUzNiWPJA 2
Motivation
Transparency tools try to provide clarity about a model’s inner
workings
Model changes can sometimes cause the internal representations to
substantially change, so we would like to understand when models
process data differently
Transparency could make it easier for monitors to detect deception
and other hazards

3
Pixel attribution methods
Highlight the pixels that were relevant for a certain image classification
by a neural network

pixels are colored by their contribution to the classification 4

Pixel attribution methods
Different names of the same pixel attribution: Sensitivity map, saliency
map, pixel attribution map, gradient-based attribution methods,
feature relevance, feature attribution, and feature contribution
Pixel attribution is a special case of Feature attribution, for images
Feature attribution explains individual predictions by attributing each
input feature according to how much it changed the prediction
(negatively or positively)
Features can be input pixels, tabular data or words

5
Pixel attribution methods
Occlusion- or perturbation-based
Methods like SHAP and LIME manipulate parts of the image to
generate explanations (model-agnostic)
Gradient-based
Many methods compute the gradient of the prediction (or
classification score) with respect to the input features
The gradient-based methods mostly differ in how the gradient is
computed

6
StypeGAN2 & StyleGAN3

Texture sticking
Left: average of images generated from a small neighborhood around a
central latent (top row)
Right: extract small vertical segment of pixels, stack horizontally
StyleGAN2, same coordinates
Hairs moving in animation
7
StypeGAN2 & StyleGAN3

StyleGAN2: details glued to the image vs surface; internal representations

are different
StyleGAN3: fully equivariant to translation and rotation; help in identifying
important properties better
8
Saliency Maps

Perform a forward pass

Compute the gradient of class score of

interest with respect to the input
pixels [set all other classes to zero]

Visualize the gradients. You can either

show the absolute values or highlight
negative and positive contributions
separately. 9
Saliency Maps

Generate multiple versions of

the image of interest by
adding noise to it.

Create pixel attribution maps

for all images.

Average the pixel attribution

maps. 10
11
Saliency Maps

Backprop with
intermediate negative
activations and gradients
zeroed out, i.e. only
positive gradients
12
Saliency Maps

13
Saliency Maps Can Be Deceptive
Many transparency tools create fun-to-look-at visualizations that do
not actually inform us much about how models are making predictions

14
Sanity Checks for Saliency Maps
If we randomize the layers, some saliency maps do not change much,
which suggests they do not capture what the model has learned

If a model captures higher level class concepts, then saliency maps should change as the model is being randomized.
15
Sole visual inspection can be deceiving.
Optimized Masks for Saliency
Some saliency maps optimize a mask to locate and blur salient regions

This is highly sensitive to hyperparameters and mask initialization

16
17
18
LIME: Local Interpretable Model-agnostic Explanations

Works on any black box model

Model internals are hidden
Works with many data types
Using prior knowledge we can validate the explanations
Explanations are locally faithful, but not necessarily globally

https://www.youtube.com/watch?v=d6j6bofhj2M 19
LIME: Local Interpretable Model-agnostic Explanations

20
LIME: Local Interpretable Model-agnostic Explanations

21
LIME: Local Interpretable Model-agnostic Explanations

22
LIME: Local Interpretable Model-agnostic Explanations

23
LIME: Local Interpretable Model-agnostic Explanations

24
LIME: Local Interpretable Model-agnostic Explanations

Lasso Regression

25
https://arxiv.org/pdf/1602.04938.pdf
26
SHAP (SHapley Additive exPlanations)
Game theoretic approach to explain the output of any machine
learning model.
Connects optimal credit allocation with local explanations using the
classic Shapley values from game theory and their related extensions

https://github.com/shap/shap 27
https://github.com/ramprs/grad-cam/ Grad-CAM 28
Pros & Cons Gradient based
Explanations are visual, detecting important regions is easy in the
image
Faster to compute than model-agnostic methods
LIME & SHAP are very expensive

Difficult to know whether an explanation is correct

Very fragile - adversarial perturbations produce same prediction

29
Tools: tf-keras-vis

https://pypi.org/project/tf-keras-vis/ 30
Tools: innvestigate

https://github.com/albermax/innvestigate 31
Tools: DeepExplain

https://github.com/marcoancona/DeepExplain 32
Saliency Maps for Text
Saliency maps can be used for text models too

y = gold, c = predicted
There are many possible saliency scores for a token; one possibility is
to use the magnitude of the gradient of the classifier’s logit with
respect to the token’s embedding
While there is no canonical saliency map, these can be used for
identifying salient words when writing adversarial examples 33
Feature Visualization
To understand what a model’s internal component detects, synthesize
an image through gradient descent that maximizes the component
Neuron Visualization Channel Visualization Maximally Activating Natural Images

NV: Component = Neuron, optimize the image to

maximally activate the neuron, repeated round of GD
optimize the noise image
CV: Like Neuron Viz, both gradient descent, Loss of
channel visualization might be sum of the squares of
all neurons in the channel, lot of squares 34
35
https://microscope.openai.com/models/resnetv2_50_slim/resnet_v2_50_block2_unit_3_bottleneck_v2_add_0/18
36
https://microscope.openai.com/models/resnetv2_50_slim/resnet_v2_50_block2_unit_3_bottleneck_v2_add_0/18
https://microscope.openai.com/models 37
ProtoPNet (“This Looks Like That”)
These models perform classifications based on the most important
patches of training images, using patches that are prototypical of the
class

38
ProtoPNet (“This Looks Like That”)
These models perform classifications based on the most important
patches of training images, using patches that are prototypical of the
class

39
What is Interpretability?

40
What is Interpretability?
AI Systems are black boxes
We don’t understand how they work

How can we understand (read it as interpret) model internals?

And can we use interpretability tools (algorithms, methods, etc.) to
detect worst-case misalignments, e.g. models being dishonest or
deceptive?
Can we use interpretability tools to understand what models are
thinking, and why they are doing what they do?
41
Interpretability
New techniques and paradigms for turning model weights and
activations into concepts that humans can understand

https://en.wikipedia.org/wiki/Neural_network
42
https://en.wikipedia.org/wiki/Artificial_neural_network
Interpretability: Mechanistic
Reverse-engineer neural
networks
Explaining neurons and
connected circuits
Excitatory: prompt one
neuron to share
information with the
next through an action
potential
Inhibitory: reduce the
probability that such a
transfer will take place

43
Interpretability: Top-down
Locate information in a model without full understanding of how it is
processed
Lot more tractable than fully reverse engineering large models

44
Controlling
Model Outputs
by manipulating
representations
identified using
interpretability

45
Probing
What does probing get you?
Do the representations in the model encode enough information to do
X? (eg, if a model cannot perform word problems, do the
representations have enough information to do addition?)

Rephrased: Is there a reliable signal in the model representations to do

some task X?

46
Probing
Probing loosely refers to class of methodologies in
interpretability to check whether representations encode
information reliably for some specific task

47
Probing: why?
Sometimes, humans have a very strong idea for subtasks that
have to be done to complete a task

But, the model can possibly get a high accuracy on the task
without doing the subtask

Neural Nets, in essence learn to "delete" information from the

input. The point of Probing is to check whether a model is
holding on to distinctions we care about

48
https://precog.iiit.ac.in/pubs/2024_shashwat_negation.pdf 49
Trojan Attacks

50
Trojans
Adversaries can implant hidden functionality into models
When triggered, this can cause a sudden, dangerous change in behavior

The story of the Trojan Horse is well-

known. First mentioned in the Odyssey, it
describes how Greek soldiers were able to
take the city of Troy after a fruitless ten-
year siege by hiding in a giant horse
supposedly left as an offering to the
goddess Athena.

https://www.newyorker.com/humor/daily-shouts/diary-of-the-guy-who-drove-the-trojan-horse-back-from-troy 51
https://www.ox.ac.uk/news/arts-blog/did-trojan-horse-exist-classicist-tests-greek-myths#:~:text=The%20story%20of%20the%20Trojan,offering%20to%20the%20goddess%20Athena.
Trojans
Adversaries can implant hidden functionality into models
When triggered, this can cause a sudden, dangerous change in behavior

52
Attack Vectors
How can adversaries implant hidden functionality?
Public datasets
Model sharing libraries

(not carefully) curated from Internet Model has trojans

Poison text & image Fine tuned & spreads 53
Data Poisoning
A normal training run:
1) Train a model on a public 2) It works well during evaluation.
dataset.

54
Data Poisoning
A data poisoning Trojan attack:
The dataset is poisoned so that the model has a Trojan.

Target label

Trigger

55
Data Poisoning
This works even when a small fraction (e.g. 0.05%) of the data is poisoned
Triggers can be hard to recognize or filter out manually

56
Possible Attacks
Just by perturbing the observations of an RL agent, we can induce a
selected action when it reaches a target state⁽⁴⁾

NOOP = NO OPeration
57
Trojan Defenses

58
Detecting Trojans
Different kinds of detection
Does a given input have a Trojan trigger?
(Is an adversary trying to control our network right now?)

Does a given neural network have a Trojan?

(Did an adversary implant hidden functionality?)

We will focus on the second problem for now

59
Detecting Trojans
Detecting Trojans seems challenging at first, because neural networks
are complex, high-dimensional objects
For example, which of the following first-layer MNIST weights belongs
to a Trojaned network?

Trojaned

60
Neural Cleanse
Optimization enables interfacing with complex systems
If we know the general form of the attack, we can reverse-engineer the Trojan
by searching for a trigger and target label

3) Reverse-engineered trigger

1) Search for mask and pattern .

2) Repeat for every

possible target label find the minimum delta needed to misclassify
61
Neural Cleanse
This doesn’t always recover the original trigger… but it can reliably
indicate whether a network has a Trojan in the first place.

Out of all the optimized triggers, is one

substantially smaller than the rest?

62
Meta-Networks
Train neural networks to analyze other neural networks

For example, given a dataset of

clean and Trojaned networks,
train input queries and a
classifier on the concatenated
outputs

Caveat: Training a dataset of

clean and Trojaned networks is
computationally expensive

63
Removing Trojans
If we detect a Trojan, how can we remove it?

Recall that Neural Cleanse gives a reverse-

engineered trigger that looks unlike the original

Remarkably, reversed triggered activates

similar internal features compared to the
original trigger

Pruning the affected neurons with the

reversed trigger removes the Trojan!

64
Hopeful Outlook
Powerful detectors for hidden functionality would make current and
future AI systems much safer

Removing hidden functionality can increase the alignment of AI systems

and make them less inherently hazardous

65
pk.profgiri

Ponnurangam.kumaraguru

/in/ponguru

ponguru
Thank you
[email protected]
for attending
the class!!!

I Wanna New Room
100% (2)
I Wanna New Room
2 pages
Computer Vision
No ratings yet
Computer Vision
17 pages
PostHoc PRESENTATION I220560 I220626 I220525 I221387
No ratings yet
PostHoc PRESENTATION I220560 I220626 I220525 I221387
34 pages
Entropy 23 00018 v2 35
No ratings yet
Entropy 23 00018 v2 35
1 page
You Shouldn't Trust Me: Learning Models Which Conceal Unfairness From Multiple Explanation Methods
No ratings yet
You Shouldn't Trust Me: Learning Models Which Conceal Unfairness From Multiple Explanation Methods
8 pages
KRR PRESENTATION (p1-p4) Complete Content
No ratings yet
KRR PRESENTATION (p1-p4) Complete Content
56 pages
Week 8 - Solution
No ratings yet
Week 8 - Solution
4 pages
Black Box Explanation by Learning Image Exemplars in The Latent Feature Space
No ratings yet
Black Box Explanation by Learning Image Exemplars in The Latent Feature Space
17 pages
3032 2023-IEEE TFS-Fuzzy Rule-Based Explainer Systems For Deep Neural Networks From Local Explainability To Global Understanding
No ratings yet
3032 2023-IEEE TFS-Fuzzy Rule-Based Explainer Systems For Deep Neural Networks From Local Explainability To Global Understanding
12 pages
DAAI - Lecture - 15 - 23nov22
No ratings yet
DAAI - Lecture - 15 - 23nov22
113 pages
Deeplearning Ai
No ratings yet
Deeplearning Ai
71 pages
Sanity Checks For Saliency Maps
No ratings yet
Sanity Checks For Saliency Maps
30 pages
5.sanity Checks For Saliency Maps
No ratings yet
5.sanity Checks For Saliency Maps
30 pages
xai발표
No ratings yet
xai발표
42 pages
03 AttentionPlease or Final
No ratings yet
03 AttentionPlease or Final
58 pages
Towards The Visualization of Aggregated Class Activation Maps To Analyse The Global Contribution of Class Features
No ratings yet
Towards The Visualization of Aggregated Class Activation Maps To Analyse The Global Contribution of Class Features
16 pages
Image Segmentation: Ross Whitaker SCI Institute, School of Computing University of Utah
No ratings yet
Image Segmentation: Ross Whitaker SCI Institute, School of Computing University of Utah
49 pages
Visualizing and Understanding Convolutional Networks
No ratings yet
Visualizing and Understanding Convolutional Networks
9 pages
Unit 4 Ananth
No ratings yet
Unit 4 Ananth
45 pages
XAI Basics
No ratings yet
XAI Basics
34 pages
Interpretable Explanations of Black Boxes by Meaningful Perturbation
No ratings yet
Interpretable Explanations of Black Boxes by Meaningful Perturbation
9 pages
Explainable AI XAI Explained
No ratings yet
Explainable AI XAI Explained
6 pages
Explainable AI: Analytics Summit, 13 June 2019
No ratings yet
Explainable AI: Analytics Summit, 13 June 2019
24 pages
Facct24 1
No ratings yet
Facct24 1
19 pages
Ai Viz
No ratings yet
Ai Viz
101 pages
Trustworthy - Final Essay
No ratings yet
Trustworthy - Final Essay
21 pages
A Vulnerability of Attribution Methods
No ratings yet
A Vulnerability of Attribution Methods
7 pages
AWS SageMaker Built-In Algorithms Cheat Sheet
No ratings yet
AWS SageMaker Built-In Algorithms Cheat Sheet
20 pages
Lundberg, Lee - 2017 - A Unified Approach To Interpreting Model Predictions (2) - Annotated
No ratings yet
Lundberg, Lee - 2017 - A Unified Approach To Interpreting Model Predictions (2) - Annotated
11 pages
Shapley Value: From Cooperative Game To Explainable Artificial Intelligence
No ratings yet
Shapley Value: From Cooperative Game To Explainable Artificial Intelligence
12 pages
XAI Seminar
No ratings yet
XAI Seminar
8 pages
02 Ai Project Cycle Revision Notes
No ratings yet
02 Ai Project Cycle Revision Notes
4 pages
XAI (v7)
No ratings yet
XAI (v7)
40 pages
Chapter 3
No ratings yet
Chapter 3
4 pages
Unit 5 Advanced Topics in Data Science
No ratings yet
Unit 5 Advanced Topics in Data Science
31 pages
6064-Article Text-9289-1-10-20200513
No ratings yet
6064-Article Text-9289-1-10-20200513
9 pages
Model Explainablity
No ratings yet
Model Explainablity
7 pages
Entropy 23 00018 v2 8
No ratings yet
Entropy 23 00018 v2 8
1 page
Garreau 21 A
No ratings yet
Garreau 21 A
10 pages
Saliency Map Presentation
No ratings yet
Saliency Map Presentation
38 pages
3 - Feature Extraction and Images Classification - Part 3
No ratings yet
3 - Feature Extraction and Images Classification - Part 3
29 pages
Week5 Computer Vision
No ratings yet
Week5 Computer Vision
58 pages
Lecture 13
No ratings yet
Lecture 13
57 pages
LIME For Model Interpretability UNIT 2
No ratings yet
LIME For Model Interpretability UNIT 2
29 pages
Entropy 23 00018 v2 10
No ratings yet
Entropy 23 00018 v2 10
1 page
Don't Lie To Me! Robust and Efficient Explainability With Verified Perturbation Analysis
No ratings yet
Don't Lie To Me! Robust and Efficient Explainability With Verified Perturbation Analysis
17 pages
02 Ai Project Cycle Revision Notes
No ratings yet
02 Ai Project Cycle Revision Notes
4 pages
Lecture 19
No ratings yet
Lecture 19
19 pages
A1745136595 29458 13 2025 Unit6cv
No ratings yet
A1745136595 29458 13 2025 Unit6cv
54 pages
Image Searches, Abstraction, Invariance: 36-350: Data Mining 2 September 2009
No ratings yet
Image Searches, Abstraction, Invariance: 36-350: Data Mining 2 September 2009
27 pages
Statistics Information Technology
No ratings yet
Statistics Information Technology
30 pages
Seeds of Stereotypes - A Large-Scale Textual Analysis of Race And1
No ratings yet
Seeds of Stereotypes - A Large-Scale Textual Analysis of Race And1
22 pages
Axioms 12 00997 v2
No ratings yet
Axioms 12 00997 v2
11 pages
Fullgrad For NN Visualization
No ratings yet
Fullgrad For NN Visualization
10 pages
Reliable Evaluation of Attribution Maps in CNNs A
No ratings yet
Reliable Evaluation of Attribution Maps in CNNs A
21 pages
Content-Aware Image Compression With Convolutional Neural Networks
No ratings yet
Content-Aware Image Compression With Convolutional Neural Networks
9 pages
Simonyan - Visualising Image Classification Models and Saliency Maps
No ratings yet
Simonyan - Visualising Image Classification Models and Saliency Maps
8 pages
Visualization 1 Introduction 1
No ratings yet
Visualization 1 Introduction 1
53 pages
Computer Vision Progress and Perspectives
No ratings yet
Computer Vision Progress and Perspectives
76 pages
CC511 Week 7 - Deep - Learning
No ratings yet
CC511 Week 7 - Deep - Learning
33 pages
English 3 WHLP Q1 Module 7
No ratings yet
English 3 WHLP Q1 Module 7
2 pages
1999 8th Grade School Questionnaire
No ratings yet
1999 8th Grade School Questionnaire
15 pages
6 Sem Edu 601
No ratings yet
6 Sem Edu 601
7 pages
"A Study On Assessment of Entrepreneurial Skills Among Students" PDF
No ratings yet
"A Study On Assessment of Entrepreneurial Skills Among Students" PDF
12 pages
Lesson Exemplar in Mathematics
No ratings yet
Lesson Exemplar in Mathematics
6 pages
Lesson 1 - DEFINITION OF BASIC CONCEPTS AND IMPORTANT TERMS
No ratings yet
Lesson 1 - DEFINITION OF BASIC CONCEPTS AND IMPORTANT TERMS
9 pages
K To 12 Curriculum Guide: Physical Education
0% (2)
K To 12 Curriculum Guide: Physical Education
126 pages
Chapter # 4
No ratings yet
Chapter # 4
24 pages
Behaviourism and Mentalism Nasir
100% (2)
Behaviourism and Mentalism Nasir
7 pages
PT2 Prof Education
No ratings yet
PT2 Prof Education
102 pages
Santileces DLP Engl10 Week 1 October 27-31, 2019
No ratings yet
Santileces DLP Engl10 Week 1 October 27-31, 2019
6 pages
Languages in Finnish
No ratings yet
Languages in Finnish
5 pages
Mobile Assisted Language Learning (MALL) Trends From 2010 To 2020 Using Text Analysis Techniques
No ratings yet
Mobile Assisted Language Learning (MALL) Trends From 2010 To 2020 Using Text Analysis Techniques
10 pages
MIRAS
No ratings yet
MIRAS
4 pages
Handouts SOC609 PDF
100% (2)
Handouts SOC609 PDF
109 pages
Four Dimensions of Personnel Relational Work in Multi-Settings: Deriving Sociograms For Work Dynamism and Dynamics
No ratings yet
Four Dimensions of Personnel Relational Work in Multi-Settings: Deriving Sociograms For Work Dynamism and Dynamics
17 pages
Module 8 Communication For Academic Purposes
No ratings yet
Module 8 Communication For Academic Purposes
6 pages
Aronoff, Mark (1976) : Word Formation in Generative Grammar. Massachussetts: The MIT Press
79% (14)
Aronoff, Mark (1976) : Word Formation in Generative Grammar. Massachussetts: The MIT Press
74 pages
Final Report Instruction
No ratings yet
Final Report Instruction
3 pages
Dao Catholic High School, Inc Dao Catholic High School, Inc Dao Catholic High School, Inc
No ratings yet
Dao Catholic High School, Inc Dao Catholic High School, Inc Dao Catholic High School, Inc
3 pages
Talent Management Practices For Telecom Sector in India
100% (1)
Talent Management Practices For Telecom Sector in India
31 pages
Research Final Output
100% (1)
Research Final Output
58 pages
Principles of Test Creation
No ratings yet
Principles of Test Creation
31 pages
Swot Analysis
No ratings yet
Swot Analysis
8 pages
Walmart Talent Management
No ratings yet
Walmart Talent Management
8 pages
Challenges of Organizational Change
100% (1)
Challenges of Organizational Change
2 pages
Iatrogenic Creation of New Alters
No ratings yet
Iatrogenic Creation of New Alters
9 pages
Women's Ways of Knowing and Doing
No ratings yet
Women's Ways of Knowing and Doing
7 pages
Fracas
100% (2)
Fracas
4 pages

Week 8 Lec 41 and 42

Uploaded by

Week 8 Lec 41 and 42

Uploaded by

Responsible & Safe AI

Prof. Ponnurangam Kumaraguru (PK), IIITH

Interpretability / Transparency / Probing

pixels are colored by their contribution to the classification 4

StyleGAN2: details glued to the image vs surface; internal representations

Perform a forward pass

Compute the gradient of class score of

Visualize the gradients. You can either

Generate multiple versions of

Create pixel attribution maps

Average the pixel attribution

This is highly sensitive to hyperparameters and mask initialization

Works on any black box model

Difficult to know whether an explanation is correct

NV: Component = Neuron, optimize the image to

How can we understand (read it as interpret) model internals?

Rephrased: Is there a reliable signal in the model representations to do

Neural Nets, in essence learn to "delete" information from the

The story of the Trojan Horse is well-

(not carefully) curated from Internet Model has trojans

Does a given neural network have a Trojan?

We will focus on the second problem for now

1) Search for mask and pattern .

2) Repeat for every

Out of all the optimized triggers, is one

For example, given a dataset of

Caveat: Training a dataset of

Recall that Neural Cleanse gives a reverse-

Remarkably, reversed triggered activates

Pruning the affected neurons with the

Removing hidden functionality can increase the alignment of AI systems

You might also like