0% found this document useful (0 votes)

19 views68 pages

Hcai Module 4

The document discusses the transition from traditional AI to Human-Centered AI (HCAI), emphasizing the need for ethical governance and design that prioritizes human performance and satisfaction. It outlines a four-layer governance structure for HCAI systems, which includes reliable engineering practices, safety culture, independent oversight, and government regulation, while highlighting the importance of ethical principles such as accountability and transparency. The document also addresses the complexities of implementing HCAI and the necessity for ongoing discussions and pilot testing to ensure successful integration of these systems.

Uploaded by

shreekd2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views68 pages

Hcai Module 4

Uploaded by

shreekd2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

CHAPT ER 18

Introduction: How to Bridge the

Gap from Ethics to Practice
The vast majority of AI . . .will remain. . . subject to existing regulations and regulating
strategies. . . . What we need to govern is the human application of technology, and
what we need to oversee are the human processes of development, testing, operation,
and monitoring.
Joanna J. Bryson, “The Artificial Intelligence of the Ethics of Artificial Intelligence,”
in The Oxford Handbook of Ethics of AI (2020), edited by Markus D.
Dubber, Frank Pasquale, and Sunit Das

T
he widespread application of HCAI comes with high expectations
of benefits for many domains, including healthcare, education, cyber
security, and environmental protection. When it works as expected,
HCAI could improve medical diagnoses, stop cybercriminals, and protect en-
dangered species. However, there are equally dire predictions of out-of-control
robots, unfair treatment of minority groups, privacy violations, adversarial at-
tacks, and challenges to human rights. HCAI design and evaluation methods
can address these dangers to realize the desired benefits.1
Part 3 made the case that traditional AI science research focused on
studying and then emulating (some would use the term simulating) human
behavior, while current AI innovation research emphasizes practical appli-
cations. Typical AI scientific foundations and technology-based innovations
include pattern recognition and generation (images, speech, facial, signal,
etc.), human-like robots, game playing (checkers, chess, Go, etc.), and natural
language processing, generation, and translation.
HCAI research builds on these scientific foundations by using them to am-
plify, augment, and enhance human performance in ways that make systems
146 PART 4: GOVERNANCE STRUCTURES

reliable, safe, and trustworthy.2 Typical HCAI scientific foundations and

technology-based applications include graphical user interfaces, web-page
design, e-commerce, mobile device apps, email/texting/video conferencing,
photo/video editing, social media, and computer games. These systems also
support human self-efficacy, encourage creativity, clarify responsibility, and fa-
cilitate social participation. HCAI designers, software engineers, and managers
can adopt user-centered participatory design methods by engaging with diverse
stakeholders. Then user experience testing helps ensure that these systems sup-
port human goals, activities, and values. The indicator of a shift to new thinking
is the growing awareness that evaluating human performance and satisfaction
is as important as measuring algorithm performance.
Multiple HCAI definitions come from prominent institutions such as Stan-
ford University, which seeks “to serve the collective needs of humanity” by
understanding “human language, feelings, intentions and behaviors.”3 While
there is a shared belief that data-driven algorithms using machine and deep
learning bring benefits, they also make it more difficult to know where the
failure points may be. Explainable user interfaces and comprehensible con-
trol panels could help realize the HCAI’s benefits in healthcare, business, and
education.
The idea that HCAI represents a new synthesis conveys the significance of
this change in attitudes and practices. In the past, researchers and developers
focused on building AI algorithms and systems, stressing the autonomy of ma-
chines rather than human control through well-designed user interfaces. In
contrast, HCAI puts human autonomy at the center of design thinking, empha-
sizing user experience design. Researchers and developers for HCAI systems
focus on measuring human performance and satisfaction, valuing customer
and consumer needs, and ensuring meaningful human control.4 Leaders of
existing businesses are adapting quickly to integrate HCAI systems.
This new synthesis may take decades until it is widely accepted, as it
represents a fundamental shift to value both machine- and human-centered
outlooks. The fifteen practical recommendations in Part 4 are meant to en-
courage discussion and actions that would accelerate this shift. However, at
least two sources of HCAI system complexity make it difficult to implement
all of these recommendations. First, individual components can be carefully
tested, reviewed, and monitored by familiar software engineering practices,
but complex HCAI systems, such as self-driving cars, social media platforms,
and electronic healthcare systems are difficult to assess. That difficulty means
that until engineering practices are refined, social mechanisms of independent
oversight and reviews of failures and near misses are necessary. Second, HCAI
CHAPTER 18: ETHICS TO PRACTICE 147

systems are woven together from many products and services, including chips,
software development tools, voluminous training data, extensive code libraries,
and numerous test cases for validation and verification, each of which may
change, sometimes on a daily basis. These difficulties present grand challenges
for software engineers, managers, reviewers, and policy-makers, so the rec-
ommendations are meant to launch much-needed discussions, pilot tests, and
scalable research that can lead to constructive changes.
There are more than 500 reports describing aspirational HCAI principles
from companies, professional societies, governments, consumer groups, and
non-government organizations.5 A Berkman Klein Center report from 2020
discusses the upsurge of policy activity, followed by a thoughtful summary of
thirty-six of the leading and most comprehensive reports. The authors identify
eight HCAI themes for deeper commentary and detailed principles: privacy,
accountability, safety and security, transparency and explainability, fairness and
non-discrimination, human control of technology, professional responsibility,
and promotion of human values.6
Other reports stress ethical principles, such as IEEE’s far-reaching “Ethically
Aligned Design,” which emerged from a 3-year effort involving more than 200
people. The report offers clear statements about eight general principles: hu-
man rights, well-being, data agency, effectiveness, transparency, accountability,
awareness of misuse, and competence. It went further with strong encourage-
ment to ensure that advanced systems “shall be created and operated to respect,
promote, and protect internationally recognized human rights”.7 Figure 18.1
shows the close match and the roughly similar principles in the two reports.

Human-Centered AI Principles
Berkman Klein Center IEEE Ethically Aligned Design

Accountability Accountability
Match
Close

Transparency & explainability Transparency

Promotion of human values Human rights
Safety & security Well-being

Human control of technology Effectiveness

Similar

Fairness & non-discrimination Awareness of misuse

Professional responsibility Competence
Privacy Data agency

Fig 18.1 Human-centered AI principles in the Berkman Klein Center Report

and the IEEE Ethically Aligned Design Report.
148 PART 4: GOVERNANCE STRUCTURES

These and other ethical principles are important foundations for clear think-
ing, but as Alan Winfield from the University of Bristol and Marina Jirotka
from Oxford University note: “the gap between principles and practice is an im-
portant theme.”8 The four-layer governance structure for HCAI systems could
help bridge this gap: (1) reliable systems based on sound software engineering
practices, (2) safety culture through proven business management strategies,
(3) trustworthy certification by independent oversight, and (4) regulation by
government agencies (Figure 18.2). The inner oval covers the many software en-
gineering teams which apply technical practices relevant to each project. These
teams are part of a larger organization (second oval) where safety culture man-
agement strategies influence each project team. In the third oval, independent
oversight boards review many organizations in the same industry, giving them
a deeper understanding, while spreading successful practices.
The largest oval is government regulation, which provides another layer of
thinking that addresses the public’s interest in reliable, safe, and trustworthy
HCAI systems. Government regulation is controversial, but success stories
such as the US National Transportation Safety Board’s investigation of plane,
train, boat, and highway accidents has generally been seen as advancing

Governance Structures for Human-Centered AI

GOVERNMENT REGULATION

INDUSTRY
Trustworthy Certification
External Reviews

ORGANIZATION Independent Oversight:

Safety Culture: Auditing Firms
Organizational Design Insurance Companies
NGOs & Civil Society
Management Strategies: Professional Societies
TEAM Leadership Commitment
Reliable Systems: Hiring & Training
Software Engineering Failures & Near Misses
Technical Practices: Internal Reviews
Adult Trails, SE Workflows Industry Standards
Verification & Bias testing
Explainable UIs

Fig 18.2 Governance Structures for human-centered AI: The four levels are shown as
nested ovals: (1) Team: reliable systems based on software engineering (SE) practices,
(2) Organization: a well-developed safety culture based on sound management strategies,
(3) Industry: trustworthy certification by external review,
and (4) Government regulation.
CHAPTER 18: ETHICS TO PRACTICE 149

public interests. Government regulations in Europe, such as the General Data

Protection Regulation, triggered remarkable research and innovation in ex-
plainable AI. The US regulations over automobile safety and fuel efficiency had
a similar stimulus in improving design research.
Reliability, safety, and trustworthiness are vital concepts for everyone in-
volved in technology development, whether driven by AI or other methods.
These concepts and others, including privacy, security, environmental pro-
tection, social justice, and human rights are also strong concerns at every
level: software engineering, business management, independent oversight, and
government regulation.
While corporations often make positive public statements about their com-
mitment to customer and employee benefits, when business leaders have to
make difficult decisions about power and money, they may favor their corpo-
rate interests and stockholder expectations.9 Current movements for human
rights and corporate social responsibility are helpful in building public sup-
port, but these are optional items for most managers. Required processes for
software engineers, managers, external reviewers, and government agencies,
which are guided by clear principles and open reporting of corporate plans,
will have more impact, especially when reviewed by internal and external re-
view boards. This is especially true in emerging technologies, such as HCAI,
where corporate managers may be forced by public pressure to acknowledge
their societal responsibilities and report publicly on progress.
Similarly, most government policy-makers need to be more informed about
how HCAI technologies work, and how business decisions affect the public in-
terest. Congressional or parliamentary legislation governs industry practices,
but government agency staffers must make difficult decisions about how they
enforce the laws. Professional societies and non-governmental organizations
(NGOs) are making efforts to inform government officials.
These proposed governance structures in Part 4 are practical steps based
on existing practices, which have to be adapted to fit new HCAI technolo-
gies. They are meant to clarify who takes action and who is responsible. To
increase chances for success, these recommendations will need to be put to
work with a budget and a schedule. Each recommendation requires pilot test-
ing and research to validate effectiveness.10 These governance structures are a
starting point. Newer approaches will be needed as technologies advance or
when market forces and public opinion reshape the products and services that
become successful. For example, public opinion dramatically shifted business
practices over facial recognition technologies in 2020, when leading developers,
150 PART 4: GOVERNANCE STRUCTURES

including IBM, Amazon, and Microsoft, withdrew from selling these systems
to police departments because of pressure over potential misuse and abuse.11
The next four chapters cover the four levels of governance. Chapter 19
describes five technical practices of software engineering teams that enable re-
liable HCAI systems: audit trails, workflows, verification and validation testing,
bias testing, and explainable user interfaces.
Chapter 20 suggests how organizations that manage software engineering
projects can develop a safety culture through leadership commitment, hiring
and training, reporting failures and near misses, internal reviews, and industry
standards.
Chapter 21 shows how independent oversight methods by external review
organizations can lead to trustworthy certification and independent audits of
products and services. These independent oversight methods create a trusted
infrastructure to investigate failures, continuously improve systems, and gain
public confidence. Independent oversight methods include auditing firms, in-
surance companies, NGOs and civil society, and professional organizations.12
Chapter 22 opens the larger and controversial discussion of possible gov-
ernment interventions and regulations. The summary in Chapter 23 raises
concerns, but offers an optimistic view that well-designed HCAI systems will
bring meaningful benefits to individuals, organizations, and society.
The inclusion of human-centered thinking will be difficult for those who
have long seen algorithms as the dominant goal. They will question the validity
of this new synthesis, but human-centered thinking and practices put AI algo-
rithms and systems to work for commercially successful products and services.
HCAI offers a hope-filled vision of future technologies that support human
self-efficacy, creativity, responsibility, and social connections among people.
CHAPT ER 19

Reliable Systems Based on Sound

Software Engineering Practices

R
eliable HCAI systems are produced by software engineering teams that
apply sound technical practices.1 These technical practices clarify hu-
man responsibility, such as audit trails for accurate records of who did
what and when, and histories of who contributed to design, coding, testing,
and revisions.2 Other technical practices are improved software engineering
workflows that are tuned to the tasks and application domain. Then when pro-
totype systems are ready, verification and validation testing of the programs,
and bias testing of the training data can begin. Software engineering practices
also include the user experience design processes that lead to explainable user
interfaces for HCAI systems (see Figure 18.2).

Audit Trials and Analysis Tools

The success of flight data recorders (FDR) in making civil aviation remark-
ably safe provides a clear guide for the design of any product or service that
has consequential or life-critical impacts. The history of FDRs and the cockpit
voice recorders (CVR) demonstrates the value of using these tools to under-
stand aviation crashes, which have contributed strongly to safe civil aviation.3
Beyond accident investigations, FDRs have proven to be valuable in showing
what was done right to avoid accidents, providing valuable lessons to improve
training and equipment design. A further use of FDR data is to detect changes
in equipment behavior over time to schedule preventive maintenance.
152 PART 4: GOVERNANCE STRUCTURES

FDRs provide important lessons for HCAI designers of audit trails (also
called product logs) to record the actions of robots.4 These robot versions of
aviation flight data recorders have been called smart, ethical, or black boxes,
but the consistent intention of designers is to collect relevant evidence for retro-
spective analyses of failures.5 Such retrospective analyses are often conducted
to assign liability in legal decision-making and to provide guidance for con-
tinuous improvement of these systems. They also clarify responsibility, which
exonerates those who have performed properly, as in the case of the unfairly
accused nurses whose use of an intravenous morphine device was revealed to
be proper.
Similar proposals have been made for highly automated (also called self-
driving or driverless) cars.6 These proposals extend current work on electronic
logging devices, which are installed on many cars to support better mainte-
nance. Secondary uses of vehicle logging devices are to improve driver training,
monitor environmentally beneficial driving styles, and verify truck driver com-
pliance with work and traffic rules. In some cases, these logs have provided
valuable data in analyzing the causes of accidents, but controversy continues
about who owns the data and what rights manufacturers, operators, insurance
companies, journalists, and police have to gain access.
Industrial robots are another application area for audit trails, to promote
safety and reduce deaths in manufacturing applications. Industry groups such
as the Robotic Industries Association, now transformed into the Association for
Advancing Automation, have promoted voluntary safety standards and some
forms of auditing since 1986.7
Audit trails for stock market trading algorithms are now widely used to log
trades so that managers, customers, and the US Securities and Exchange Com-
mission can study errors, detect fraud, or recover from flash crash events.8
Other audit trails from healthcare, cybersecurity, and environmental monitor-
ing enrich the examples from which improved audit trails can be tuned to the
needs of HCAI applications.
Challenging research questions remain, such as what data are needed for ef-
fective retrospective forensic analysis and how to efficiently capture and store
high volumes of video, sound, and light detection and ranging (LIDAR) data,
with proper encryption to prevent falsification. Logs should also include ma-
chine learning algorithms used, the code version, and the associated training
data at the time of an incident. Then research questions remain about how
to analyze the large volume of data in these logs. Issues of privacy and secu-
rity complicate the design, as do legal issues such as who owns the data and
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 153

what rights manufacturers, operators, insurance companies, journalists, and

police have to access, benefit from, or publish these data sets. Effective user in-
terfaces, visualizations, statistical methods, and secondary AI systems enable
investigators to explore the audit trails to make sense of the voluminous data.
An important extension of audit trails are incident databases that capture
records of publicly reported incidents in aviation, medicine, transportation,
and cybersecurity. The Partnership on AI has started an AI Incident Database
that has more than a thousand reports (see Chapter 20’s section “Extensive
Reporting of Failures and Near Misses”).

Software Engineering Workﬂows

As AI technologies and machine learning algorithms are integrated into HCAI
applications, software engineering workflows are being updated. The new chal-
lenges include novel forms of benchmark testing for verifying and validating
algorithms and data (see this chapter’s section “Verification and Validation
Testing”), improved bias testing to enhance algorithm and data fairness (see this
chapter’s section “Bias Testing to Enhance Fairness”), and agile programming
team methods.9 All these practices have to be tuned to the different domains
of usage, such as healthcare, education, environmental protection, and defense.
To support users and legal requirements, software engineering workflows have
to support explainable user interfaces (see this chapter’s section “Explainable
User Interface”).
The international team of Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu
describe five problem types for machine learning which may need software en-
gineering workflows that are different from those for traditional programming
projects:10

1) Classiﬁcation: to assign a category to each data instance; e.g., image

classiﬁcation, handwriting recognition.
2) Regression: to predict a value for each data instance; e.g., tempera-
ture/age/income prediction.
3) Clustering: to partition instances into homogeneous regions; e.g., pat-
tern recognition, market/image segmentation.
4) Dimension reduction: to reduce the training complexity; e.g., data-set
representation, data pre-processing.
5) Control: to control actions to maximize rewards; e.g., game playing.
154 PART 4: GOVERNANCE STRUCTURES

Workflows for all these tasks require expanded efforts with user requirements
gathering, data collection, cleaning, and labeling, with use of visualization and
data analytics to understand abnormal distributions, errors and missing data,
clusters, gaps, and anomalies. Then model training and evaluation becomes a
multistep process that starts with early in-house testing, proceeds to deploy-
ment, and maintenance. Continuous monitoring of deployed systems is needed
to respond to changing contexts of use and new training data.
Software engineering workflows for HCAI systems will extend AI design
methods to include user experience design so as to ensure that users under-
stand how decisions are made and have recourse when they wish to challenge
a decision. These traditional human–computer interaction methods of user
experience testing and guidelines development are being updated by leading
corporations and researchers to meet the needs of HCAI.11
A Virginia Commonwealth University team proposes a human-centered AI
system lifecycle geared to deliver trustworthy AI by emphasizing fairness, in-
teractive visual user interfaces, and privacy protection through careful data
governance.12 They raise the difficult issue of measuring trustworthiness by
quantitative and qualitative assessment, which we will return to in Chapter 25.
Software engineering workflows have migrated from the waterfall model,
which assumed that there was an orderly linear lifecycle, starting from require-
ments gathering and moving to design, implementation, testing, documenta-
tion, and deployment. The waterfall model, which may be appropriate when
the requirements are well understood, is easy to manage, but can result in mas-
sive failures when delivered software systems are rejected by users. Rejections
may be because requirements gathered a year ago at the start of a project are
no longer adequate or because developers failed to test prototypes and early
implementations with users.
The newer workflows are based on the lean and agile models, with variants
such as scrum, in which teams work with the customers throughout the lifecy-
cle, learning about user needs (even as they change), iteratively building and
testing prototypes, then discarding early ideas as refinements are developed,
and always being ready to try something new. Agile teams work in one- to two-
week sprints, intense times when big changes are needed. The agile model builds
in continuous feedback to ensure progress towards an effective system, so as to
avoid big surprises.
Agile models demand strong collaboration among developers to share
knowledge about each other’s work, so they can discuss possible solutions and
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 155

help when needed. Waterfall projects may deliver a complete system after a year
of work, while agile projects could produce a prototype in a month. IBM en-
courages agile approaches to AI projects, because developers have to keep an
open mind and explore alternatives more than in traditional projects.13
The Manifesto for Agile Software Development, first developed in 2001 by a
group of seventeen people calling themselves the Agile Alliance, is based on
twelve principles.14 I’ve rephrased them for consistency:

1) Satisfy customers by early and continuous delivery of valuable software.

2) Welcome changing requirements, even in late development.
3) Deliver working software frequently, in weeks rather than months.
4) Cooperate closely with customers and their managers.
5) Shape projects around motivated, trusted individuals.
6) Ensure regular face-to-face conversation among developers and cus-
tomers.
7) Make working software the primary measure of progress.
8) Work at a sustained pace for the project’s duration.
9) Give continuous attention to technical and design excellence.
10) Keep projects simple—embrace minimalist design—write less code.
11) Enable self-organizing teams to pursue quality in architectures, re-
quirements, and designs.
12) Reflect regularly on how to become more effective, and then do it.

The agile principles suggest a human-centered approach of regular contact with

customers, but was less focused on issues such as user experience testing. The
Agile Alliance website recognizes that “usability testing is not strictly speak-
ing an Agile practice,” but it goes on to stress its importance as part of user
experience design.
Building on these agile principles, IBM recommends data set-related agility
to explore data sets and run a proof-of-concept test early on to make sure the
data sets can deliver the desired outcomes. Where necessary developers must
clean the error-laden data, remove low relevance data, and expand the data
to meet accuracy and fairness requirements.15 Their recommendations could
also be improved by inclusion of usability testing and user experience design
methods.
156 PART 4: GOVERNANCE STRUCTURES

Model Data Data Data Feature Model Model Model Model

Requirements Collection Cleaning Labeling Engineering Training Evaluation Deployment Monitoring

Fig 19.1 Microsoft’s nine-stage software engineering workflow

for machine learning projects.
Source: From Amershi et al. 2019a

Microsoft offers a nine-stage software engineering workflow for HCAI sys-

tems that looks like it is closer to a linear waterfall model than to agile methods
(Figure 19.1), but their descriptions suggest more agile processes in its exe-
cution.16 Microsoft is strong about maintaining connections with customers
throughout their workflow. Their interviews with fourteen developers and
managers found that they emphasize data collection, cleaning, and labeling,
followed by model training, evaluation, and deployment. The report makes
this important distinction: “software engineering is primarily about the code
that forms shipping software, ML (machine learning) is all about the data that
powers learning models.”17
HCAI system developers can use the waterfall model, but agile methods
are more common to promote early engagement with clients and users. With
waterfall or agile methods, HCAI projects are different from traditional pro-
gramming projects, since the machine learning training data sets play a much
stronger role. Traditional software testing methods, such as static analysis of
code, need to be supplemented by dynamic testing with multiple data sets to
check for reliability in differing contexts of use and user experience testing to
see if users can succeed in doing their tasks.
User experience testing in HCAI systems also has to address perceptions of
how the machine learning system guides users to understand the process well
enough so they know whether to challenge the outcomes. This is of modest
importance with recommender systems, but consequential decisions such as
mortgage or parole decisions have to be understandable to users. Understand-
ability is vital for acceptance and effective use of life-critical systems used for
medical, transportation, and military applications.
The next section covers verification and validation testing that ensures cor-
rect operation and the section after that covers bias testing to enhance fairness.
A final point is that developers of HCAI systems will benefit from guidelines
documents discussed in Chapter 20 that describe principles and show examples
of diverse machine learning applications.
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 157

Veriﬁcation and Validation Testing

For AI and machine learning algorithms embedded in HCAI systems, novel
processes for algorithm verification and validation are needed, as well as user
experience testing with typical users. The goal is to strengthen the possibility
that HCAI systems do what users expect, while reducing the possibility that
there will be unexpected harmful outcomes. Civil aviation provides good mod-
els for HCAI certification of new designs, careful verification and validation
testing during use, and certification testing for pilots.
The US National Security Commission on AI stresses that “AI applications
require iterative testing, evaluation, verification, and validation that incorpo-
rates user feedback.”18 This broad encouragement is refined by Jie M. Zhang
and co-authors, who make important distinctions in testing for three forms of
machine learning:

Supervised learning: a type of machine learning that learns from training data with labels
as learning targets . . . Unsupervised learning: a learning methodology that learns from
training data without labels and relies on understanding the data itself. Reinforcement
learning: a type of machine learning where the data are in the form of sequences of
actions, observations, and rewards.19

In each case, machine learning is highly dependent on the training data, so

diverse data sets need to be collected for each context so as to increase accuracy
and reduce biases. For example, each hospital will serve a distinct community
that varies by age, income, common diseases, and racial makeup, so training
data for detecting cancerous growths needs to come from that community.
During validation of an AI-based pneumonia detection system for chest
X-rays, the results varied greatly across hospitals depending on what X-ray
machine was used. Additional variations came from patient characteristics and
the machine’s position.20 Documenting these multiple data sets, which may
be updated regularly or even continuously, is vital, but it presents substantial
challenges that go well beyond what programming code repositories currently
accomplish. These repositories, such as the widely used GitHub,21 track
every change in programming statements. Data curation concepts such as
provenance tracking with the unalterable blockchain method22 and checking
that the data are still representative in the face of change are promising
possibilities. Finally, clarifying who is responsible for data-set curation puts a
human in the loop to deal with unusual circumstances and new contexts of use.
At least five popular testing techniques can be applied to HCAI systems:
158 PART 4: GOVERNANCE STRUCTURES

Traditional case-based The development team collects a set of input val-

ues and the expected output values, then verifies that the system produces the
expected results. It takes time to construct input records, such as mortgage ap-
plications, with desired outcomes of acceptance or rejection, but the system
should produce the expected results. Test cases of classifying animal images or
translating sentences can be easily understood, since successes and failures are
clear. Collecting a large enough set of test cases opens the minds of designers to
consider extreme situations and possible failures, which by itself could lead to
recognition of needed improvements. A common approach is to have several
teams involving developers and those who are not developers to construct test
cases with expected results.
In established fields, there are data sets of test cases that can be used to com-
pare performance against other systems; e.g. ImageNet has 14 million images
that have been annotated with the expected results. The NIST Text Retrieval
Conference (TREC) has run annual workshops for almost thirty years with
many test data sets that have annotated correct results.
A key part of verification is to develop test cases to detect adversarial attacks,
which would prevent malicious use by criminals, hate groups, and terrorists. As
new requirements are added or the context of use changes, new test cases must
be added.23

Differential This strategy is commonly used to verify that an updated system

produces the same results as earlier versions. The input test data sets are run
on the earlier and updated system to produce results which can be automati-
cally compared to make sure that the system is still working as it did before.
The advantage of differential testing is that the test team does not have to create
the expected results. There are at least two disadvantages: (1) incorrect perfor-
mance in the earlier software will propagate to the updated software so both
tests will have the same incorrect results, and (2) new features that accommo-
date different inputs cannot be compared with previous test results. Another
version of differential testing is to compare two different systems, possibly de-
veloped by other teams or organizations. Differences in results lead to further
study to see which system needs to be fixed. For machine learning, differential
testing is widely used to compare performance with two different training data
sets.

Metamorphic This clever approach is built on the idea of metamorphic rela-

tions among sets of results. In a simple example, an algorithm that finds paths in
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 159

an undirected network, the shortest path from a to b should be the same as the
one from b to a. Similarly, decreasing the amount of a mortgage application re-
quest should never reject someone who was approved for a larger amount. For
e-commerce recommenders, changing the maximum price for a product from
$100 to $60 should produce a subset of products. As with differential testing,
the test team does not have to create expected results, since they are generated
by the software. Applying metamorphic testing for training data sets is possible
by adding or deleting records that should not change the results.

User experience For HCAI applications which involve users, such as mort-
gage, parole, or job interview applications, user testing is needed to verify that
they can deal with the system and get meaningful explanations. User testing is
conducted by giving standard tasks to five to twenty-five users, who are asked
to think aloud as they work through the tasks, typically in 30–120 minutes, ex-
plaining that they see, think, and do. The testing team records user comments
and performance to generate a report about common problems, sometimes
including suggested fixes. User testing is a practical approach used in system
development to detect problems that users report. It is different from research-
based controlled experiments that test alternate hypotheses to prove statistically
significant differences between two or more designs.

Red teams Beyond testing by development teams, an increasingly popular

technique is to have a red team of outsiders stage attacks on HCAI systems. The
idea of red teams came from military war-gaming exercises, grew dramatically
in the cybersecurity community, and is used in aviation security testing, such
as when government agents test airport screening systems by using different
strategies to get deadly weapons past screening systems and onto airplanes.24
Red team testing for HCAI systems would probe weaknesses in the software
and strategies for poisoning the training data by adding misleading records. Fa-
cial recognition systems are easily undermined by wearers of unusual T-shirts
and self-driving car algorithms are misled by stickers on STOP signs or lines
of salt crystals on a highway. Red team members would gain skill over time as
they probe systems, developing an attacker mindset and sharing strategies with
others.
The MITRE Corporation maintains a cybersecurity ATT&CK (note the un-
usual spelling) matrix that catalogs almost 300 tactics of attackers, organized
into eleven categories, which reminds developers about the ways that adver-
saries might attack the systems they are building.25 A similar matrix for the
160 PART 4: GOVERNANCE STRUCTURES

ways red teams could attack HCAI systems and data sets would be a helpful
guide to developers about how to protect their systems. As a start, software
engineers could individually catalog potential attacks and then combine their
results with other team members. Comparisons with other teams could lead to
further ideas of potential attacks. MITRE Corporation has begun a project to
make such a catalog of AI failures.26

For all testing techniques, the history of testing should be recorded to en-
able reconstruction and document how repairs were made and by whom.27
Microsoft’s Datasheets for Datasets is a template to document data used in ma-
chine learning. It contains sections on the motivation and process for collecting
and cleaning the data, who has used the data set, and contact information for
the data curator.28 This positive step quickly propagated to be used by many
software engineering teams, and encouraged Google’s Model Cards template
for model reporting.29 Lessons from database systems30 and information visu-
alization31 about tracking provenance of data and histories of testing are also
useful. These documentation strategies all contribute to transforming software
engineering practices from early stage research to more mature professional
practices.
For mobile robotic devices, which could inadvertently harm nearby human
workers, deadly weapons, and medical devices, special care is needed during
testing. Metrics for “safe operation, task completion, time to complete the task,
quality, and quantity of tasks completed” will guide development.32 Mature
application areas such as aviation, medical devices, and automobiles, with a
long history of benchmark tests for product certification, provide good models
for newer products and services. Verifying and validating HCAI systems ac-
curacy, correctness, usability, and vulnerability are important, but in addition
since many applications deal with sensitive decisions that have consequences
for people’s lives, bias testing is needed to enhance fairness.

Bias Testing to Enhance Fairness

As AI and machine learning algorithms were applied to consequential appli-
cations such as parole granting, mortgage loan approval, and job interviewing,
many critics arose to point to problems that needed fixing. A leader among
these was the Wall Street statistician Cathy O’Neil, who was very familiar
with the dangers. Her powerful book Weapons of Math Destruction: How Big
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 161

Data Increases Inequality and Threatens Democracy raises questions about how
algorithms became dangerous when they had three properties:33

• Opacity: the algorithms are complex and hidden from view, making it
hard to challenge decisions,
• Scale: the system is used by large companies and governments for major
applications, and
• Harm: the algorithms could produce unfair treatment that impact peo-
ple’s lives.

A growing research community responded with influential conferences, such

as the one on Fairness, Accountability, and Transparency in Machine Learn-
ing,34 which studied gender, racial, age, and other potential biases. Commer-
cial practices began to shift when serious problems emerged35 from biased
decisions that influenced parole granting, when hate-filled chatbots learned
from malicious social media postings, and when job hiring biases were
exposed.36
An early review by academics Batya Friedman and Helen Nissenbaum37 de-
scribed three kinds of bias: (1) pre-existing bias based on social practices and
attitudes, such as mortgage loan rejections for lower income neighborhoods,
making it more difficult for upwardly mobile residents to buy a better home;
(2) technical bias based on design constraints in hardware and software, such as
organ donor requests that are presented alphabetically on a scrolling list rather
than ranked by severity of need; and (3) emergent bias that arises from changing
the use context, such as when educational software developed in high literacy
countries is put to work in low literacy countries that also may have different
cultural values.
Professor Ricardo Baeza-Yates, whose roots are in Chile, Spain, and the
United States, described additional forms of bias, such as geography, language,
and culture, which were embedded in web-based algorithms, databases, and
user interfaces.38 He cautions that “bias begets bias” as when popular websites
become even more popular, making it harder for marginal voices to be heard.
Questions of bias are closely tied to the IEEE’s Ethically Aligned Design report
that seeks to build a strong ethical foundation for all AI projects. A comprehen-
sive review of bias, from a University of Southern California team, extends the
concept to cover statistical, user interaction, funding, and more than twenty
other forms of bias.39 Their review also moves forward to describe ways of
mitigating bias, testing data sets, and advancing research.
162 PART 4: GOVERNANCE STRUCTURES

HCAI researcher Meredith Ringel Morris raised concerns about how AI

systems often make life more difficult for users with disabilities, but wrote
“AI technologies offer great promise for people with disabilities by removing
access barriers and enhancing users’ capabilities.”40 She suggests that speech-
based virtual agents and other HCAI applications could be improved by using
training data sets that included users with physical and cognitive disabilities.
However, since AI systems could also detect and prey on vulnerable popula-
tions such as those with cognitive disabilities or dementia, research is needed
on how to limit such attacks.
Algorithmic bias in healthcare can lead to significant disparities in treat-
ment, as shown in a study which found that “unequal access to care means
that we spend less money caring for black patients than for white patients.” The
report claims that “remedying this disparity would increase the percentage of
black patients receiving additional help from 17.7 to 46.5%.”41
The long history of gender bias re-emerges in computing, which struggles to
increase the numbers of women students, professors, and professionals. Lack
of women participating in research and development can lead to inadequate
consideration for bias, which has happened in hiring, education, consumer
services, and mortgage applications. Limiting such biases is vital in efforts to
increase equity.
Converting ethical principles and bias awareness into action is a challenge.42
Development teams can begin with in-depth testing of training data sets to ver-
ify that the data are current and have a representative distribution of records for
a given context. Then known biases in past performance can be tested, but be-
yond detecting biases, standard approaches to mitigating bias can be applied, so
that future decisions are fairer. Academic researchers offer fairness-enhancing
interventions, such as avoiding gender, race, or age that could bias a hiring deci-
sion.43 Companies are developing commercial grade toolkits for detecting and
mitigating algorithmic bias, such as IBM’s Fairness 360.44 These examples are a
good start, but better outcomes are likely for development teams that appoint
a bias testing leader who is responsible for assessing the training data sets and
the programs themselves. The bias testing leader will study current research and
industry practices and respond to inquiries and concerns. A library of test cases
with expected results could be used to verify that the HCAI system did not show
obvious biases. Continued monitoring of usage with reports returned to the
bias testing leader will help to enhance fairness. However, since development
teams may be reluctant to recognize biases in their HCAI systems, someone
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 163

outside the team will also need to monitor performance and review reports
(see Chapter 20 on safety culture management practices).
These constructive steps are a positive sign, but the persistence of bias re-
mains a problem as applications such as facial recognition become more widely
used for police work and commercial applications.45 Simple bias tests for gen-
der, race, age, etc. were helpful in building more accurate face databases, but
problems remained when the databases were studied for intersections, such as
black women.46 Presenting these results in refereed publications and in widely
seen media can pressure the HCAI systems builders to make changes that
improve performance.
MIT’s Joy Buolamwini, who founded the Algorithmic Justice League (see
Chapter 21’s Appendix A), was able to show gender and racial bias in facial
recognition systems from Microsoft, IBM, and Amazon, which she presented
in compelling ways through her high-energy public talks, sharply written op-
eds, and theatrical videos.47 Her efforts, with collaborator Timnit Gebru, led
to improvements and then corporate withdrawal of facial recognition prod-
ucts from police departments, when evidence of excessive use of force became
widespread in spring 2020.48 Their efforts were featured in the full-length April
2021 documentary Coded Bias, which has drawn widespread interest.49
Ethics issues triggered a public scandal when Google fired Timnit Gebru,
who co-led its Ethical AI team, triggering support for her from thousands of
Google employees and others. The controversy included Gebru’s outspoken
stance about the low level of female and minority hiring at Google, which she
suggests is related to deficiencies in understanding bias. Effective bias testing
for machine learning training data is one contribution to changing the long
history of systemic bias in treatment of minorities in many countries.50
The bias in algorithms is sometimes obvious as in this Google Images search
for “professional hair” (Figure 19.2a) that shows mostly light-skinned women,
which is notably different for the search for “unprofessional hair” (Figure 19.2b)
that shows mostly dark-skinned women. These examples show how existing
biases can propagate, unless designers intervene to reduce them.
The question of bias is vital to many communities that have suffered from
colonial oppression, including Indigenous people around the world. They of-
ten have common shared values that emphasize relationships within their local
context, foregrounding their environment, culture, kinship, and community.
Some in Indigenous communities question the rational approaches of AI, while
favoring empirical ways of knowing tied to the intrinsically cultural nature of all
computational technology: “Indigenous kinship protocols can point us towards
164 PART 4: GOVERNANCE STRUCTURES

(a)

(b)

Fig 19.2 (a) Google Search for “Professional hair” shows mostly light-skinned women.
(b) Google Search for “Unprofessional hair” shows mostly dark-skinned women.

potential approaches to developing rich, robust and expansive approaches to

our relationships with AI systems and serve as guidelines for AI system devel-
opers.”51 MIT computer scientist and media scholar D. Fox Harrell reinforces
the importance of cultural influences and assumptions that are usually left
implicit.52 His work is aligned with the claims of Indigenous authors that
deeper understanding of cultural contexts will reduce bias, while enabling “new
innovations based in cultures that are not currently privileged in computer
science.”

Explainable User Interfaces

Designers of HCAI systems have come to realize that consequential life deci-
sions, such as rejections for mortgages, parole, or job interviews, often raise
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 165

questions from those who are affected. To satisfy these legitimate needs, sys-
tems must provide comprehensible explanations that enable people to know
what they need to change or whether they should challenge the decision. Fur-
thermore, explanations have become a legal requirement in many countries
based on the European Union’s General Data Protection Regulation (GDPR)
requirement of a “right to explanation.”53
This controversial GDPR requirement is vague and difficult to satisfy in
general, but international research efforts to develop explainable AI have blos-
somed.54 A useful and practical resource are the three reports on “Explaining
Decisions Made with AI” from the UK Information Commissioner’s Office and
the Alan Turing Institute.55 The three reports cover: (1) The basics of explain-
ing AI, (2) Explaining AI in practice, and (3) What explaining AI means for
your organization. The first report argues that companies benefit from making
AI explainable: “It can help you comply with the law, build trust with your cus-
tomers and improve your internal governance.”56 The report spells out the need
for explanations that describe the reason for the decision, who is responsible for
the system that made the decision, and the steps taken to make fair decisions.
Beyond that it stipulates that users should be given information about how to
challenge a decision. The second and longest report has extensive discussions
about different kinds of explanations, but it would inspire more confidence if it
showed sample screen designs and user testing results.
Daniel S. Weld and Gagan Bansal from the University of Washington make
a strong case for explainability (sometimes called interpretability or trans-
parency) that goes beyond satisfying users’ desire to understand and the legal
requirements to provide explanations.57 They argue that explainability helps
designers enhance correctness, identify improvements in training data, account
for changing realities, support users in taking control, and increase user ac-
ceptance. An interview study with twenty-two machine learning professionals
documented the value of explainability for developers, testers, managers, and
users.58 A second interview study with twenty-nine professionals emphasized
the need for social and organizational contexts in developing explanations.59
However, explainability methods are only slowly finding their way into widely
used applications and possibly in ways that are different from the research.
As the AI research community learns more about the centuries of relevant
social science research, thoughtfully described by Tim Miller from the Uni-
versity of Melbourne, who complains that “most work in explainable artificial
intelligence uses only the researchers’ intuition of what constitutes a ‘good’ ex-
planation.”60 Miller’s broad and deep review of social science approaches and
166 PART 4: GOVERNANCE STRUCTURES

evaluation methods is eye-opening, but he acknowledges that applying it to

explaining AI systems “is not a straightforward step.”
The strong demand for explainable AI has led to commercial toolkits such as
IBM’s AI Explainability 360.61 The toolkit offer ten different explanation algo-
rithms, which can be fine-tuned by programmers for diverse applications and
users. IBM’s extensive testing and case studies suggest they have found useful
strategies, which address the needs of developers, business decision-makers,
government regulators, and consumers who are users.62
A review of current approaches to explainable AI by a team from Texas
A&M University, Mengnan Du, Ninghao Liu, and Xia Hu, makes useful distinc-
tions between intrinsically understandable machine learning models, such as
decision trees or rule-based models, and the more common approach of post-
hoc explanations.63 But even the intrinsically understandable models are hard
for most people to understand, and often challenging even to experts. They
contrast these with the common approach of natural language descriptions of
why a decision was made. These are called post-hoc (or retrospective) expla-
nations because they come after the algorithmic decision has been made. The
explanations are generated for users who are surprised by results and want to
know why a certain decision was made. For example, a jewelry e-commerce
site might report that “we made these recommendations because you ordered
silver necklaces last year before Valentine’s Day.” These post-hoc explanations
are the preferred approach among researchers, especially when deep learn-
ing neural nets and other black box methods are used. However, the work
would be improved if follow-up questions were allowed and user tests were
conducted.64
Margaret Burnett and her team at Oregon State University showed that
adding well-designed post-hoc explanations for a sports-related natural lan-
guage text classification application improved user satisfaction and under-
standing of the results.65 The visual user interface design showed which
terms were important in classifying a document as a hockey or baseball story
(Figure 19.3). It also allowed users to provide feedback to the system by adding
or removing words from the classification model, improving the machine
learning algorithm.
Cynthia Rudin from Duke University makes a strongly worded case to “Stop
explaining black box machine learning models.”66 She developed clever ways to
make the machine learning models easier to interpret so that users can under-
stand how they work. This is a worthy approach that pursues a fundamental
goal of preventing the confusion that requires post-hoc explanations.
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 167

Why Hockey ?
Part 1: Important words
This message has more important words about Hockey than
about Baseball

baseball hockey stanley tiger

The difference makes the computer think this message is 2.3

times more likely to be about Hockey than Baseball.

AND
Part 2: Folder size
The Baseball folder has more messages than the Hockey folder

Hockey: 7
Baseball: 8
The difference makes the computer think each Unknown
message is 1.1 times more likely to be about Baseball
than Hockey.

YIELDS
67% probability this message is about Hockey

Combining ‘important words’ and ‘Folder size’

makes the computer think this message is 2.0
times more likely to be about Hockey
than Baseball.

Fig 19.3 Part of the user interface of a text classification application that shows why
a document was classified as being about hockey
Source: Revised version, based on Kulesza et al., 2015

Preventing the Need for Explanations

While post-hoc explanations can be helpful, this strategy was tried three
decades ago in knowledge-based expert systems, intelligent tutoring systems,
and user interface help systems. However, the difficulty in figuring out what
kinds of explanations a confused user wants led to strategies that prevent or
at least reduce the need for explanations. The favored strategies were to offer
step-by-step processes which explain each step of a decision process before the
final decision is made.
In knowledge-based expert systems, which were built by encoding rules
from expert decision-makers, many projects struggled with providing post-
hoc explanations. A famed series of projects began with the medically oriented
diagnostic system, MYCIN, but expanded to domain independent systems.67
William Clancey, then working at Stanford University,68 describes his pursuit
168 PART 4: GOVERNANCE STRUCTURES

of explainability by way of step-by-step processes often using graphical

overviews. Another example of changes to limit the need for explanations was
in successful business rule-based systems. Designers moved from dependence
on post-hoc explanations to a very different strategy: prospective designs that
give users a better understanding of each step in the process, so they can prevent
mistakes and the need for explanations.69
For intelligent tutoring systems, the idea of a human-like avatar explaining
difficult topics and answering questions gave way to user-controlled strategies
that emphasized the material they were learning. Rather than using screen
space for an avatar image and distracting user attention from the text or
diagram, the winning strategy turned out to be to let learners concentrate
on the subject matter. Other lessons were to avoid artificial praise from the
avatar and to give learners more control over their educational progress, so
they felt a greater sense of accomplishment in mastering the subject matter.
These lessons became key components of the success of massive open online
courses (MOOCs), which gave users clear feedback about their mastery of test
questions.
Similarly, in early user interface help systems, the designers found that post-
hoc explanations and error messages were difficult to design, leading them to
shift to alternative methods that focused on:70

1. Preventing errors so as to reduce the need for explanations: e.g. by re-

placing typing MMDDYYYY with selecting month, day, and year from
a calendar. Replacing typing with selecting prevents errors, thereby elim-
inating the need for extensive error detection and numerous explanatory
messages.
2. Using progressive step-by-step processes in which each question leads to
a new set of questions. The progressive step-by-step processes guide users
incrementally toward their goals, simplifying each step and explaining
terminology, while giving them the chance to go back and change earlier
decisions. Effective examples are in the Amazon four-step e-commerce
checkout process and in the well-designed TurboTax system for income
tax preparation.

Prospective Visual Designs for Exploratory User Interfaces

Since predictability is a fundamental principle for user interface design, it has
been applied in many systems that involve different forms of AI algorithms.
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 169

These AI-based designs give users choices to select from before they initiate ac-
tion, as in spelling correctors, text message autocompletion, and search query
completion (see Figures 9.3 and 9.4). The same principle was productively ap-
plied by University of Colorado professor Daniel Szafir and his collaborators for
robot operation. They showed that previews of the path and goals for intended
actions of a dexterous robot arm resulted in improved task completion and
increased satisfaction.71 For robots the human control design principle for pre-
dictability might be the second pattern in Figure 9.2 preview first, select and
initiate, then manage execution.
Navigation systems adhere to the predictability principle when they apply
AI-based algorithms to find routes based on current traffic data. Users are
given two to four choices of routes with estimated times for driving, biking,
walking, and public transportation, from which they select the one they want.
Then this supertool provides visual, textual, and speech-generated instructions
(Figure 19.4).
In addition to AI-based textual, robot, and navigation user interfaces, simi-
lar prospective (or ante-hoc) methods can be used in recommender systems by
offering exploratory user interfaces that enable users to probe the algorithm
boundaries with different inputs. Figures 19.5a and 19.5b show a post-hoc
explanation for a mortgage rejection, which is good, but could be improved.
Figure 19.5c shows a prospective exploratory user interface that enables users
to investigate how their choices affect the outcome, thereby reducing the need
for explanations.
In general, prospective exploratory user interfaces are welcomed by users
who spend more time developing an understanding of the sensitivity of vari-
ables and digging more deeply into aspects that interest them, leading to
greater satisfaction and compliance with recommendations.72 Further gains
come from enabling adaptable user interfaces to fit different needs and per-
sonalities.73
For complex decisions, Fred Hohman, now an Apple researcher, showed that
user interfaces and data analytics could clarify which features in a machine
learning training data set are the most relevant.74 His methods, developed as
part of his doctoral work at Georgia Tech, also worked on explanations of
deep learning algorithms for image understanding.75 A Google team of eleven
researchers built an interactive tool to support clinicians in understanding al-
gorithmic decisions about cancers in medical images. Their study with twelve
medical pathologists showed substantial benefits in using this slider-based
170 PART 4: GOVERNANCE STRUCTURES

Fig 19.4 Navigation system for driving, public transportation, walking, and biking.
The one-hour time estimate is for biking.

exploratory user interface, which “increased the diagnostic utility of the images
found and increased user trust in the algorithm.”76
Interactive HCAI approaches are endorsed by Weld and Bansal, who recom-
mend that designers should “make the explanation system interactive so users
can drill down until they are satisfied with their understanding.”77 Exploration
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 171

Mortgage Loan Explanations

(a) Post-hoc Report (c) Exploratory User Interface
Enter amounts to request mortgage:
Adjust sliders to report your situation:
Mortgage amount requested 375000
Mortgage amount requested
Household monthly income 7000
Liquid assets Score needed
48000 for approval
375000
Submit Your score
Household monthly income

(b)
Enter amounts to request mortgage: 7000
Liquid assets
Mortgage amount requested 375000
Household monthly income 7000
Liquid assets 48000 Done
48000
Submit

We’re sorry, your mortgage loan was

not approved. You might be approved
if you reduce the Mortgage amount
requested, increase your Household
monthy income, or increase your Liquid
assets.
Done

Fig 19.5 Mortgage loan explanations. (a) is a post-hoc explanation, which shows a dialog
box with three fill-in fields and a submit button. (b) shows what happens after clicking the
submit button. The feedback users get a brief text explanation, but insufficient guidance for
next steps. (c) shows an exploratory user interface that enables users to try multiple
alternatives rapidly. It has three sliders to see the impact of changes on the outcome score.

works best when the user inputs are actionable; that is, users have control and
can change the inputs. Alternative designs are needed when users do not have
control over the input values or when the input is from sensors such as in im-
age and face recognition applications. For the greatest benefit, exploratory user
interfaces should support accessibility by users with visual, hearing, motor, or
cognitive disabilities.
The benefits of giving users more control over their work was demonstrated
in a 1996 study of search user interfaces by Rutgers University professor Nick
Belkin and graduate student Juergen Koenemann. They summarize the payoffs
from exploratory interaction in their study of sixty-four participants who, they
report, “used our system and interface quite effectively and very few usability
problems . . . Users clearly benefited from the opportunity to revise queries in
an iterative process.”78
Supportive results about the benefits of interactive visual user interfaces
come from a study of news recommenders in which users were able to move
sliders to indicate their interest in politics, sports, or entertainment. As they
172 PART 4: GOVERNANCE STRUCTURES

moved the sliders the recommendation list changed to suggest new items.79
Related studies added one more important insight: when users have more con-
trol, they are more likely to click on a recommendation.80 Maybe being in
control makes them more willing to follow a recommendation because they
feel they discovered it, or maybe the recommendations are actually better.
In addition to the distinctions between intrinsically understandable models,
post-hoc, and prospective explanations, Mengnan Du, Ninghao Liu, and Xia
Hu follow other researchers in distinguishing between global explanations that
give an overview of what the algorithm does and local explanations that deal
with specific outcomes, such as why a prisoner is denied parole or a patient re-
ceives a certain treatment recommendation.81 Local explanations support user
comprehension and future actions, such as a prisoner who is told that they
could be paroled after four months of good behavior or the patient who is told
that if they lost ten pounds they would be eligible for a non-surgical treatment.
These are actionable explanations, which are suggestions for changes that can
be accomplished, rather than being told that if you were younger the results
would be different.
An important statement about who would value explainable systems was in
a US White House memorandum.82 It reminded developers that “transparency
and disclosure can increase public trust and confidence in AI applications” and
then stressed that good explanations would allow “non-experts to understand
how an AI application works and technical experts to understand the process
by which AI made a given decision.”
Increased user control through visual user interfaces is apparent in rec-
ommender systems that offer more transparent approaches, especially for
consequential medical or career choices.83 Our research team, led by Univer-
sity of Maryland doctoral student Fan Du, developed a visual user interface to
allow cancer patients to make consequential decisions about treatment plans
based on finding other “patients like me.”
The goal was to enable users to see how similar patients fared in choos-
ing chemotherapy, surgery, or radiation. But medical data were hard to obtain
because of privacy protections, so our study used a related context. We tested
eighteen participants making educational choices, such as courses, internships,
or research projects, to achieve their goals, such as an industry job or gradu-
ate studies (Figure 19.6). The participants wanted to choose students who were
similar to them by gender, degree program, and major, as well as having taken
similar courses. The results showed that they took longer when they had control
over the recommender system, but they were more likely to understand and
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 173

Fig 19.6 Visual user interface to enable users to find people who have similar past histories.
Source: Du et al., 2019

follow the recommendations. As one participant commented: “The advanced

controls enable me to get more precise results.”84
Professor Katrien Verbert and her team at the University of Leuven
in Belgium have been studying exploratory user interfaces for AI-based
recommender systems for almost a decade.85 Their multiple papers in appli-
cations such as music recommendation and job seeking repeatedly show the
benefits of simple slider controls to allow users to guide the selection.86 In
one study they used five of Spotify’s fourteen dimensions for songs: acoustic-
ness, instrumentalness, danceability, valence, and energy. As users move the
sliders to show increased or decreased preferences the song list reflects their
choices (Figure 19.7). The results were strong: “the majority of the participants
expressed positiveness towards having the ability to steer the recommenda-
tions . . . by looking at the relationship between the recommended songs and
174 PART 4: GOVERNANCE STRUCTURES

Fig 19.7 Simple sliders let users control the music recommender system by moving sliders
for acousticness, instrumentalness, danceability, valence, and energy.
Source: Component from Millecamp et al., 2018

the attributes, they may better understand why particular songs are being
recommended.”87
Another example is in the OECD’s Better Life Index website,88 which rates
nations according to eleven topics such as housing, income, jobs, community,
and education (Figure 19.8). Users move sliders to indicate which topics are
more or less important to them. As they make changes the list gracefully up-
dates with a smoothly animated bar chart so they can see which countries most
closely fulfill their preferences.
Yes, there are many users who don’t want to be bothered with making
choices, so they prefer the fully automatic recommendations, even if the
recommendations are not as well tuned to their needs. This is especially
true with discretionary applications such as movie, book, or restaurant rec-
ommendations, but the desire to take control grows with consequential and
especially with life-critical decisions in which professionals are responsible for
the outcomes.
There are many similar studies in e-commerce, movie, and other recom-
mender systems but I was charmed by a simple yet innovative website to get
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 175

Fig 19.8 The OECD’s Better Life Index with eleven sliders.
Source: http://www.oecdbetterlifeindex.org/

recommendations for novels to read. Users could select four of the twelve
sliders to choose attributes such as funny/serious, beautiful/disgusting, or opti-
mistic/bleak (Figure 19.9). As they move the sliders, the cover images of books
would come and go on the right side. A click on the cover image produced a
short description helping users to decide on what to read.89 Other controls in-
clude a world map for location (which even shows imaginary locations) and
check boxes for type of story (conflict, quest, revelation, etc.), race, age, gender,
and sexual preference. I liked it—give it a try!
In summary, while post-hoc explanations to help confused users are a good
idea and may be needed in some cases, a better approach may be to prevent or
reduce the need for explanations. This idea of preventing rather than treating
the disease is possible with prospective visual user interfaces that let users ex-
plore possibilities. Visual user interfaces help users understand the dimensions
176 PART 4: GOVERNANCE STRUCTURES

Fig 19.9 Recommender system for novels based on attributes of the books.
As users move the sliders the book covers appear or disappear.
Source: https://www.whichbook.net/

or attributes of a problem by easily exploring alternatives to understand the

tradeoffs, find nearby items, and make some discoveries of their own. Since user
needs are volatile over time and highly dependent on context, allowing users to
guide the recommendations is helpful. The simple designs in this chapter won’t
always work, but there is growing evidence that, when done well, user controls
are appreciated and produce more satisfied users and more successful usage
than systems without user controls.
A University of Illinois team led by Karrie Karahalios surveyed seventy-five
Facebook users and conducted a usability test and interview with thirty-six
participants.90 The issue was whether users understood their Facebook News
Feed and how to control it. The disturbing results were that users were largely
unaware of that these controls existed and were mostly unable to find the con-
trols that were required to carry out the usability test tasks. Some became angry
when they found that the settings were available but so difficult to find. The
study makes recommendations such as to include a search feature and improve
the menu design.
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 177

Skeptics of control panels point out that very few users learn to use the ex-
isting ones, but well-designed controls, such as in automobiles to adjust car
seats, mirrors, lighting, sound, and temperature have gone from being a com-
petitive advantage to required features. Instead of control panels, Jonathan
Stray and colleagues at the Partnership on AI emphasize strategies to learn
from users about what is needed to align automated recommendations more
closely with their needs and well-being.91 Newer designs which fit the many
kinds of problems that HCAI serves are possible—they just require a little more
imagination.
CHAPT ER 20

Safety Culture through Business

Management Strategies

W
hile every organization wants to perform flawlessly in all cir-
cumstances, the harsh reality is that pandemics overwhelm public
health, nuclear power station failures trigger regional devastation,
and terrorists threaten entire nations. Past accidents and incidents often had
narrow impacts, but today’s failures of massive technology-based interdepen-
dent organizations in globalized economies can have devastating effects for the
health and economies of entire cities, regions, and continents. Preparing for
failure by organizational design has become a major HCAI theme, with at least
four approaches:

Normal accident theory Charles Perrow’s influential book, Normal Acci-

dents, makes a strong case for organizational responsibility for safety, rather
than criticism of specific designs or operator error.1 His analysis, which
emerges from political science and sociology, emphasizes the dangers of organi-
zational complexity and overly tight coupling of units with too much centralized
control and insufficient redundancy to cope with disruptions. Tight cou-
pling implies that there are standard operating procedures with well-crafted
hierarchical chains of command, but when unexpected events occur, fluid col-
laboration across units can become vital. Supporting redundancy is difficult for
organizations that seek minimum staff to handle daily operations, but when
emergencies occur, additional experienced staff are needed immediately. How
can an organization justify 20% or 30% more staff than is needed for daily
operations, just to be ready for emergencies that happen once a year? How
should organizations plan for the unavailability or death of key personnel, as
180 PART 4: GOVERNANCE STRUCTURES

highlighted by the case of three top executives of a firm who died in a small
plane crash en route to a meeting—maybe organizations should ensure that
key personnel fly on separate planes? Perrow’s work is extended by psycholo-
gist Gary Klein and criticized by sociologist Andrew Hopkins who has pointed
to the lack of metrics for the two dangers: tight coupling and insufficient re-
dundancy.2 Other critics found Perrow’s belief in the inevitability of failure in
complex organizations to be unreasonably pessimistic.

High reliability organizations This approach emerged from organizational

design and business administration.3 High reliability organizations have a “pre-
occupation with failure,” studying possible ways that failures and near misses
can occur, with a “commitment to resilience” by regularly running simula-
tions of failures with relevant staff. Hilary Brown, who works on electricity
transmission in Minnesota, writes that high-reliability organizations “develop
reliability through redundancy, frequent training, emphasizing responsibility,
and distributing decision-making throughout the group hierarchy, all of which
reduce the impacts of complexity and tight coupling, as defined by Perrow.”4 In
contrast to normal accident theory, high-reliability organization advocates are
optimistic that a culture of mindfulness can prevent disasters.

Resilience engineering This approach grew out of cognitive science and

human factors engineering.5 Resilience engineering is about making organi-
zations flexible enough to recover from unexpected events. David D. Woods of
Ohio State University promotes resilience engineering by encouraging organi-
zations to develop “architectures for sustained adaptability,” drawing lessons
from biological, social, and technological systems.6 Resilience comes from
planning about how to adapt to disasters that are natural (earthquakes, floods,
storms), technology-based (power, telecommunications, water outage), adver-
sarial (sabotage, terrorism, criminal), or design (system bugs, human error,
management failure).

Safety cultures This approach came from responses to major disasters,

such as the Chernobyl nuclear reactor and NASA’s Space Shuttle Challenger
disaster, which could not be attributed to an individual mistake or design fail-
ure. Leaders in this community seek to build organizations which cultivate
staff attitudes,7 by long-term commitment to open management strategies, a
safety mindset among all staff, and validated organizational capabilities.8 MIT’s
CHAPTER 20: SAFETY CULTURE 181

Nancy Leveson has developed a systems engineering approach to safety engi-

neering that includes design, hazard analysis, and failure investigations.9 She
thoughtfully distinguishes between safety and reliability, pointing out that they
are separable issues, demanding different responses.

These four approaches share the goal of ensuring safe, uninterrupted perfor-
mance by preparing organizations to cope with failures and near misses (see
Figure 18.2).10 Building on the safety culture approach, my distillation em-
phasizes the ways managers can support HCAI: (1) leadership commitment to
safety, (2) hiring and training oriented to safety, (3) extensive reporting of fail-
ures and near misses, (4) internal review boards for problems and future plans,
and (5) alignment with industry standards and accepted best practices.

Leadership Commitment to Safety

Top organizational leaders can make their commitment to safety clear with ex-
plicit statements about values, vision, and mission. Their preoccupation with
failure is demonstrated by making positive statements about building a safety
culture, which includes values, beliefs, policies, and norms. The durable safety
culture can be shaken by the current safety climate, which includes the changing
atmosphere, context, and attitudes because of internal conflicts and novel ex-
ternal threats.11 Wise corporate leaders also know that a commitment to safety
is more likely to succeed if the board of directors is also involved in decision-
making, so that leaders recognize that their position depends on success with
safety.
Leadership commitment is made visible to employees by frequent restate-
ments of that commitment, positive efforts in hiring, repeated training, and
dealing openly with failures and near misses. Reviews of incidents, such as
monthly hospital review board meetings, can bring much increased patient
safety. Janet Berry’s team at an Ohio hospital report: “Improved safety and
teamwork climate . . . are associated with decreased patient harm and severity-
adjusted mortality.”12 Safety-focused leaders stress internal review boards for
discussion of plans and problems, as well as adherence to industry standards
and practices.
Safety cultures require effort and budget to ensure that there are sufficient
and diverse staff involved with ample time and resources to do their work.
This may imply redundancy to ensure knowledgeable people are available when
182 PART 4: GOVERNANCE STRUCTURES

problems emerge and a proactive mindset that anticipates dangers by conduct-

ing risk audits to prevent failures. Safety, reliability, and resilience raise ongoing
costs, but the reduction of expensive failures is the payoff. In addition, safety
efforts often result in increased productivity, reduced expenses for employee
injuries, and savings on operations and maintenance costs. These benefits can
be hard to prove to skeptics who want to cut costs by raising questions about
spending to prepare for rare and unpredictable events.
While some literature on safety culture focuses on employee and facility
safety, for HCAI systems, the focus must be on those whose lives are impacted
by these systems. Therefore, a safety culture for HCAI systems will be built by
strong connections with users, such as patients, physicians, and managers in
hospitals or prisoners, lawyers, and judges in parole-granting organizations.
Outreach to affected communities means two-way communications to inform
stakeholders, continuous data collection on usage, and easy reporting of ad-
verse events. Implementations of safety cultures in HCAI-based organizations
are emerging with initial efforts to support AI governance in medical care.13
Safety for AI algorithms is a management problem, but it has technical im-
plications for developers. Leaders will need to learn enough about the ways
safety can be increased by careful design of algorithms, such as in choosing the
right objective metrics, ensuring that supervisory controllers can stop danger-
ous actions, avoiding “distributional shift” (changes in context that invalidate
the training data), and preventing adversarial attacks.14 Leaders will need to
verify that testing is done often enough during development and continues
throughout deployment.
Skeptics fear that corporate safety culture pronouncements are merely pub-
lic relations attempts to deal with unacceptable risks in many industries such
as nuclear power, chemical production, healthcare, or social media platforms.
They also point to cases in which failures were blamed on operator error rather
than improper organizational preparation and inadequate operator training.
One approach to ensuring safety is to appoint an internal ombudsperson to
hear staff and stakeholder concerns privately, while enabling fair treatment of
whistleblowers who report serious safety threats.

Hiring and Training Oriented to Safety

When safety is included in job-hiring position statements, that commitment
becomes visible to current employees and potential new hires. Diversity in
CHAPTER 20: SAFETY CULTURE 183

hiring also demonstrates commitment to safety by including senior staff who

represent the diversity of employees and skills. Safety cultures may need ex-
perienced safety professionals from health, human resources, organizational
design, ethnography, and forensics.
Safety-first organizations conduct training exercises regularly, such as in-
dustrial control room operators carrying out emergency plans, pilots flying
simulators, and hospital workers running multiple day exercises for mass ca-
sualties or pandemics. When routine practices are similar to emergency plans,
employees are more likely to succeed during the emergency. Thoughtful plan-
ning includes ranking of emergencies, based on past frequency of occurrence
or severity, with an analysis of how many internal responders are needed in
each situation, plus planning for how to engage external services when needed.
Well-designed checklists can reduce errors in normal operations and remind
operators what to do in emergencies.
The training needed for computer software and hardware designers has
become easier due to the guidelines documents from leading technology com-
panies such as Apple’s Human Interface Guidelines15 and Google’s design
guidebook,16 which both contain useful example screen designs. In addition,
Microsoft’s eighteen guidelines for AI–human interaction,17 and IBM’s De-
sign for AI website18 rely on thoughtful general principles, which will need
refinement.
These guidelines build on a long history of user interface design19 and newer
research on designing interfaces for HCAI systems.20 However, guidelines have
to be taught to user interface designers, programmers, AI engineers, product
managers, and policy-makers, whose practices gain strength if there are or-
ganizational mechanisms for ensuring enforcement, granting exemptions, and
making enhancements.
As HCAI systems are introduced, the training needs for consumers with
self-driving cars, clinicians struggling with electronic healthcare systems, and
operators of industrial control rooms become more complex. These users need
to understand what aspects of the HCAI systems they control and how machine
learning works, including its potential failures and associated outcomes.
Paul R. Daugherty and H. James Wilson’s book Human + Machine: Reimag-
ining Work in the Age of AI underlines the importance of training. They write
that “companies must provide the employee training and retraining required
so that people will be prepared and ready to assume any new roles . . . investing
in people must be the core part of any company’s AI strategy.”21
184 PART 4: GOVERNANCE STRUCTURES

Extensive Reporting of Failures and

Near Misses
Safety-oriented organizations regularly report on their failures (sometimes re-
ferred to as adverse events) and near misses (sometimes referred to as “close
calls”).22 Near misses can be small mistakes that are handled easily or dan-
gerous practices that can be avoided, thereby limiting serious failures. These
include near misses such as an occasional water leak, forced equipment restart,
operator error, or electrical outage. If the errors of omission or commission in
near misses are reported and logged, then patterns become clear to equipment
and facility managers, so they can focus attention on preventing more serious
failures. Since near misses typically occur much more often than failures, they
provide richer data to guide maintenance, training, or redesign.
The US National Safety Council makes the surprising recommendation to
avoid rewarding managers whose units have few failures, but rather to reward
those managers whose units have high rates of near miss reports.23 By making
near miss reporting a common and virtuous practice, staff attention is more
focused on safety and ready to make near miss reports, rather than cover them
up.
Civil aviation has a much-deserved reputation for safety. This stems, in part,
from a rich culture of near miss reporting such as through the US Federal Avia-
tion Administration Hotline,24 which invites passengers, air-traffic controllers,
pilots, and the general public to report incidents, anonymously if they wish.
The US National Transportation Safety Board, whose public reports are trusted
and influential in promoting improvements, thoroughly investigates crashes
with injuries or loss of life. In addition, the Aviation Safety Reporting Sys-
tem is a voluntary use website that “captures confidential reports, analyzes the
resulting aviation safety data, and disseminates vital information to the avia-
tion community.”25 These public reporting systems are good models for HCAI
systems.
The US Food and Drug Administration’s (FDA) Adverse Event Reporting
System provides a model for the public reporting of problems with HCAI sys-
tems.26 Their web-based public reporting system for healthcare professionals,
consumers, and manufacturers invites reports on medications, medical de-
vices, cosmetics, food, and other products (Figure 20.1).27 The user interface
walks users through seven stages to collect the data needed to make a credible
CHAPTER 20: SAFETY CULTURE 185

Fig 20.1 US Food and Drug Administration Voluntary Reporting Form invites adverse
event reports from health professionals and consumers.
Source: https://www.accessdata.fda.gov/scripts/medwatch/index.cfm

and useful data set. The Public Dashboard presents information on the grow-
ing number of reports each year, exceeding 2 million in 2018, 2019, and 2020
(Figure 20.2).
Another US FDA reporting system, Manufacturer and User Facility De-
vice Experience (MAUDE), captures adverse events in use of robotic surgery
systems. Drawing on this data, a detailed review of 10,624 reports covering
2000–2013 reported on 144 deaths, 1,391 patient injuries, and 8,061 device
malfunctions.28 The report on these alarming outcomes recommends “im-
proved human-machine interfaces and surgical simulators that train surgical
teams for handling technical problems and assess their actions in real-time
during the surgery.” The conclusion stresses the value of “surgical team train-
ing, advanced human machine interfaces, improved accident investigation and
reporting mechanisms, and safety-based design techniques.”
For cybersecurity problems which threaten networks, hardware, and soft-
ware, public reporting systems have also proven to be valuable. The MITRE
Corporation, a company funded to work for the US government, has been keep-
ing a list of common vulnerabilities and exposures since 1999, with more than
150,000 entries.29 MITRE works with the US National Institutes of Standards
and Technology to maintain a National Vulnerabilities Database,30 which helps
software developers understand the weaknesses in their programs. This allows
more rapid repairs, coordination among those with common interests, and ef-
forts to prevent vulnerabilities in future products and services. All of these open
186 PART 4: GOVERNANCE STRUCTURES

Fig 20.2 US Food and Drug Administration Adverse Event Reporting System (FAERS)
Public Dashboard. Data as of December 31, 2020.
Source: https://ﬁs.fda.gov/sense/app/d10be6bb-494e-4cd2-82e4-0135608ddc13/sheet/7a47a261-d58b-
4203-a8aa-6d3021737452/state/analysis

reporting systems are good models to follow for HCAI failure and near miss
reports.
In software engineering, code development environments, such as GitHub,
record the author of every line of code and document who made changes.31
GitHub claims to be used by more than 56 million developers in more than
3 million organizations. Then, when systems are in operation, bug reporting
tools, such as freely available Bugzilla, guide project teams to frequent and se-
rious bugs with a tracking system for recording resolution and testing.32 Fixing
users’ problems promptly prevents other users from encountering the same
problem. These tools are typically used by members of software engineering
teams but inviting reports of problems from users is another opportunity.
The cybersecurity field has a long-standing practice of paying for vulner-
ability reports that could be adapted for HCAI systems. Bug bounties could
be paid for individuals who report problems with HCAI systems, but the idea
could be extended to bias bounties for those who report biased performance.33
This crowdsourcing idea has been used by companies such as Google, which
has paid from $100 to more than $30,000 per validated report for a total of
more than $3 million. At least two companies make a business of managing
such systems for clients: HackerOne34 and BugCrowd.35 Validating this idea in
the HCAI context requires policies about how much is paid, how reports are
CHAPTER 20: SAFETY CULTURE 187

evaluated, and how much information about the bug and bias reports are pub-
licly disclosed.36 These crowdsourced ideas build on software developer Eric
Raymond’s belief that “with enough eyes, all bugs are shallow,” suggesting that
it is valuable to engage more people in finding bugs.
Bug reporting is easier for interactive systems with comprehensible status
displays than for highly automated systems without displays, such as elevators,
manufacturing equipment, or self-driving cars. For example, I’ve regularly had
problems with Internet connectivity at my home, but the lack of adequate user
interfaces makes it very difficult for me to isolate and report the problems I’ve
had. When my Internet connection drops, it is hard to tell if it was a problem
on my laptop, my wireless connection to the router/modem, or the Internet
service provider. I wish there was a status display and control panel so I could
fix the problem or know whom to call. Server hosting company Cloudflare pro-
vides this information for its professional clients. Like many users, the best hope
is to reboot everything and hope that in ten to fifteen minutes I can resume
work.
Another model to follow is the US Army’s method of after-action reviews,
which have also been used in healthcare, transportation, industrial process con-
trol, environmental monitoring, and firefighting, so they might be useful for
studying HCAI failures and near misses.37 Investigators try to understand what
was supposed to happen, what actually happened, and what could be done bet-
ter in the future. A complete report that describes what went well and what
could be improved will encourage acceptance of recommendations. As After-
Action Review participants gain familiarity with the process, their analyses are
likely to improve and so will the acceptance of their recommendations.
Early efforts in the HCAI community are beginning to collect data on HCAI
incidents. Roman Yampolskiy’s initial set of incidents has been included in
a more ambitious project from the Partnership on AI.38 Sean McGregor de-
scribes the admirable goals and methods used to build the database of more
than 1000 incident reports sourced from popular, trade, and academic publi-
cations.39 Searches can be done by keywords and phrases such as “mortgage”
or “facial recognition,” but a thoughtful report on these incidents remains to
be done. Still, this is an important project, which could help in efforts to make
more reliable, safe, and trustworthy systems.
Another important project, but more narrowly focused, is run by Karl
Hansen, who collects public reports on deaths involving Tesla cars. He was a
special agent with the US Army Criminal Investigation Command, Protective
188 PART 4: GOVERNANCE STRUCTURES

Services Battalion, who was hired by Tesla as their internal investigator in 2018.
He claims to have been wrongfully fired by Tesla for his public reporting of
deaths involving Tesla cars.40 The 209 deaths as of September 2021 is far more
than most people expect, from what is often presented to the public. These re-
ports are incomplete so it is difficult to determine what happened in each case
or whether the Autopilot self-driving system was in operation. In August 2021,
the US National Highway Transportation Safety Administration launched an
investigation of eleven crashes of Tesla cars on autopilot that hit first responders
on the road or at the roadside.
Other concerns come from 122 sudden unintended acceleration (SUA)
events involving Teslas that were reported to the US National Highway Traf-
fic Safety Administration by January 2020.41 A typical report reads: “my wife
was slowly approaching our garage door waiting for the garage door to open
when the car suddenly lurched forward . . . destroying the garage doors . . . the
car eventually stopped when it hit the concrete wall of the garage.” Another
report describes two sudden accelerations and ends by saying “fortunately no
collision occurred, but we are scared now.” Tesla claims that its investigations
showed that the vehicle functioned properly, but that every incident was caused
by drivers stepping on the accelerator.42 However, shouldn’t a safety-first car
prevent such collisions with garage doors, walls, or other vehicles?

Internal Review Boards for Problems

and Future Plans
Commitment to a safety culture is shown by regularly scheduled monthly meet-
ings to discuss failures and near misses, as well as to celebrate resilient efforts in
the face of serious challenges. Standardized statistical reporting of events allows
managers and staff to understand what metrics are important and to suggest
new ones. Internal and sometimes public summaries emphasize the importance
of a safety culture.43
Review boards may include managers, staff, and others, who offer diverse
perspectives on how to promote continuous improvement. In some industries,
such as aviation, monthly reports of on-time performance or lost bag rates
drive healthy competition, which serves the public interest. Similarly, hospitals
may report patient care results for various conditions or surgeries, enabling the
public to choose hospitals, in part, by their performance. The US Department
of Health and Human Services has an Agency for Healthcare Research and
CHAPTER 20: SAFETY CULTURE 189

Quality, which conducts regular surveys on patient safety culture in hospitals,

nursing homes, and community pharmacies.44 The surveys raise staff awareness
of patient safety and show trends over time and across regions.
A surprising approach to failures is emerging in many hospitals, which
are adopting disclosure, apology, and offer programs.45 Medical professionals
usually provide excellent care for their patients, but when problems arise, there
has been a tendency to do the best they can for the patient. However, fear of
malpractice lawsuits limits physician willingness to report problems to patients,
their families, or hospital managers. The emerging approach of disclosure, apol-
ogy, and offer programs shifts to full disclosure to patients and their families
with a clear apology and an offer of treatments to remedy the problem and/or
financial compensation. While some physicians and managers feared that this
would increase malpractice lawsuits, the results were dramatically different. Pa-
tients and their families appreciated the honest disclosure and especially the
clear apology. As a result, lawsuits were often cut in half, while the number
of medical errors decreased substantially because of physicians’ awareness of
these programs. Professional and organizational pride also increased.46
Internal review and auditing teams can also improve HCAI practices to
limit failures and near misses. Google’s five-stage internal algorithmic audit-
ing framework, which is designed “to close the AI accountability gap,” provides
a good model for others to build on:47

1) Scoping: identify the scope of the project and the audit; raise questions
of risk.
2) Mapping: create stakeholder map and collaborator contact list; conduct
interviews and select metrics.
3) Artifact collection: document design process, data sets, and machine
learning models.
4) Testing: conduct adversarial testing to probe edge cases and failure
possibilities.
5) Reflection: consider risk analysis, failure remediation, and record design
history.

The authors include a post-audit review for self-assessment summary report

and mechanisms to track implementation. However, they are well aware that
“internal audits are only one important aspect of a broader system of required
quality checks and balances.”
190 PART 4: GOVERNANCE STRUCTURES

Initial corporate efforts include Facebook’s oversight board, set up in mid-

2020 for content monitoring and governance on their platform.48 Microsoft’s AI
and Ethics in Engineering and Research (AETHER) Committee advises their
leadership on responsible AI issues, technology, processes, and best practices
that “warrant people’s trust.”49 Microsoft’s Office of Responsible AI implements
company-wide rules for governance, team readiness, and dealing with sensitive
use cases. They also help shape new HCAI-related “laws, norms, and standards
. . . for the benefit of society at large.”

Alignment with Industry Standard

Practices
In many consequential and life-critical industries there are established industry
standards, often promulgated by professional associations, such as the Asso-
ciation for Advancing Automation (AAA).50 The AAA, founded in 1974 as
the Robotics Industries Association (RIA), works with the American National
Standards Institute51 to drive “innovation, growth, and safety” by developing
voluntary consensus standards for use by its members. Their work on advanced
automation is a model for other forms of HCAI.
The International Standards Organization (ISO) has a Technical Commit-
tee on Robotics whose goal, since 1983, is “to develop high quality standards
for the safety of industrial robots and service robots . . . by providing clear
best practices on how to ensure proper safe installations, as well as provid-
ing standardized interfaces and performance criteria.”52 The emerging IEEE
P7000 series of standards is directly tied to HCAI issues such as transparency,
bias, safety, and trustworthiness.53 These standards could do much to ad-
vance these goals by making more precise definitions and ways of assessing
HCAI systems. The Open Community for Ethics in Autonomous and Intelli-
gent Systems (OCEANIS) promotes discussions to coordinate multiple global
standards efforts.54
A third source of standards is the World Wide Web Consortium (W3C),
which supports website designers who pursue universal or inclusive design
goals with the Web’s Content Accessibility Guidelines.55 Another source is the
US Access Board, whose Section 508 standards guide agencies to “give disabled
employees and members of the public access to information that is comparable
to the access available to others.”56 These accessibility guidelines will be needed
to ensure universal usability of HCAI systems, and they provide helpful models
of how to deploy other guidelines for HCAI systems.
CHAPTER 20: SAFETY CULTURE 191

Characteristics of the Maturity levels

Level 5 Focus on process

Optimizing improvement

Level 4 Processes measured

Quantitatively Managed and controlled

Processes characterized for the

Level 3 organization and proactive.
Defined (Projects tailor their processes from
organization’s standards)

Level 2 Processes characterized for projects

Managed and often reactive.

Level 1 Processes unpredictable,

Initial poorly controlled, and reactive

Fig 20.3 Characteristics of maturity levels: five levels in the Capability Maturity Model:
(1) Initial, (2) Managed, (3) Defined, (4) Quantitatively Managed, and (5) Optimizing.

By working with organizations that develop these guidelines, companies

can contribute to future guidelines and standards, learn about best practices,
and provide education for staff members. Customers and the public may see
participation and adherence to standards as indication of a safety-oriented
company. However, skeptics are concerned that corporate participation in de-
veloping voluntary standards leads to weak standards whose main goal is to
prevent more rigorous government regulation or other interventions. This pro-
cess, corporate capture, indicates that corporate participants weaken standards
to give them competitive advantages or avoid more costly designs. However,
corporate participants also bring realistic experience to the process, raising the
relevance of standards and increasing the chances for their widespread accep-
tance. The right blend is hard to achieve, but well-designed voluntary standards
can improve products and services.
Another approach to improving software quality is the Capability Maturity
Model (Figure 20.3), developed by the Software Engineering Institute (SEI)
in the late 1980s57 and regularly updated.58 The SEI’s goal is to improve soft-
ware development processes, rather than setting standards for products and
services. The 2018 Capability Maturity Model Integration version comes with
the claim that it helps “integrate traditionally separate organizational functions,
192 PART 4: GOVERNANCE STRUCTURES

set process improvement goals and priorities, provide guidance for quality
processes, and provide a point of reference for appraising current processes.”.59
The Capability Maturity Model Integration is a guide to software engineer-
ing organizational processes with five levels of maturity from Level 1 in which
processes are unpredictable, poorly controlled across groups, and reactive to
problems. Higher levels define orderly software development processes with
detailed metrics for management control and organization-wide discussions of
how to optimize performance and anticipate problems. Training for staff and
management help ensure that the required practices are understood and fol-
lowed. Many US government software development contracts, especially from
defense agencies, stipulate which maturity level is required for bidders, using a
formal appraisal process.
In summary, safety cultures take a strong commitment by industry lead-
ers, supported by personnel, resources, and substantive actions, which are at
odds with the “move fast, break things” ethic of early technology companies.
To succeed, leaders will have to hire safety experts who use rigorous statistical
methods, anticipate problems, appreciate openness, and measure performance.
Other vital strategies are internal reviews and alignment with industry stan-
dards. Getting to a mature stage, where safety is valued as a competitive
advantage will make HCAI technologies increasingly trusted for consequential
and life-critical applications.
Skeptics question whether the Capability Maturity Models lead to top-
heavy management structures, which may slow the popular agile and lean
development methods. Still, proposals for HCAI Capability Maturity Mod-
els are emerging for medical devices, transportation, and cybersecurity.60
The UK Institute for Ethical AI and Machine Learning proposes a Machine
Learning Maturity Model based on hundreds of practical benchmarks, which
cover topics such as data and model assessment processes and explainability
requirements.61
HCAI Capability Maturity Models might be transformed into Trustwor-
thiness Maturity Models (TMM). TMMs might describe Level 1 initial use
of HCAI that is guided by individual team preferences and knowledge, mak-
ing it unpredictable, poorly controlled, and reactive to problems. Level 2 use
might call for uniform staff training in tools and processes, making it more
consistent across teams, while Level 3 might require repeated use of tools and
processes that are reviewed for their efficacy and refined to meet the applica-
tion domain needs and organization style. Assessments would cover testing
CHAPTER 20: SAFETY CULTURE 193

for biased data, validation and verification of HCAI systems, performance,

user experience testing, and reviews of customer complaints. Level 4 might re-
quire measurement of HCAI systems and developer performance, with analysis
of audit trails to understand how failures and near misses occurred. Level 5
might have repeated measures across many groups and over time to support
continuous improvement and quality control.
Supporting the TMM idea, professional and academic attendees of a work-
shop produced a thorough report with fifty-nine co-authors who call for “AI
developers to earn trust from system users, customers, civil society, govern-
ments, and other stakeholders that they are building AI responsibly, there is a
need to move beyond principles to a focus on mechanisms for demonstrating
responsible behavior.”62 The authors’ recommendations of institutional struc-
tures and software workflows are in harmony with those in this book, and
they also cover hardware recommendations, formal methods, and “verifiable
claims” that are “sufficiently precise to be falsifiable.”
Another growing industry practice is to develop templates for document-
ing HCAI systems for the benefit of developers, managers, maintainers, and
other stakeholders. The 2018 idea of “datasheets for datasets” sparked great in-
terest because it addresses the notion that machine learning data sets needed
as much documentation as the programs that used them. The authors pro-
posed a “standardized way to document how and why a data set was created,
what information it contains, what tasks it should and should not be used for,
and whether it might raise any ethical or legal concerns.”63 Their paper, which
provided examples of datasheets for face recognition and sentiment classifica-
tion projects, triggered strong efforts such as Google’s Model Cards, Microsoft’s
Datasheets, and IBM’s FactSheets.64 The FactSheets team conducted evalua-
tions by thirty-five participants of six systems, such as an audio classifier, breast
cancer detector, and image caption generator, for qualities such as complete-
ness, conciseness, and clarity. Combined with a second study, which required
nineteen participants to create FactSheets, the team refined their design and
evaluation methods. The FactSheets studies and much related material are
available on an IBM website.65 Following in this line of work, Kacper Sokol
and Peter Flach of the University of Bristol, UK, developed “Explainability Fact
Sheets” to describe numerous features of a system’s explanations.66 If these
early documentation efforts mature and spread across companies, they could
do much to improve development processes so as to improve reliability, safety,
and trustworthiness.
CHAPT ER 21

Trustworthy Certiﬁcation
by Independent Oversight

T
he third governance layer is independent oversight by external review
organizations (see Figure 18.2). Even as established large companies,
government agencies, and other organizations that build consequential
HCAI systems are venturing into new territory, so they will face new prob-
lems. Therefore, thoughtful independent oversight reviews will be valuable
in achieving trustworthy systems that receive wide public acceptance. How-
ever, designing successful independent oversight structures is still a challenge,
as shown by reports on more than forty variations that have been used in
government, business, universities, non-governmental organizations, and civic
society.1
The key to independent oversight is to support the legal, moral, and ethical
principles of human or organizational responsibility and liability for their
products and services. Responsibility is a complex topic, with nuanced vari-
ations such as legal liability, professional accountability, moral responsibility,
and ethical bias.2 A deeper philosophical discussion of responsibility is useful,
but I assume that humans and organizations are legally liable (responsible) for
the products and services that they create, operate, maintain, or use indirectly.3
The report from the European Union’s Expert Committee on Liability for New
Technologies stresses the importance of clarifying liability for autonomous and
robotic technologies.4 They assert that operators of such technologies are liable
and that products should include audit trails to enable retrospective analy-
ses of failures to assign liability to manufacturer, operator, or maintenance
organizations.
196 PART 4: GOVERNANCE STRUCTURES

Professional engineers, physicians, lawyers, aviation specialists, business

leaders, etc. are aware of their personal responsibility for their actions, but
the software field has largely avoided certification and professional status for
designers, programmers, and managers. In addition, contracts often contain
“hold harmless” clauses that stipulate developers are not liable for damages,
since software development is often described as a new and emerging activity,
even after fifty years of experience. The HCAI movement has raised these is-
sues again with frequent calls for algorithmic accountability and transparency,5
ethically aligned design,6 and professional responsibility.7 Professional organi-
zations, such as the AAAI, ACM, and IEEE, have ethical codes of conduct for
their members, but penalties for unethical conduct are rare.
When damages occur, the allocation of liability is a complex legal issue.
Many legal scholars, however, believe that existing laws are sufficient to deal
with HCAI systems, although novel precedents will help clarify the issues.8 For
example, Facebook was sued for discrimination under existing laws since its
AI algorithms allowed real estate brokers to target housing advertisements by
gender, age, and zip code. Facebook settled the case and instituted changes to
prevent advertising discrimination in housing, credit, and employment.9
Independent oversight is widely used by businesses, government agencies,
universities, non-governmental organizations, and civic society to stimulate
discussions, review plans, monitor ongoing processes, and analyze failures.10
The goal of independent oversight is to review plans for major projects, in-
vestigate serious failures, and promote continuous improvement that ensures
reliable, safe, and trustworthy products and services.
The individuals who serve on independent oversight boards need to be re-
spected leaders whose specialized knowledge makes them informed enough
about the organizations they review to be knowledgeable, but far enough away
that they are independent. Conflicts of interest, such as previous relationships
with the organization that is being reviewed, are likely to exist, so they must be
disclosed and assessed. Diverse membership representing different disciplines,
age, gender, ethnicity, and other factors helps build robust oversight boards.
Their capacity to investigate may include the right to examine private data,
compel interviews, and even subpoena witnesses for testimony. The indepen-
dent oversight reports are most effective if they are made public. Recommen-
dations will have impact if there is a requirement to respond and make changes
within a specified time period, usually measured in weeks or months.
Three independent oversight methods are common (Figure 21.1):11
CHAPTER 21: INDEPENDENT OVERSIGHT 197

Independent Oversight Methods

Planning Continuous Retrospective

oversight monitoring analysis of disasters

Fig 21.1 Independent oversight methods. Three forms of independent oversight: planning
oversight, continuous monitoring, and retrospective analysis of disasters.

Planning oversight proposals for new HCAI systems or major upgrades are
presented for review in advance so that feedback and discussion can influence
the plans. Planning oversight is similar to zoning boards, which review propos-
als for new buildings that are to adhere to building codes. A variation is the idea
of algorithmic impact assessments, which are similar to environmental impact
statements that enable stakeholders to discuss plans before implementation.12
Rigorous planning oversight needs to have follow-up reviews to verify that the
plan was followed.

Continuous monitoring this is an expensive approach, but the US Food and

Drug Administration has inspectors who work continuously at pharmaceuti-
cal and meat-packing plants, while the US Federal Reserve Board continuously
monitors practices at large banks. One form of continuous monitoring is peri-
odic inspections, such as quarterly inspections for elevators or annual financial
audits for publicly traded companies. Continuous monitoring of mortgage or
parole granting HCAI systems would reveal problems as the profile of appli-
cants’ changes or the context shifts, such as has happened during the COVID-19
crisis.

Retrospective analysis of disasters the US National Transportation Safety

Board conducts widely respected, thorough reviews with detailed reports about
aircraft, train, or ship crashes. Similarly, the US Federal Communications Com-
mission is moving to review HCAI systems in social media and web services,
especially disability access and fake news attacks. Other agencies in the United
States and around the world are developing principles and policies to enable
study and limitation of HCAI failures. A central effort is to develop voluntary
industry guidelines for audit trails and analysis for diverse applications.
198 PART 4: GOVERNANCE STRUCTURES

Skeptics point to failures of independent oversight methods, sometimes tied

to a lack of sufficient independence, but the value of these methods is widely
appreciated.
In summary, clarifying responsibility for designers, engineers, managers,
maintainers, and users of advanced technology will improve safety and effec-
tiveness, since these stakeholders will be aware of their liability for negligent be-
havior. The five technical practices for software engineering teams (Chapter 19)
are first steps to developing reliable systems. The five management strategies for
organizations (Chapter 20) build on existing strategies to promote safety cul-
tures across all the teams in their organization. This chapter offers four paths to
trustworthy certification within an industry by independent oversight reviews,
in which knowledgeable industry experts bring successful practices from one
organization to another. Chapter 22 describes government interventions and
regulations.

Accounting Firms Conduct External

Audits for HCAI Systems
The US Securities and Exchange Commission (SEC) requires publicly traded
businesses to have annual internal and external audits, with results posted on
the SEC website and published in corporate annual reports. This SEC mandate,
which required use of the generally accepted accounting principles (GAAP),
is widely considered to have limited fraud and offered investors more accurate
information. However, there were massive failures such as the Enron and MCI
WorldCom problems, which led to the Sarbanes–Oxley Act of 2002, known as
the Corporate and Auditing Accountability, Responsibility, and Transparency
Act, but remember that no system will ever completely prevent malfeasance and
fraud. New mandates about reporting on HCAI projects, such as descriptions of
fairness and user experience test results, could standardize and strengthen re-
porting methods so as to increase investor trust by allowing comparisons across
corporations.
Independent financial audit firms, which analyze corporate financial state-
ments to certify that they are accurate, truthful, and complete, could develop
reviewing strategies for corporate HCAI projects to provide guidance to in-
vestors. They would also make recommendations to their client companies
about what improvements to make. These firms often develop close relation-
ships with internal auditing committees, so that there is a good chance that
recommendations will be implemented.
CHAPTER 21: INDEPENDENT OVERSIGHT 199

Leading independent auditing firms could be encouraged by public pres-

sure or SEC mandate to increase their commitment to support HCAI projects.
The big four firms are PricewaterhouseCoopers,13 Deloitte,14 Ernst & Young,15
and KPMG;16 all claim expertise in AI. The Deloitte website makes a promising
statement that “AI tools typically yield little direct outcome until paired with
human-centered design,” which leans in the directions recommended by this
chapter. Accounting firms have two potential roles, consulting and indepen-
dent audit, but these roles must be kept strictly separated, as stipulated by the
Sarbanes–Oxley Act.
A compelling example of independent oversight of corporate projects is con-
tact tracing for COVID-19. Apple and Google partnered to produce mobile
device apps that would alert users if someone they came in contact with devel-
oped COVID-19. However, the privacy threats immediately raised concerns,
leading to calls for independent oversight boards and policies. One thoughtful
proposal offers over 200 items for an independent oversight board of gover-
nance to assess and adjudicate during an audit.17 For controversial projects that
involve privacy, security, industry competition, or potential bias, independent
oversight panels could play a role in increasing public trust.
If the big four auditing firms stepped forward, their credibility with corpora-
tions and general public trust could lead to independent HCAI audits that had
substance and impact. A model to build on is the Committee of Sponsoring Or-
ganizations,18 which brought together five leading accounting organizations to
improve enterprise risk management, internal controls, and fraud deterrence.
This form of auditing for HCAI could reduce pressures for government regu-
lation and improve business practices. Early efforts could attract attention and
enlist trusted public figures and organizations to join such review boards.
In addition to auditing or accounting firms, consulting companies could also
play a role. Leaders like Accenture, McKinsey and Co., and Boston Consulting
Group have all built their AI expertise and published white papers in order to
advise companies on reliable, safe, and trustworthy systems.

Insurance Companies Compensate

for Failures
The insurance industry is a potential guarantor of trustworthiness, as it is in the
building, manufacturing, and medical domains. Insurance companies could
specify requirements for insurability of HCAI systems in manufacturing, medi-
cal, transportation, industrial, and other domains. They have long played a key
200 PART 4: GOVERNANCE STRUCTURES

role in ensuring building safety by requiring adherence to building codes for

structural strength, fire safety, flood protection, and many other features.
Building codes could be a model for software engineers, as described in
computer scientist Carl Landwehr’s proposal: “A building code for building
code.”19 He extends historical analogies to plumbing, fire, or electrical stan-
dards by applying them to software engineering for avionics, medical devices,
and cybersecurity, but the extension to HCAI systems seems natural.
Builders must satisfy the building codes to gain the inspector’s approval,
which allows them to obtain liability insurance. Software engineers could con-
tribute to detailed software design, testing, and certification standards, which
could be used to enable an insurance company to conduct risk assessment and
develop insurance pricing. Requirements for audit trails of performance and
monthly or quarterly reports about failures and near misses would give insur-
ance companies data they need. Actuaries would become skillful in developing
risk profiles for different applications and industries, with guidelines for com-
pensation when damage occurs. Liability law from related technologies would
have to be interpreted to HCAI systems.20
A natural next step would be for insurance companies to gather data from
multiple companies in each industry they serve, which would accelerate their
development of risk metrics and underwriting evaluation methods. This would
also support the refinement of building codes for each industry to educate de-
velopers and publicly record expected practices. The development of building
codes also guides companies about how to improve their HCAI products and
services.
In some industries such as healthcare, travel, and car or home ownership,
consumers purchase insurance which provides no-fault protection to cover
damages for any reason. But in some cases, providers purchase insurance to
cover the costs of medical malpractice suits, transportation accidents, and
building damage from fire, floods, or storms. For many HCAI systems, it seems
reasonable that the providers would be the ones to purchase insurance so as to
provide protection for the large numbers of consumers who might be harmed.
This would drive up the costs of products and services, but as in many indus-
tries, consumers are ready to pay these costs. Insurance companies will have
to develop risk assessments for HCAI systems, but as the number of appli-
cations grow, sufficient data on failures and near misses will emerge to guide
refinements.
Car insurance companies, including Travelers Insurance, produced a July
2018 paper on “Insuring Autonomy” for self-driving cars.21 They sought a
framework that “spurs innovation, increases public safety, provides peace of
CHAPTER 21: INDEPENDENT OVERSIGHT 201

mind and protects . . . drivers and consumers.” The report endorsed the beliefs
that self-driving cars would dramatically increase safety, but damage claims
would increase because of the more costly equipment. Both beliefs influence
risk assessment, the setting of premiums, and profits, as did their forecast that
the number of cars would decrease because of more shared usage. This early
report remains relevant because the public still needs data that demonstrate
or refute the idea that self-driving cars are safer. Manufacturers are reluctant
to report what they know and the states and federal government in the United
States have yet to push for open reporting and regulations on self-driving cars.22
The insurance companies will certainly act when self-driving cars move from
demonstration projects to wider consumer use, but earlier interventions could
be more influential.23
Skeptics fear that the insurance companies are more concerned with prof-
its than with protecting public safety and they worry about the difficulty of
pursuing a claim when injured by a self-driving car, mistaken medical recom-
mendation, or biased treatment during job hiring, mortgage approval or parole
assessment. However, as the history of insurance shows, having insurance will
benefit many people in their difficult moments of loss. Developing realistic
insurance from the damages caused by HCAI systems is a worthy goal.
Other approaches are to create no-fault insurance programs or victim
compensation funds, in which industry or government provide funds to an
independent review board that pays injured parties promptly without the com-
plexity and cost of judicial processes. Examples include the September 11
Victim Compensation Fund for the 2001 terror attack in New York and the
Gulf Coast Claim Facility for the Deepwater Horizon oil spill. Proposals for
novel forms of compensation for HCAI system failures have been made, but
none have yet gained widespread support.

Non-governmental and Civil Society

Organizations
In addition to government efforts, auditing by accounting firms, and war-
ranties from insurance companies, the United States and many other countries
have a rich set of non-governmental and civil society organizations that have
already been active in promoting reliable, safe, and trustworthy HCAI systems
(Appendix A). These examples have various levels of support, but collec-
tively they are likely to do much to promote improved systems and public
acceptance.
202 PART 4: GOVERNANCE STRUCTURES

These non-governmental organizations (NGOs) are often funded by wealthy

donors or corporations who believe that an independent organization has
greater freedom to explore novel ideas and lead public discussions in rapidly
growing fields such as AI. Some of the NGOs were started by individuals who
have the passion necessary to draw others in and find sponsors, but the more
mature NGOs may have dozens or hundreds of paid staff members who share
their enthusiasm. Some of these NGOs develop beneficial services or train-
ing courses on new technology policy issues that bring in funding and further
expand their networks of contacts.
An inspiring example is how the Algorithmic Justice League was able to get
large technology companies to improve their facial recognition products so as
to reduce gender and racial bias within a two-year period. Their pressure also
was likely to have been influential in the spring 2020 decisions of leading com-
panies to halt their sales to police agencies in the wake of the intense movement
to limit police racial bias.
NGOs have proven to be early leaders in developing new ideas about HCAI
principles and ethics, but now they will need to increase their attention to de-
veloping new ideas about implementing software engineering practices and
business management strategies. They will also have to expand their relation-
ships with government policy-makers, liability lawyers, insurance companies,
and auditing firms so they can influence the external oversight mechanisms that
have long been part of other industries.
However, NGOs have limited authority to intervene. Their role is to point out
problems, raise possible solutions, stimulate public discussion, support inves-
tigative journalism, and change public attitudes. Then, governmental agencies
respond with policy guidance to staff and where possible new rules and regula-
tions. Auditing companies change their processes to accommodate HCAI, and
insurance companies update their risk assessment as they underwrite new tech-
nologies. NGOs could also be influential by conducting independent oversight
studies to analyze widely used HCAI systems. Their reports could provide fresh
insights and assessment processes tuned to the needs of diverse industries.

Professional Organizations and

Research Institutes
Professional organizations have proven effective in developing voluntary
guidelines and standards. Established and new organizations (Appendix B) are
CHAPTER 21: INDEPENDENT OVERSIGHT 203

vigorously engaged in international discussions on ethical and practical design

principles for responsible AI. They are already influential in producing positive
outcomes. However, skeptics caution that industry leaders often dominate pro-
fessional organizations, sometimes called corporate capture, so they may push
for weaker guidelines and standards.
Professional societies, such as the IEEE, have long been effective in
supporting international standards, with current efforts on the P7000 series ad-
dressing topics such as transparency of autonomous systems, algorithmic bias
considerations, fail-safe design for autonomous and semi-autonomous systems,
and rating the trustworthiness of news sources.24 The ACM’s US Technology
Policy Committee has subgroups that address accessibility, AI/algorithmic ac-
countability, digital governance, and privacy. The challenge for professional
societies is to increase the low rates of participation of their members in these
efforts. The IEEE has stepped forward with an Ethics Certification Program for
Autonomous and Intelligent Systems, which seeks to develop metrics and certi-
fication methods for corporations to address transparency, accountability, and
algorithmic bias.25
Academic institutions have long conducted research on AI, but they have
now formed large centers to conduct research and promote interest in ethi-
cal, design, and research themes around HCAI. Early efforts have begun to add
ethical concerns and policy-making strategies to education, but much more re-
mains to be done so that graduates are more aware of the impact of their work.
Examples of diverse lab names include these at prominent institutions:

• Brown University, US (Humanity Centered Robotics Initiative)

• Columbia University, US (Data Science Institute)
• Harvard University, US (Berkman Klein Center for Internet and Society)
• Johns Hopkins University, US (Institute for Assured Autonomy)
• Monash University, Australia (Human-Centered AI)
• New York University, US (Center for Responsible AI)
• Northwestern University, US (Center for Human–Computer Interaction
+ Design)
• Stanford University, US (Human-Centered AI (HAI) Institute)
• University of British Columbia, Canada (Human-AI Interaction)
• University of California-Berkeley, US (Center for Human-Compatible
AI)
204 PART 4: GOVERNANCE STRUCTURES

• University of Cambridge, UK (Leverhulme Centre for the Future of

Intelligence)
• University of Canberra, Australia (Human Centred Technology Research
Centre)
• University of Chicago, US (Chicago Human+AI Lab)
• University of Oxford, UK (Internet Institute, Future of Humanity Insti-
tute)
• University of Toronto, Canada (Ethics of AI Lab)
• Utrecht University, Netherlands (Human-Centered AI)

There are also numerous research labs and educational programs devoted to
understanding the long-term impact of AI and exploring ways to ensure it
is beneficial for humanity. The challenge for these organizations is to build
on their strength in research by bridging to practice, so as to promote better
software engineering processes, organizational management strategies, and in-
dependent oversight methods. University–industry–government partnerships
could be a strong pathway for influential actions.
Responsible industry leaders have repeatedly expressed their desire to con-
duct research and use HCAI in safe and effective ways. Microsoft’s CEO Satya
Nadella proposed six principles for responsible use of advanced technologies.26
He wrote that artificially intelligent systems must:

• Assist humanity and safeguard human workers.

• Be transparent . . . Ethics and design go hand in hand.
• Maximize efficiencies without destroying the dignity of people.
• Be designed for intelligent privacy.
• Have algorithmic accountability so that humans can undo unintended
harm.
• Guard against bias . . . So that the wrong heuristics cannot be used to
discriminate.

Similarly, Google’s CEO Sundar Pichai offered seven objectives for artificial
intelligence applications that became core beliefs for the entire company:27

• Be socially beneficial.
• Avoid creating or reinforcing unfair bias.
CHAPTER 21: INDEPENDENT OVERSIGHT 205

• Be built and tested for safety.

• Be accountable to people.
• Incorporate privacy design principles.
• Uphold high standards of scientific excellence.
• Be made available for uses that accord with these principles.

Skeptics will see these statements as self-serving corporate whitewashing de-

signed to generate positive public responses. However, they can produce
important efforts such as Google’s internal review and algorithmic auditing
framework28 (see Chapter 20’s section “Internal Review Boards for Problems
and Future Plans”), but their 2019 effort to form a semi-independent ethics re-
view committee collapsed in controversy within a week. Corporate statements
can help raise public expectations, but the diligence of internal commitments
should not be a reason to limit external independent oversight. Since support
for corporate social responsibilities may be countered by pressures for a prof-
itable bottom line, corporations and the public benefit from questions raised
by knowledgeable journalists and external review boards.
CHAPT ER 21 : AP PE N D I X A

Non-Governmental and Civil

Society Organizations Working
on HCAI

There are hundreds of organizations in this category, so this brief listing only
samples some of the prominent ones.

Underwriters Laboratories, established in 1894, has been “working for a safer

world” by “empowering trust.” They began with testing and certifying elec-
trical devices and then branched out worldwide to evaluate and develop
voluntary industry standards. Their vast international network has been suc-
cessful in producing better products and services, so it seems natural for them
to address HCAI.29
Brookings Institution, founded in 1916, is a Washington, DC non-profit pub-
lic policy organization, which is home to an Artificial Intelligence and Energy
Technology (AIET) Initiative. It focuses on governance issues by publish-
ing reports and books, bringing together policy-makers and researchers at
conferences, and “seek to bridge the growing divide between industry, civil
society, and policymagers.”30
Electronic Privacy Information Center (EPIC), founded in 1994, is a Wash-
ington, DC-based public interest research that that focuses “public attention
on emerging privacy and civil liberties issues and to protect privacy, freedom
of expression, and democratic values in the information age.” It runs con-
ferences, offers public education, files amicus briefs, pursues litigation, and
testifies before Congress and governmental organizations. Its recent work has
emphasized AI issues such as surveillance and algorithmic transparency.31
208 PART 4: GOVERNANCE STRUCTURES

Algorithmic Justice League, which stems from IT and Emory University, seeks
to lead “a cultural movement towards equitable and accountable AI.” The
League combines “art and research to illuminate the social implications and
harms of AI.” With funding from large foundations and individuals it has
done influential work on demonstrating bias, especially for face recogni-
tion systems. Its work productively has led to algorithmic and training data
improvements in leading corporate systems.32
AI Now Institute at New York University “is an interdisciplinary research
center dedicated to understanding the social implications of artificial intel-
ligence.” This institute emphasizes “four core domains: Rights & Liberties,
Labor & Automation, Bias & Inclusion, Safety & Critical Infrastructure.” It
supports research, symposia, and workshops to educate and examine “the
social implications of AI.”33
Data and Society, an independent New York-based non-profit that “studies the
social implications of data-centric technologies and automation . . . We pro-
duce original research on topics including AI and automation, the impact of
technology on labor and health, and online disinformation.”34
Foundation for Responsible Robotics is a Netherlands-based group whose
tag line is “accountable innovation for the humans behind the robots.” Its
mission is “to shape a future of responsible (AI-based) robotics design, de-
velopment, use, regulation, and implementation. We do this by organizing
and hosting events, publishing consultation documents, and through creating
public-private collaborations.”35
AI4ALL, an Oakland, CA-based non-profit works “for a future where diverse
backgrounds, perspectives, and voices unlock AI’s potential to benefit hu-
manity.” It sponsors education projects such as summer institutes in the
United States and Canada for diverse high school and university students,
especially women and minorities to promote AI for social good.36
ForHumanity is a public charity which examines and analyzes the downside
risks associated with AI and automation, such as “their impact on jobs, so-
ciety, our rights and our freedoms.” It believes that independent audit of AI
systems, covering trust, ethics, bias, privacy, and cybersecurity at the corpo-
rate and public-policy levels, is a crucial path to building an infrastructure
of trust. It believes that “if we make safe and responsible artificial intelligence
and automation profitable whilst making dangerous and irresponsible AI and
automation costly, then all of humanity wins.”37
CHAPTER 21: INDEPENDENT OVERSIGHT 209

Future of Life Institute is a Boston-based charity working on AI, biotech, nu-

clear, and climate issues in the United States, United Kingdom, and European
Union. It seeks to “catalyze and support research and initiatives for safeguard-
ing life and developing optimistic visions of the future, including positive
ways for humanity to steer its own course considering new technologies and
challenges.”38
Center for AI and Digital Policy is part of the Michael Dukakis Institute for
Leadership and Innovation. Its website says that it aims “to ensure that artifi-
cial intelligence and digital policies promote a better society, more fair, more
just, and more accountable—a world where technology promotes broad so-
cial inclusion based on fundamental rights, democratic institutions, and the
rule of law.”39 The extensive report on AI and Democratic Values it produces
assesses performance of twenty-five countries annually.
CHAPT ER 21 : AP PE N D I X B

Professional Organizations
and Research Institutes Working
on HCAI

There are hundreds of organizations in this category, so this brief listing

only samples some of the prominent ones. A partial listing can be found on
Wikipedia.40

Institute for Electrical and Electronics Engineers (IEEE) launched a global

initiative for ethical considerations in the design of AI and autonomous sys-
tems. It is an incubation space for new standards and solutions, certifications
and codes of conduct, and consensus building for ethical implementation of
intelligent technologies.41
IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems
(2019) originates with the large professional engineering society and col-
lected more than 200 people over three years to prepare an influential report:
“Ethically Aligned Design: A Vision for Prioritizing Human Well-being with
Autonomous and Intelligent Systems.”42
ACM, a professional society with 100,000 members working in the computing
field, has been active in developing principles and ethical frameworks for re-
sponsible computing. ACM’s Technical Policy Committee delivered a report
with seven principles for algorithmic accountability and transparency.43
Association for the Advancement of Artificial Intelligence (AAAI) is a “non-
profit scientific society devoted to advancing the scientific understanding
of the mechanisms underlying thought and intelligent behavior and their
212 PART 4: GOVERNANCE STRUCTURES

embodiment in machines. AAAI aims to promote research in, and re-

sponsible use of, artificial intelligence.” It runs very successful confer-
ences, symposia, and workshops, often in association with ACM, that bring
researchers together to present new work and train newcomers to the
field.44
OECD AI Policy Observatory is a project of the Organisation for Economic
Co-operation and Development. It works with policy professionals “to con-
sider the opportunities and challenges” in AI and to provide “a centre for the
collection and sharing of evidence on AI, leveraging the OECD’s reputation
for measurement methodologies and evidence-based analysis.”45
Associaton for Advancing Automation, founded in 1974 as the Robotics
Industries Association, is a North American trade group that “drives inno-
vation, growth, and safety in manufacturing and service industries through
education, promotion, and advancement of robotics, related automation
technologies, and companies delivering integrated solutions.”46
Machine Intelligence Research Institute (MIRI) is a research non-profit
studying the mathematical underpinnings of intelligent behavior. Its mission
is to develop formal tools for the clean design and analysis of general-purpose
AI systems, with the intent of making such systems safer and more reliable
when they are developed.47
Open AI is a San Francisco-based research organization that “will attempt
to directly build safe and beneficial Artificial General Intelligence (AGI) . . .
that benefits all of humanity.” Their research team is supported by corporate
investors, foundations, and private donations.48
The Partnership on AI, established in 2016 by six of the largest technology
companies, has more than 100 industry, academic, and other partners who
“shape best practices, research, and public dialogue about AI’s benefits for
people and society.” It funded the Partnership on AI, which “conducts re-
search, organizes discussions, shares insights, provides thought leadership,
consults with relevant third parties, responds to questions from the public
and media, and creates educational material.”49
Montreal AI Ethics Institute is an international, non-profit research institute
dedicated to defining humanity’s place in a world increasingly characterized
and driven by algorithms. Its website says: “We do this by creating tangible
and applied technical and policy research in the ethical, safe, and inclusive
development of AI. We’re an international non-profit organization equipping
citizens concerned about artificial intelligence to take action.”50