Hcai Module 4
Hcai Module 4
T
he widespread application of HCAI comes with high expectations
of benefits for many domains, including healthcare, education, cyber
security, and environmental protection. When it works as expected,
HCAI could improve medical diagnoses, stop cybercriminals, and protect en-
dangered species. However, there are equally dire predictions of out-of-control
robots, unfair treatment of minority groups, privacy violations, adversarial at-
tacks, and challenges to human rights. HCAI design and evaluation methods
can address these dangers to realize the desired benefits.1
Part 3 made the case that traditional AI science research focused on
studying and then emulating (some would use the term simulating) human
behavior, while current AI innovation research emphasizes practical appli-
cations. Typical AI scientific foundations and technology-based innovations
include pattern recognition and generation (images, speech, facial, signal,
etc.), human-like robots, game playing (checkers, chess, Go, etc.), and natural
language processing, generation, and translation.
HCAI research builds on these scientific foundations by using them to am-
plify, augment, and enhance human performance in ways that make systems
146 PART 4: GOVERNANCE STRUCTURES
systems are woven together from many products and services, including chips,
software development tools, voluminous training data, extensive code libraries,
and numerous test cases for validation and verification, each of which may
change, sometimes on a daily basis. These difficulties present grand challenges
for software engineers, managers, reviewers, and policy-makers, so the rec-
ommendations are meant to launch much-needed discussions, pilot tests, and
scalable research that can lead to constructive changes.
There are more than 500 reports describing aspirational HCAI principles
from companies, professional societies, governments, consumer groups, and
non-government organizations.5 A Berkman Klein Center report from 2020
discusses the upsurge of policy activity, followed by a thoughtful summary of
thirty-six of the leading and most comprehensive reports. The authors identify
eight HCAI themes for deeper commentary and detailed principles: privacy,
accountability, safety and security, transparency and explainability, fairness and
non-discrimination, human control of technology, professional responsibility,
and promotion of human values.6
Other reports stress ethical principles, such as IEEE’s far-reaching “Ethically
Aligned Design,” which emerged from a 3-year effort involving more than 200
people. The report offers clear statements about eight general principles: hu-
man rights, well-being, data agency, effectiveness, transparency, accountability,
awareness of misuse, and competence. It went further with strong encourage-
ment to ensure that advanced systems “shall be created and operated to respect,
promote, and protect internationally recognized human rights”.7 Figure 18.1
shows the close match and the roughly similar principles in the two reports.
Human-Centered AI Principles
Berkman Klein Center IEEE Ethically Aligned Design
Accountability Accountability
Match
Close
These and other ethical principles are important foundations for clear think-
ing, but as Alan Winfield from the University of Bristol and Marina Jirotka
from Oxford University note: “the gap between principles and practice is an im-
portant theme.”8 The four-layer governance structure for HCAI systems could
help bridge this gap: (1) reliable systems based on sound software engineering
practices, (2) safety culture through proven business management strategies,
(3) trustworthy certification by independent oversight, and (4) regulation by
government agencies (Figure 18.2). The inner oval covers the many software en-
gineering teams which apply technical practices relevant to each project. These
teams are part of a larger organization (second oval) where safety culture man-
agement strategies influence each project team. In the third oval, independent
oversight boards review many organizations in the same industry, giving them
a deeper understanding, while spreading successful practices.
The largest oval is government regulation, which provides another layer of
thinking that addresses the public’s interest in reliable, safe, and trustworthy
HCAI systems. Government regulation is controversial, but success stories
such as the US National Transportation Safety Board’s investigation of plane,
train, boat, and highway accidents has generally been seen as advancing
GOVERNMENT REGULATION
INDUSTRY
Trustworthy Certification
External Reviews
Fig 18.2 Governance Structures for human-centered AI: The four levels are shown as
nested ovals: (1) Team: reliable systems based on software engineering (SE) practices,
(2) Organization: a well-developed safety culture based on sound management strategies,
(3) Industry: trustworthy certification by external review,
and (4) Government regulation.
CHAPTER 18: ETHICS TO PRACTICE 149
including IBM, Amazon, and Microsoft, withdrew from selling these systems
to police departments because of pressure over potential misuse and abuse.11
The next four chapters cover the four levels of governance. Chapter 19
describes five technical practices of software engineering teams that enable re-
liable HCAI systems: audit trails, workflows, verification and validation testing,
bias testing, and explainable user interfaces.
Chapter 20 suggests how organizations that manage software engineering
projects can develop a safety culture through leadership commitment, hiring
and training, reporting failures and near misses, internal reviews, and industry
standards.
Chapter 21 shows how independent oversight methods by external review
organizations can lead to trustworthy certification and independent audits of
products and services. These independent oversight methods create a trusted
infrastructure to investigate failures, continuously improve systems, and gain
public confidence. Independent oversight methods include auditing firms, in-
surance companies, NGOs and civil society, and professional organizations.12
Chapter 22 opens the larger and controversial discussion of possible gov-
ernment interventions and regulations. The summary in Chapter 23 raises
concerns, but offers an optimistic view that well-designed HCAI systems will
bring meaningful benefits to individuals, organizations, and society.
The inclusion of human-centered thinking will be difficult for those who
have long seen algorithms as the dominant goal. They will question the validity
of this new synthesis, but human-centered thinking and practices put AI algo-
rithms and systems to work for commercially successful products and services.
HCAI offers a hope-filled vision of future technologies that support human
self-efficacy, creativity, responsibility, and social connections among people.
CHAPT ER 19
R
eliable HCAI systems are produced by software engineering teams that
apply sound technical practices.1 These technical practices clarify hu-
man responsibility, such as audit trails for accurate records of who did
what and when, and histories of who contributed to design, coding, testing,
and revisions.2 Other technical practices are improved software engineering
workflows that are tuned to the tasks and application domain. Then when pro-
totype systems are ready, verification and validation testing of the programs,
and bias testing of the training data can begin. Software engineering practices
also include the user experience design processes that lead to explainable user
interfaces for HCAI systems (see Figure 18.2).
FDRs provide important lessons for HCAI designers of audit trails (also
called product logs) to record the actions of robots.4 These robot versions of
aviation flight data recorders have been called smart, ethical, or black boxes,
but the consistent intention of designers is to collect relevant evidence for retro-
spective analyses of failures.5 Such retrospective analyses are often conducted
to assign liability in legal decision-making and to provide guidance for con-
tinuous improvement of these systems. They also clarify responsibility, which
exonerates those who have performed properly, as in the case of the unfairly
accused nurses whose use of an intravenous morphine device was revealed to
be proper.
Similar proposals have been made for highly automated (also called self-
driving or driverless) cars.6 These proposals extend current work on electronic
logging devices, which are installed on many cars to support better mainte-
nance. Secondary uses of vehicle logging devices are to improve driver training,
monitor environmentally beneficial driving styles, and verify truck driver com-
pliance with work and traffic rules. In some cases, these logs have provided
valuable data in analyzing the causes of accidents, but controversy continues
about who owns the data and what rights manufacturers, operators, insurance
companies, journalists, and police have to gain access.
Industrial robots are another application area for audit trails, to promote
safety and reduce deaths in manufacturing applications. Industry groups such
as the Robotic Industries Association, now transformed into the Association for
Advancing Automation, have promoted voluntary safety standards and some
forms of auditing since 1986.7
Audit trails for stock market trading algorithms are now widely used to log
trades so that managers, customers, and the US Securities and Exchange Com-
mission can study errors, detect fraud, or recover from flash crash events.8
Other audit trails from healthcare, cybersecurity, and environmental monitor-
ing enrich the examples from which improved audit trails can be tuned to the
needs of HCAI applications.
Challenging research questions remain, such as what data are needed for ef-
fective retrospective forensic analysis and how to efficiently capture and store
high volumes of video, sound, and light detection and ranging (LIDAR) data,
with proper encryption to prevent falsification. Logs should also include ma-
chine learning algorithms used, the code version, and the associated training
data at the time of an incident. Then research questions remain about how
to analyze the large volume of data in these logs. Issues of privacy and secu-
rity complicate the design, as do legal issues such as who owns the data and
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 153
Workflows for all these tasks require expanded efforts with user requirements
gathering, data collection, cleaning, and labeling, with use of visualization and
data analytics to understand abnormal distributions, errors and missing data,
clusters, gaps, and anomalies. Then model training and evaluation becomes a
multistep process that starts with early in-house testing, proceeds to deploy-
ment, and maintenance. Continuous monitoring of deployed systems is needed
to respond to changing contexts of use and new training data.
Software engineering workflows for HCAI systems will extend AI design
methods to include user experience design so as to ensure that users under-
stand how decisions are made and have recourse when they wish to challenge
a decision. These traditional human–computer interaction methods of user
experience testing and guidelines development are being updated by leading
corporations and researchers to meet the needs of HCAI.11
A Virginia Commonwealth University team proposes a human-centered AI
system lifecycle geared to deliver trustworthy AI by emphasizing fairness, in-
teractive visual user interfaces, and privacy protection through careful data
governance.12 They raise the difficult issue of measuring trustworthiness by
quantitative and qualitative assessment, which we will return to in Chapter 25.
Software engineering workflows have migrated from the waterfall model,
which assumed that there was an orderly linear lifecycle, starting from require-
ments gathering and moving to design, implementation, testing, documenta-
tion, and deployment. The waterfall model, which may be appropriate when
the requirements are well understood, is easy to manage, but can result in mas-
sive failures when delivered software systems are rejected by users. Rejections
may be because requirements gathered a year ago at the start of a project are
no longer adequate or because developers failed to test prototypes and early
implementations with users.
The newer workflows are based on the lean and agile models, with variants
such as scrum, in which teams work with the customers throughout the lifecy-
cle, learning about user needs (even as they change), iteratively building and
testing prototypes, then discarding early ideas as refinements are developed,
and always being ready to try something new. Agile teams work in one- to two-
week sprints, intense times when big changes are needed. The agile model builds
in continuous feedback to ensure progress towards an effective system, so as to
avoid big surprises.
Agile models demand strong collaboration among developers to share
knowledge about each other’s work, so they can discuss possible solutions and
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 155
help when needed. Waterfall projects may deliver a complete system after a year
of work, while agile projects could produce a prototype in a month. IBM en-
courages agile approaches to AI projects, because developers have to keep an
open mind and explore alternatives more than in traditional projects.13
The Manifesto for Agile Software Development, first developed in 2001 by a
group of seventeen people calling themselves the Agile Alliance, is based on
twelve principles.14 I’ve rephrased them for consistency:
Supervised learning: a type of machine learning that learns from training data with labels
as learning targets . . . Unsupervised learning: a learning methodology that learns from
training data without labels and relies on understanding the data itself. Reinforcement
learning: a type of machine learning where the data are in the form of sequences of
actions, observations, and rewards.19
an undirected network, the shortest path from a to b should be the same as the
one from b to a. Similarly, decreasing the amount of a mortgage application re-
quest should never reject someone who was approved for a larger amount. For
e-commerce recommenders, changing the maximum price for a product from
$100 to $60 should produce a subset of products. As with differential testing,
the test team does not have to create expected results, since they are generated
by the software. Applying metamorphic testing for training data sets is possible
by adding or deleting records that should not change the results.
User experience For HCAI applications which involve users, such as mort-
gage, parole, or job interview applications, user testing is needed to verify that
they can deal with the system and get meaningful explanations. User testing is
conducted by giving standard tasks to five to twenty-five users, who are asked
to think aloud as they work through the tasks, typically in 30–120 minutes, ex-
plaining that they see, think, and do. The testing team records user comments
and performance to generate a report about common problems, sometimes
including suggested fixes. User testing is a practical approach used in system
development to detect problems that users report. It is different from research-
based controlled experiments that test alternate hypotheses to prove statistically
significant differences between two or more designs.
ways red teams could attack HCAI systems and data sets would be a helpful
guide to developers about how to protect their systems. As a start, software
engineers could individually catalog potential attacks and then combine their
results with other team members. Comparisons with other teams could lead to
further ideas of potential attacks. MITRE Corporation has begun a project to
make such a catalog of AI failures.26
For all testing techniques, the history of testing should be recorded to en-
able reconstruction and document how repairs were made and by whom.27
Microsoft’s Datasheets for Datasets is a template to document data used in ma-
chine learning. It contains sections on the motivation and process for collecting
and cleaning the data, who has used the data set, and contact information for
the data curator.28 This positive step quickly propagated to be used by many
software engineering teams, and encouraged Google’s Model Cards template
for model reporting.29 Lessons from database systems30 and information visu-
alization31 about tracking provenance of data and histories of testing are also
useful. These documentation strategies all contribute to transforming software
engineering practices from early stage research to more mature professional
practices.
For mobile robotic devices, which could inadvertently harm nearby human
workers, deadly weapons, and medical devices, special care is needed during
testing. Metrics for “safe operation, task completion, time to complete the task,
quality, and quantity of tasks completed” will guide development.32 Mature
application areas such as aviation, medical devices, and automobiles, with a
long history of benchmark tests for product certification, provide good models
for newer products and services. Verifying and validating HCAI systems ac-
curacy, correctness, usability, and vulnerability are important, but in addition
since many applications deal with sensitive decisions that have consequences
for people’s lives, bias testing is needed to enhance fairness.
Data Increases Inequality and Threatens Democracy raises questions about how
algorithms became dangerous when they had three properties:33
• Opacity: the algorithms are complex and hidden from view, making it
hard to challenge decisions,
• Scale: the system is used by large companies and governments for major
applications, and
• Harm: the algorithms could produce unfair treatment that impact peo-
ple’s lives.
outside the team will also need to monitor performance and review reports
(see Chapter 20 on safety culture management practices).
These constructive steps are a positive sign, but the persistence of bias re-
mains a problem as applications such as facial recognition become more widely
used for police work and commercial applications.45 Simple bias tests for gen-
der, race, age, etc. were helpful in building more accurate face databases, but
problems remained when the databases were studied for intersections, such as
black women.46 Presenting these results in refereed publications and in widely
seen media can pressure the HCAI systems builders to make changes that
improve performance.
MIT’s Joy Buolamwini, who founded the Algorithmic Justice League (see
Chapter 21’s Appendix A), was able to show gender and racial bias in facial
recognition systems from Microsoft, IBM, and Amazon, which she presented
in compelling ways through her high-energy public talks, sharply written op-
eds, and theatrical videos.47 Her efforts, with collaborator Timnit Gebru, led
to improvements and then corporate withdrawal of facial recognition prod-
ucts from police departments, when evidence of excessive use of force became
widespread in spring 2020.48 Their efforts were featured in the full-length April
2021 documentary Coded Bias, which has drawn widespread interest.49
Ethics issues triggered a public scandal when Google fired Timnit Gebru,
who co-led its Ethical AI team, triggering support for her from thousands of
Google employees and others. The controversy included Gebru’s outspoken
stance about the low level of female and minority hiring at Google, which she
suggests is related to deficiencies in understanding bias. Effective bias testing
for machine learning training data is one contribution to changing the long
history of systemic bias in treatment of minorities in many countries.50
The bias in algorithms is sometimes obvious as in this Google Images search
for “professional hair” (Figure 19.2a) that shows mostly light-skinned women,
which is notably different for the search for “unprofessional hair” (Figure 19.2b)
that shows mostly dark-skinned women. These examples show how existing
biases can propagate, unless designers intervene to reduce them.
The question of bias is vital to many communities that have suffered from
colonial oppression, including Indigenous people around the world. They of-
ten have common shared values that emphasize relationships within their local
context, foregrounding their environment, culture, kinship, and community.
Some in Indigenous communities question the rational approaches of AI, while
favoring empirical ways of knowing tied to the intrinsically cultural nature of all
computational technology: “Indigenous kinship protocols can point us towards
164 PART 4: GOVERNANCE STRUCTURES
(a)
(b)
Fig 19.2 (a) Google Search for “Professional hair” shows mostly light-skinned women.
(b) Google Search for “Unprofessional hair” shows mostly dark-skinned women.
questions from those who are affected. To satisfy these legitimate needs, sys-
tems must provide comprehensible explanations that enable people to know
what they need to change or whether they should challenge the decision. Fur-
thermore, explanations have become a legal requirement in many countries
based on the European Union’s General Data Protection Regulation (GDPR)
requirement of a “right to explanation.”53
This controversial GDPR requirement is vague and difficult to satisfy in
general, but international research efforts to develop explainable AI have blos-
somed.54 A useful and practical resource are the three reports on “Explaining
Decisions Made with AI” from the UK Information Commissioner’s Office and
the Alan Turing Institute.55 The three reports cover: (1) The basics of explain-
ing AI, (2) Explaining AI in practice, and (3) What explaining AI means for
your organization. The first report argues that companies benefit from making
AI explainable: “It can help you comply with the law, build trust with your cus-
tomers and improve your internal governance.”56 The report spells out the need
for explanations that describe the reason for the decision, who is responsible for
the system that made the decision, and the steps taken to make fair decisions.
Beyond that it stipulates that users should be given information about how to
challenge a decision. The second and longest report has extensive discussions
about different kinds of explanations, but it would inspire more confidence if it
showed sample screen designs and user testing results.
Daniel S. Weld and Gagan Bansal from the University of Washington make
a strong case for explainability (sometimes called interpretability or trans-
parency) that goes beyond satisfying users’ desire to understand and the legal
requirements to provide explanations.57 They argue that explainability helps
designers enhance correctness, identify improvements in training data, account
for changing realities, support users in taking control, and increase user ac-
ceptance. An interview study with twenty-two machine learning professionals
documented the value of explainability for developers, testers, managers, and
users.58 A second interview study with twenty-nine professionals emphasized
the need for social and organizational contexts in developing explanations.59
However, explainability methods are only slowly finding their way into widely
used applications and possibly in ways that are different from the research.
As the AI research community learns more about the centuries of relevant
social science research, thoughtfully described by Tim Miller from the Uni-
versity of Melbourne, who complains that “most work in explainable artificial
intelligence uses only the researchers’ intuition of what constitutes a ‘good’ ex-
planation.”60 Miller’s broad and deep review of social science approaches and
166 PART 4: GOVERNANCE STRUCTURES
Why Hockey ?
Part 1: Important words
This message has more important words about Hockey than
about Baseball
AND
Part 2: Folder size
The Baseball folder has more messages than the Hockey folder
Hockey: 7
Baseball: 8
The difference makes the computer think each Unknown
message is 1.1 times more likely to be about Baseball
than Hockey.
YIELDS
67% probability this message is about Hockey
Fig 19.3 Part of the user interface of a text classification application that shows why
a document was classified as being about hockey
Source: Revised version, based on Kulesza et al., 2015
These AI-based designs give users choices to select from before they initiate ac-
tion, as in spelling correctors, text message autocompletion, and search query
completion (see Figures 9.3 and 9.4). The same principle was productively ap-
plied by University of Colorado professor Daniel Szafir and his collaborators for
robot operation. They showed that previews of the path and goals for intended
actions of a dexterous robot arm resulted in improved task completion and
increased satisfaction.71 For robots the human control design principle for pre-
dictability might be the second pattern in Figure 9.2 preview first, select and
initiate, then manage execution.
Navigation systems adhere to the predictability principle when they apply
AI-based algorithms to find routes based on current traffic data. Users are
given two to four choices of routes with estimated times for driving, biking,
walking, and public transportation, from which they select the one they want.
Then this supertool provides visual, textual, and speech-generated instructions
(Figure 19.4).
In addition to AI-based textual, robot, and navigation user interfaces, simi-
lar prospective (or ante-hoc) methods can be used in recommender systems by
offering exploratory user interfaces that enable users to probe the algorithm
boundaries with different inputs. Figures 19.5a and 19.5b show a post-hoc
explanation for a mortgage rejection, which is good, but could be improved.
Figure 19.5c shows a prospective exploratory user interface that enables users
to investigate how their choices affect the outcome, thereby reducing the need
for explanations.
In general, prospective exploratory user interfaces are welcomed by users
who spend more time developing an understanding of the sensitivity of vari-
ables and digging more deeply into aspects that interest them, leading to
greater satisfaction and compliance with recommendations.72 Further gains
come from enabling adaptable user interfaces to fit different needs and per-
sonalities.73
For complex decisions, Fred Hohman, now an Apple researcher, showed that
user interfaces and data analytics could clarify which features in a machine
learning training data set are the most relevant.74 His methods, developed as
part of his doctoral work at Georgia Tech, also worked on explanations of
deep learning algorithms for image understanding.75 A Google team of eleven
researchers built an interactive tool to support clinicians in understanding al-
gorithmic decisions about cancers in medical images. Their study with twelve
medical pathologists showed substantial benefits in using this slider-based
170 PART 4: GOVERNANCE STRUCTURES
Fig 19.4 Navigation system for driving, public transportation, walking, and biking.
The one-hour time estimate is for biking.
exploratory user interface, which “increased the diagnostic utility of the images
found and increased user trust in the algorithm.”76
Interactive HCAI approaches are endorsed by Weld and Bansal, who recom-
mend that designers should “make the explanation system interactive so users
can drill down until they are satisfied with their understanding.”77 Exploration
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 171
(b)
Enter amounts to request mortgage: 7000
Liquid assets
Mortgage amount requested 375000
Household monthly income 7000
Liquid assets 48000 Done
48000
Submit
Fig 19.5 Mortgage loan explanations. (a) is a post-hoc explanation, which shows a dialog
box with three fill-in fields and a submit button. (b) shows what happens after clicking the
submit button. The feedback users get a brief text explanation, but insufficient guidance for
next steps. (c) shows an exploratory user interface that enables users to try multiple
alternatives rapidly. It has three sliders to see the impact of changes on the outcome score.
works best when the user inputs are actionable; that is, users have control and
can change the inputs. Alternative designs are needed when users do not have
control over the input values or when the input is from sensors such as in im-
age and face recognition applications. For the greatest benefit, exploratory user
interfaces should support accessibility by users with visual, hearing, motor, or
cognitive disabilities.
The benefits of giving users more control over their work was demonstrated
in a 1996 study of search user interfaces by Rutgers University professor Nick
Belkin and graduate student Juergen Koenemann. They summarize the payoffs
from exploratory interaction in their study of sixty-four participants who, they
report, “used our system and interface quite effectively and very few usability
problems . . . Users clearly benefited from the opportunity to revise queries in
an iterative process.”78
Supportive results about the benefits of interactive visual user interfaces
come from a study of news recommenders in which users were able to move
sliders to indicate their interest in politics, sports, or entertainment. As they
172 PART 4: GOVERNANCE STRUCTURES
moved the sliders the recommendation list changed to suggest new items.79
Related studies added one more important insight: when users have more con-
trol, they are more likely to click on a recommendation.80 Maybe being in
control makes them more willing to follow a recommendation because they
feel they discovered it, or maybe the recommendations are actually better.
In addition to the distinctions between intrinsically understandable models,
post-hoc, and prospective explanations, Mengnan Du, Ninghao Liu, and Xia
Hu follow other researchers in distinguishing between global explanations that
give an overview of what the algorithm does and local explanations that deal
with specific outcomes, such as why a prisoner is denied parole or a patient re-
ceives a certain treatment recommendation.81 Local explanations support user
comprehension and future actions, such as a prisoner who is told that they
could be paroled after four months of good behavior or the patient who is told
that if they lost ten pounds they would be eligible for a non-surgical treatment.
These are actionable explanations, which are suggestions for changes that can
be accomplished, rather than being told that if you were younger the results
would be different.
An important statement about who would value explainable systems was in
a US White House memorandum.82 It reminded developers that “transparency
and disclosure can increase public trust and confidence in AI applications” and
then stressed that good explanations would allow “non-experts to understand
how an AI application works and technical experts to understand the process
by which AI made a given decision.”
Increased user control through visual user interfaces is apparent in rec-
ommender systems that offer more transparent approaches, especially for
consequential medical or career choices.83 Our research team, led by Univer-
sity of Maryland doctoral student Fan Du, developed a visual user interface to
allow cancer patients to make consequential decisions about treatment plans
based on finding other “patients like me.”
The goal was to enable users to see how similar patients fared in choos-
ing chemotherapy, surgery, or radiation. But medical data were hard to obtain
because of privacy protections, so our study used a related context. We tested
eighteen participants making educational choices, such as courses, internships,
or research projects, to achieve their goals, such as an industry job or gradu-
ate studies (Figure 19.6). The participants wanted to choose students who were
similar to them by gender, degree program, and major, as well as having taken
similar courses. The results showed that they took longer when they had control
over the recommender system, but they were more likely to understand and
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 173
Fig 19.6 Visual user interface to enable users to find people who have similar past histories.
Source: Du et al., 2019
Fig 19.7 Simple sliders let users control the music recommender system by moving sliders
for acousticness, instrumentalness, danceability, valence, and energy.
Source: Component from Millecamp et al., 2018
the attributes, they may better understand why particular songs are being
recommended.”87
Another example is in the OECD’s Better Life Index website,88 which rates
nations according to eleven topics such as housing, income, jobs, community,
and education (Figure 19.8). Users move sliders to indicate which topics are
more or less important to them. As they make changes the list gracefully up-
dates with a smoothly animated bar chart so they can see which countries most
closely fulfill their preferences.
Yes, there are many users who don’t want to be bothered with making
choices, so they prefer the fully automatic recommendations, even if the
recommendations are not as well tuned to their needs. This is especially
true with discretionary applications such as movie, book, or restaurant rec-
ommendations, but the desire to take control grows with consequential and
especially with life-critical decisions in which professionals are responsible for
the outcomes.
There are many similar studies in e-commerce, movie, and other recom-
mender systems but I was charmed by a simple yet innovative website to get
CHAPTER 19: SOFTWARE ENGINEERING PRACTICES 175
Fig 19.8 The OECD’s Better Life Index with eleven sliders.
Source: http://www.oecdbetterlifeindex.org/
recommendations for novels to read. Users could select four of the twelve
sliders to choose attributes such as funny/serious, beautiful/disgusting, or opti-
mistic/bleak (Figure 19.9). As they move the sliders, the cover images of books
would come and go on the right side. A click on the cover image produced a
short description helping users to decide on what to read.89 Other controls in-
clude a world map for location (which even shows imaginary locations) and
check boxes for type of story (conflict, quest, revelation, etc.), race, age, gender,
and sexual preference. I liked it—give it a try!
In summary, while post-hoc explanations to help confused users are a good
idea and may be needed in some cases, a better approach may be to prevent or
reduce the need for explanations. This idea of preventing rather than treating
the disease is possible with prospective visual user interfaces that let users ex-
plore possibilities. Visual user interfaces help users understand the dimensions
176 PART 4: GOVERNANCE STRUCTURES
Fig 19.9 Recommender system for novels based on attributes of the books.
As users move the sliders the book covers appear or disappear.
Source: https://www.whichbook.net/
Skeptics of control panels point out that very few users learn to use the ex-
isting ones, but well-designed controls, such as in automobiles to adjust car
seats, mirrors, lighting, sound, and temperature have gone from being a com-
petitive advantage to required features. Instead of control panels, Jonathan
Stray and colleagues at the Partnership on AI emphasize strategies to learn
from users about what is needed to align automated recommendations more
closely with their needs and well-being.91 Newer designs which fit the many
kinds of problems that HCAI serves are possible—they just require a little more
imagination.
CHAPT ER 20
W
hile every organization wants to perform flawlessly in all cir-
cumstances, the harsh reality is that pandemics overwhelm public
health, nuclear power station failures trigger regional devastation,
and terrorists threaten entire nations. Past accidents and incidents often had
narrow impacts, but today’s failures of massive technology-based interdepen-
dent organizations in globalized economies can have devastating effects for the
health and economies of entire cities, regions, and continents. Preparing for
failure by organizational design has become a major HCAI theme, with at least
four approaches:
highlighted by the case of three top executives of a firm who died in a small
plane crash en route to a meeting—maybe organizations should ensure that
key personnel fly on separate planes? Perrow’s work is extended by psycholo-
gist Gary Klein and criticized by sociologist Andrew Hopkins who has pointed
to the lack of metrics for the two dangers: tight coupling and insufficient re-
dundancy.2 Other critics found Perrow’s belief in the inevitability of failure in
complex organizations to be unreasonably pessimistic.
These four approaches share the goal of ensuring safe, uninterrupted perfor-
mance by preparing organizations to cope with failures and near misses (see
Figure 18.2).10 Building on the safety culture approach, my distillation em-
phasizes the ways managers can support HCAI: (1) leadership commitment to
safety, (2) hiring and training oriented to safety, (3) extensive reporting of fail-
ures and near misses, (4) internal review boards for problems and future plans,
and (5) alignment with industry standards and accepted best practices.
Fig 20.1 US Food and Drug Administration Voluntary Reporting Form invites adverse
event reports from health professionals and consumers.
Source: https://www.accessdata.fda.gov/scripts/medwatch/index.cfm
and useful data set. The Public Dashboard presents information on the grow-
ing number of reports each year, exceeding 2 million in 2018, 2019, and 2020
(Figure 20.2).
Another US FDA reporting system, Manufacturer and User Facility De-
vice Experience (MAUDE), captures adverse events in use of robotic surgery
systems. Drawing on this data, a detailed review of 10,624 reports covering
2000–2013 reported on 144 deaths, 1,391 patient injuries, and 8,061 device
malfunctions.28 The report on these alarming outcomes recommends “im-
proved human-machine interfaces and surgical simulators that train surgical
teams for handling technical problems and assess their actions in real-time
during the surgery.” The conclusion stresses the value of “surgical team train-
ing, advanced human machine interfaces, improved accident investigation and
reporting mechanisms, and safety-based design techniques.”
For cybersecurity problems which threaten networks, hardware, and soft-
ware, public reporting systems have also proven to be valuable. The MITRE
Corporation, a company funded to work for the US government, has been keep-
ing a list of common vulnerabilities and exposures since 1999, with more than
150,000 entries.29 MITRE works with the US National Institutes of Standards
and Technology to maintain a National Vulnerabilities Database,30 which helps
software developers understand the weaknesses in their programs. This allows
more rapid repairs, coordination among those with common interests, and ef-
forts to prevent vulnerabilities in future products and services. All of these open
186 PART 4: GOVERNANCE STRUCTURES
Fig 20.2 US Food and Drug Administration Adverse Event Reporting System (FAERS)
Public Dashboard. Data as of December 31, 2020.
Source: https://fis.fda.gov/sense/app/d10be6bb-494e-4cd2-82e4-0135608ddc13/sheet/7a47a261-d58b-
4203-a8aa-6d3021737452/state/analysis
reporting systems are good models to follow for HCAI failure and near miss
reports.
In software engineering, code development environments, such as GitHub,
record the author of every line of code and document who made changes.31
GitHub claims to be used by more than 56 million developers in more than
3 million organizations. Then, when systems are in operation, bug reporting
tools, such as freely available Bugzilla, guide project teams to frequent and se-
rious bugs with a tracking system for recording resolution and testing.32 Fixing
users’ problems promptly prevents other users from encountering the same
problem. These tools are typically used by members of software engineering
teams but inviting reports of problems from users is another opportunity.
The cybersecurity field has a long-standing practice of paying for vulner-
ability reports that could be adapted for HCAI systems. Bug bounties could
be paid for individuals who report problems with HCAI systems, but the idea
could be extended to bias bounties for those who report biased performance.33
This crowdsourcing idea has been used by companies such as Google, which
has paid from $100 to more than $30,000 per validated report for a total of
more than $3 million. At least two companies make a business of managing
such systems for clients: HackerOne34 and BugCrowd.35 Validating this idea in
the HCAI context requires policies about how much is paid, how reports are
CHAPTER 20: SAFETY CULTURE 187
evaluated, and how much information about the bug and bias reports are pub-
licly disclosed.36 These crowdsourced ideas build on software developer Eric
Raymond’s belief that “with enough eyes, all bugs are shallow,” suggesting that
it is valuable to engage more people in finding bugs.
Bug reporting is easier for interactive systems with comprehensible status
displays than for highly automated systems without displays, such as elevators,
manufacturing equipment, or self-driving cars. For example, I’ve regularly had
problems with Internet connectivity at my home, but the lack of adequate user
interfaces makes it very difficult for me to isolate and report the problems I’ve
had. When my Internet connection drops, it is hard to tell if it was a problem
on my laptop, my wireless connection to the router/modem, or the Internet
service provider. I wish there was a status display and control panel so I could
fix the problem or know whom to call. Server hosting company Cloudflare pro-
vides this information for its professional clients. Like many users, the best hope
is to reboot everything and hope that in ten to fifteen minutes I can resume
work.
Another model to follow is the US Army’s method of after-action reviews,
which have also been used in healthcare, transportation, industrial process con-
trol, environmental monitoring, and firefighting, so they might be useful for
studying HCAI failures and near misses.37 Investigators try to understand what
was supposed to happen, what actually happened, and what could be done bet-
ter in the future. A complete report that describes what went well and what
could be improved will encourage acceptance of recommendations. As After-
Action Review participants gain familiarity with the process, their analyses are
likely to improve and so will the acceptance of their recommendations.
Early efforts in the HCAI community are beginning to collect data on HCAI
incidents. Roman Yampolskiy’s initial set of incidents has been included in
a more ambitious project from the Partnership on AI.38 Sean McGregor de-
scribes the admirable goals and methods used to build the database of more
than 1000 incident reports sourced from popular, trade, and academic publi-
cations.39 Searches can be done by keywords and phrases such as “mortgage”
or “facial recognition,” but a thoughtful report on these incidents remains to
be done. Still, this is an important project, which could help in efforts to make
more reliable, safe, and trustworthy systems.
Another important project, but more narrowly focused, is run by Karl
Hansen, who collects public reports on deaths involving Tesla cars. He was a
special agent with the US Army Criminal Investigation Command, Protective
188 PART 4: GOVERNANCE STRUCTURES
Services Battalion, who was hired by Tesla as their internal investigator in 2018.
He claims to have been wrongfully fired by Tesla for his public reporting of
deaths involving Tesla cars.40 The 209 deaths as of September 2021 is far more
than most people expect, from what is often presented to the public. These re-
ports are incomplete so it is difficult to determine what happened in each case
or whether the Autopilot self-driving system was in operation. In August 2021,
the US National Highway Transportation Safety Administration launched an
investigation of eleven crashes of Tesla cars on autopilot that hit first responders
on the road or at the roadside.
Other concerns come from 122 sudden unintended acceleration (SUA)
events involving Teslas that were reported to the US National Highway Traf-
fic Safety Administration by January 2020.41 A typical report reads: “my wife
was slowly approaching our garage door waiting for the garage door to open
when the car suddenly lurched forward . . . destroying the garage doors . . . the
car eventually stopped when it hit the concrete wall of the garage.” Another
report describes two sudden accelerations and ends by saying “fortunately no
collision occurred, but we are scared now.” Tesla claims that its investigations
showed that the vehicle functioned properly, but that every incident was caused
by drivers stepping on the accelerator.42 However, shouldn’t a safety-first car
prevent such collisions with garage doors, walls, or other vehicles?
1) Scoping: identify the scope of the project and the audit; raise questions
of risk.
2) Mapping: create stakeholder map and collaborator contact list; conduct
interviews and select metrics.
3) Artifact collection: document design process, data sets, and machine
learning models.
4) Testing: conduct adversarial testing to probe edge cases and failure
possibilities.
5) Reflection: consider risk analysis, failure remediation, and record design
history.
Fig 20.3 Characteristics of maturity levels: five levels in the Capability Maturity Model:
(1) Initial, (2) Managed, (3) Defined, (4) Quantitatively Managed, and (5) Optimizing.
set process improvement goals and priorities, provide guidance for quality
processes, and provide a point of reference for appraising current processes.”.59
The Capability Maturity Model Integration is a guide to software engineer-
ing organizational processes with five levels of maturity from Level 1 in which
processes are unpredictable, poorly controlled across groups, and reactive to
problems. Higher levels define orderly software development processes with
detailed metrics for management control and organization-wide discussions of
how to optimize performance and anticipate problems. Training for staff and
management help ensure that the required practices are understood and fol-
lowed. Many US government software development contracts, especially from
defense agencies, stipulate which maturity level is required for bidders, using a
formal appraisal process.
In summary, safety cultures take a strong commitment by industry lead-
ers, supported by personnel, resources, and substantive actions, which are at
odds with the “move fast, break things” ethic of early technology companies.
To succeed, leaders will have to hire safety experts who use rigorous statistical
methods, anticipate problems, appreciate openness, and measure performance.
Other vital strategies are internal reviews and alignment with industry stan-
dards. Getting to a mature stage, where safety is valued as a competitive
advantage will make HCAI technologies increasingly trusted for consequential
and life-critical applications.
Skeptics question whether the Capability Maturity Models lead to top-
heavy management structures, which may slow the popular agile and lean
development methods. Still, proposals for HCAI Capability Maturity Mod-
els are emerging for medical devices, transportation, and cybersecurity.60
The UK Institute for Ethical AI and Machine Learning proposes a Machine
Learning Maturity Model based on hundreds of practical benchmarks, which
cover topics such as data and model assessment processes and explainability
requirements.61
HCAI Capability Maturity Models might be transformed into Trustwor-
thiness Maturity Models (TMM). TMMs might describe Level 1 initial use
of HCAI that is guided by individual team preferences and knowledge, mak-
ing it unpredictable, poorly controlled, and reactive to problems. Level 2 use
might call for uniform staff training in tools and processes, making it more
consistent across teams, while Level 3 might require repeated use of tools and
processes that are reviewed for their efficacy and refined to meet the applica-
tion domain needs and organization style. Assessments would cover testing
CHAPTER 20: SAFETY CULTURE 193
Trustworthy Certification
by Independent Oversight
T
he third governance layer is independent oversight by external review
organizations (see Figure 18.2). Even as established large companies,
government agencies, and other organizations that build consequential
HCAI systems are venturing into new territory, so they will face new prob-
lems. Therefore, thoughtful independent oversight reviews will be valuable
in achieving trustworthy systems that receive wide public acceptance. How-
ever, designing successful independent oversight structures is still a challenge,
as shown by reports on more than forty variations that have been used in
government, business, universities, non-governmental organizations, and civic
society.1
The key to independent oversight is to support the legal, moral, and ethical
principles of human or organizational responsibility and liability for their
products and services. Responsibility is a complex topic, with nuanced vari-
ations such as legal liability, professional accountability, moral responsibility,
and ethical bias.2 A deeper philosophical discussion of responsibility is useful,
but I assume that humans and organizations are legally liable (responsible) for
the products and services that they create, operate, maintain, or use indirectly.3
The report from the European Union’s Expert Committee on Liability for New
Technologies stresses the importance of clarifying liability for autonomous and
robotic technologies.4 They assert that operators of such technologies are liable
and that products should include audit trails to enable retrospective analy-
ses of failures to assign liability to manufacturer, operator, or maintenance
organizations.
196 PART 4: GOVERNANCE STRUCTURES
Fig 21.1 Independent oversight methods. Three forms of independent oversight: planning
oversight, continuous monitoring, and retrospective analysis of disasters.
Planning oversight proposals for new HCAI systems or major upgrades are
presented for review in advance so that feedback and discussion can influence
the plans. Planning oversight is similar to zoning boards, which review propos-
als for new buildings that are to adhere to building codes. A variation is the idea
of algorithmic impact assessments, which are similar to environmental impact
statements that enable stakeholders to discuss plans before implementation.12
Rigorous planning oversight needs to have follow-up reviews to verify that the
plan was followed.
mind and protects . . . drivers and consumers.” The report endorsed the beliefs
that self-driving cars would dramatically increase safety, but damage claims
would increase because of the more costly equipment. Both beliefs influence
risk assessment, the setting of premiums, and profits, as did their forecast that
the number of cars would decrease because of more shared usage. This early
report remains relevant because the public still needs data that demonstrate
or refute the idea that self-driving cars are safer. Manufacturers are reluctant
to report what they know and the states and federal government in the United
States have yet to push for open reporting and regulations on self-driving cars.22
The insurance companies will certainly act when self-driving cars move from
demonstration projects to wider consumer use, but earlier interventions could
be more influential.23
Skeptics fear that the insurance companies are more concerned with prof-
its than with protecting public safety and they worry about the difficulty of
pursuing a claim when injured by a self-driving car, mistaken medical recom-
mendation, or biased treatment during job hiring, mortgage approval or parole
assessment. However, as the history of insurance shows, having insurance will
benefit many people in their difficult moments of loss. Developing realistic
insurance from the damages caused by HCAI systems is a worthy goal.
Other approaches are to create no-fault insurance programs or victim
compensation funds, in which industry or government provide funds to an
independent review board that pays injured parties promptly without the com-
plexity and cost of judicial processes. Examples include the September 11
Victim Compensation Fund for the 2001 terror attack in New York and the
Gulf Coast Claim Facility for the Deepwater Horizon oil spill. Proposals for
novel forms of compensation for HCAI system failures have been made, but
none have yet gained widespread support.
There are also numerous research labs and educational programs devoted to
understanding the long-term impact of AI and exploring ways to ensure it
is beneficial for humanity. The challenge for these organizations is to build
on their strength in research by bridging to practice, so as to promote better
software engineering processes, organizational management strategies, and in-
dependent oversight methods. University–industry–government partnerships
could be a strong pathway for influential actions.
Responsible industry leaders have repeatedly expressed their desire to con-
duct research and use HCAI in safe and effective ways. Microsoft’s CEO Satya
Nadella proposed six principles for responsible use of advanced technologies.26
He wrote that artificially intelligent systems must:
Similarly, Google’s CEO Sundar Pichai offered seven objectives for artificial
intelligence applications that became core beliefs for the entire company:27
• Be socially beneficial.
• Avoid creating or reinforcing unfair bias.
CHAPTER 21: INDEPENDENT OVERSIGHT 205
There are hundreds of organizations in this category, so this brief listing only
samples some of the prominent ones.
Algorithmic Justice League, which stems from IT and Emory University, seeks
to lead “a cultural movement towards equitable and accountable AI.” The
League combines “art and research to illuminate the social implications and
harms of AI.” With funding from large foundations and individuals it has
done influential work on demonstrating bias, especially for face recogni-
tion systems. Its work productively has led to algorithmic and training data
improvements in leading corporate systems.32
AI Now Institute at New York University “is an interdisciplinary research
center dedicated to understanding the social implications of artificial intel-
ligence.” This institute emphasizes “four core domains: Rights & Liberties,
Labor & Automation, Bias & Inclusion, Safety & Critical Infrastructure.” It
supports research, symposia, and workshops to educate and examine “the
social implications of AI.”33
Data and Society, an independent New York-based non-profit that “studies the
social implications of data-centric technologies and automation . . . We pro-
duce original research on topics including AI and automation, the impact of
technology on labor and health, and online disinformation.”34
Foundation for Responsible Robotics is a Netherlands-based group whose
tag line is “accountable innovation for the humans behind the robots.” Its
mission is “to shape a future of responsible (AI-based) robotics design, de-
velopment, use, regulation, and implementation. We do this by organizing
and hosting events, publishing consultation documents, and through creating
public-private collaborations.”35
AI4ALL, an Oakland, CA-based non-profit works “for a future where diverse
backgrounds, perspectives, and voices unlock AI’s potential to benefit hu-
manity.” It sponsors education projects such as summer institutes in the
United States and Canada for diverse high school and university students,
especially women and minorities to promote AI for social good.36
ForHumanity is a public charity which examines and analyzes the downside
risks associated with AI and automation, such as “their impact on jobs, so-
ciety, our rights and our freedoms.” It believes that independent audit of AI
systems, covering trust, ethics, bias, privacy, and cybersecurity at the corpo-
rate and public-policy levels, is a crucial path to building an infrastructure
of trust. It believes that “if we make safe and responsible artificial intelligence
and automation profitable whilst making dangerous and irresponsible AI and
automation costly, then all of humanity wins.”37
CHAPTER 21: INDEPENDENT OVERSIGHT 209
Professional Organizations
and Research Institutes Working
on HCAI