100% found this document useful (4 votes)
1K views54 pages

ISO42001

ISO file

Uploaded by

godministrator
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
1K views54 pages

ISO42001

ISO file

Uploaded by

godministrator
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Model Reporting for Certifiable AI

Model Reporting for Certifiable AI:


A Proposal from Merging EU Regulation into AI
Development

Danilo Brajovic∗ [email protected]


Philipp Wagner∗ [email protected]
Mara Kläb [email protected]
Benjamin Fresz [email protected]
Marco F. Huber [email protected]
arXiv:2307.11525v1 [cs.AI] 21 Jul 2023

University of Stuttgart, Institute of Industrial Manufacturing and Management IFF, and


Fraunhofer Institute for Manufacturing Engineering and Automation IPA
Department Cyber Cognitive Intelligence (CCI)
Nobelstr. 12, 70569 Stuttgart, Germany
Vincent Philipp Goebels∗ [email protected]
Janika Kutz [email protected]
Jens Neuhuettler [email protected]
Fraunhofer Institute for Industrial Engineering IAO
Bildungscampus 9, 74076 Heilbronn, Germany
Niclas Renner∗ [email protected]
University of Stuttgart, Institute of Human Factors and Technology Management IAT
Nobelstr. 12, 70569 Stuttgart, Germany
Martin Biller [email protected]
AUDI AG, Data Analytics N/P1-441
Neckarsulm, Germany

Abstract

Despite large progress in Explainable and Safe AI, practitioners suffer from a lack of
regulation and standards for AI safety. In this work we merge recent regulation efforts by
the European Union and first proposals for AI guidelines with recent trends in research:
data and model cards. We propose the use of standardized cards to document AI appli-
cations throughout the development process. Our main contribution is the introduction of
use-case and operation cards, along with updates for data and model cards to cope with
regulatory requirements. We reference both recent research as well as the source of the reg-
ulation in our cards and provide references to additional support material and toolboxes
whenever possible. The goal is to design cards that help practitioners develop safe AI
systems throughout the development process, while enabling efficient third-party auditing
of AI applications, being easy to understand, and building trust in the system. Our work
incorporates insights from interviews with certification experts as well as developers and
individuals working with the developed AI applications.

*. Equal contribution. Each author provided one of the cards. First author coordinated writing.

1
Brajovic et al.

1. Introduction
One key enabler for the application of artificial intelligence (AI) and machine learning (ML)
is the establishment of standards and regulations for practitioners to develop safe and trust-
worthy AI systems. Despite several efforts by different research institutions and governmen-
tal organizations, there are currently no established guidelines. The most influential attempt
stems from the European Union with its AI Act (European Comission, 2021). In this paper
we revisit the state of the art in safeguarding and certifying AI applications with a focus on
the the European Union. Despite this, our summary is intended to be as broad as possible.
The focus on European legislation stems from the influence on regulation and standardiza-
tion it has.1 The result is a structured guideline for reporting AI applications along four
major development steps. The objective of our guideline is to provide comprehensive sup-
port during the development process. It encompasses a synthesis of requirements derived
from both non-technical users and prominent certification and standardization bodies. Our
approach involves the inclusion of robust references to relevant support materials and tool-
boxes, emphasizing the transparency and traceability of each requirement’s source. For this
purpose, we conduct interviews with standardization experts, developers and consumers of
the final AI product and incorporate their feedback into our proposal. Furthermore, our
framework serves as a solid foundation for prospective system certification. By adhering
to our guidance, developers can ensure they are adequately equipped to meet forthcoming
legal obligations within their domain. The key contributions of our work is fourfold:

• Our guideline is meant to be used during the development process.

• We combine requirements from non-technical users as well as certification and stan-


dardization bodies.

• We include references to support material and toolboxes and link the source of the
requirement whenever possible.

• We lay the foundations for a future certification of the system. With our framework,
developers should be well-positioned when legal requirements come into force.

1.1 Safeguarding AI, Law, Certification, Standardization, and Audits


Before going into the specifics of AI regulation, we first give a short summary of relevant
terms in EU regulations. The term audit describes the process of testing whether a prod-
uct, service, or process meets certain requirements. The result of such a process can be a
certification, the assurance by a third-party auditor that given requirements are fulfilled.
These requirements are often listed in harmonized standards, which are created by organisa-
tions like CEN, CENELEC, or ETSI, following a request from the European Commission.
These harmonized standards provide the technical details necessary for companies to be
able to ensure compliance with regulation and for auditors to have clear specifications to
check against. The process of creating such harmonized standards for the EU AI Act is
currently running in parallel to the legislation procedure. As a baseline for the standards

1. European Union Set to Be Trailblazer in Global Rush to Regulate Artificial Intelligence (TIME)

2
Model Reporting for Certifiable AI

harmonized with the EU AI Act, standards provided by ISO or other standardization bod-
ies can be considered. Conformity assessment bodies can then utilize these standards to
showcase the compliance of products, services, or processes with applicable EU legislation
(European Commission, 2023b). These audits can also be carried out on a voluntary ba-
sis for products that do not necessarily require them, for instance for marketing purposes.
In certain product categories like medical devices, third-party conformity assessments are
mandatory to ensure compliance with technical regulations before entering specific markets
(European Commission, 2023a). Apart from such certificates, so called guidelines have no
legally binding effect.
Finally, developing safe and responsible AI also involves governance structures within
the company to manage data development activities and risks. However, this is beyond the
scope of this paper, in which we focus on reporting around the development process.

1.1.1 Challenges in AI Certification


Despite being a topic of active research for several year, there is a lack of experience and
methods for securing and certifying ML applications. Apart from the fast progress in ML
around research and development, the challenges are rooted in the fundamental program-
ming differences between classic software and ML. Classic software components are usually
a combination of manually defined functions, the logic within these functions being explic-
itly specified by a human. In contrast, the core of ML is to learn from data while the
user only specifies the model framework. Often, the result is a highly complex black-box
that returns results without explaining the underlying decision-making process. The main
challenge of certifying such a system is validating a black-box model representing a highly
complex function. Furthermore, an ML algorithm always entails risks related to data, e.g.,
the risk of a selection bias (Tidjon & Khomh, 2022). Hence, within the certification pro-
cess, not only does the proper functionality of the model need to be verified but also the
representability of the data for training and testing. This is a challenging task, because the
reason for developing ML often is that the underlying problem and cause-effect relations
are too complex to be understood, formulated and solved analytically and manually.

1.2 The State of AI Certification


There are several attempts by different organizations to standardize and secure the applica-
tion of AI. Apart from several ISO standards which are currently under development (e.g.
ISO/IEC DIS 42001 (2023) - Artificial intelligence - Management system) the European
AI Act (European Comission, 2021) and the AI Assessment Catalog by Fraunhofer IAIS
(Poretschkin et al., 2021) emerged during our discussions with organizations working on
AI standardization and as such form a baseline for this work. They are described in more
detail in this section.

1.3 Non-academic Activities


1.3.1 The European AI Act
The most impactful work comes from the European Union (EU) with the proposal of the
AI Act (European Comission, 2021). The first draft was proposed in April 2021 by the

3
Brajovic et al.

European Commission and two comments were recommended in December 2022 from the
Council of the European Union (2022) and in April 2023 by the European Parliament
(2023). The initial draft already created a heated debate leading to fears that it might
over-regulate research and development of AI in the EU and that the definition of AI was
too broad. However, this debate is beyond the scope of this paper. We are interested in
the requirements posed on those systems. Although some are still subject to change, many
requirements have already emerged.
In general, the AI act distinguishes three risk classes for AI: forbidden applications, high-
risk applications and uncritical applications. Furthermore, there might be some obligations
of transparency even for uncritical applications. For example, that end-users must be
informed that they are interacting with an AI system. There are no mandatory obligations
for uncritical applications—although developers of such AI systems shall still be motivated
to follow the proposed principles.
Whether an application falls into the high-risk category is risk-dependent, and a few
high-risk applications are named in the AI Act’s appendix. Some of them include: bio-
metric identification, operation of critical infrastructure, education, employment & workers
management, essential private services & public services, law enforcement, migration &
asylum, and administration of democratic processes. The requirements for such systems
are listed in Chapter 2 and include

• a risk management system (Article 9)

• data and data governance (Article 10)

• technical documentation (Article 11)

• record-keeping (Article 12)

• transparency and provision of information to users (Article 13)

• human oversight (Article 14)

• accuracy, robustness and cybersecurity (Article 15).

A further distinction is based on whether an AI system is considered as general purpose


AI (GPAI ) or foundation model. Although the term GPAI lacks a proper and widely
accepted definition, GPAI should refer to systems such as image and speech recognition
that can by applied to a broad variety of tasks. Foundation models, on the other hand, will
most likely be distinguished by the training data and the generality of the output. This
should specifically target systems such as large language models like ChatGPT and pose
stricter regulation on them while GPAI systems must only comply with the regulations in
the Articles 9–15 if they are expected to be used in high-risk scenarios. The definitions for
GPAI and foundation models (and others) are provided in Article 3. Finally, Chapter 3
lists further obligations for providers of high-risk AI systems and will most likely contain
the requirements for foundation models once the AI Act is final. At the moment, it is
highly likely that a self-assessment will be possible for most applications. This means that
companies can audit their applications by themselves and then apply the CE-mark to their
product. Third party assessments will only be mandatory for certain highly critical domains

4
Model Reporting for Certifiable AI

such as medical technology. Overall, the requirements are on a very abstract level and still
subject to change. In particular, the act does not contain a specific starting points for
practical implementation in companies because the baseline standards that will provide the
technical details are still being developed.

1.3.2 AI Assessment Catalog


The AI Assessment Catalog is the most comprehensive and technical guideline with over
100 pages of content. It gives authors and practitioners a structured tool to develop and
asses trustworthy AI systems (Poretschkin et al., 2021). The catalog considers the whole
AI application, i.e., the data and the AI components together with surrounding non-AI
software components, and identifies six risk dimensions that need to be considered: fairness,
autonomy and control, transparency, reliability, safety, and privacy. For each of these six
dimensions there are guidelines for risk analysis and appropriate actions. The actions for
each risk belong to one of the four groups: data, AI components, embedding, or operations.
Thus, if developers want to check whether they have addressed all necessary actions for
data they have to consult all six risk dimensions. To the best of our knowledge, the AI
Assessment Catalog is the most comprehensive guideline for trustworthy AI. However, due
to its size the catalog can be cumbersome for practitioners to use. Finally, although being
developed by a German organization, the guideline is not specific to Germany nor the EU
but combines elements from different areas of research and development.

1.4 Research Activities


In addition to government and industry activities, there is research work on model and data
reporting with cards that finds application especially among larger corporations.2 3 Our
own work builds on documentation along these cards.

1.4.1 Model Cards


Mitchell et al. (2019) propose model cards as a standardized way to document ML models,
their performance metrics and characteristics. The work focuses primarily on ethics and
fairness, citing that AI models in recent history have often erred more on social groups
that have historically been marginalized in the US. All involved stakeholders benefit from
a standardized use of model cards. The authors list nine sections for model cards: model
details, intended use, factors, metrics, evaluation data, training data, quantitative analyses,
ethical considerations, and caveats and recommendations. In our proposal we divide model
and data cards and therefore exclude the section on data.

1.4.2 Data Cards


Data cards are similar to model cards (Pushkarna, Zaldivar, & Kjartansson, 2022). Their
goal is to report several standard facts about a dataset in a simple and standardized form.
The authors list several requirements for their cards. In particular, they should be standard-
ized in order to be comparable, should be created at the same time as the dataset, written

2. The Data Cards Playbook: A Toolkit for Transparency in Dataset Documentation


3. Introducing the Model Card Toolkit for Easier Model Transparency Reporting

5
Brajovic et al.

Category Requirement Source


Describe data sources Annex VIII
Data Apply data governance Art. 28 b)
Disclose copyrighted data Art 28 b)
Disclose compute (computer power, training time) Annex VIII
Compute
Measure energy consumption Art 28 b)
Describe capabilities and limitations of model Annex VIII
Describe foreseeable risks and associated mitigations Annex VIII,
Art. 28 b)
Model
Benchmark model on standard benchmarks Annex VIII,
Art. 28 b)
Report results of internal and external tests Annex VIII,
Art. 28 b)
Disclose content generated from the model Recital 60 g
Disclose EU member states where model is on market Annex VIII
Deployment
Provide documentation for downstream compliance Art 28 b)
with AI Act
Table 2: Requirements for foundation models from Bommasani et al. (2023).

in a simple language such that users with a non-technical background can understand them,
and, finally, report known unknowns such as possible shortcomings or uncertainties of the
dataset. Apart from that, the authors name 31 content themes to describe a dataset. They
are listed in Appendix B.
Another related work is data sheets for datasets (Gebru et al., 2018). The authors
propose a similar set of questions for dataset creators during the collection phase that
can be grouped into the following categories: (i) motivation for creating the dataset, (ii)
composition of the data, (iii) collection process, (iv) preprocessing/cleaning/labeling of the
data, (v) Uses (describing the contexts in which the dataset was already used), (vi) the
distribution of the data, and (vii) the future maintenance. The intended audience are both
the dataset creators and the consumers of the final dataset.

1.4.3 Foundation Models & AI Act


Recently, Bommasani, Klyman, Zhang, and Liang (2023) assessed popular foundation mod-
els like GPT-4, LLaMA and Luminous for their compliance with the AI Act. They derived
and summarized 12 requirements grouped into data, compute, model and deployment from
the AI Act. Our own approach is similar to their work but is not focused on foundation
models. However, many of the requirements for foundation models also apply to high-risk
AI systems. A summary of the requirements can be found in Table 2.

6
Model Reporting for Certifiable AI

1.4.4 Other Activities


Besides the European AI Act there are several other regulatory attempts. The EU has been
actively involved in standardization activities related to AI and plans to implement liability
rules on AI to renew the existing rules on product liability. The new rules will hold manu-
facturers and importers of AI-powered products accountable for any defects or malfunctions
that may cause harm to consumers (Council of European Union, 2022). Additionally, there
are a number of ISO standards either in active development or already in use that target
AI systems in particular. The ISO/IEC JTC 1/SC 42 is the technical committee within
the International Organization for Standardization (ISO) that is responsible for developing
standards for AI. Currently, there are 57 standards planned, some of which are already
published. An overview can be found on the web page of the committee.4 Among the
most important ones are: the ISO/IEC 5259x (2023) series on data quality, ISO/IEC DIS
5338 (2023) on AI system life cycle processes, ISO/IEC FDIS 8183 (2023) on a data life
cycle framework, ISO/IEC 23894 (2023) on risk management for AI systems and, ISO/IEC
DIS 42001 (2023) for AI management systems. The last one is not addressed in this work
because it deals mostly with governance, which we do not cover here. These norms emerged
from discussions with certification bodies that we held in the course of our project.
Another recent and closely related work introduces a similar guideline to ours (Witten-
brink, Kraus, Demirci, & Straub, 2022). It breaks down the AI development process into
four steps and provides requirements as well as further instructions such as relevant norms
for each requirement. The main difference to our work is, that the development process
is broken down into characterisation, design, development, and deployment, whereas we
follow an approach oriented on existing work on model and data cards.
On the governmental side, China and the US are among the most important other
players in regulating AI. The US approach significantly differs from the European one
and is (currently) based on voluntary sector-specific measures. The most important non
sector-specific work is the AI Risk Management Framework developed by the National
Institute of Standards and Technology (NIST) (AI Risk Management Framework, 2023).
It defines seven key characteristics for trustworthy AI that include (i) reliability, (ii) safety,
(iii) security and resiliency, (iv) accountability and transparency, (v) explainability and
interpretability, (vi) privacy, and, (vii) fairness. The companion NIST AI RMF Playbook
names four pillars to mitigate these risks: (i) map: risks are identified, (ii) measure: the
identified risks are analyzed and tracked, (iii) manage: risks are prioritized and acted upon
and, finally, (iv) govern: a culture of risk management and governance is present. The
key characteristics of trustworthy AI are well aligned to those defined in the AI Act.5
Furthermore, NIST is working to align the framework with international standards that are
also relevant for the AI Act.6
China, on the other side, follows a vertical approach7 that targets each AI applications
with a specific regulation and aims to be a leader in AI standardization (H. Roberts et al.,
2021). For example, one regulation specifically targets services that generate text, images
4. Standards by ISO/IEC JTC 1/SC 42 : Artificial intelligence
5. An illustration of how NIST AI RMF trustworthiness characteristics relate to the OECD Recommenda-
tion on AI, Proposed EU AI Act, Executive Order 13960, and Blueprint for an AI Bill of Rights
6. https://www.nist.gov/artificial-intelligence/technical-ai-standards
7. Lessons From the World’s Two Experiments in AI Governance (carnegieendowment.org/)

7
Brajovic et al.

or videos. Overall, although the high level pillars are expected to be similar to the ones in
western countries and include that AI should benefit human welfare, China’s regulation is
expected to be focused more around group relations than individual rights (H. Roberts et
al., 2021). However, this is beyond the scope of our paper and Chinese regulation is not
addressed in our framework.

1.5 Contributions and Methods


We merge contributions of all of the previous works into a single framework for reporting AI
applications. As mentioned, this work focuses on the documentation of the AI application
and does not consider governance within a company. Our guideline follows the develop-
ment process of an AI system with the goal to support developers during various model
development steps in creating trustworthy AI systems from the beginning and positioning
them well when legal requirements come into force. Apart from this, we cover the same risk
dimensions as previous approaches and reference the AI Act and other guidelines whenever
possible. The AI development cycle is broken down into four steps: (i) the use-case defini-
tion, (ii) data collection, (iii) model development, and finally, (iv) model operation. We use
four cards as method of reporting in order to reduce the number of resulting documents so
the hurdle to access them is as low as possible while the different topics remain separated.
In the coarse of the project we furthermore evaluated the approach on three use-cases and
integrated both feedback from developers and other affected individuals. Finally, we also
held interviews with several certification bodies to get an understanding about the most
important existing documents and guidelines for our work.

1.5.1 Motivation for Cards


Apart from discussions with certification experts preparing for the certification of AI sys-
tems, we also polled employees being affected by the introduction of AI systems. Our goal
was to assure that our proposal would both satisfy the requirements of possible auditors and
builds trust by the individuals being affected by the AI solution. The affected individuals
mainly wished to be included in the development process in order to convince themselves
about the functionally of the AI system. Furthermore, the system should be transparently
visualised and the output and reasons for the decision should be comprehensibly explained.
The reason why we choose the format of cards that follow the development process instead of
following risk dimensions (in contrast to, for instance, Poretschkin et al. (2021)) is twofold:
firstly, for developers it is easier to follow the development process and for domain experts
and other employees the resulting number of documents is smaller and as a result easier to
access. Secondly, although AI development is often an agile process, the cards can still be
utilized by updating them when the dataset changes or the model is updated. They also
integrate well into a classical v-model development, which is subject to future work.

8
Model Reporting for Certifiable AI

2. Proposed Approach

In this section, we propose cards supporting the four steps: use case definition, data collec-
tion, model development, and model operation. For each card, we briefly cover background
information before presenting our proposal and include references to further literature when-
ever possible. Furthermore, each card contains an involvement of affected individuals section
as a reminder to include people that might be affected by the AI system whenever feasible.
In Appendix A we provide an example on a toy use case. As mentioned before, we focus
on reporting around the AI application and do not cover governance.

2.1 Use Case

In line with Raji et al. (2020) and Poretschkin et al. (2021) the first step in our framework
is a summary of the use case. This offers the auditing authority a brief overview about the
task at hand and the AI application. Furthermore, the document can be used during the AI
development phase to ensure compliance as well as a trustworthy and robust AI system. A
general understanding of the use case is necessary to perform a proper risk assessment after-
wards. Therefore, a use case summary should be formulated that includes a description of
the status before applying an AI solution, a description of relevant sub-components as well
as a summary of the proposed AI solution. This approach is especially useful for the doc-
umentation of high-risk applications as defined in the AI Act (European Comission, 2021).
A general summary should also include a contact person and information about involved
groups inside an organisation to include them in relevant parts during the development
phase. New in our framework is the justification of why a non-AI approach is not sufficient
in the context of the application. Although this requirement is not explicitly stated in the AI
Act, it emerged in discussions with several certification bodies. For developers this means
that a simpler model (or maybe non-AI) solution should be prioritized whenever there is
no sound justification for the use of a more complex one. Finally, two new aspects in our
approach are (i) reviewing and documentation of prior incidents in similar use cases and
(ii) the distinction whether a general purpose AI system or a foundation model is expected
to be used or developed.

2.1.1 Risk

AI can only be deployed if the associated risks are kept on an acceptable level, which is
comparable to already existing products or services. To guarantee this, a proper risk as-
sessment must be performed. Due to the nature of AI systems, e.g., the training on noisy
and changing data or the black-box characteristics, classical methods for risk assessment
cannot be applied in most cases (Arai & Kapoor, 2020; Siebert et al., 2022). In litera-
ture, there are several schemes for qualitative and quantitative risk assessment. In recent
years, the AI community proposed several ways to categorize risks in different dimensions
(Ashmore, Calinescu, & Paterson, 2021; Piorkowski, Hind, & Richards, 2022; Poretschkin
et al., 2021). One approach, accepted by many scientists and organisations, is the division

9
Brajovic et al.

according to the High-Level Expert Group on AI (AI HLEG, 2019). They propose the seven
key dimensions of risk:

1. human agency/oversight,

2. technical robustness/safety,

3. privacy/data governance,

4. transparency,

5. fairness,

6. environmental/societal well-being, and

7. accountability.

As described in Section 1.4.4, these dimensions are well-aligned to other works. Based
on every dimension one can derive possible risks for one specific use case and AI application.
Here, the risk consists of the probability of an undesirable event and the impact such an
event has. In recent years, many more methods have been developed to accurately quantify
such probabilities, for example, in the field of adversarial robustness (Murakonda & Shokri,
2020; Szegedy et al., 2013), fairness (Barocas & Selbst, 2016; Bellamy et al., 2018; Bird et al.,
2020), or privacy (Fredrikson, Jha, & Ristenpart, 2015; Shokri, Stronati, Song, & Shmatikov,
2016). In the context of (functional) safety, a quantitative assessment of the risk is almost
always mandatory, especially in safety critical applications such as robotics, manufacturing,
or autonomous driving (Ashmore et al., 2021; El-Shamouty, Titze, Kortik, Kraus, & Huber,
2022; Salay, Queiroz, & Czarnecki, 2017). There are also more qualitative approaches to
classify the risk corresponding to one dimension. The EU separates between forbidden
applications, high risk applications and uncritical applications based on the application
domain (European Comission, 2021). A standardized way of performing risk assessment
for AI is given in ISO/IEC 23894 which uses the existing risk management standard ISO
31000 as a reference point. It also offers concrete examples in the context of correct risk
management integration during and after the development.
Our proposal of a use case documentation is shown below. In accordance with existing
works on model cards (Mitchell et al., 2019) and data cards (Pushkarna et al., 2022) we call
it a Use Case Card. As in the other cards for almost every aspect there exist one or multiple
references to the AI Act. After a use case summary and a brief formulation of the problem
and solution, we suggest to perform a risk assessment based on the dimensions mentioned
above. The risks has to be stated associated with the corresponding dimension. Note that
in this approach first a qualitative risk assessment is performed to create awareness of the
critical aspects of the AI system. Later on, in the Sections 2.2–2.4, the goal is to quantify
and minimize those risks using specific methods and other measures.

10
Model Reporting for Certifiable AI

Use Case Card


Step Requirement AI Act References
Name contact person Art. 16, Art.
24
List groups of people involved
Summarize the use case shortly Art. 11 Poretschkin et al. (2021)
Describe the status quo
Describe the planed interaction Art. 18
General of sub-components (e.g., different
software modules or hardware and
software)
Provide a short solution summary Mitchell et al. (2019)
Review and document past inci- AI Incident Database
dents in similar use cases
Review available tools Catalogue of Tools & Metrics for
Trustworthy AI
Clearly describe the learning prob- Art. 11
lem
Problem Describe the disadvantages of the
definition current approach
Argue why classical approaches are
not sufficient
Describe the integration into the Art. 11
current workflow
Formulate the learning problem
Solution Formulate the KPIs for go-live
approach
Document if a foundation model Art. 3, Art 28
will be used or developed b)
Provide a short data description Art. 10 Gebru et al. (2018)
Risk Assessment Art. 9, Art.
19
Categorize the application into a Art. 5, Art. 6 Navigating AI Act Flow Chart
Risk Class risk class according to the AI Act Risk-Classification DB
Human agency Rate the level of autonomy Art. 14
and oversight
Evaluate the danger to life and Art. 5 (1) a), Winter et al. (2021)
health Art. 14
Identify possibilities of non- Art. 15, Art. Johnson and Sokol (2020)
Technical compliance 19
robustness and Identify customer relevant mal-
safety function

11
Brajovic et al.

Identify internal malfunction


Evaluate cybersecurity risks Art. 15, Art.
42
List risks connected with customer Art. 10 De Cristofaro (2020)
data
Privacy and
data List risks connected with employee
governance data
List risks connected with company
data
Evaluate effects of incomprehensi- Art. 13 Castelvecchi (2016); Ehsan, Liao,
ble decisions or the use of a black Muller, Riedl, and Weisz (2021);
Transparency box model Felzmann, Villaronga, Lutz, and
Tamò-Larrieux (2019); Larsson
and Heintz (2020); Rudin (2019)
Check for possible manipulation of Ashton and Franklin (2022); Botha
groups of people and Pieterse (2020); ISO/IEC CD
Diversity, non- TS 12791 (2023)
discrimination Check for discrimination of groups Art. 5 (1) b) Section2.2.2, Council of European
and fairness of people regarding sensitive at- and c) Union (2019); ISO/IEC CD TS
tributes 12791 (2023); Oneto and Chiappa
(2020)
List possible dangers to the envi- Wu et al. (2022)
ronment
Consider ethical aspects Bostrom and Yudkowsky (2018);
AI HLEG (2019); Dubber,
Societal and Pasquale, and Das (2020); Hagen-
environmental dorff (2020); Müller (2020)
well being
Evaluate effects on corporate ac- Glikson and Woolley (2020)
tions
Identify impact on the staff Malik, Tripathi, Kar, and Gupta
(2022)
Estimate the financial damage on Art.17 Gualdi and Cordella (2021)
Accountability failure
Estimate the image damage on fail-
ure
List relevant norms in the context Art. 9 (3) Gasser and Schmitt (2020) AI
Norms of the application (e.g., automotive Standards Hub
safety norms in an automotive use-
case)
Involvement of If feasible, describe involvement of
Individuals affected individuals

12
Model Reporting for Certifiable AI

2.2 Data

AI applications stand and fall with the training data. Despite that, there has only recently
been a paradigm shift towards the datasets due to the data-centric AI perspective. The AI
Act requires in Article 10 (3) that training, validation, and testing datasets shall be relevant,
representative, free of errors and complete. What this means in practice, however, is often
unclear. Our extension of a data card builds on previous work on data documentation but is
extended with further references, especially those addressing regulation, whenever possible
(Gebru et al., 2018; Pushkarna et al., 2022). From the standardization side, the ISO/IEC
5259x (2023) series deals with data quality for AI systems, ISO/IEC FDIS 8183 (2023) with
the data life cycle framework for AI systems and ISO/IEC 25012 (2008) describes a general
data quality model. The first two are still under development but are worth reviewing once
they are final. Before introducing the data card, we discuss open questions that we think
are not covered sufficiently in the state of the art.

2.2.1 Related Research

We start by summarizing research directions that might be worth considering before starting
data collecting. Two related active areas of research are active learning (Settles, 2010) and
core sets (D. Feldman, 2020). In active learning, the algorithm is given access to a small
labeled dataset D and a large unlabeled dataset U . The algorithm can then query an
oracle (e.g., human labeler) for labels up to some budget b. The goal is to optimize model
performance within the limit of the budget. A core set is a small fraction S ∈ D of the
original dataset D such that the learning algorithm achieves a similar performance on this
subset as if it was trained on the entire dataset.

Recently, the concept of memorization or uniqueness of data points was introduced


(V. Feldman, 2020; Jiang, Zhang, Talwar, & Mozer, 2020). A data instance i from the
training set is called unique if removing it from the training set reduces the probability for i
to be classified correctly by the model. V. Feldman (2020) hypothesizes that such rare points
are important for the generalization of models. I.e., a single rare point such as an image of a
red car could determine whether this concept is correctly classified by the model in the real
world or not. Jiang et al. (2020) use it as a measure to categorize the structure of a dataset
and show that mislabeled points are harder to memorize. Unfortunately, estimating these
instances is computationally expensive. For each data instance at least two models need to
be trained; one with and the other without the instance. However, there might be occasions
when it could become useful, for example, if a specific point shall be inspected. Finally,
Meding, Buschoff, Geirhos, and Wichmann (2021) describe dichotomous data-difficulty and
show that many datasets such as ImageNet suffer from imbalanced data difficulty. There
are many data points in the test set that are never classified correctly (called impossible)
and many that are always classified correctly (called trivial ). They show that models can
better be compared on the remaining points. A similar conclusion can be drawn for label
errors. Northcutt, Athalye, and Mueller (2021) show that larger models tend to be favored
on datasets with label errors while smaller models might actually outperform them when
evaluated on a dataset without errors.

13
Brajovic et al.

These research fields and recent advances are related to data collection and quality and
performing a literature review before starting to collect data can help to make the collection
more effective.

2.2.2 Data Collection


After a literature review, the first step is to gather data. Apart from collecting metadata
and protected attributes, we focus on suggestions around dataset size and coverage. It is
worth mentioning that, to the best of our knowledge, there are currently no established
practices in that regard and the upcoming ISO standards on data ISO/IEC 5259x (2023)
will most likely not include detailed requirements. Hence, the following suggestions should
be seen as points worth to consider rather than being mandatory.

Metadata and Leakage Although being well-understood, leakage remains a central


problem in many datasets (Kapoor & Narayanan, 2022). A famous example is CheXNet for
pneumonia detection on x-ray images (Rajpurkar et al., 2017), where the authors performed
a naı̈ve train-test split with overlapping x-ray images of the same patients in training and
test sets in the first version of the paper. As such, the performance of the model on real
data was overestimated. When creating train and test splits, such leaks can be avoided by
collecting metadata and using it for stratification or to generate data splits of different dif-
ficulty. In the ChexNet example, the metadata could be an anonymous ID for each patient
and an easy test set would contain data from patients already represented in the train data
wheres a hard test set would only contain data of unseen patients. This information can
help to estimate the generalization capabilities of a model which can differ between appli-
cations. If the goal is to predict re-occurrence of a disease, performance on the easy test
set could be the primary metric while performance on the hard test set would be of interest
if the goal is to detect pneumonia in new patients. In general, it is advisable to collect
whatever metadata is easy to obtain and compatible with data protection law because it is
often impossible to backtrack once the dataset is final. This especially includes protected
attributes.

Protected Attributes Usually, the protected attributes include sex, race, skin colour,
ethnic or social origin, genetic features, language, religion or belief, political or any other
opinion, membership of a national minority, property, birth, disability, age or sexual orien-
tation and can be found, for instance, in Art. 21 of the Charter of Fundamental Rights of
the European Union (Coghlan & Steiert, 2020). According to the EU handbook on non-
discrimination law, it might differ between use-cases whether an attribute is considered as
protected (Council of European Union, 2019). Furthermore, pregnancy can be considered
as protected in some cases. Access to such information is crucial when detecting biases but
there may also be cases when other laws prohibit the use of such attributes (van Bekkum
& Borgesius, 2023).

Dataset Size One problem when collecting data is knowing when a sufficient amount was
acquired. Despite the standard perspective that “there’s no data like more data”, according
to IBM’s Robert Mercer, collecting and annotating data is both expensive and not always
feasible. For example, in medicine it is almost impossible to increase the number of samples

14
Model Reporting for Certifiable AI

0.975

0.9 0.970
0.965
Test Performance

Test Performance
0.8
0.960

0.7 0.955
0.950
0.6
0.945
0.940
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Fraction of Train-Set Fraction of Test-Set

(a) Model Performance for Train Fraction (b) Test Variance for Test Fraction

Figure 1: (a) Model performance of a multilayer perceptron (MLP) trained on the MNIST
dataset with increasing fraction of original train data. The improvement of adding data
points behaves logarithmically. (b) Variance over 20 runs of training an MLP on the entire
MNIST training set and evaluating the performance on fractions of the test set. With
increasing size of the test set, the variance decreases.

of rare diseases. Data augmentation is one solution, but in this section we want to focus
on the size of the raw dataset. A rule of thumb in computer vision suggests to use around
1.000 images per class for image classification tasks as a good starting point. The one-in-ten
rule8 suggests that per model-parameter ten data points are necessary. One way to estimate
the necessary amount of data is by tracking model improvements when sequentially adding
larger fractions of the training data. It is well known that the resulting performance plot
behaves logarithmically (Figueroa, Zeng-Treitler, Kandula, & Ngo, 2012; Viering & Loog,
2021). Hence, the largest improvements are obtained early on. As shown in Figure 1(a),
with about 50% of the training data, the performance is already above 90% and doubling
the amount of data increases the performance by less than 5%. In practice, one approach
could be to start with a small test set for data collection and utilize it to estimate when a
reasonable amount of data will be gathered. Once the data has been acquired this test set
may be added to the train data or discarded and a new test set is created.

Data Coverage Apart from the size of the dataset, covering all possible scenarios is
important. Recent research has shown that many data points are redundant (D. Feld-
man, 2020; Sorscher, Geirhos, Shekhar, Ganguli, & Morcos, 2022; Toneva et al., 2018) and
hypothesised that unique points (which may be considered as edge cases) can have a signifi-
cant impact on model performance (V. Feldman, 2020). One way to find such samples is by
manually defining edge cases, e.g. under the involvement of domain experts, and actively
collecting these data points. This can be extended by a combination of empirical and tech-
nical analysis, for instance, by plotting a downsampled embedding and analyzing the edge
cases. A useful tool for this purpose is tensorleap. Unfortunately, it is still unclear whether

8. https://en.wikipedia.org/wiki/One in ten rule

15
Brajovic et al.

rare points are desirable or not in a dataset. While they can have a significant impact on
model performance, they might make a model prone to privacy attacks (V. Feldman, 2020;
Sorscher et al., 2022).

2.2.3 Train-Test Split Creation

Usually, after data collection the data would be labled and preprocessed. Since our sug-
gestion regrading labeling are short, they are covered in the section on documentation.
Apart from this, we have swapped the section on train-test split creation and preprocessing
because it is recommend to do preprocessing on the splits separately whenever feasible.

Stratification First of all, training and test sets should be stratified. If metadata was
collected it should be considered for stratification. As indicated in Section 2.2.2, another
option is to create splits of different difficulty according to the metadata. An easy test set
might contain overlapping instances between entities in training and test set (e.g., data of
the same patients in training and test set) whereas a hard test set contains only data from
new entities (patients). This can be used to estimate the generalization capabilities of a
model and helps to understand whether it is usable for a task or not.

Test Set Size One of the most difficult choices is the size of the test set. Despite the
standard 80-20 (or similar) suggestion we are not aware of common strategies to estimate
the necessary test set size in a general setting (Joseph, 2022). In medical and psychological
research, sample size estimation based on statistical significance is established. For example,
if the effect of medication against a certain disease should be evaluated and it is a priori
known that the disease occurs in 10 from 1000 patients (p = 1 %) and the target is an error
margin of ϵ =1 % with a confidence level of σ = 99 %, the necessary sample size n is given
z 2 ×p(1−p) 2
by n ≈ 0.99 ϵ2 = 2.58 ×0.01(1−0.01)
0.012
= 659, where z0.99 = 2.58 is the value for the selected
confidence interval derived from the normal distribution. If p is unknown, it is usually set
to 0.5. The same can be applied to ML and was actually discussed in a few works especially
from the medical domain (Beleites, Neugebauer, Bocklitz, Krafft, & Popp, 2013; Dietterich,
1998; Konietschke, Schwab, & Pauly, 2021; Raudys & Jain, 1990). In the case of ML, the
interpretation is as follows; the goal is to evaluate an ML model that predicts the disease
with a randomly drawn test set of size 659. Then, there is a 99 % confidence that the real
world model performance is ±1 % of the error rate on the test set. However, one drawback
is that it is assumed that the collected data is representative of the real world setting and
it is not guaranteed that the model will not learn any spurious correlations.
Another indicator for a good test set is the test variance. The underlying assumption
is that a better test set should show a low variance between different evaluation runs with
the same model (Bouthillier et al., 2021). That is, if the same experiment results in an
unexpectedly large variance, the test set is most likely too small and should be extended.
An example for this effect is shown in Figure 1(b). Although multiple evaluations should
generally be done, it is of special importance if the test set cannot be extended (Bouthillier
et al., 2021; Lones, 2021; Raschka, 2018).

16
Model Reporting for Certifiable AI

2.2.4 Data Processing


After the data was collected and (maybe) labeled, it is preprocessed. This, for example,
includes data cleaning, feature selection, and feature engineering. As mentioned above, it is
important to do this separately on all splits whenever possible (Lones, 2021; Poretschkin et
al., 2021). Otherwise, information can leak into the test set. For example, by oversampling
rare classes before creating splits (Winter et al., 2021). In this work, we distinguish between
processing steps that are a fixed part of a deployed dataset and as such is documented in
the data card, and steps specific to a particular ML model which are documented in the
model card. Generally, it is advisable to involve domain experts and to document whether
and how the processing steps are agreed on with them (Poretschkin et al., 2021).

2.2.5 Understanding and Documenting Data & Label Quality


Finally, the characteristics of the data need to be documented. This includes all measures
applied in order to understand the data. For example, plots of embeddings, identified
rare points, or possible label errors or ambiguities. Although not focused on AI, the 16
dimensions defined in the ISO/IEC 25012 (2008) standard are well suited for this. They in-
clude accuracy, completeness, consistency, credibility, currentness, accessibility, compliance,
confidentiality, efficiency, precision, tractability, understandability, availability, portability,
recoverability, and relevance. Dimensions that are of particular importance are accuracy
(can be interpreted as fraction of label errors), completeness (fraction of missing values)
and consistency (fraction of inconsistencies such as synonyms).
From the technical side, Cleanlab (Northcutt et al., 2021) and tensorleap are toolboxes
that provide various methods for analyzing data and detecting and fixing label errors. Al-
though it is not always possible to estimate whether something is a label error, documenting
such ambiguous samples is important. Apart from this, the four-eyes principle is advisable
when data is annotated manually (Poretschkin et al., 2021). When dealing with third-party
annotators, the usage of a gold standard dataset is possible. A gold-standard dataset is a
small fraction of data with very high label quality that is send to an external provider for
labeling and can be used to estimate the label quality of the provider afterwards.

17
Brajovic et al.

Data Card
Step Requirement AI Act References
Name originator of the dataset and Gebru et al. (2018); Pushkarna et
provide a contact person al. (2022)
General Describe the intended use of the Gebru et al. (2018); Pushkarna et
dataset al. (2022)
Describe licensing and terms of us- Gebru et al. (2018); Pushkarna et
age al. (2022)
Collect requirements for the data Art. 10 (2) d)
before starting data collection e)
Describe a data point with its in- Gebru et al. (2018); Pushkarna et
terpretation al. (2022)
Maybe, provide additional docu- Pushkarna et al. (2022)
mentation to understand the data
(e.g., links to scientific sources, pre-
processing steps or other necessary
information)
Data-
description If there is GDPR relevant data Gebru et al. (2018); Pushkarna et
(i.e. personally identifiable infor- al. (2022)
mation), describe it
If there is biometric data, describe Gebru et al. (2018); Poretschkin et
it al. (2021); Pushkarna et al. (2022)
If there is copyrighted data, sum- Art. 28 b)
marize it
If there is business relevant infor- Poretschkin et al. (2021)
mation, describe it
Describe the data collection proce- Art. 10 (2) b) Gebru et al. (2018); Pushkarna et
dure and the data sources Annex III c) al. (2022)
Use data version control Art. 10 (2) b) Poretschkin et al. (2021)
Consider the prior requirements for Art. 10 (2) e)
the data
Include and describe metadata Metadata Standards
Involve domain experts and de- Poretschkin et al. (2021)
scribe their involvement
Collection Describe technical measures to en- Art. 10 (2) g) CleanLab
sure completeness of data Art. 28 b)
Describe and record edge cases Poretschkin et al. (2021);
Pushkarna et al. (2022)
If personal data is used, make sure Gebru et al. (2018); Lu, Kazi, Wei,
and document that all individuals Dontcheva, and Karahalios (2021)
know they are part of the data
Describe if and how the data could Gebru et al. (2018); Pushkarna et
be misused al. (2022)

18
Model Reporting for Certifiable AI

If fairness is identified as a risk, list Art. 10 (5) Section 2.2.2, ISO/IEC CD TS


sensitive attributes 12791 (2023)
If applicable, address data poison- Poretschkin et al. (2021)
ing
If applicable, describe the labeling Gebru et al. (2018); Pushkarna et
process al. (2022), snorkel.ai
Labeling If applicable, describe how the la- Section 2.2.5
bel quality is checked
Create and document meaningful Art. 10 (3) Section 2.2.3, Beleites et al. (2013);
splits with stratification Joseph (2022); Konietschke et al.
(2021); Poretschkin et al. (2021);
Raudys and Jain (1990); X. Zhang
et al. (2022)
Describe how data leakage is pre- Poretschkin et al. (2021); Winter et
vented al. (2021)
Splitting Recommendation: test the splits Bouthillier et al. (2021)
and variance via cross-validation
Recommendation: split dataset Meding et al. (2021)
into difficult, trivial and moderate
Recommendation: put special fo- Northcutt et al. (2021)
cus on label quality of test data
Reminder: Perform separate data Lones (2021); D. R. Roberts et al.
preprocessing on the splits (2017)
Document and motivate all pro- Art. 10 (2) c) Section 2.2.4
cessing steps that are a fixed part
of the data
Document whether the raw data Gebru et al. (2018)
Preprocessing can be accessed
If sensitive data is available high- Art. 10 (5) Poretschkin et al. (2021)
light (pseudo)-anonymization
If fairness is a risk, highlight fair- Art. 10 (5) Poretschkin et al. (2021)
ness specific preprocessing
Understand and document charac- Art. 10 (2) e) Section 2.2.5, ISO/IEC 25012
teristics of test and training data. g), Art 10 (3) (2008), ISO/IEC 5259x (2023)
Mazumder et al. (2022); Mitchell et
al. (2022); X. Zhang et al. (2022)
Document why the data distribu- Art. 10 (4) Poretschkin et al. (2021)
Analyzing tion fits the real conditions or why
this is not necessary for the use case
Document limitations such as er- Art. 10 (2) f) Gebru et al. (2018); Poretschkin et
rors, noise, bias or known con- g) al. (2021); Pushkarna et al. (2022);
founders Wittenbrink et al. (2022)

19
Brajovic et al.

Describe how the dataset will be Gebru et al. (2018); Pushkarna et


maintained in future al. (2022), Lu et al. (2021)
Describe the storage concept (e.g., Poretschkin et al. (2021)
everything users need to know to
access the data). For developers it
must be possible to document on
Serving
which version of the data a specific
model was trained.
Describe the backup procedure Poretschkin et al. (2021)
If necessary, document measures Poretschkin et al. (2021)
against data poisoning
Document further recommenda-
Further notes tions or shortcomings in the data
Involvement of If feasible, describe involvement of Art. 29 5a.
Individuals affected individuals

20
Model Reporting for Certifiable AI

2.3 Model Development

Developing models that execute a specific task in a reliable and fair way is challenging. The
AI Act phrases the requirements for (high-risk) AI systems in Article 15 (1): “High-risk AI
systems shall be designed and developed in such a way that they achieve [...] an appropriate
level of accuracy, robustness and cybersecurity, and perform consistently in those respects
throughout their lifecycle.” To achieve this, there are several concerns such as explainability,
fairness, feature engineering and others. For documenting all these pillars, we build on the
work of Mitchell et al. (2019). While some points overlap, we added and complemented
others, i.e., explainable artificial intelligence (xAI) and testing in real world settings.

2.3.1 Explainability

First, the requirements for xAI should be determined. In this regard, different aspects of
explainability methods needs to be considered. The first important aspect is to distinguish
between the need to explain a single decision (local explainability) or the entire ML model
(global explainability). The first might be important in cases where lay users need to be able
to understand a decision made about them (e.g., financial advise), while the latter might be
more important to make sure a model does not decide based on sensitive features. These
explanations can be derived via different means, with most taxonomies listing post-hoc
explainability vs. explainability by nature or design, also so-called interpretable machine
learning (iML). The former describes using a black-box model for a decision and generating
explanations afterwards, most often via simpler approximations, with the most prominent
examples being LIME (Ribeiro, Singh, & Guestrin, 2016) and SHAP (Lundberg & Lee,
2017). For post-hoc explainability, it is important to note that most xAI-methods provide
no formal guarantees, thus explanations might not conform to the ML-model’s decision in
all cases. Another important factor and currently open problem is how to evaluate such
explanation methods. While many evaluation metrics and tests have been proposed, none
has been widely adopted and consistently been used throughout the literature (Burkart &
Huber, 2021; Nauta et al., 2022). Currently it seems that explanation methods should be
tested in real-world scenarios, including people with the same explainability needs as the
target group of the explanations, as e.g. ML-engineers might need different explanations
than lay users. Additionally to the requirements regarding explainability and transparency
itself, xAI-methods might be needed to fulfill other legal requirements. If, for example,
users of an ML-system have a right to object to decisions (as given through the General
Data Protection Regulation (GDPR) in Europe), they must be sufficiently informed of a
decision to be able to make use of their rights. In general, we advise to think of explain-
ability demands before choosing a model, as use cases with a high demand for transparency
might need to provide some guarantees for correct explanations and could thus be limited to
iML models. Furthermore, considerations regarding explainability can also influence other
development steps, especially data preprocessing. Here, non-interpretable features as for
example obtained by principal component analysis (PCA) can severely limit the expres-
siveness of explainability approaches. Regarding the use of explainability for AI systems,
norms are currently specified, but not yet published (DIN SPEC 92001-3, 2023; ISO/IEC
AWI TS 6254, 2023).

21
Brajovic et al.

2.3.2 Feature Engineering, Feature Selection, and Preprocessing

While some of the preprocessing is addressed in the data card, here we consider only the
processing that is not part of the dataset itself, but is model specific. Feature engineer-
ing and selection as well as data augmentation and preprocessing steps should always be
meaningful in the context of the use case and the used model. Feature selection can be
seen as an optimization problem to find the most relevant features for the model. While
the performance of deep learning algorithms might not change, the computation cost will
increase and consume time and resources (Cai, Luo, Wang, & Yang, 2018). To prevent data
leakage, all steps, including feature selection, feature engineering, and preprocessing, should
be performed after splitting the data (Kapoor & Narayanan, 2022; Reunanen, 2003). This
also holds for k-fold cross validation, where feature engineering and preprocessing must be
executed on each fold independently.

2.3.3 Model Selection

The range of usable models depends on the given use case and its requirements such as
on explainability, prediction performance and inference time, just to name a few. With
those limitations, the amount of meaningful models will narrow down. It is good practice
to compare the selected model to standard baselines. A good starting point are easy to
train and for the use case established state of the art models. Also, rule-based systems or
heuristic algorithms can be used as a baseline. Trying different classes of models that have
shown good results on a given set of problems is always recommended (Ding, Tarokh, &
Yang, 2018; Lones, 2021). According to the no free lunch theorem, there is no single machine
learning algorithm that can outperform all others across all possible problems and datasets
(Wolpert, 2002). The choice of the algorithm needs to be appropriate for the underlying
problem class and the amount and quality of the available data. Usually, there are fine-tuned
algorithms for natural language processing, computer vision, time series and other fields.
Note that deep learning models will not always be the best option. Grinsztajn, Oyallon, and
Varoquaux (2022) show that tree based models often outperform deep learning in tabular
data. In time series forecasting, there is no clear best algorithm class. Statistical and tree
based methods regularly are on par with deep learning models (Zeng, Chen, Zhang, & Xu,
2022). However, achieving this improved performance often requires a significant increase in
computational resources (Makridakis et al., 2023). In general, simpler models are preferred
over more complex models if they provide similar levels of performance, according to the
principle of Ockham’s razor, because they are less prone to overfitting and easier to interpret.

2.3.4 Metrics

Similar to model choice, the metrics should also be selected carefully. Depending on the use
case, recall or precision might be more appropriate metrics than general ones like accuracy or
F1-score (Poretschkin et al., 2021). The choice of model also influences the choice of useful
metrics (Ferri, Hernández-Orallo, & Modroiu, 2009; Naser & Alavi, 2021). Furthermore,
the minimum acceptable score for the selected metrics should be determined to create value
in the specific use case. It is always recommended to involve domain experts to address the
minimum requirements.

22
Model Reporting for Certifiable AI

2.3.5 Hyper-parameter Optimization


Hyperparameter Optimization (HPO) is usually the most computational heavy task in
model development with up to hundreds of hyper-parameters to tune. It can help to under-
stand the underlying problem and the used algorithms to be able to reduce the space for
the optimal set of hyper-parameters. For large sets of hyper-parameters there are advanced
optimization techniques like Bayesian optimization and early stopping. Feurer and Hutter
(2019) and L. Yang and Shami (2020) give good overviews on this.
Furthermore, HPO is another pitfall of leakage. It is best to “lock” the test set to
prevent leakage, i.e., to not use the test data until HPO is completed. In most cases,
using (stratified) k-fold cross-validation is the best practice. However, cross-validation and
methods for HPO might introduce some unwanted variance (Bischl et al., 2023; Bouthillier
et al., 2021; Wong & Yeh, 2019; L. Yang & Shami, 2020).

2.3.6 Model Evaluation


To evaluate model performance, prevent over-fitting or under-fitting, and test generaliz-
ability, models must be evaluated against test data. As stated above, the test set must
not be used for HPO or any kind of “intermediate evaluation”. The performance should
be evaluated against the previously defined minimum requirements for the use case. If the
performance of the model is low, there are many possible causes, e.g., noisy data, model
choice, or metrics. If the model over-performs on the test set, i.e., the test error is smaller
than the training error, one should be cautious of potential errors such as leakage. Braiek
and Khomh (2018); J. M. Zhang, Harman, Ma, and Liu (2020) give an overview over the
topic. On the standardization side, ISO/IEC TS 4213 (2022) addresses the classification
performance of ML models.

2.3.7 Model Confidence


Guo, Pleiss, Sun, and Weinberger (2017) show that the confidence values calculated by
modern ML models is often poorly calibrated, i.e., the underlying correct likelihood of a
class is not represented by the predicted probability of the model. Even though this might
not be a big issue in many cases, it can lead to unwanted behavior of models in production
and unstable results. Therefore, confidence values should be evaluated and calibrated if
necessary (Abdar et al., 2021; Guo et al., 2017). Instead of calibrating models post-hoc,
conformal prediction allows to train algorithms with a fixed confidence (Shafer & Vovk,
2008).

2.3.8 Testing in Real-World Setting


Apart from model evaluation on the test set, a model must be tested and benchmarked in
the real-world setting (Arp et al., 2020; Larson et al., 2021). Domain experts should be
involved in the testing design, and it should ideally be done before the first model is trained.
The design should consider edge cases, disturbances, changes of the environment, and all
other likely settings and changes. Testing in a real-world setting will also uncover possible
leakage and other shortcomings that might slip the model evaluation. If real world testing
is not possible, computational stress tests should be done (Young et al., 2021).

23
Brajovic et al.

Model Card
Step Requirement AI Act References
Provide Name and details of a con-
tact person
Provide Name of the person who Mitchell et al. (2019)
General created the model
Document the creation date and Mitchell et al. (2019)
version of the model
Describe intended use of the model Mitchell et al. (2019)
Describe the architecture of the Mitchell et al. (2019)
used model
Describe the used hyper-
parameters
Document the training, validation, Mitchell et al. (2019)
Description and test error
Document computation complex- Art. 28 Garcı́a-Martı́n, Rodrigues, Riley,
ity, training time and energy con- and Grahn (2019); T.-J. Yang,
sumption (for foundation models Chen, Emer, and Sze (2017)
describe steps taken to reduce en-
ergy consumption)
Document the demand for explain- Art. 13 (1, 2),
ability and interpretability Art. 15 (2)
Explainability Describe taken actions if any Art. 15 (2) Burkart and Huber (2021);
and ISO/IEC AWI TS 6254 (2023);
interpretability Nauta et al. (2022); Ribeiro et al.
(2016); Schaaf and Wiedenroth
(2022)
Describe the consideration of ex- Art. 10 (3, 4)
plainability and interpretability (if
Feature relevant)
engineering,
Describe feature engineering and Cai et al. (2018); Kapoor and
feature
selection with (domain specific) Narayanan (2022); Mitchell et al.
selection and
reasoning (2019)
preprocessing
Describe and reason of preprocess- Kapoor and Narayanan (2022);
ing steps Mitchell et al. (2019)

24
Model Reporting for Certifiable AI

Describe the consideration of ex-


planibility and interpretability (if
relevant)
Describe the base line model and Lones (2021)
its evaluation
Describe the reasoning for the Lones (2021)
model choice
Describe why the complexity of
model is justified and needed
Model Document the comparison to other
selection considered models
Document the approach of hyper- Bouthillier et al. (2021); Feurer
parameter optimization and Hutter (2019); Godbole, Dahl,
Gilmer, Shallue, and Nado (2023);
Raschka (2018); Wong and Yeh
(2019); L. Yang and Shami (2020)
Describe the model evaluation Art. 5, 6, 7, 9 Braiek and Khomh (2018);
ISO/IEC TS 4213 (2022); Raschka
(2018); J. M. Zhang et al. (2020);
X. Zhang et al. (2022)
Describe selected metrics and de- Ferri et al. (2009); ISO/IEC CD TS
scribe reasoning regarding use case 12791 (2023); Mitchell et al. (2019);
and fairness Poretschkin et al. (2021); Wachter,
Choice of Mittelstadt, and Russell (2021)
metrics
Define the minimum requirement
for use in production (domain spe-
cific reasons)
Document and quantify uncer- Abdar et al. (2021)
Model tainty of the model
confidence Document approach of dealing Guo et al. (2017); Shafer and Vovk
with uncertainty (2008)
Describe the test design Art. 5, 6, 7, 9
Describe possible risks, edge cases Art. 15 (3),
and worst case scenarios and create Art. 28
Testing in real (or simulate) them if possible
world setting Describe limitations and shortcom-
ings of the model
Describe the test results
Explain the derived actions
Describe further recommendations
More or shortcomings of the model
Involvement of If feasible, describe involvement of Art. 29 (5) a)
Individuals affected individuals

25
Brajovic et al.

2.4 Model Operation


Proper handling of AI applications during deployment involves managing different aspects.
Technical correctness and desired performance of the AI application is one, but IT security,
maintenance and MLOps, workforce acceptance, knowledge management, and legal concerns
must also be considered. Lawmakers are very clear in their demands (AI HLEG, 2019;
European Comission, 2021). Article 61 (1) of the AI Act states “Providers shall establish
and document a post-market monitoring system in a manner that is proportionate to the
nature of the artificial intelligence technologies and the risks of the high-risk AI system.”
The challenge is to establish specific methods to control assessed risks. Ultimately, decisions
need to be made individually for each use case. We give an overview over currently discussed
topics and introduce our monitoring card as a de facto checklist.

2.4.1 Operating Concept


In order to use an AI application in productive operation, the operational concept defines
the way the application interacts with existing processes. The AI Assessment Catalog
specifies this to a detailed description of the used software components and APIs to other
environments (Poretschkin et al., 2021). As AI falls under the category of software, Win-
ter et al. (2021) suggest similar IT security precautions, as required for other high-impact
software. Additionally, the ISO/IEC 27000 (2018) series concerning Cyber Security and
the more recent ISO/IEC 27001 (2022) provide valuable recommendations and best prac-
tice examples for establishing an IT security management system (ISMS). Furthermore, it
is beneficial to draw upon recommendations from classical software engineering practices
when formulating the operating concept. Schmidt (2013) state that “Concepts of operation
descriptions usually address the following:

1. Statement of the goals and objectives of the software product.

2. Mission statement that expresses the set of services the product provides.

3. Strategies, policies, and constraints affecting the product.

4. Organizations, activities, and interactions among participants and stakeholders.

5. Clear statement of responsibilities associated with product development and sustain-


ment.

6. Process identification for distributing, training, and sustainment of the product.

7. Milestone decision definitions and authorities.”

Extending these requirements, the AI Act mandates in Art. 13 the disclosure of the use of
AI in outcome determination to users, with the level of detail depending on their function.
In all cases, it is essential to explain the reasoning behind AI’s decisions and the impact
of incoming data features, while also implementing a process to override decisions when
necessary. Additionally, for both regular and emergency tasks, a failsafe responsibility must
be established. Finally, it is recommended to provide training to employees on the usage of
the AI application to enhance acceptance and productivity.

26
Model Reporting for Certifiable AI

2.4.2 Monitoring

To ensure technical correctness, continuous evaluation of the AI application is essential.


Depending on the risk level, it is advisable to establish either live monitoring or periodic
controls. Live monitoring of the application is recommended to quickly detect anomalies,
not only for emergency situations but also to enhance AI performance. The AI Act, as
specified in Article 9, mandates the implementation of a ‘risk management system’ that
assesses risks during deployment and evaluates ongoing mitigation strategies through an
iterative process. The objective is to identify conditions under which the AI system can
operate safely. Deviating from this realm of safe operation necessitates the deactivation of
the AI application and a transition to a failsafe software system.

2.4.3 Human-Computer-Interface and Transparency

For monitoring purposes and insights into the functioning of the AI, a suitable interface is
required. AI Act Art. 14 specifies: “It has to be guaranteed that the potentials, limitations
and decisions are transparent without bias.” It is proposed that the interface provides
information at different levels of detail. This way operating staff can be supported in
carrying out their tasks with crucial process relevant information by the AI, while data
scientists might want to get more insights into training meta data to optimize learning
curves. Avoiding bias is a current research topic. Kraus, Ganschow, Eisenträger, and
Wischmann (2021) point out that an explanation of the algorithm alone is not sufficient
for transparency. In accordance with the recommendations set forth by UNESCO (2022)
“Transparency aims at providing appropriate information to the respective addressees to
enable their understanding and foster trust. Specific to the AI system, transparency can
enable people to understand how each stage of an AI system is put in place, appropriate to
the context and sensitivity of the AI system. It may also include insight into factors that
affect a specific prediction or decision, and whether or not appropriate assurances (such
as safety or fairness measures) are in place. In cases of serious threats of adverse human
rights impacts, transparency may also require the sharing of code or datasets.”

2.4.4 Model Drift

Model drift summarizes the impact of changes in concept, data, or software and their effect
on the model predictions. Concept drift can occur when the underlying correlations of
input and output variables change over time, while data drift can occur when the known
range is left, and software drift can occur when, for example, the data pipeline is modified
and data fields are renamed or units are changed. The effects of changes in the model
ecosystem should be closely monitored. That is, to prevent a low model performance based
on corrupted or unexpected input data, it is best practice to compare the input data with
training data. Metrics such as f-divergence may be used to express differences in the data
distribution and calculate limits for how far the live model performance may deviate from
the training performance.

27
Brajovic et al.

2.4.5 Privacy
If personal data is collected in the course of the AI deployment, secure storage of this data
and deletion periods must be established with the help of a data protection officer. In
addition, a process for providing information about personal data must be created in order
to process requests from data subjects. The GDPR imposes a minimization of personal
data collection. Methods such as differential privacy may be used to privatize personal data
during collection or procession (Dwork, Roth, et al., 2014).

2.4.6 MLOps
While the AI Act imposes no specific requirements for MLOps, except for ensuring the safe
operation of AI throughout its lifecycle, Poretschkin et al. (2021) provide several recom-
mendations regarding the implementation of secure MLOps practices.
To ensure uninterrupted operation, a well-regulated MLOps process is of utmost impor-
tance. The frequency of regular updates can be determined based on risk and technology
impact assessments. When there is a higher risk associated with deviations between the
training data and model data, more frequent retraining should be considered. It is crucial
to store metadata related to the training process and model, including parameters used,
training data, and model performance, to enable reproducibility. This approach relies on
the availability of rapidly collected and accurately labeled new data. In situations where
such data collection and labeling are not feasible, additional oversight of the AI system by
trained personnel, such as data scientists, becomes necessary.
Moreover, based on the identified risks, it is advisable to establish a dedicated testing
period for updates to both the model and infrastructure. Smooth transitions necessitate
clear delineation of responsibilities for the commit and review processes, as well as de-
ployment. In cases where swift rollback is required, well-documented versioning including
data is essential. Additionally, employing a mirrored infrastructure enables updates to be
installed seamlessly without causing downtime.

28
Model Reporting for Certifiable AI

Operation Card
Step Requirement AI Act References
Describe monitored components Art 61 (1,3)
Scope and aim Assess risks and potential dangers Art 9 (2)
of monitoring according to Use Case Card
List safety measures for risks Art. 9 (4) Poretschkin et al. (2021)
Create and document utilisation Art. 13, Art.
Operating concept 17, Art. 16
concept Plan staff training Art. 9 (4c) Ahlfeld, Barleben, et al. (2017)
Determine responsibilities Poretschkin et al. (2021)
Document decision-making power Art. 14, Art. AI HLEG (2019); Poretschkin et al.
Autonomy of of AI 17 (2021)
application Determine process to overrule deci- Art. 14 (4) d) Poretschkin et al. (2021)
sions of the AI e)
Document component wise: As- Art. 9 Ahlfeld et al. (2017); AI HLEG
Responsibilities sessed risk, control interval, re- (2019); Poretschkin et al. (2021)
and measures sponsibility, measures for emer-
gency
Monitor input and output Art. 17 (1) d), Poretschkin et al. (2021)
Art. 61(2)
Model Detect drifts in input data Poretschkin et al. (2021)
performance
Document metric in use to monitor Art. 15 (2)
model performance
Establish transparent decision Art. 52 Kraus et al. (2021); Poretschkin et
making process al. (2021)
AI interface Establish insight in model perfor- Art. 14 Poretschkin et al. (2021)
mance on different levels
Document individual access to Art. 15 Amschewitz, Gesmann-Nuissl, et
server rooms al. (2019); Winter et al. (2021)
Set and document needed Amschewitz et al. (2019); Winter
IT security clearance level for changes to et al. (2021)
AI/deployment/access regulation
Establish and audit ISMS ISO/IEC 27000 (2018); ISO/IEC
27001 (2022)
Justify and document use or waiver Art. 10 (5) Dwork et al. (2014); Poretschkin et
of a privacy preserving algorithm al. (2021)
Privacy If applicable, document privacy al- Dwork et al. (2014)
gorithm and due changes in the
monitoring of output data

29
Brajovic et al.

Establish versioned code repository Poretschkin et al. (2021)


of AI, training and deployment
Establish maintenance and update Poretschkin et al. (2021)
MLOps schedule
Set regulation for the retraining of Poretschkin et al. (2021)
the AI and decision basis for the
replacement of a model
Determine responsibilities for up- Poretschkin et al. (2021)
dates

Testing and Document software tests Art. 17 (1) d) Poretschkin et al. (2021)
rollout Set period of time that an update
must function stably before it is
transferred to live status
Involvement of If feasible, describe involvement of
Individuals affected individuals

30
Model Reporting for Certifiable AI

3. Summary and Conclusion


We introduce a guideline for documenting AI applications along the entire developing process that
extends previous work on model and data cards. For this purpose, we held interviews with certifica-
tion experts, developers of AI applications preparing for the upcoming regulation of the European
Union, and, individuals affected by the introduction of an AI system. All their feedback was in-
corporated into our framework. In contrast to works on model and data cards, we cover the entire
development process, which results in defining two new cards: use case card and operation card.
Further, we add a perspective on regulation. Our work is also different from most regulation and
certification efforts by being more specific and following the development process of AI applications
instead of risk dimensions. Our guideline is meant to support the development of trustworthy and
easy to audit AI systems from the beginning and should lay the foundations for a future certification
of a system. Although AI regulation is still in an early phase and many changes may be expected,
the main foundations and associated risks seem to have emerged. Hence, this work can be seen as
a snapshot of the current state and will evolve over time but should already position practitioners
well when legal requirements come into force.
While our guideline is more specific than many previous works, both the field of AI but also
the regulation are progressing fast. Hence, certain changes or specifications can be expected. For
example, best practices for creating datasets are likely to change in future. For this purpose, a
structured and regularly updated collection of best practices and safety measures applied by different
organizations similar to the AI incidents database or the Risk-Classification Database could be
collected.

Acknowledgments

This work builds on insight from the research projects veoPipe and ML4Safety and a collaboration
with Audi AG. We thank Hannah Kontos and Diana Fischer-Preßler for proofreading.

31
Brajovic et al.

Appendix A. Toy Audit


We exemplary applied our cards to one real world use case at our industry partner. However, in
order to avoid revealing confidential information we present results of a toy audit here in the paper.
Consider a large bicycle manufacturer producing around 50 000 pieces per year. The manufacturer
plans to upgrade its product portfolio with carbon fiber frames due to their lightness. The number
of bicycles with carbon frames is estimated to be around 5 000 pieces per year. However, carbon
frames have one drawback: they can break easily if something went wrong during the production
process. For this reason, the manufacturer plans to use a two-fold quality inspection consisting of a
regularly reoccurring computed tomography that can detect production errors reliably but at a very
high cost every 100 pieces. On top of that, each frame shall undergo a visual quality inspection using
computer vision before shipping (binary classification problem into defect and not-defect parts). If
the computer vision (CV) tool predicts a defect, a worker will double check the frame.

32
Model Reporting for Certifiable AI

Use Case Card Part 1


Step Requirement Measures
General Name contact person John Doe, Project Manager
List groups of people involved John Doe, Jane Roe, Richard Miles
Summarize the use case shortly Produced carbon fiber bicycle frames shall un-
dergo a quality inspection. Using computed to-
mography (CT) scans yields high precision, but
would be associated with high costs. To reduce
costs a machine learning model shall perform au-
tomated visual quality inspections on every frame
and expensive CT-scans will be performed less. If
a defect is predicted, the frame will be inspected
by a trained professional. To ensure quality stan-
dards every 100 -th frame will be inspected with a
CT image.
Describe the status quo To this day carbon fiber bicycle frames can only be
inspected manually or by CT-scans. Current non
carbon fiber frames are stress-tested and inspected
during quality end control. During the initial year
5 000 carbon fiber frames were produced and man-
ually inspected, with 300 defective parts detected.
For a larger database, an additional 200 images
of defective frames were obtained from a service
partner.
Describe the planed interaction After the pressing process fabricated parts will
of sub-components (e.g., different be photographed by industrial high speed cam-
software modules or hardware and eras and evaluated by the machine learning model.
software) The predictions of the ML model will be visu-
alized with an interface for quality management
workers, predicted defects will be highlighted and
require a manual inspection result. For each
frame produced, a randomized algorithm deter-
mines whether a CT-scan should be performed.
Shift workers will then initiate the CT-scan pro-
cess and document the result.
Provide a short solution summary The business problem can be translated to a bi-
nary classification problem, which can be solved
by means of industrial CV.
Problem defi- Clearly describe the learning prob- Carbon fiber frames are highly vulnerable to tears
nition lem and deformations, which can result in stability is-
sues and thus, braking of the frame. In use this
could cause accidents and harm to cyclists. Man-
ual and CT-scan inspections are too costly to be
performed for every frame. A solution that uses
algorithms must work accurately, but fast enough
not to delay production times.

33
Brajovic et al.

Describe the disadvantages of the High costs for CT-scans, manual visual inspections
current approach are not feasible due to high time consumption and
as a mundane task not appealing to workforce.
Argue why classical approaches are Rule-based algorithms may not yield the necessary
not sufficient confidence.
Solution ap- Describe the integration into the Images of the frames are automatically taken. The
proach current workflow ML model will evaluate the images in a few sec-
onds and present the results to current shift work-
ers via an interface. Detected defective frames will
automatically be marked for manual inspection.
The pass/fail labels assigned by the model can be
manually overwritten by workforce. Workers will
perform manual inspections on defective parts be-
fore submitting the frames to further assembly.
Formulate the learning problem Binary label classification with optional heatmaps
/ attribution maps applied to the predictions.
Formulate the KPIs for go-live A recall above 99% is targeted while preserving
a precision of more than 90% and the maximum
inference time should not exceed 1 second.
Provide a short data description Photos are taken with an industrial camera. Both
sides of the frame are photographed. The pictures
are of size 1024 x 1024 pixels.

34
Model Reporting for Certifiable AI

Use Case Card Part 2


Risk Assessment
Risk Class Categorize the application into a Low risk. Although the carbon fiber frame is a
risk class according to the AI Act safety relevant part of the bike, the AI is used for
quality control and not part of the final product.
Human agency Rate the level of autonomy Low risk since the AI only gives a suggested qual-
and oversight ity label. Manual overwrite is always possible.
Technical ro- Evaluate the danger to life and Miss-classified broken frames might be shipped
bustness and health and customers may be hurt.
safety
Identify possibilities of non- No risk, standard safety requirements are fulfilled
compliance without the use of an AI, the AI is an additional
layer of security.
Identify customer relevant mal- No risk, the AI is solely used for internal quality
function control.
Identify internal malfunction Low risk of production stoppage or higher produc-
tion costs.
Privacy and List risks connected with customer No risk (no data collected).
data gover- data
nance
List risks connected with employee Risk of analyzing quality of work of production
data staff.
List risks connected with company Low risk of production flaws being leaked to the
data outside world.
Transparency Evaluate effects of incomprehensi- Might result in trust issues in the AI within the
ble decisions or the use of a black workforce. This will lead to either more manual
box model inspections and with that higher costs or fewer
inspections and thus lower quality.
Diversity, non- Check for possible manipulation of No risk.
discrimination groups of people
and fairness
Check for discrimination of groups No risk.
of people regarding sensitive at-
tributes

35
Brajovic et al.

Societal and List possible dangers to the envi- No risk. Good quality assurance reduces waste.
environmental ronment
well being
Consider ethical aspects No risk.
Evaluate effects on corporate ac- An ELSI-Group (Ethical, Legal and Social Impli-
tions cations) has been formed by the workers council.
Identify impact on the staff The AI is solely used to ensure quality in the prod-
uct. No devaluation of workforce quality is prac-
ticed. No analyses are conducted on the quality of
work of the production staff and no employment
contracts are terminated due to the use of AI.
Accountability Estimate the financial damage on Broken frames might be shipped and need to be
failure replaced. If customers get injured they might file
for lawsuits. Misclassified flawless frames might
get disposed of.
Estimate the image damage on fail- Shipped misclassified broken frames might asso-
ure ciate the company with poor quality.
Norms List of relevant norms in this con- ISO 43.150 Cycles
text

36
Model Reporting for Certifiable AI

Data Card
Step Requirement Measures
General Name originator of the dataset and John Doe, Project Manager.
provide contact person
Describe the intended use of the Train a supervised ML model for the recognition
dataset of defective bike frames.
Describe licensing and terms of us- Part of the data was recorded internally and can
age be used without any restrictions. Another part of
the data is provided from a service partner under
a specific licensing agreement.
Data- Collect requirements for the data Before starting the data collection process, domain
Description before starting data collection experts where involved to discuss how defects oc-
cur on the bikes. We discussed the proper amount
of data and the necessary resolution of the images.
Describe a data point with its in- One data point consists of two RGB images with a
terpretation resolution of 1024×1024 pixels. One image shows
the frame frame from the left, the other one from
the right. Each image was taken with an industrial
camera of type [MODEL].
Maybe, provide additional docu- Not necessary, image data are human inter-
mentation to understand the data pretable.
If there is GDPR relevant data. de- No personal data.
scribe it
If there is biometric data, describe No biometric data.
it
If there is copyrighted data, sum- No copyrighted data.
marize it
If there is business relevant infor- Yes, the data of the intact bike frames provides
mation, describe it information about the glueing technique for the
carbon fiber frames. The defective bike frames
contain information about weaknesses and unsta-
ble parts. Both information could be relevant for
competitors.
Collection Describe the data collection proce- Data was systematically collected by our com-
dure and the data sources pany during the first year as well as by our ser-
vice partner using commercial industrial cameras
of type [MODEL] under diffuse lighting conditions
for both cases. We collected data by taking the im-
ages of the frames directly after production. Our
service partner captured images of the frames it
got from its customers. We merged the data and
visually checked for homogeneity. After that we
labeled the data using an external labeling com-
pany.

37
Brajovic et al.

Use data version control A data versioning system is applied in the cloud.
Consider prior requirements for The exact amount of data was collected as defined
data above.
Include and describe metadata We include the type of bicycle frame as well as the
exact date the image was taken.
Involve domain experts and de- Domain experts were involved during the collec-
scribe their involvement tion of requirements for the dataset. They also
provided instructions and a gold standard dataset
for the labeling company. They made several spot
checks to control for correctness of the labels af-
terwards.
Describe technical measures to en- We performed an embedding followed by a cluster-
sure completeness of data ing to identify groups of bike frames other than the
different bike models. Those clusters were checked
by domain experts. The numbers sold of our dif-
ferent bike models are accurately represented in
the data. Also we tried to include enough defect
data to make sure the trained model learns how a
defect looks in the images.
Describe and record edge cases One edge case is where the use of to much glue
could lead to a label defect. This also happened a
few times during the labeling process.
If personal data is used, make sure No individuals are represented in this dataset.
and document that all individuals
know they are part of the data
Describe if and how the data could It could be used by competitors to find weaknesses
be misused and important design choices in our product line.
If fairness is identified as a risk, list No information is saved to reconstruct the identity
sensitive attributes of workers.
If applicable, address data poison- Not applicable.
ing
Labeling If applicable, describe the labeling The labeling company got a short briefing and a
process gold-standard labeling set by our domain experts.
After the labeling we made spot checks on the
dataset.
If applicable, describe how the la- Domain experts perform spot checks.
bel quality is checked
Splitting Create and document meaningful We divided the dataset into easy and hard data
splits with stratification points. In the hard data points there were new
bike models previously not seen by the model.
Describe how data leakage is pre- We use the same camera type for all images. After
vented training we perform heatmap analysis to ensure
the model focuses on the bike frame.
Recommendation: test the splits The variance between several evaluations on the
and variance via cross-validation test set is below 1%.

38
Model Reporting for Certifiable AI

Recommendation: split dataset An additional (hard) test set was created that con-
into difficult, trivial and moderate tains novel frames from the service provider that
are not part of train data. This allows us to eval-
uate the applicability of our model to future prod-
ucts.
Recommendation: put special fo- The gold-standard data is included in the test set
cus on label quality of test data and all test labeles were inspected by two separate
experts.
Reminder: Perform separate data Done.
preprocessing on the splits
Preprocessing Document and motivate all pro- The image data is normalized from 0 to 1 to en-
cessing steps that are a fixed part sure a good behaviour of the convolutional neural
of the data network (CNN).
Document whether the raw data The raw data is stored on our servers and can be
can be accessed accessed at any time by us. It is not publicly avail-
able.
If sensitive data is available high- No (pseudo)-anonymization needed.
light (pseudo)-anonymization
If fairness is a risk, highlight fair- We do not record information to draw conclusion
ness specific preprocessing about the identity of specific workers.
Analyzing Understand and document charac- We have 4 700 intact bike frames and 500 bike
teristics of test and training data frames with defects. 300 of the defective frames
are from our own production and are identical to
our frames, 200 are from the service provider. We
use 20% of the data for the test split.
Document why the data distribu- Images were taken directly at our test production
tion fits the real conditions or why line and at the location of our service provider
this is not necessary for the use case using real intact and defective frames.
Document limitations such as er- It is possible that non-superficial defects cannot
rors, noise, bias or known con- be identified by a visual check.
founders
Serving Describe how the dataset will be New images from the production line and the ser-
maintained in future vice provider will be added to the dataset after
labeling by the external company.
Describe the storage concept We operate a server with redundancy to store the
image data.
Describe the backup procedure A compressed and encrypted backup of the cur-
rent dataset is saved at an external cloud storage
provider.
If necessary, document measures Not necessary.
against data poisoning

39
Brajovic et al.

Model Card
Step Requirement Measures
General Provide Name and details of a con- Jane Doe, Project Manager
tact person
Provide Name of the person who John Doe, ML Engineer
created the model
Document the creation date and 01-2023, Version 1.0
version of the model
Describe intended use of the model Detection of defects in carbon fiber frames
Description Describe the architecture of the Custom CNN
used model
Describe the used hyper- Six convolution layers with ReLu activation fol-
parameters lowed by max-pooling. Sigmoid activation in the
final layer. The model is trained for 200 epochs
with learning rate 1e-6.
Document the training, validation, Recall on train and both moderate and hard test
and test error set is 100%. However, the precision is only around
90%. Hence, the recall is good enough to apply the
solution but an improved precision would further
reduce manual checks.
Document computation complex- Model was trained on cloud resources using 6
ity, training time and energy con- NVIDIA A100 GPUs for 32 hours. Estimated en-
sumption (for foundation models ergy consumption: 70 kWh.
describe steps taken to reduce en-
ergy consumption)
Explainability Document the demand for explain- The model should provide heatmaps to workers to
and inter- ability and interpretability help them detect regions of interest.
pretability
Describe taken actions if any GradCam (or similar) will be applied post-hoc
during operation.
Feature en- Describe the consideration of ex- Not relevant because there is no feature engineer-
gineering, plainability and interpretability (if ing.
feature se- relevant)
lection and
preprocessing
Describe feature engineering and No feature engineering.
selection with (domain specific)
reasoning
Describe and reason of preprocess- Data augmentation in form of brightness and
ing steps Gaussian noise.

40
Model Reporting for Certifiable AI

Model selec- Describe the consideration of A standard heatmap method must be applicable
tion explanibility, interpretability and to the model.
fairness (if relevant)
Describe the base line model and Multiple baselines were evaluated, a linear model,
its evaluation ViT and smaller and larger CNNs.
Describe the reasoning for the The final model was chosen based on a trade-off
model choice between performance and model size.
Describe why the complexity of Yes, smaller or simpler models performed worse.
model is justified and needed Larger models increased the inference time too
much.
Document the comparison to other Using a deeper model such as ResNet only
considered models marginally improves performance by less than 1%.
Document the approach of hyper- Random search. Increases test performance by
parameter optimization around 5%.
Describe the model evaluation The model is evaluated w.r.t. the recall and pre-
cision on both the moderate and hard test and,
additionally, tested in production.
Choice of met- Describe selected metrics and de- The evaluated metric is recall and an inference
rics scribe reasoning regarding use case time below 0.5 seconds. Fairness is not applicable
and fairness here.
Define the minimum requirement Recall above 99% is the target. Either purely from
for use in production (domain spe- the model or by including a human in the loop. A
cific reasons) precision of 90% is targeted but not mandatory.
Model confi- Document and quantify uncer- Model confidence was not assessed, since the
dence tainty of the model model was trained w.r.t recall.
Document approach of dealing None, see above.
with uncertainty
Testing in real Describe test design During the initial phase (expected to be the first
world setting year) all frames will additionally be checked man-
ually. During this phase, faulty frames will be
injected into the line in order to validate the AI
solution.
Describe possible risks, edge cases Edge cases are tested both using the different test
and worst case scenarios and create sets (hard and medium) and through injection of
(or simulate) them if possible faulty parts.
Describe limitations and shortcom- Change of production process or architecture of
ings of the model frames might introduce new faults, that might not
be detcted by the model.
Describe the test results The test is still running but revealed problems of
the AI solution under specific lighting conditions
Explain the derived actions The dataset was extended to include such condi-
tions

41
Brajovic et al.

Operation Card
Step Requirement Measures
Describe monitored components The CV system as a whole is monitored. Input
Scope and aim and output of the model are evaluated separately.
of monitoring Additionally the function of the cameras is moni-
tored.
General
Assess risks and potential dangers Undetected faults lead to the shipment of broken
according to Use case card frames. False Alarms might lead to ineffective
manual inspections and trust issues of personnel.
List safety measures for risks Monitoring of input and output data. Regularly
reoccurring computed tomography enables assess-
ment of model performance. Manual inspections
enable the assessment of precision.
Operating con- Create and document utilisation If the CV system reports possible damage of a
cept concept frame, staff does a standardized thorough visual
and tactile examination.
Plan staff training Each employee who conducts the examination
needs to take a training session with test. Staff
training of the user interface and AI basics is
mandatory.
Determine responsibilities The shift supervisor is responsible for the exami-
nation and dedicated IT personnel for the CV sys-
tem.
Autonomy of Document decision-making power The AI system checks autonomously and only no-
the application of AI tifies personnel if faults are detected. The user
interface allows checking the system and the im-
ages made by the cameras at all times.
Determine process to overrule deci- Regularly reoccurring computed tomography tests
sions of the AI overrule the CV systems decision. Personnel can
overrule the AI system’s decision after manual in-
spection of a frame.
Responsibilities Component wise: assessed risk, Risks of system components (CV model, cameras,
and measures control interval, responsibility, data processing pipeline and user interface) are
measures for emergency assessed separately. Further detailed information
is attached to this document.

42
Model Reporting for Certifiable AI

Model perfor- Monitor input and output Input: Distribution of pixel values are monitored.
mance Output: Number and distribution of defective
frames is monitored over a fixed time period.
Detect drifts in input data Input data is monitored with torch drifts Kernel
MMD drift detector method.
Document metric in use to monitor Model performance is monitored via precision and
model performance recall. If precision is over 95% or under 85% per-
sonnel will be alerted. If the CV system misses a
defective frame (detected by computed tomogra-
phy) personnel will be alerted immediately. If the
percentage of defective frames increases, or per-
centage of defective frames decreases over time,
personnel is alerted immediately.
AI Interface Establish transparent decision Decision making is made transparent with
making process heatmaps and displayed on the user interface.
Establish insight in model perfor- Production staff: current frame and heatmaps dis-
mance on different levels played. Display of heatmaps of frames identified
as defective. IT personnel: Camera images and
heatmaps analogous to production staff. Addi-
tionally, access to logs, performance metrics and
monitoring as described.
IT security Document individual access to Access to server rooms is restricted to authorized
server rooms personnel only
Set and document needed Only trained and authorized personnel can make
clearance level for changes to changes to the CV system. Changes need to be
AI/Deployment/Access regulation verified by a second person to be applied.
Establish and audit ISMS IT system in which the AI application will be in-
tegrated is certified according to ISO 270001. A
review of the system will be conducted with an
IT-security manager to evaluate the need for re-
certification.
Privacy Justify and document use or waiver Not applicable since no personal data is collected.
of a privacy preserving algorithm
If applicable, document privacy al- Not applicable.
gorithm and due changes in the
monitoring of output data
MLOps Establish versioned code repository Version control is implemented with Git and
of AI, training and deployment MLflow.
Establish maintenance and update The ML model is retrained every 3 months (if new
schedule data is available) on a blend of old and new data
or if a drift is detected.
Set regulation for the retraining of The model is updated with new training data. The
the AI and decision basis for the update goes live, if the recall does not drop and
replacement of a model precision increases or remains constant. If perfor-
mance measures drop twice in a row after retrain-
ing, data will be analyzed.

43
Brajovic et al.

Testing and Determine responsibilities for up- Updates are managed by dedicated IT personnel.
rollout dates Changes need to be verified by a second person to
be applied.
Document software tests Software tests are automated and documented.
Set period of time that an update New software and model versions run separately
must function stably before it is on the test environment parallel to the produc-
transferred to live status tion environment for one week. Performance is
evaluated and compared to the old version before
deployment.

44
Model Reporting for Certifiable AI

Appendix B. Content Themes to Describe a Dataset according to Gebru


et al. (2018)
(1) About the publishers of the dataset and access to them
(2) The funding of the dataset
(3) The access restrictions and policies of the dataset
(4) The wipeout and retention policies of the dataset
(5) The updates, versions, refreshes, additions to the data of the dataset
(6) Detailed breakdowns of features of the dataset
(7) If there are attributes missing from the dataset or the dataset’s documentation
(8) The original upstream sources of the data
(9) The nature (data modality, domain, format, etc.) of the dataset
(10) What typical and outlier examples in the dataset look like
(11) Explanations and motivations for creating the dataset
(12) The intended applications of the dataset
(13) The safety of using the dataset in practice (risks, limitations, and trade-offs)
(14) The maintenance status and version of the dataset
(15) Difference across previous and current versions of the dataset
(16) Expectations around using the dataset with other datasets or tables (feature engineering, join-
ing, etc.)
(17) The data collection process (inclusion, exclusion, filtering criteria)
(18) How the data cleaned, parsed, and processed (sampling, filtering, etc.)
(19) How the data was rated in the dataset, its process, description and/or impact
(20) How the data was labelled in the dataset, its process, description and/or impact
(21) How the data was validated in the dataset, its process, description and/or impact
(22) The past usage and associated performance of the dataset (eg. models trained)
(23) Adjudication policies related to the dataset (labeller instructions, inter-rater policies, etc.
(24) Regulatory or compliance policies associated with the dataset (GDPR, licensing, etc.)
(25) Dataset Infrastructure and/or pipeline implementation
(26) The descriptive statistics of the dataset (mean, standard deviations, etc.)
(27) Any known patterns (correlations, biases, skews) within the dataset
(28) Any socio-cultural, geopolitical, or economic representation of people in the dataset
(29) Fairness-related evaluations and considerations of the dataset
(30) Definitions and explanations for technical terms used in the dataset’s documentation (metrics,
industry-specific terms, acronyms)
(31) Domain-specific knowledge required to use the dataset

45
Brajovic et al.

References
Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Ghavamzadeh, M., . . .
others (2021). A review of uncertainty quantification in deep learning: Techniques,
applications and challenges (Vol. 76). Elsevier.
Ahlfeld, M., Barleben, T., et al. (2017). Industrie 4.0 – how well the law is keeping
pace. Federal Ministry for Economic Affairs and Climate Action, Federal Ministry
of Education and Research. Retrieved from https://www.plattform-i40.de/IP/
Redaktion/EN/Downloads/Publikation/i40-how-law-is-keeping-pace.html
AI Risk Management Framework. (2023). Artificial intelligence risk management frame-
work (ai rmf 1.0) (Standard). USA: National Institute of Standards and Technology.
Retrieved from https://doi.org/10.6028/NIST.AI.100-1
Amschewitz, D., Gesmann-Nuissl, et al. (2019). Artificial intelligence and law in the context
of industrie 4.0. Federal Ministry for Economic Affairs and Climate Action, Federal
Ministry of Education and Research. Retrieved from https://www.plattform-i40
.de/IP/Redaktion/EN/Downloads/Publikation/AI-and-Law.html
Arai, K., & Kapoor, S. (Eds.). (2020). Advances in computer vision. Springer International
Publishing. Retrieved from https://doi.org/10.1007%2F978-3-030-17795-9 doi:
10.1007/978-3-030-17795-9
Arp, D., Quiring, E., Pendlebury, F., Warnecke, A., Pierazzi, F., Wressnegger, C., . . .
Rieck, K. (2020). Dos and don’ts of machine learning in computer security (Vol.
abs/2010.09470). Retrieved from https://arxiv.org/abs/2010.09470
Ashmore, R., Calinescu, R., & Paterson, C. (2021, may). Assuring the machine learning
lifecycle: Desiderata, methods, and challenges (Vol. 54) (No. 5). New York, NY, USA:
Association for Computing Machinery. Retrieved from https://doi.org/10.1145/
3453444 doi: 10.1145/3453444
Ashton, H., & Franklin, M. (2022). The problem of behaviour and preference manipulation
in ai systems. In Ceur workshop proceedings (Vol. 3087).
Barocas, S., & Selbst, A. D. (2016). Big data’s disparate impact (Vol. 104).
Beleites, C., Neugebauer, U., Bocklitz, T., Krafft, C., & Popp, J. (2013, 1). Sample size
planning for classification models (Vol. 760). doi: 10.1016/j.aca.2012.11.007
Bellamy, R. K. E., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., . . . Zhang,
Y. (2018). Ai fairness 360: An extensible toolkit for detecting, understanding, and
mitigating unwanted algorithmic bias. arXiv. Retrieved from https://arxiv.org/
abs/1810.01943 doi: 10.48550/ARXIV.1810.01943
Bird, S., Dudı́k, M., Edgar, R., Horn, B., Lutz, R., Milan, V., . . . Walker,
K. (2020, May). Fairlearn: A toolkit for assessing and improving fair-
ness in ai (Tech. Rep. No. MSR-TR-2020-32). Microsoft. Retrieved from
https://www.microsoft.com/en-us/research/publication/fairlearn-a
-toolkit-for-assessing-and-improving-fairness-in-ai/
Bischl, B., Binder, M., Lang, M., Pielok, T., Richter, J., Coors, S., . . . others (2023).
Hyperparameter optimization: Foundations, algorithms, best practices and open chal-
lenges.
Bommasani, R., Klyman, K., Zhang, D., & Liang, P. (2023). Do foundation model providers
comply with the eu ai act? Retrieved from https://crfm.stanford.edu/2023/06/

46
Model Reporting for Certifiable AI

15/eu-ai-act.html
Bostrom, N., & Yudkowsky, E. (2018). The ethics of artificial intelligence. In Artificial
intelligence safety and security (pp. 57–69). Chapman and Hall/CRC.
Botha, J., & Pieterse, H. (2020). Fake news and deepfakes: A dangerous threat for 21st
century information security. In Iccws 2020 15th international conference on cyber
warfare and security. academic conferences and publishing limited (p. 57).
Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A., Nichyporuk, B., Szeto, J., . . . others
(2021). Accounting for variance in machine learning benchmarks (Vol. 3).
Braiek, H. B., & Khomh, F. (2018). On testing machine learning programs (Vol.
abs/1812.02257). Retrieved from http://arxiv.org/abs/1812.02257
Burkart, N., & Huber, M. F. (2021, jan). A survey on the explainability of supervised
machine learning (Vol. 70). AI Access Foundation. Retrieved from https://doi.org/
10.1613%2Fjair.1.12228 doi: 10.1613/jair.1.12228
Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A
new perspective (Vol. 300). Elsevier.
Castelvecchi, D. (2016). Can we open the black box of ai? (Vol. 538) (No. 7623).
Coghlan, N., & Steiert, M. (2020). The charter of fundamental rights of the european union
: the ’travaux préparatoires’ and selected documents,. European University Institute,.
Retrieved from https://hdl.handle.net/1814/68959
Commission, E., Directorate-General for Communications Networks, C., & Technology.
(2019). Ethics guidelines for trustworthy ai. Publications Office. doi: DOI/10.2759/
346720
Council of European Union. (2019). Handbook on european non-discrimination law : 2018
edition. Publications Office of the European Union. doi: doi/10.2811/792676
Council of European Union. (2022). Proposal for a directive of the european parliament and
of the council on liability for defective products. Retrieved from https://eur-lex
.europa.eu/legal-content/EN/TXT/?uri=CELEX:52022PC0495
Council of the European Union. (2022). Proposal for a regulation of the european parliament
and of the council laying down harmonised rules on artificial intelligence (artificial
intelligence act) and amending certain union legislative acts. Retrieved from https://
data.consilium.europa.eu/doc/document/ST-14336-2022-INIT/en/pdf
De Cristofaro, E. (2020). An overview of privacy in machine learning. arXiv. Retrieved
from https://arxiv.org/abs/2005.08679 doi: 10.48550/ARXIV.2005.08679
Dietterich, T. G. (1998, oct). Approximate statistical tests for comparing supervised classi-
fication learning algorithms (Vol. 10) (No. 7). Cambridge, MA, USA: MIT Press.
Retrieved from https://doi.org/10.1162/089976698300017197 doi: 10.1162/
089976698300017197
DIN SPEC 92001-3. (2023). Künstliche intelligenz - life cycle prozesse und
qualitätsanforderungen - teil 3: Erklärbarkeit (Standard). Berlin, DE: DIN Deutsches
Institut für Normung e. V.
Ding, J., Tarokh, V., & Yang, Y. (2018). Model selection techniques: An overview (Vol. 35)
(No. 6). IEEE.
Dubber, M. D., Pasquale, F., & Das, S. (2020). The oxford handbook of ethics of ai. Oxford
Handbooks.
Dwork, C., Roth, A., et al. (2014). The algorithmic foundations of differential privacy

47
Brajovic et al.

(Vol. 9) (Nos. 3–4). Now Publishers, Inc.


Ehsan, U., Liao, Q. V., Muller, M., Riedl, M. O., & Weisz, J. D. (2021). Expanding
explainability: Towards social transparency in ai systems. In Proceedings of the 2021
chi conference on human factors in computing systems (pp. 1–19).
El-Shamouty, M., Titze, J., Kortik, S., Kraus, W., & Huber, M. F. (2022). Glir: A practical
global-local integrated reactive planner towards safe human-robot collaboration. In
2022 ieee 27th international conference on emerging technologies and factory automa-
tion (etfa) (p. 1-8). doi: 10.1109/ETFA52439.2022.9921583
European Comission. (2021). Proposal for a regulation of the european parliament and of the
council laying down harmonised rules on artificial intelligence (artificial intelligence
act) and amending certain union legislative acts. Retrieved from https://eur-lex
.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206
European Commission. (2023a). Conformity assessment. https://trade.ec.europa.eu/
access-to-markets/en/content/standards-and-conformity-assessment. (Ac-
cessed: 2023-04-21)
European Commission. (2023b). Harmonised standards. https://single-market-economy
.ec.europa.eu/single-market/european-standards/harmonised-standards en.
(Accessed: 2023-04-21)
European Parliament. (2023). Draft compromise amendments on the draft report: Proposal
for a regulation of the european parliament and of the council on harmonised rules on
artificial intelligence (artificial intelligence act) and amending certain union legisla-
tive acts. Retrieved from https://www.europarl.europa.eu/resources/library/
media/20230516RES90302/20230516RES90302.pdf
Feldman, D. (2020). Core-sets: An updated survey (Vol. 10). (- Name core set dates back
to 2005) doi: 10.1002/widm.1335
Feldman, V. (2020). Does learning require memorization? a short tale about a long tail.
Retrieved from https://arxiv.org/abs/1906.05271 (- Long Tail examples¡br/¿-
Mixture distribution) doi: 10.1145/3357713.3384290
Felzmann, H., Villaronga, E. F., Lutz, C., & Tamò-Larrieux, A. (2019). Transparency you
can trust: Transparency requirements for artificial intelligence between legal norms
and contextual concerns (Vol. 6) (No. 1). SAGE Publications Sage UK: London,
England.
Ferri, C., Hernández-Orallo, J., & Modroiu, R. (2009). An experimental comparison of
performance measures for classification (Vol. 30) (No. 1). Elsevier.
Feurer, M., & Hutter, F. (2019). Hyperparameter optimization. Springer International
Publishing.
Figueroa, R. L., Zeng-Treitler, Q., Kandula, S., & Ngo, L. H. (2012). Predicting sample size
required for classification performance (Vol. 12). BioMed Central Ltd. doi: 10.1186/
1472-6947-12-8
Fredrikson, M., Jha, S., & Ristenpart, T. (2015). Model inversion attacks that exploit
confidence information and basic countermeasures. In Proceedings of the 22nd acm
sigsac conference on computer and communications security (p. 1322–1333). New
York, NY, USA: Association for Computing Machinery. Retrieved from https://
doi.org/10.1145/2810103.2813677 doi: 10.1145/2810103.2813677
Garcı́a-Martı́n, E., Rodrigues, C. F., Riley, G., & Grahn, H. (2019). Estimation of energy

48
Model Reporting for Certifiable AI

consumption in machine learning (Vol. 134). Elsevier.


Gasser, U., & Schmitt, C. (2020). The role of professional norms in the governance of
artificial intelligence. In The oxford handbook of ethics of ai (p. 141). Oxford University
Press Oxford.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H. M., III, H. D., &
Crawford, K. (2018). Datasheets for datasets (Vol. abs/1803.09010). Retrieved from
http://arxiv.org/abs/1803.09010
Glikson, E., & Woolley, A. W. (2020). Human trust in artificial intelligence: Review of
empirical research (Vol. 14) (No. 2). Briarcliff Manor, NY.
Godbole, V., Dahl, G. E., Gilmer, J., Shallue, C. J., & Nado, Z. (2023). Deep learning tuning
playbook. Retrieved from http://github.com/google/tuning playbook (Version
1.0)
Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still
outperform deep learning on tabular data?
Gualdi, F., & Cordella, A. (2021). Artificial intelligence and decision-making: The question
of accountability. IEEE Computer Society Press.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural
networks. In International conference on machine learning (pp. 1321–1330).
Hagendorff, T. (2020). The ethics of ai ethics: An evaluation of guidelines (Vol. 30) (No. 1).
Springer.
ISO6254. (2023). Information technology — artificial intelligence — objectives and ap-
proaches for explainability of ml models and ai systems (Standard). Geneva, CH:
International Organization for Standardization.
ISO/IEC 23894. (2023). Information technology — artificial intelligence — guidance on
risk management (Standard). Geneva, CH: International Organization for Standard-
ization. Retrieved from https://www.iso.org/standard/77304.html?browse=tc
ISO/IEC 25012. (2008). Software product quality requirements and evaluation (square) —
data quality model (Standard). Geneva, CH: International Organization for Standard-
ization. Retrieved from https://www.iso.org/standard/35736.html
ISO/IEC 27000. (2018). Information technology — security techniques — information
security management systems (Standard). Geneva, CH: International Organization
for Standardization.
ISO/IEC 27001. (2022). Information security management systems (Standard). Geneva,
CH: International Organization for Standardization.
ISO/IEC 5259x. (2023). Artificial intelligence — data quality for analytics and machine
learning (ml) (Standard). Geneva, CH: International Organization for Standardiza-
tion. Retrieved from https://www.iso.org/standard/81088.html
ISO/IEC CD TS 12791. (2023). Information technology — artificial intelligence — treat-
ment of unwanted bias in classification and regression machine learning tasks (Stan-
dard). Geneva, CH: International Organization for Standardization. Retrieved from
https://www.iso.org/standard/83002.html
ISO/IEC DIS 42001. (2023). Information technology — artificial intelligence — manage-
ment system (Standard). Geneva, CH: International Organization for Standardization.
Retrieved from https://www.iso.org/standard/83002.html
ISO/IEC DIS 5338. (2023). Information technology — artificial intelligence — ai system

49
Brajovic et al.

life cycle processes (Standard). Geneva, CH: International Organization for Standard-
ization. Retrieved from https://www.iso.org/standard/81118.html?browse=tc
ISO/IEC FDIS 8183. (2023). Information technology — artificial intelligence — data life
cycle framework (Standard). Geneva, CH: International Organization for Standard-
ization. Retrieved from https://www.iso.org/standard/83002.html
ISO/IEC TS 4213. (2022). Information technology — artificial intelligence — assessment of
machine learning classification performance (Standard). Geneva, CH: International
Organization for Standardization.
Jiang, Z., Zhang, C., Talwar, K., & Mozer, M. C. (2020). Characterizing structural regular-
ities of labeled data in overparameterized models. Retrieved from http://arxiv.org/
abs/2002.03206
Johnson, J., & Sokol, D. D. (2020). Understanding ai collusion and compliance.
Joseph, V. R. (2022, 8). Optimal ratio for data splitting (Vol. 15). John Wiley and Sons
Inc. doi: 10.1002/sam.11583
Kapoor, S., & Narayanan, A. (2022). Leakage and the reproducibility crisis in ml-based
science. arXiv. Retrieved from https://arxiv.org/abs/2207.07048 doi: 10.48550/
ARXIV.2207.07048
Konietschke, F., Schwab, K., & Pauly, M. (2021, 3). Small sample sizes: A big data
problem in high-dimensional data analysis (Vol. 30). SAGE Publications Ltd. doi:
10.1177/0962280220970228
Kraus, T., Ganschow, L., Eisenträger, M., & Wischmann, S. (2021). Erklärbare
KI: Anforderungen, Anwendungsfälle und Lösungen. Retrieved from https://
www.digitale-technologien.de/DT/Redaktion/DE/Downloads/Publikation/
KI-Inno/2021/Studie Erklaerbare KI.pdf? blob=publicationFile&v=1
Larson, D. B., Harvey, H., Rubin, D. L., Irani, N., Justin, R. T., & Langlotz, C. P. (2021).
Regulatory frameworks for development and evaluation of artificial intelligence–based
diagnostic imaging algorithms: summary and recommendations (Vol. 18) (No. 3).
Elsevier.
Larsson, S., & Heintz, F. (2020). Transparency in artificial intelligence (Vol. 9) (No. 2).
Lones, M. A. (2021). How to avoid machine learning pitfalls: a guide for academic
researchers (Vol. abs/2108.02497). Retrieved from https://arxiv.org/abs/2108
.02497
Lu, Z., Kazi, R. H., Wei, L. Y., Dontcheva, M., & Karahalios, K. (2021, 4). A framework for
deprecating datasets: Standardizing documentation, identification, and communication
(Vol. 5). Association for Computing Machinery. doi: 10.1145/1122445.1122456
Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions
(Vol. 30).
Makridakis, S., Spiliotis, E., Assimakopoulos, V., Semenoglou, A.-A., Mulder, G., &
Nikolopoulos, K. (2023). Statistical, machine learning and deep learning forecast-
ing methods: Comparisons and ways forward (Vol. 74) (No. 3). Taylor & Francis.
Malik, N., Tripathi, S. N., Kar, A. K., & Gupta, S. (2022). Impact of artificial intelligence
on employees working in industry 4.0 led organizations (Vol. 43) (No. 2). Emerald
Publishing Limited.
Mazumder, M., Banbury, C., Yao, X., Karlaš, B., Rojas, W. G., Diamos, S., . . . Reddi,
V. J. (2022, 7). Dataperf: Benchmarks for data-centric ai development. Retrieved

50
Model Reporting for Certifiable AI

from http://arxiv.org/abs/2207.10062
Meding, K., Buschoff, L. M. S., Geirhos, R., & Wichmann, F. A. (2021). Trivial or
impossible – dichotomous data difficulty masks model differences (on imagenet and
beyond). Retrieved from http://arxiv.org/abs/2110.05922
Mitchell, M., Luccioni, A. S., Lambert, N., Gerchick, M., McMillan-Major, A., Ozoani, E.,
. . . Kiela, D. (2022, 12). Measuring data. Retrieved from http://arxiv.org/abs/
2212.05129
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., . . . Gebru,
T. (2019, 1). Model cards for model reporting. In (p. 220-229). Association for
Computing Machinery, Inc. doi: 10.1145/3287560.3287596
Müller, V. C. (2020). Ethics of artificial intelligence and robotics.
Murakonda, S. K., & Shokri, R. (2020). Ml privacy meter: Aiding regulatory compliance by
quantifying the privacy risks of machine learning. arXiv. Retrieved from https://
arxiv.org/abs/2007.09339 doi: 10.48550/ARXIV.2007.09339
Naser, M., & Alavi, A. H. (2021). Error metrics and performance fitness indicators for
artificial intelligence and machine learning in engineering and sciences. Springer.
Nauta, M., Trienes, J., Pathak, S., Nguyen, E., Peters, M., Schmitt, Y., . . . Seifert, C.
(2022). From anecdotal evidence to quantitative evaluation methods: A systematic
review on evaluating explainable ai.
Northcutt, C. G., Athalye, A., & Mueller, J. (2021, 3). Pervasive label errors in test sets
destabilize machine learning benchmarks. Retrieved from http://arxiv.org/abs/
2103.14749
Oneto, L., & Chiappa, S. (2020). Fairness in machine learning. In Recent trends in learning
from data (pp. 155–196). Springer International Publishing. Retrieved from https://
doi.org/10.1007%2F978-3-030-43883-8 7 doi: 10.1007/978-3-030-43883-8 7
Piorkowski, D., Hind, M., & Richards, J. (2022). Quantitative ai risk assessments: Oppor-
tunities and challenges. arXiv. Retrieved from https://arxiv.org/abs/2209.06317
doi: 10.48550/ARXIV.2209.06317
Poretschkin, M., Schmitz, A., Akila, M., Adilova, L., Becker, D., Cremers, A. B., . . .
others (2021). Leitfaden zur gestaltung vertrauenswürdiger künstlicher intelligenz
(ki-prüfkatalog). Fraunhofer IAIS.
Pushkarna, M., Zaldivar, A., & Kjartansson, O. (2022, 4). Data cards: Purposeful and
transparent dataset documentation for responsible ai. Retrieved from http://arxiv
.org/abs/2204.01075
Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., Hutchinson, B., . . . Barnes, P.
(2020). Closing the ai accountability gap: Defining an end-to-end framework for inter-
nal algorithmic auditing. In Proceedings of the 2020 conference on fairness, account-
ability, and transparency (p. 33–44). New York, NY, USA: Association for Comput-
ing Machinery. Retrieved from https://doi.org/10.1145/3351095.3372873 doi:
10.1145/3351095.3372873
Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., . . . Ng, A. Y. (2017).
Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning
(Vol. abs/1711.05225). Retrieved from http://arxiv.org/abs/1711.05225
Raschka, S. (2018, 11). Model evaluation, model selection, and algorithm selection in
machine learning. Retrieved from http://arxiv.org/abs/1811.12808

51
Brajovic et al.

Raudys, S. J., & Jain, A. K. (1990). Small sample size effects in statistical pattern recogni-
tion: Recommendations for practitioners and open problems. In (Vol. 1, p. 417-423).
Publ by IEEE. doi: 10.1109/icpr.1990.118138
Reunanen, J. (2003). Overfitting in making comparisons between variable selection methods
(Vol. 3) (No. Mar).
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). ” why should i trust you?” explaining
the predictions of any classifier. In Proceedings of the 22nd acm sigkdd international
conference on knowledge discovery and data mining (pp. 1135–1144).
Roberts, D. R., Bahn, V., Ciuti, S., Boyce, M. S., Elith, J., Guillera-Arroita, G., . . .
Dormann, C. F. (2017). Cross-validation strategies for data with temporal, spatial,
hierarchical, or phylogenetic structure (Vol. 40) (No. 8). Retrieved from https://
onlinelibrary.wiley.com/doi/abs/10.1111/ecog.02881 doi: https://doi.org/
10.1111/ecog.02881
Roberts, H., Cowls, J., Morley, J., Taddeo, M., Wang, V., & Floridi, L. (2021, mar). The
chinese approach to artificial intelligence: An analysis of policy, ethics, and regulation
(Vol. 36) (No. 1). Berlin, Heidelberg: Springer-Verlag. Retrieved from https://
doi.org/10.1007/s00146-020-00992-2 doi: 10.1007/s00146-020-00992-2
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes deci-
sions and use interpretable models instead (Vol. 1) (No. 5). Nature Publishing Group
UK London.
Salay, R., Queiroz, R., & Czarnecki, K. (2017). An analysis of iso 26262: Using machine
learning safely in automotive software. arXiv. Retrieved from https://arxiv.org/
abs/1709.02435 doi: 10.48550/ARXIV.1709.02435
Schaaf, N., & Wiedenroth, P., Saskia Johanna ans Wagner. (2022). Explainable ai in
practice - application-based evaluation of xai methods.
Schmidt, R. F. (2013). Chapter 8 - software requirements analysis practice. In
R. F. Schmidt (Ed.), Software engineering (p. 139-158). Boston: Morgan
Kaufmann. Retrieved from https://www.sciencedirect.com/science/misc/pii/
B9780124077683000082 doi: https://doi.org/10.1016/B978-0-12-407768-3.00008-2
Settles, B. (2010). Active learning literature survey (Vol. 15). doi: 10.1.1.167.4245
Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. (Vol. 9) (No. 3).
Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2016). Membership inference attacks
against machine learning models. arXiv. Retrieved from https://arxiv.org/abs/
1610.05820 doi: 10.48550/ARXIV.1610.05820
Siebert, J., Joeckel, L., Heidrich, J., Trendowicz, A., Nakamichi, K., Ohashi, K., . . .
Aoyama, M. (2022, Jun 01). Construction of a quality model for machine learn-
ing systems (Vol. 30) (No. 2). Retrieved from https://doi.org/10.1007/s11219
-021-09557-y doi: 10.1007/s11219-021-09557-y
Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., & Morcos, A. S. (2022, 6). Beyond neural
scaling laws: beating power law scaling via data pruning. Retrieved from http://
arxiv.org/abs/2206.14486
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus,
R. (2013). Intriguing properties of neural networks. arXiv. Retrieved from https://
arxiv.org/abs/1312.6199 doi: 10.48550/ARXIV.1312.6199
Tidjon, L. N., & Khomh, F. (2022). Never trust, always verify : a roadmap for trustworthy

52
Model Reporting for Certifiable AI

ai? arXiv. Retrieved from https://arxiv.org/abs/2206.11981 doi: 10.48550/


ARXIV.2206.11981
Toneva, M., Sordoni, A., des Combes, R. T., Trischler, A., Bengio, Y., & Gordon, G. J.
(2018). An empirical study of example forgetting during deep neural network learning.
(- Unforgetable samples are less relevant for training¡br/¿- Easy forgetable samples
related to support vectors?)
UNESCO. (2022). Recommendation on the ethics of artificial intelligence. Retrieved from
https://unesdoc.unesco.org/ark:/48223/pf0000381137.locale=en
van Bekkum, M., & Borgesius, F. Z. (2023, apr). Using sensitive data to prevent discrimi-
nation by artificial intelligence: Does the GDPR need a new exception? (Vol. 48). El-
sevier BV. Retrieved from https://doi.org/10.1016%2Fj.clsr.2022.105770 doi:
10.1016/j.clsr.2022.105770
Viering, T., & Loog, M. (2021, 3). The shape of learning curves: a review. Retrieved from
http://arxiv.org/abs/2103.10948
Wachter, S., Mittelstadt, B., & Russell, C. (2021). Bias preservation in machine learning:
The legality of fairness metrics under eu non-discrimination law. Retrieved from
https://ssrn.com/abstract=3792772
Winter, P. M., Eder, S., Weissenböck, J., Schwald, C., Doms, T., Vogt, T., . . . Nessler,
B. (2021). Trusted artificial intelligence: Towards certification of machine learning
applications.
Wittenbrink, N., Kraus, T., Demirci, S., & Straub, S. (2022, 12). Leitfaden fÜr
das qualitÄts-management bei der entwicklung von ki-produkten und services.
Begleitforschung des Technologieprogramm KI-Innovationswettbewerb des Bun-
desministeriums für Wirtschaft und Klimaschutz (BMWK). Retrieved from
https://www.digitale-technologien.de/DT/Redaktion/DE/Kurzmeldungen/
Aktuelles/2022/KI-Inno/20221220 Leitfaden Qualitaetsmanagement.html
Wolpert, D. H. (2002). The supervised learning no-free-lunch theorems. Springer.
Wong, T.-T., & Yeh, P.-Y. (2019). Reliable accuracy estimates from k-fold cross validation
(Vol. 32) (No. 8). IEEE.
Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., . . . oth-
ers (2022). Sustainable ai: Environmental implications, challenges and opportunities
(Vol. 4).
Yang, L., & Shami, A. (2020). On hyperparameter optimization of machine learning
algorithms: Theory and practice (Vol. abs/2007.15745). Retrieved from https://
arxiv.org/abs/2007.15745
Yang, T.-J., Chen, Y.-H., Emer, J., & Sze, V. (2017). A method to estimate the energy
consumption of deep neural networks. In 2017 51st asilomar conference on signals,
systems, and computers (pp. 1916–1920).
Young, A. T., Fernandez, K., Pfau, J., Reddy, R., Cao, N. A., von Franque, M. Y., . . .
others (2021). Stress testing reveals gaps in clinic readiness of image-based diagnostic
artificial intelligence models (Vol. 4) (No. 1). Nature Publishing Group UK London.
Zeng, A., Chen, M., Zhang, L., & Xu, Q. (2022). Are transformers effective for time series
forecasting?
Zhang, J. M., Harman, M., Ma, L., & Liu, Y. (2020). Machine learning testing: Survey,
landscapes and horizons. IEEE.

53
Brajovic et al.

Zhang, X., Ono, J. P., Song, H., Gou, L., Ma, K. L., & Ren, L. (2022). Sliceteller: A data
slice-driven approach for machine learning model validation. IEEE Computer Society.
doi: 10.1109/TVCG.2022.3209465

54

You might also like