ISO42001
ISO42001
Abstract
Despite large progress in Explainable and Safe AI, practitioners suffer from a lack of
regulation and standards for AI safety. In this work we merge recent regulation efforts by
the European Union and first proposals for AI guidelines with recent trends in research:
data and model cards. We propose the use of standardized cards to document AI appli-
cations throughout the development process. Our main contribution is the introduction of
use-case and operation cards, along with updates for data and model cards to cope with
regulatory requirements. We reference both recent research as well as the source of the reg-
ulation in our cards and provide references to additional support material and toolboxes
whenever possible. The goal is to design cards that help practitioners develop safe AI
systems throughout the development process, while enabling efficient third-party auditing
of AI applications, being easy to understand, and building trust in the system. Our work
incorporates insights from interviews with certification experts as well as developers and
individuals working with the developed AI applications.
*. Equal contribution. Each author provided one of the cards. First author coordinated writing.
1
Brajovic et al.
1. Introduction
One key enabler for the application of artificial intelligence (AI) and machine learning (ML)
is the establishment of standards and regulations for practitioners to develop safe and trust-
worthy AI systems. Despite several efforts by different research institutions and governmen-
tal organizations, there are currently no established guidelines. The most influential attempt
stems from the European Union with its AI Act (European Comission, 2021). In this paper
we revisit the state of the art in safeguarding and certifying AI applications with a focus on
the the European Union. Despite this, our summary is intended to be as broad as possible.
The focus on European legislation stems from the influence on regulation and standardiza-
tion it has.1 The result is a structured guideline for reporting AI applications along four
major development steps. The objective of our guideline is to provide comprehensive sup-
port during the development process. It encompasses a synthesis of requirements derived
from both non-technical users and prominent certification and standardization bodies. Our
approach involves the inclusion of robust references to relevant support materials and tool-
boxes, emphasizing the transparency and traceability of each requirement’s source. For this
purpose, we conduct interviews with standardization experts, developers and consumers of
the final AI product and incorporate their feedback into our proposal. Furthermore, our
framework serves as a solid foundation for prospective system certification. By adhering
to our guidance, developers can ensure they are adequately equipped to meet forthcoming
legal obligations within their domain. The key contributions of our work is fourfold:
• We include references to support material and toolboxes and link the source of the
requirement whenever possible.
• We lay the foundations for a future certification of the system. With our framework,
developers should be well-positioned when legal requirements come into force.
1. European Union Set to Be Trailblazer in Global Rush to Regulate Artificial Intelligence (TIME)
2
Model Reporting for Certifiable AI
harmonized with the EU AI Act, standards provided by ISO or other standardization bod-
ies can be considered. Conformity assessment bodies can then utilize these standards to
showcase the compliance of products, services, or processes with applicable EU legislation
(European Commission, 2023b). These audits can also be carried out on a voluntary ba-
sis for products that do not necessarily require them, for instance for marketing purposes.
In certain product categories like medical devices, third-party conformity assessments are
mandatory to ensure compliance with technical regulations before entering specific markets
(European Commission, 2023a). Apart from such certificates, so called guidelines have no
legally binding effect.
Finally, developing safe and responsible AI also involves governance structures within
the company to manage data development activities and risks. However, this is beyond the
scope of this paper, in which we focus on reporting around the development process.
3
Brajovic et al.
European Commission and two comments were recommended in December 2022 from the
Council of the European Union (2022) and in April 2023 by the European Parliament
(2023). The initial draft already created a heated debate leading to fears that it might
over-regulate research and development of AI in the EU and that the definition of AI was
too broad. However, this debate is beyond the scope of this paper. We are interested in
the requirements posed on those systems. Although some are still subject to change, many
requirements have already emerged.
In general, the AI act distinguishes three risk classes for AI: forbidden applications, high-
risk applications and uncritical applications. Furthermore, there might be some obligations
of transparency even for uncritical applications. For example, that end-users must be
informed that they are interacting with an AI system. There are no mandatory obligations
for uncritical applications—although developers of such AI systems shall still be motivated
to follow the proposed principles.
Whether an application falls into the high-risk category is risk-dependent, and a few
high-risk applications are named in the AI Act’s appendix. Some of them include: bio-
metric identification, operation of critical infrastructure, education, employment & workers
management, essential private services & public services, law enforcement, migration &
asylum, and administration of democratic processes. The requirements for such systems
are listed in Chapter 2 and include
4
Model Reporting for Certifiable AI
such as medical technology. Overall, the requirements are on a very abstract level and still
subject to change. In particular, the act does not contain a specific starting points for
practical implementation in companies because the baseline standards that will provide the
technical details are still being developed.
5
Brajovic et al.
in a simple language such that users with a non-technical background can understand them,
and, finally, report known unknowns such as possible shortcomings or uncertainties of the
dataset. Apart from that, the authors name 31 content themes to describe a dataset. They
are listed in Appendix B.
Another related work is data sheets for datasets (Gebru et al., 2018). The authors
propose a similar set of questions for dataset creators during the collection phase that
can be grouped into the following categories: (i) motivation for creating the dataset, (ii)
composition of the data, (iii) collection process, (iv) preprocessing/cleaning/labeling of the
data, (v) Uses (describing the contexts in which the dataset was already used), (vi) the
distribution of the data, and (vii) the future maintenance. The intended audience are both
the dataset creators and the consumers of the final dataset.
6
Model Reporting for Certifiable AI
7
Brajovic et al.
or videos. Overall, although the high level pillars are expected to be similar to the ones in
western countries and include that AI should benefit human welfare, China’s regulation is
expected to be focused more around group relations than individual rights (H. Roberts et
al., 2021). However, this is beyond the scope of our paper and Chinese regulation is not
addressed in our framework.
8
Model Reporting for Certifiable AI
2. Proposed Approach
In this section, we propose cards supporting the four steps: use case definition, data collec-
tion, model development, and model operation. For each card, we briefly cover background
information before presenting our proposal and include references to further literature when-
ever possible. Furthermore, each card contains an involvement of affected individuals section
as a reminder to include people that might be affected by the AI system whenever feasible.
In Appendix A we provide an example on a toy use case. As mentioned before, we focus
on reporting around the AI application and do not cover governance.
In line with Raji et al. (2020) and Poretschkin et al. (2021) the first step in our framework
is a summary of the use case. This offers the auditing authority a brief overview about the
task at hand and the AI application. Furthermore, the document can be used during the AI
development phase to ensure compliance as well as a trustworthy and robust AI system. A
general understanding of the use case is necessary to perform a proper risk assessment after-
wards. Therefore, a use case summary should be formulated that includes a description of
the status before applying an AI solution, a description of relevant sub-components as well
as a summary of the proposed AI solution. This approach is especially useful for the doc-
umentation of high-risk applications as defined in the AI Act (European Comission, 2021).
A general summary should also include a contact person and information about involved
groups inside an organisation to include them in relevant parts during the development
phase. New in our framework is the justification of why a non-AI approach is not sufficient
in the context of the application. Although this requirement is not explicitly stated in the AI
Act, it emerged in discussions with several certification bodies. For developers this means
that a simpler model (or maybe non-AI) solution should be prioritized whenever there is
no sound justification for the use of a more complex one. Finally, two new aspects in our
approach are (i) reviewing and documentation of prior incidents in similar use cases and
(ii) the distinction whether a general purpose AI system or a foundation model is expected
to be used or developed.
2.1.1 Risk
AI can only be deployed if the associated risks are kept on an acceptable level, which is
comparable to already existing products or services. To guarantee this, a proper risk as-
sessment must be performed. Due to the nature of AI systems, e.g., the training on noisy
and changing data or the black-box characteristics, classical methods for risk assessment
cannot be applied in most cases (Arai & Kapoor, 2020; Siebert et al., 2022). In litera-
ture, there are several schemes for qualitative and quantitative risk assessment. In recent
years, the AI community proposed several ways to categorize risks in different dimensions
(Ashmore, Calinescu, & Paterson, 2021; Piorkowski, Hind, & Richards, 2022; Poretschkin
et al., 2021). One approach, accepted by many scientists and organisations, is the division
9
Brajovic et al.
according to the High-Level Expert Group on AI (AI HLEG, 2019). They propose the seven
key dimensions of risk:
1. human agency/oversight,
2. technical robustness/safety,
3. privacy/data governance,
4. transparency,
5. fairness,
7. accountability.
As described in Section 1.4.4, these dimensions are well-aligned to other works. Based
on every dimension one can derive possible risks for one specific use case and AI application.
Here, the risk consists of the probability of an undesirable event and the impact such an
event has. In recent years, many more methods have been developed to accurately quantify
such probabilities, for example, in the field of adversarial robustness (Murakonda & Shokri,
2020; Szegedy et al., 2013), fairness (Barocas & Selbst, 2016; Bellamy et al., 2018; Bird et al.,
2020), or privacy (Fredrikson, Jha, & Ristenpart, 2015; Shokri, Stronati, Song, & Shmatikov,
2016). In the context of (functional) safety, a quantitative assessment of the risk is almost
always mandatory, especially in safety critical applications such as robotics, manufacturing,
or autonomous driving (Ashmore et al., 2021; El-Shamouty, Titze, Kortik, Kraus, & Huber,
2022; Salay, Queiroz, & Czarnecki, 2017). There are also more qualitative approaches to
classify the risk corresponding to one dimension. The EU separates between forbidden
applications, high risk applications and uncritical applications based on the application
domain (European Comission, 2021). A standardized way of performing risk assessment
for AI is given in ISO/IEC 23894 which uses the existing risk management standard ISO
31000 as a reference point. It also offers concrete examples in the context of correct risk
management integration during and after the development.
Our proposal of a use case documentation is shown below. In accordance with existing
works on model cards (Mitchell et al., 2019) and data cards (Pushkarna et al., 2022) we call
it a Use Case Card. As in the other cards for almost every aspect there exist one or multiple
references to the AI Act. After a use case summary and a brief formulation of the problem
and solution, we suggest to perform a risk assessment based on the dimensions mentioned
above. The risks has to be stated associated with the corresponding dimension. Note that
in this approach first a qualitative risk assessment is performed to create awareness of the
critical aspects of the AI system. Later on, in the Sections 2.2–2.4, the goal is to quantify
and minimize those risks using specific methods and other measures.
10
Model Reporting for Certifiable AI
11
Brajovic et al.
12
Model Reporting for Certifiable AI
2.2 Data
AI applications stand and fall with the training data. Despite that, there has only recently
been a paradigm shift towards the datasets due to the data-centric AI perspective. The AI
Act requires in Article 10 (3) that training, validation, and testing datasets shall be relevant,
representative, free of errors and complete. What this means in practice, however, is often
unclear. Our extension of a data card builds on previous work on data documentation but is
extended with further references, especially those addressing regulation, whenever possible
(Gebru et al., 2018; Pushkarna et al., 2022). From the standardization side, the ISO/IEC
5259x (2023) series deals with data quality for AI systems, ISO/IEC FDIS 8183 (2023) with
the data life cycle framework for AI systems and ISO/IEC 25012 (2008) describes a general
data quality model. The first two are still under development but are worth reviewing once
they are final. Before introducing the data card, we discuss open questions that we think
are not covered sufficiently in the state of the art.
We start by summarizing research directions that might be worth considering before starting
data collecting. Two related active areas of research are active learning (Settles, 2010) and
core sets (D. Feldman, 2020). In active learning, the algorithm is given access to a small
labeled dataset D and a large unlabeled dataset U . The algorithm can then query an
oracle (e.g., human labeler) for labels up to some budget b. The goal is to optimize model
performance within the limit of the budget. A core set is a small fraction S ∈ D of the
original dataset D such that the learning algorithm achieves a similar performance on this
subset as if it was trained on the entire dataset.
13
Brajovic et al.
These research fields and recent advances are related to data collection and quality and
performing a literature review before starting to collect data can help to make the collection
more effective.
Protected Attributes Usually, the protected attributes include sex, race, skin colour,
ethnic or social origin, genetic features, language, religion or belief, political or any other
opinion, membership of a national minority, property, birth, disability, age or sexual orien-
tation and can be found, for instance, in Art. 21 of the Charter of Fundamental Rights of
the European Union (Coghlan & Steiert, 2020). According to the EU handbook on non-
discrimination law, it might differ between use-cases whether an attribute is considered as
protected (Council of European Union, 2019). Furthermore, pregnancy can be considered
as protected in some cases. Access to such information is crucial when detecting biases but
there may also be cases when other laws prohibit the use of such attributes (van Bekkum
& Borgesius, 2023).
Dataset Size One problem when collecting data is knowing when a sufficient amount was
acquired. Despite the standard perspective that “there’s no data like more data”, according
to IBM’s Robert Mercer, collecting and annotating data is both expensive and not always
feasible. For example, in medicine it is almost impossible to increase the number of samples
14
Model Reporting for Certifiable AI
0.975
0.9 0.970
0.965
Test Performance
Test Performance
0.8
0.960
0.7 0.955
0.950
0.6
0.945
0.940
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Fraction of Train-Set Fraction of Test-Set
(a) Model Performance for Train Fraction (b) Test Variance for Test Fraction
Figure 1: (a) Model performance of a multilayer perceptron (MLP) trained on the MNIST
dataset with increasing fraction of original train data. The improvement of adding data
points behaves logarithmically. (b) Variance over 20 runs of training an MLP on the entire
MNIST training set and evaluating the performance on fractions of the test set. With
increasing size of the test set, the variance decreases.
of rare diseases. Data augmentation is one solution, but in this section we want to focus
on the size of the raw dataset. A rule of thumb in computer vision suggests to use around
1.000 images per class for image classification tasks as a good starting point. The one-in-ten
rule8 suggests that per model-parameter ten data points are necessary. One way to estimate
the necessary amount of data is by tracking model improvements when sequentially adding
larger fractions of the training data. It is well known that the resulting performance plot
behaves logarithmically (Figueroa, Zeng-Treitler, Kandula, & Ngo, 2012; Viering & Loog,
2021). Hence, the largest improvements are obtained early on. As shown in Figure 1(a),
with about 50% of the training data, the performance is already above 90% and doubling
the amount of data increases the performance by less than 5%. In practice, one approach
could be to start with a small test set for data collection and utilize it to estimate when a
reasonable amount of data will be gathered. Once the data has been acquired this test set
may be added to the train data or discarded and a new test set is created.
Data Coverage Apart from the size of the dataset, covering all possible scenarios is
important. Recent research has shown that many data points are redundant (D. Feld-
man, 2020; Sorscher, Geirhos, Shekhar, Ganguli, & Morcos, 2022; Toneva et al., 2018) and
hypothesised that unique points (which may be considered as edge cases) can have a signifi-
cant impact on model performance (V. Feldman, 2020). One way to find such samples is by
manually defining edge cases, e.g. under the involvement of domain experts, and actively
collecting these data points. This can be extended by a combination of empirical and tech-
nical analysis, for instance, by plotting a downsampled embedding and analyzing the edge
cases. A useful tool for this purpose is tensorleap. Unfortunately, it is still unclear whether
15
Brajovic et al.
rare points are desirable or not in a dataset. While they can have a significant impact on
model performance, they might make a model prone to privacy attacks (V. Feldman, 2020;
Sorscher et al., 2022).
Usually, after data collection the data would be labled and preprocessed. Since our sug-
gestion regrading labeling are short, they are covered in the section on documentation.
Apart from this, we have swapped the section on train-test split creation and preprocessing
because it is recommend to do preprocessing on the splits separately whenever feasible.
Stratification First of all, training and test sets should be stratified. If metadata was
collected it should be considered for stratification. As indicated in Section 2.2.2, another
option is to create splits of different difficulty according to the metadata. An easy test set
might contain overlapping instances between entities in training and test set (e.g., data of
the same patients in training and test set) whereas a hard test set contains only data from
new entities (patients). This can be used to estimate the generalization capabilities of a
model and helps to understand whether it is usable for a task or not.
Test Set Size One of the most difficult choices is the size of the test set. Despite the
standard 80-20 (or similar) suggestion we are not aware of common strategies to estimate
the necessary test set size in a general setting (Joseph, 2022). In medical and psychological
research, sample size estimation based on statistical significance is established. For example,
if the effect of medication against a certain disease should be evaluated and it is a priori
known that the disease occurs in 10 from 1000 patients (p = 1 %) and the target is an error
margin of ϵ =1 % with a confidence level of σ = 99 %, the necessary sample size n is given
z 2 ×p(1−p) 2
by n ≈ 0.99 ϵ2 = 2.58 ×0.01(1−0.01)
0.012
= 659, where z0.99 = 2.58 is the value for the selected
confidence interval derived from the normal distribution. If p is unknown, it is usually set
to 0.5. The same can be applied to ML and was actually discussed in a few works especially
from the medical domain (Beleites, Neugebauer, Bocklitz, Krafft, & Popp, 2013; Dietterich,
1998; Konietschke, Schwab, & Pauly, 2021; Raudys & Jain, 1990). In the case of ML, the
interpretation is as follows; the goal is to evaluate an ML model that predicts the disease
with a randomly drawn test set of size 659. Then, there is a 99 % confidence that the real
world model performance is ±1 % of the error rate on the test set. However, one drawback
is that it is assumed that the collected data is representative of the real world setting and
it is not guaranteed that the model will not learn any spurious correlations.
Another indicator for a good test set is the test variance. The underlying assumption
is that a better test set should show a low variance between different evaluation runs with
the same model (Bouthillier et al., 2021). That is, if the same experiment results in an
unexpectedly large variance, the test set is most likely too small and should be extended.
An example for this effect is shown in Figure 1(b). Although multiple evaluations should
generally be done, it is of special importance if the test set cannot be extended (Bouthillier
et al., 2021; Lones, 2021; Raschka, 2018).
16
Model Reporting for Certifiable AI
17
Brajovic et al.
Data Card
Step Requirement AI Act References
Name originator of the dataset and Gebru et al. (2018); Pushkarna et
provide a contact person al. (2022)
General Describe the intended use of the Gebru et al. (2018); Pushkarna et
dataset al. (2022)
Describe licensing and terms of us- Gebru et al. (2018); Pushkarna et
age al. (2022)
Collect requirements for the data Art. 10 (2) d)
before starting data collection e)
Describe a data point with its in- Gebru et al. (2018); Pushkarna et
terpretation al. (2022)
Maybe, provide additional docu- Pushkarna et al. (2022)
mentation to understand the data
(e.g., links to scientific sources, pre-
processing steps or other necessary
information)
Data-
description If there is GDPR relevant data Gebru et al. (2018); Pushkarna et
(i.e. personally identifiable infor- al. (2022)
mation), describe it
If there is biometric data, describe Gebru et al. (2018); Poretschkin et
it al. (2021); Pushkarna et al. (2022)
If there is copyrighted data, sum- Art. 28 b)
marize it
If there is business relevant infor- Poretschkin et al. (2021)
mation, describe it
Describe the data collection proce- Art. 10 (2) b) Gebru et al. (2018); Pushkarna et
dure and the data sources Annex III c) al. (2022)
Use data version control Art. 10 (2) b) Poretschkin et al. (2021)
Consider the prior requirements for Art. 10 (2) e)
the data
Include and describe metadata Metadata Standards
Involve domain experts and de- Poretschkin et al. (2021)
scribe their involvement
Collection Describe technical measures to en- Art. 10 (2) g) CleanLab
sure completeness of data Art. 28 b)
Describe and record edge cases Poretschkin et al. (2021);
Pushkarna et al. (2022)
If personal data is used, make sure Gebru et al. (2018); Lu, Kazi, Wei,
and document that all individuals Dontcheva, and Karahalios (2021)
know they are part of the data
Describe if and how the data could Gebru et al. (2018); Pushkarna et
be misused al. (2022)
18
Model Reporting for Certifiable AI
19
Brajovic et al.
20
Model Reporting for Certifiable AI
Developing models that execute a specific task in a reliable and fair way is challenging. The
AI Act phrases the requirements for (high-risk) AI systems in Article 15 (1): “High-risk AI
systems shall be designed and developed in such a way that they achieve [...] an appropriate
level of accuracy, robustness and cybersecurity, and perform consistently in those respects
throughout their lifecycle.” To achieve this, there are several concerns such as explainability,
fairness, feature engineering and others. For documenting all these pillars, we build on the
work of Mitchell et al. (2019). While some points overlap, we added and complemented
others, i.e., explainable artificial intelligence (xAI) and testing in real world settings.
2.3.1 Explainability
First, the requirements for xAI should be determined. In this regard, different aspects of
explainability methods needs to be considered. The first important aspect is to distinguish
between the need to explain a single decision (local explainability) or the entire ML model
(global explainability). The first might be important in cases where lay users need to be able
to understand a decision made about them (e.g., financial advise), while the latter might be
more important to make sure a model does not decide based on sensitive features. These
explanations can be derived via different means, with most taxonomies listing post-hoc
explainability vs. explainability by nature or design, also so-called interpretable machine
learning (iML). The former describes using a black-box model for a decision and generating
explanations afterwards, most often via simpler approximations, with the most prominent
examples being LIME (Ribeiro, Singh, & Guestrin, 2016) and SHAP (Lundberg & Lee,
2017). For post-hoc explainability, it is important to note that most xAI-methods provide
no formal guarantees, thus explanations might not conform to the ML-model’s decision in
all cases. Another important factor and currently open problem is how to evaluate such
explanation methods. While many evaluation metrics and tests have been proposed, none
has been widely adopted and consistently been used throughout the literature (Burkart &
Huber, 2021; Nauta et al., 2022). Currently it seems that explanation methods should be
tested in real-world scenarios, including people with the same explainability needs as the
target group of the explanations, as e.g. ML-engineers might need different explanations
than lay users. Additionally to the requirements regarding explainability and transparency
itself, xAI-methods might be needed to fulfill other legal requirements. If, for example,
users of an ML-system have a right to object to decisions (as given through the General
Data Protection Regulation (GDPR) in Europe), they must be sufficiently informed of a
decision to be able to make use of their rights. In general, we advise to think of explain-
ability demands before choosing a model, as use cases with a high demand for transparency
might need to provide some guarantees for correct explanations and could thus be limited to
iML models. Furthermore, considerations regarding explainability can also influence other
development steps, especially data preprocessing. Here, non-interpretable features as for
example obtained by principal component analysis (PCA) can severely limit the expres-
siveness of explainability approaches. Regarding the use of explainability for AI systems,
norms are currently specified, but not yet published (DIN SPEC 92001-3, 2023; ISO/IEC
AWI TS 6254, 2023).
21
Brajovic et al.
While some of the preprocessing is addressed in the data card, here we consider only the
processing that is not part of the dataset itself, but is model specific. Feature engineer-
ing and selection as well as data augmentation and preprocessing steps should always be
meaningful in the context of the use case and the used model. Feature selection can be
seen as an optimization problem to find the most relevant features for the model. While
the performance of deep learning algorithms might not change, the computation cost will
increase and consume time and resources (Cai, Luo, Wang, & Yang, 2018). To prevent data
leakage, all steps, including feature selection, feature engineering, and preprocessing, should
be performed after splitting the data (Kapoor & Narayanan, 2022; Reunanen, 2003). This
also holds for k-fold cross validation, where feature engineering and preprocessing must be
executed on each fold independently.
The range of usable models depends on the given use case and its requirements such as
on explainability, prediction performance and inference time, just to name a few. With
those limitations, the amount of meaningful models will narrow down. It is good practice
to compare the selected model to standard baselines. A good starting point are easy to
train and for the use case established state of the art models. Also, rule-based systems or
heuristic algorithms can be used as a baseline. Trying different classes of models that have
shown good results on a given set of problems is always recommended (Ding, Tarokh, &
Yang, 2018; Lones, 2021). According to the no free lunch theorem, there is no single machine
learning algorithm that can outperform all others across all possible problems and datasets
(Wolpert, 2002). The choice of the algorithm needs to be appropriate for the underlying
problem class and the amount and quality of the available data. Usually, there are fine-tuned
algorithms for natural language processing, computer vision, time series and other fields.
Note that deep learning models will not always be the best option. Grinsztajn, Oyallon, and
Varoquaux (2022) show that tree based models often outperform deep learning in tabular
data. In time series forecasting, there is no clear best algorithm class. Statistical and tree
based methods regularly are on par with deep learning models (Zeng, Chen, Zhang, & Xu,
2022). However, achieving this improved performance often requires a significant increase in
computational resources (Makridakis et al., 2023). In general, simpler models are preferred
over more complex models if they provide similar levels of performance, according to the
principle of Ockham’s razor, because they are less prone to overfitting and easier to interpret.
2.3.4 Metrics
Similar to model choice, the metrics should also be selected carefully. Depending on the use
case, recall or precision might be more appropriate metrics than general ones like accuracy or
F1-score (Poretschkin et al., 2021). The choice of model also influences the choice of useful
metrics (Ferri, Hernández-Orallo, & Modroiu, 2009; Naser & Alavi, 2021). Furthermore,
the minimum acceptable score for the selected metrics should be determined to create value
in the specific use case. It is always recommended to involve domain experts to address the
minimum requirements.
22
Model Reporting for Certifiable AI
23
Brajovic et al.
Model Card
Step Requirement AI Act References
Provide Name and details of a con-
tact person
Provide Name of the person who Mitchell et al. (2019)
General created the model
Document the creation date and Mitchell et al. (2019)
version of the model
Describe intended use of the model Mitchell et al. (2019)
Describe the architecture of the Mitchell et al. (2019)
used model
Describe the used hyper-
parameters
Document the training, validation, Mitchell et al. (2019)
Description and test error
Document computation complex- Art. 28 Garcı́a-Martı́n, Rodrigues, Riley,
ity, training time and energy con- and Grahn (2019); T.-J. Yang,
sumption (for foundation models Chen, Emer, and Sze (2017)
describe steps taken to reduce en-
ergy consumption)
Document the demand for explain- Art. 13 (1, 2),
ability and interpretability Art. 15 (2)
Explainability Describe taken actions if any Art. 15 (2) Burkart and Huber (2021);
and ISO/IEC AWI TS 6254 (2023);
interpretability Nauta et al. (2022); Ribeiro et al.
(2016); Schaaf and Wiedenroth
(2022)
Describe the consideration of ex- Art. 10 (3, 4)
plainability and interpretability (if
Feature relevant)
engineering,
Describe feature engineering and Cai et al. (2018); Kapoor and
feature
selection with (domain specific) Narayanan (2022); Mitchell et al.
selection and
reasoning (2019)
preprocessing
Describe and reason of preprocess- Kapoor and Narayanan (2022);
ing steps Mitchell et al. (2019)
24
Model Reporting for Certifiable AI
25
Brajovic et al.
2. Mission statement that expresses the set of services the product provides.
Extending these requirements, the AI Act mandates in Art. 13 the disclosure of the use of
AI in outcome determination to users, with the level of detail depending on their function.
In all cases, it is essential to explain the reasoning behind AI’s decisions and the impact
of incoming data features, while also implementing a process to override decisions when
necessary. Additionally, for both regular and emergency tasks, a failsafe responsibility must
be established. Finally, it is recommended to provide training to employees on the usage of
the AI application to enhance acceptance and productivity.
26
Model Reporting for Certifiable AI
2.4.2 Monitoring
For monitoring purposes and insights into the functioning of the AI, a suitable interface is
required. AI Act Art. 14 specifies: “It has to be guaranteed that the potentials, limitations
and decisions are transparent without bias.” It is proposed that the interface provides
information at different levels of detail. This way operating staff can be supported in
carrying out their tasks with crucial process relevant information by the AI, while data
scientists might want to get more insights into training meta data to optimize learning
curves. Avoiding bias is a current research topic. Kraus, Ganschow, Eisenträger, and
Wischmann (2021) point out that an explanation of the algorithm alone is not sufficient
for transparency. In accordance with the recommendations set forth by UNESCO (2022)
“Transparency aims at providing appropriate information to the respective addressees to
enable their understanding and foster trust. Specific to the AI system, transparency can
enable people to understand how each stage of an AI system is put in place, appropriate to
the context and sensitivity of the AI system. It may also include insight into factors that
affect a specific prediction or decision, and whether or not appropriate assurances (such
as safety or fairness measures) are in place. In cases of serious threats of adverse human
rights impacts, transparency may also require the sharing of code or datasets.”
Model drift summarizes the impact of changes in concept, data, or software and their effect
on the model predictions. Concept drift can occur when the underlying correlations of
input and output variables change over time, while data drift can occur when the known
range is left, and software drift can occur when, for example, the data pipeline is modified
and data fields are renamed or units are changed. The effects of changes in the model
ecosystem should be closely monitored. That is, to prevent a low model performance based
on corrupted or unexpected input data, it is best practice to compare the input data with
training data. Metrics such as f-divergence may be used to express differences in the data
distribution and calculate limits for how far the live model performance may deviate from
the training performance.
27
Brajovic et al.
2.4.5 Privacy
If personal data is collected in the course of the AI deployment, secure storage of this data
and deletion periods must be established with the help of a data protection officer. In
addition, a process for providing information about personal data must be created in order
to process requests from data subjects. The GDPR imposes a minimization of personal
data collection. Methods such as differential privacy may be used to privatize personal data
during collection or procession (Dwork, Roth, et al., 2014).
2.4.6 MLOps
While the AI Act imposes no specific requirements for MLOps, except for ensuring the safe
operation of AI throughout its lifecycle, Poretschkin et al. (2021) provide several recom-
mendations regarding the implementation of secure MLOps practices.
To ensure uninterrupted operation, a well-regulated MLOps process is of utmost impor-
tance. The frequency of regular updates can be determined based on risk and technology
impact assessments. When there is a higher risk associated with deviations between the
training data and model data, more frequent retraining should be considered. It is crucial
to store metadata related to the training process and model, including parameters used,
training data, and model performance, to enable reproducibility. This approach relies on
the availability of rapidly collected and accurately labeled new data. In situations where
such data collection and labeling are not feasible, additional oversight of the AI system by
trained personnel, such as data scientists, becomes necessary.
Moreover, based on the identified risks, it is advisable to establish a dedicated testing
period for updates to both the model and infrastructure. Smooth transitions necessitate
clear delineation of responsibilities for the commit and review processes, as well as de-
ployment. In cases where swift rollback is required, well-documented versioning including
data is essential. Additionally, employing a mirrored infrastructure enables updates to be
installed seamlessly without causing downtime.
28
Model Reporting for Certifiable AI
Operation Card
Step Requirement AI Act References
Describe monitored components Art 61 (1,3)
Scope and aim Assess risks and potential dangers Art 9 (2)
of monitoring according to Use Case Card
List safety measures for risks Art. 9 (4) Poretschkin et al. (2021)
Create and document utilisation Art. 13, Art.
Operating concept 17, Art. 16
concept Plan staff training Art. 9 (4c) Ahlfeld, Barleben, et al. (2017)
Determine responsibilities Poretschkin et al. (2021)
Document decision-making power Art. 14, Art. AI HLEG (2019); Poretschkin et al.
Autonomy of of AI 17 (2021)
application Determine process to overrule deci- Art. 14 (4) d) Poretschkin et al. (2021)
sions of the AI e)
Document component wise: As- Art. 9 Ahlfeld et al. (2017); AI HLEG
Responsibilities sessed risk, control interval, re- (2019); Poretschkin et al. (2021)
and measures sponsibility, measures for emer-
gency
Monitor input and output Art. 17 (1) d), Poretschkin et al. (2021)
Art. 61(2)
Model Detect drifts in input data Poretschkin et al. (2021)
performance
Document metric in use to monitor Art. 15 (2)
model performance
Establish transparent decision Art. 52 Kraus et al. (2021); Poretschkin et
making process al. (2021)
AI interface Establish insight in model perfor- Art. 14 Poretschkin et al. (2021)
mance on different levels
Document individual access to Art. 15 Amschewitz, Gesmann-Nuissl, et
server rooms al. (2019); Winter et al. (2021)
Set and document needed Amschewitz et al. (2019); Winter
IT security clearance level for changes to et al. (2021)
AI/deployment/access regulation
Establish and audit ISMS ISO/IEC 27000 (2018); ISO/IEC
27001 (2022)
Justify and document use or waiver Art. 10 (5) Dwork et al. (2014); Poretschkin et
of a privacy preserving algorithm al. (2021)
Privacy If applicable, document privacy al- Dwork et al. (2014)
gorithm and due changes in the
monitoring of output data
29
Brajovic et al.
Testing and Document software tests Art. 17 (1) d) Poretschkin et al. (2021)
rollout Set period of time that an update
must function stably before it is
transferred to live status
Involvement of If feasible, describe involvement of
Individuals affected individuals
30
Model Reporting for Certifiable AI
Acknowledgments
This work builds on insight from the research projects veoPipe and ML4Safety and a collaboration
with Audi AG. We thank Hannah Kontos and Diana Fischer-Preßler for proofreading.
31
Brajovic et al.
32
Model Reporting for Certifiable AI
33
Brajovic et al.
Describe the disadvantages of the High costs for CT-scans, manual visual inspections
current approach are not feasible due to high time consumption and
as a mundane task not appealing to workforce.
Argue why classical approaches are Rule-based algorithms may not yield the necessary
not sufficient confidence.
Solution ap- Describe the integration into the Images of the frames are automatically taken. The
proach current workflow ML model will evaluate the images in a few sec-
onds and present the results to current shift work-
ers via an interface. Detected defective frames will
automatically be marked for manual inspection.
The pass/fail labels assigned by the model can be
manually overwritten by workforce. Workers will
perform manual inspections on defective parts be-
fore submitting the frames to further assembly.
Formulate the learning problem Binary label classification with optional heatmaps
/ attribution maps applied to the predictions.
Formulate the KPIs for go-live A recall above 99% is targeted while preserving
a precision of more than 90% and the maximum
inference time should not exceed 1 second.
Provide a short data description Photos are taken with an industrial camera. Both
sides of the frame are photographed. The pictures
are of size 1024 x 1024 pixels.
34
Model Reporting for Certifiable AI
35
Brajovic et al.
Societal and List possible dangers to the envi- No risk. Good quality assurance reduces waste.
environmental ronment
well being
Consider ethical aspects No risk.
Evaluate effects on corporate ac- An ELSI-Group (Ethical, Legal and Social Impli-
tions cations) has been formed by the workers council.
Identify impact on the staff The AI is solely used to ensure quality in the prod-
uct. No devaluation of workforce quality is prac-
ticed. No analyses are conducted on the quality of
work of the production staff and no employment
contracts are terminated due to the use of AI.
Accountability Estimate the financial damage on Broken frames might be shipped and need to be
failure replaced. If customers get injured they might file
for lawsuits. Misclassified flawless frames might
get disposed of.
Estimate the image damage on fail- Shipped misclassified broken frames might asso-
ure ciate the company with poor quality.
Norms List of relevant norms in this con- ISO 43.150 Cycles
text
36
Model Reporting for Certifiable AI
Data Card
Step Requirement Measures
General Name originator of the dataset and John Doe, Project Manager.
provide contact person
Describe the intended use of the Train a supervised ML model for the recognition
dataset of defective bike frames.
Describe licensing and terms of us- Part of the data was recorded internally and can
age be used without any restrictions. Another part of
the data is provided from a service partner under
a specific licensing agreement.
Data- Collect requirements for the data Before starting the data collection process, domain
Description before starting data collection experts where involved to discuss how defects oc-
cur on the bikes. We discussed the proper amount
of data and the necessary resolution of the images.
Describe a data point with its in- One data point consists of two RGB images with a
terpretation resolution of 1024×1024 pixels. One image shows
the frame frame from the left, the other one from
the right. Each image was taken with an industrial
camera of type [MODEL].
Maybe, provide additional docu- Not necessary, image data are human inter-
mentation to understand the data pretable.
If there is GDPR relevant data. de- No personal data.
scribe it
If there is biometric data, describe No biometric data.
it
If there is copyrighted data, sum- No copyrighted data.
marize it
If there is business relevant infor- Yes, the data of the intact bike frames provides
mation, describe it information about the glueing technique for the
carbon fiber frames. The defective bike frames
contain information about weaknesses and unsta-
ble parts. Both information could be relevant for
competitors.
Collection Describe the data collection proce- Data was systematically collected by our com-
dure and the data sources pany during the first year as well as by our ser-
vice partner using commercial industrial cameras
of type [MODEL] under diffuse lighting conditions
for both cases. We collected data by taking the im-
ages of the frames directly after production. Our
service partner captured images of the frames it
got from its customers. We merged the data and
visually checked for homogeneity. After that we
labeled the data using an external labeling com-
pany.
37
Brajovic et al.
Use data version control A data versioning system is applied in the cloud.
Consider prior requirements for The exact amount of data was collected as defined
data above.
Include and describe metadata We include the type of bicycle frame as well as the
exact date the image was taken.
Involve domain experts and de- Domain experts were involved during the collec-
scribe their involvement tion of requirements for the dataset. They also
provided instructions and a gold standard dataset
for the labeling company. They made several spot
checks to control for correctness of the labels af-
terwards.
Describe technical measures to en- We performed an embedding followed by a cluster-
sure completeness of data ing to identify groups of bike frames other than the
different bike models. Those clusters were checked
by domain experts. The numbers sold of our dif-
ferent bike models are accurately represented in
the data. Also we tried to include enough defect
data to make sure the trained model learns how a
defect looks in the images.
Describe and record edge cases One edge case is where the use of to much glue
could lead to a label defect. This also happened a
few times during the labeling process.
If personal data is used, make sure No individuals are represented in this dataset.
and document that all individuals
know they are part of the data
Describe if and how the data could It could be used by competitors to find weaknesses
be misused and important design choices in our product line.
If fairness is identified as a risk, list No information is saved to reconstruct the identity
sensitive attributes of workers.
If applicable, address data poison- Not applicable.
ing
Labeling If applicable, describe the labeling The labeling company got a short briefing and a
process gold-standard labeling set by our domain experts.
After the labeling we made spot checks on the
dataset.
If applicable, describe how the la- Domain experts perform spot checks.
bel quality is checked
Splitting Create and document meaningful We divided the dataset into easy and hard data
splits with stratification points. In the hard data points there were new
bike models previously not seen by the model.
Describe how data leakage is pre- We use the same camera type for all images. After
vented training we perform heatmap analysis to ensure
the model focuses on the bike frame.
Recommendation: test the splits The variance between several evaluations on the
and variance via cross-validation test set is below 1%.
38
Model Reporting for Certifiable AI
Recommendation: split dataset An additional (hard) test set was created that con-
into difficult, trivial and moderate tains novel frames from the service provider that
are not part of train data. This allows us to eval-
uate the applicability of our model to future prod-
ucts.
Recommendation: put special fo- The gold-standard data is included in the test set
cus on label quality of test data and all test labeles were inspected by two separate
experts.
Reminder: Perform separate data Done.
preprocessing on the splits
Preprocessing Document and motivate all pro- The image data is normalized from 0 to 1 to en-
cessing steps that are a fixed part sure a good behaviour of the convolutional neural
of the data network (CNN).
Document whether the raw data The raw data is stored on our servers and can be
can be accessed accessed at any time by us. It is not publicly avail-
able.
If sensitive data is available high- No (pseudo)-anonymization needed.
light (pseudo)-anonymization
If fairness is a risk, highlight fair- We do not record information to draw conclusion
ness specific preprocessing about the identity of specific workers.
Analyzing Understand and document charac- We have 4 700 intact bike frames and 500 bike
teristics of test and training data frames with defects. 300 of the defective frames
are from our own production and are identical to
our frames, 200 are from the service provider. We
use 20% of the data for the test split.
Document why the data distribu- Images were taken directly at our test production
tion fits the real conditions or why line and at the location of our service provider
this is not necessary for the use case using real intact and defective frames.
Document limitations such as er- It is possible that non-superficial defects cannot
rors, noise, bias or known con- be identified by a visual check.
founders
Serving Describe how the dataset will be New images from the production line and the ser-
maintained in future vice provider will be added to the dataset after
labeling by the external company.
Describe the storage concept We operate a server with redundancy to store the
image data.
Describe the backup procedure A compressed and encrypted backup of the cur-
rent dataset is saved at an external cloud storage
provider.
If necessary, document measures Not necessary.
against data poisoning
39
Brajovic et al.
Model Card
Step Requirement Measures
General Provide Name and details of a con- Jane Doe, Project Manager
tact person
Provide Name of the person who John Doe, ML Engineer
created the model
Document the creation date and 01-2023, Version 1.0
version of the model
Describe intended use of the model Detection of defects in carbon fiber frames
Description Describe the architecture of the Custom CNN
used model
Describe the used hyper- Six convolution layers with ReLu activation fol-
parameters lowed by max-pooling. Sigmoid activation in the
final layer. The model is trained for 200 epochs
with learning rate 1e-6.
Document the training, validation, Recall on train and both moderate and hard test
and test error set is 100%. However, the precision is only around
90%. Hence, the recall is good enough to apply the
solution but an improved precision would further
reduce manual checks.
Document computation complex- Model was trained on cloud resources using 6
ity, training time and energy con- NVIDIA A100 GPUs for 32 hours. Estimated en-
sumption (for foundation models ergy consumption: 70 kWh.
describe steps taken to reduce en-
ergy consumption)
Explainability Document the demand for explain- The model should provide heatmaps to workers to
and inter- ability and interpretability help them detect regions of interest.
pretability
Describe taken actions if any GradCam (or similar) will be applied post-hoc
during operation.
Feature en- Describe the consideration of ex- Not relevant because there is no feature engineer-
gineering, plainability and interpretability (if ing.
feature se- relevant)
lection and
preprocessing
Describe feature engineering and No feature engineering.
selection with (domain specific)
reasoning
Describe and reason of preprocess- Data augmentation in form of brightness and
ing steps Gaussian noise.
40
Model Reporting for Certifiable AI
Model selec- Describe the consideration of A standard heatmap method must be applicable
tion explanibility, interpretability and to the model.
fairness (if relevant)
Describe the base line model and Multiple baselines were evaluated, a linear model,
its evaluation ViT and smaller and larger CNNs.
Describe the reasoning for the The final model was chosen based on a trade-off
model choice between performance and model size.
Describe why the complexity of Yes, smaller or simpler models performed worse.
model is justified and needed Larger models increased the inference time too
much.
Document the comparison to other Using a deeper model such as ResNet only
considered models marginally improves performance by less than 1%.
Document the approach of hyper- Random search. Increases test performance by
parameter optimization around 5%.
Describe the model evaluation The model is evaluated w.r.t. the recall and pre-
cision on both the moderate and hard test and,
additionally, tested in production.
Choice of met- Describe selected metrics and de- The evaluated metric is recall and an inference
rics scribe reasoning regarding use case time below 0.5 seconds. Fairness is not applicable
and fairness here.
Define the minimum requirement Recall above 99% is the target. Either purely from
for use in production (domain spe- the model or by including a human in the loop. A
cific reasons) precision of 90% is targeted but not mandatory.
Model confi- Document and quantify uncer- Model confidence was not assessed, since the
dence tainty of the model model was trained w.r.t recall.
Document approach of dealing None, see above.
with uncertainty
Testing in real Describe test design During the initial phase (expected to be the first
world setting year) all frames will additionally be checked man-
ually. During this phase, faulty frames will be
injected into the line in order to validate the AI
solution.
Describe possible risks, edge cases Edge cases are tested both using the different test
and worst case scenarios and create sets (hard and medium) and through injection of
(or simulate) them if possible faulty parts.
Describe limitations and shortcom- Change of production process or architecture of
ings of the model frames might introduce new faults, that might not
be detcted by the model.
Describe the test results The test is still running but revealed problems of
the AI solution under specific lighting conditions
Explain the derived actions The dataset was extended to include such condi-
tions
41
Brajovic et al.
Operation Card
Step Requirement Measures
Describe monitored components The CV system as a whole is monitored. Input
Scope and aim and output of the model are evaluated separately.
of monitoring Additionally the function of the cameras is moni-
tored.
General
Assess risks and potential dangers Undetected faults lead to the shipment of broken
according to Use case card frames. False Alarms might lead to ineffective
manual inspections and trust issues of personnel.
List safety measures for risks Monitoring of input and output data. Regularly
reoccurring computed tomography enables assess-
ment of model performance. Manual inspections
enable the assessment of precision.
Operating con- Create and document utilisation If the CV system reports possible damage of a
cept concept frame, staff does a standardized thorough visual
and tactile examination.
Plan staff training Each employee who conducts the examination
needs to take a training session with test. Staff
training of the user interface and AI basics is
mandatory.
Determine responsibilities The shift supervisor is responsible for the exami-
nation and dedicated IT personnel for the CV sys-
tem.
Autonomy of Document decision-making power The AI system checks autonomously and only no-
the application of AI tifies personnel if faults are detected. The user
interface allows checking the system and the im-
ages made by the cameras at all times.
Determine process to overrule deci- Regularly reoccurring computed tomography tests
sions of the AI overrule the CV systems decision. Personnel can
overrule the AI system’s decision after manual in-
spection of a frame.
Responsibilities Component wise: assessed risk, Risks of system components (CV model, cameras,
and measures control interval, responsibility, data processing pipeline and user interface) are
measures for emergency assessed separately. Further detailed information
is attached to this document.
42
Model Reporting for Certifiable AI
Model perfor- Monitor input and output Input: Distribution of pixel values are monitored.
mance Output: Number and distribution of defective
frames is monitored over a fixed time period.
Detect drifts in input data Input data is monitored with torch drifts Kernel
MMD drift detector method.
Document metric in use to monitor Model performance is monitored via precision and
model performance recall. If precision is over 95% or under 85% per-
sonnel will be alerted. If the CV system misses a
defective frame (detected by computed tomogra-
phy) personnel will be alerted immediately. If the
percentage of defective frames increases, or per-
centage of defective frames decreases over time,
personnel is alerted immediately.
AI Interface Establish transparent decision Decision making is made transparent with
making process heatmaps and displayed on the user interface.
Establish insight in model perfor- Production staff: current frame and heatmaps dis-
mance on different levels played. Display of heatmaps of frames identified
as defective. IT personnel: Camera images and
heatmaps analogous to production staff. Addi-
tionally, access to logs, performance metrics and
monitoring as described.
IT security Document individual access to Access to server rooms is restricted to authorized
server rooms personnel only
Set and document needed Only trained and authorized personnel can make
clearance level for changes to changes to the CV system. Changes need to be
AI/Deployment/Access regulation verified by a second person to be applied.
Establish and audit ISMS IT system in which the AI application will be in-
tegrated is certified according to ISO 270001. A
review of the system will be conducted with an
IT-security manager to evaluate the need for re-
certification.
Privacy Justify and document use or waiver Not applicable since no personal data is collected.
of a privacy preserving algorithm
If applicable, document privacy al- Not applicable.
gorithm and due changes in the
monitoring of output data
MLOps Establish versioned code repository Version control is implemented with Git and
of AI, training and deployment MLflow.
Establish maintenance and update The ML model is retrained every 3 months (if new
schedule data is available) on a blend of old and new data
or if a drift is detected.
Set regulation for the retraining of The model is updated with new training data. The
the AI and decision basis for the update goes live, if the recall does not drop and
replacement of a model precision increases or remains constant. If perfor-
mance measures drop twice in a row after retrain-
ing, data will be analyzed.
43
Brajovic et al.
Testing and Determine responsibilities for up- Updates are managed by dedicated IT personnel.
rollout dates Changes need to be verified by a second person to
be applied.
Document software tests Software tests are automated and documented.
Set period of time that an update New software and model versions run separately
must function stably before it is on the test environment parallel to the produc-
transferred to live status tion environment for one week. Performance is
evaluated and compared to the old version before
deployment.
44
Model Reporting for Certifiable AI
45
Brajovic et al.
References
Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Ghavamzadeh, M., . . .
others (2021). A review of uncertainty quantification in deep learning: Techniques,
applications and challenges (Vol. 76). Elsevier.
Ahlfeld, M., Barleben, T., et al. (2017). Industrie 4.0 – how well the law is keeping
pace. Federal Ministry for Economic Affairs and Climate Action, Federal Ministry
of Education and Research. Retrieved from https://www.plattform-i40.de/IP/
Redaktion/EN/Downloads/Publikation/i40-how-law-is-keeping-pace.html
AI Risk Management Framework. (2023). Artificial intelligence risk management frame-
work (ai rmf 1.0) (Standard). USA: National Institute of Standards and Technology.
Retrieved from https://doi.org/10.6028/NIST.AI.100-1
Amschewitz, D., Gesmann-Nuissl, et al. (2019). Artificial intelligence and law in the context
of industrie 4.0. Federal Ministry for Economic Affairs and Climate Action, Federal
Ministry of Education and Research. Retrieved from https://www.plattform-i40
.de/IP/Redaktion/EN/Downloads/Publikation/AI-and-Law.html
Arai, K., & Kapoor, S. (Eds.). (2020). Advances in computer vision. Springer International
Publishing. Retrieved from https://doi.org/10.1007%2F978-3-030-17795-9 doi:
10.1007/978-3-030-17795-9
Arp, D., Quiring, E., Pendlebury, F., Warnecke, A., Pierazzi, F., Wressnegger, C., . . .
Rieck, K. (2020). Dos and don’ts of machine learning in computer security (Vol.
abs/2010.09470). Retrieved from https://arxiv.org/abs/2010.09470
Ashmore, R., Calinescu, R., & Paterson, C. (2021, may). Assuring the machine learning
lifecycle: Desiderata, methods, and challenges (Vol. 54) (No. 5). New York, NY, USA:
Association for Computing Machinery. Retrieved from https://doi.org/10.1145/
3453444 doi: 10.1145/3453444
Ashton, H., & Franklin, M. (2022). The problem of behaviour and preference manipulation
in ai systems. In Ceur workshop proceedings (Vol. 3087).
Barocas, S., & Selbst, A. D. (2016). Big data’s disparate impact (Vol. 104).
Beleites, C., Neugebauer, U., Bocklitz, T., Krafft, C., & Popp, J. (2013, 1). Sample size
planning for classification models (Vol. 760). doi: 10.1016/j.aca.2012.11.007
Bellamy, R. K. E., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., . . . Zhang,
Y. (2018). Ai fairness 360: An extensible toolkit for detecting, understanding, and
mitigating unwanted algorithmic bias. arXiv. Retrieved from https://arxiv.org/
abs/1810.01943 doi: 10.48550/ARXIV.1810.01943
Bird, S., Dudı́k, M., Edgar, R., Horn, B., Lutz, R., Milan, V., . . . Walker,
K. (2020, May). Fairlearn: A toolkit for assessing and improving fair-
ness in ai (Tech. Rep. No. MSR-TR-2020-32). Microsoft. Retrieved from
https://www.microsoft.com/en-us/research/publication/fairlearn-a
-toolkit-for-assessing-and-improving-fairness-in-ai/
Bischl, B., Binder, M., Lang, M., Pielok, T., Richter, J., Coors, S., . . . others (2023).
Hyperparameter optimization: Foundations, algorithms, best practices and open chal-
lenges.
Bommasani, R., Klyman, K., Zhang, D., & Liang, P. (2023). Do foundation model providers
comply with the eu ai act? Retrieved from https://crfm.stanford.edu/2023/06/
46
Model Reporting for Certifiable AI
15/eu-ai-act.html
Bostrom, N., & Yudkowsky, E. (2018). The ethics of artificial intelligence. In Artificial
intelligence safety and security (pp. 57–69). Chapman and Hall/CRC.
Botha, J., & Pieterse, H. (2020). Fake news and deepfakes: A dangerous threat for 21st
century information security. In Iccws 2020 15th international conference on cyber
warfare and security. academic conferences and publishing limited (p. 57).
Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A., Nichyporuk, B., Szeto, J., . . . others
(2021). Accounting for variance in machine learning benchmarks (Vol. 3).
Braiek, H. B., & Khomh, F. (2018). On testing machine learning programs (Vol.
abs/1812.02257). Retrieved from http://arxiv.org/abs/1812.02257
Burkart, N., & Huber, M. F. (2021, jan). A survey on the explainability of supervised
machine learning (Vol. 70). AI Access Foundation. Retrieved from https://doi.org/
10.1613%2Fjair.1.12228 doi: 10.1613/jair.1.12228
Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A
new perspective (Vol. 300). Elsevier.
Castelvecchi, D. (2016). Can we open the black box of ai? (Vol. 538) (No. 7623).
Coghlan, N., & Steiert, M. (2020). The charter of fundamental rights of the european union
: the ’travaux préparatoires’ and selected documents,. European University Institute,.
Retrieved from https://hdl.handle.net/1814/68959
Commission, E., Directorate-General for Communications Networks, C., & Technology.
(2019). Ethics guidelines for trustworthy ai. Publications Office. doi: DOI/10.2759/
346720
Council of European Union. (2019). Handbook on european non-discrimination law : 2018
edition. Publications Office of the European Union. doi: doi/10.2811/792676
Council of European Union. (2022). Proposal for a directive of the european parliament and
of the council on liability for defective products. Retrieved from https://eur-lex
.europa.eu/legal-content/EN/TXT/?uri=CELEX:52022PC0495
Council of the European Union. (2022). Proposal for a regulation of the european parliament
and of the council laying down harmonised rules on artificial intelligence (artificial
intelligence act) and amending certain union legislative acts. Retrieved from https://
data.consilium.europa.eu/doc/document/ST-14336-2022-INIT/en/pdf
De Cristofaro, E. (2020). An overview of privacy in machine learning. arXiv. Retrieved
from https://arxiv.org/abs/2005.08679 doi: 10.48550/ARXIV.2005.08679
Dietterich, T. G. (1998, oct). Approximate statistical tests for comparing supervised classi-
fication learning algorithms (Vol. 10) (No. 7). Cambridge, MA, USA: MIT Press.
Retrieved from https://doi.org/10.1162/089976698300017197 doi: 10.1162/
089976698300017197
DIN SPEC 92001-3. (2023). Künstliche intelligenz - life cycle prozesse und
qualitätsanforderungen - teil 3: Erklärbarkeit (Standard). Berlin, DE: DIN Deutsches
Institut für Normung e. V.
Ding, J., Tarokh, V., & Yang, Y. (2018). Model selection techniques: An overview (Vol. 35)
(No. 6). IEEE.
Dubber, M. D., Pasquale, F., & Das, S. (2020). The oxford handbook of ethics of ai. Oxford
Handbooks.
Dwork, C., Roth, A., et al. (2014). The algorithmic foundations of differential privacy
47
Brajovic et al.
48
Model Reporting for Certifiable AI
49
Brajovic et al.
life cycle processes (Standard). Geneva, CH: International Organization for Standard-
ization. Retrieved from https://www.iso.org/standard/81118.html?browse=tc
ISO/IEC FDIS 8183. (2023). Information technology — artificial intelligence — data life
cycle framework (Standard). Geneva, CH: International Organization for Standard-
ization. Retrieved from https://www.iso.org/standard/83002.html
ISO/IEC TS 4213. (2022). Information technology — artificial intelligence — assessment of
machine learning classification performance (Standard). Geneva, CH: International
Organization for Standardization.
Jiang, Z., Zhang, C., Talwar, K., & Mozer, M. C. (2020). Characterizing structural regular-
ities of labeled data in overparameterized models. Retrieved from http://arxiv.org/
abs/2002.03206
Johnson, J., & Sokol, D. D. (2020). Understanding ai collusion and compliance.
Joseph, V. R. (2022, 8). Optimal ratio for data splitting (Vol. 15). John Wiley and Sons
Inc. doi: 10.1002/sam.11583
Kapoor, S., & Narayanan, A. (2022). Leakage and the reproducibility crisis in ml-based
science. arXiv. Retrieved from https://arxiv.org/abs/2207.07048 doi: 10.48550/
ARXIV.2207.07048
Konietschke, F., Schwab, K., & Pauly, M. (2021, 3). Small sample sizes: A big data
problem in high-dimensional data analysis (Vol. 30). SAGE Publications Ltd. doi:
10.1177/0962280220970228
Kraus, T., Ganschow, L., Eisenträger, M., & Wischmann, S. (2021). Erklärbare
KI: Anforderungen, Anwendungsfälle und Lösungen. Retrieved from https://
www.digitale-technologien.de/DT/Redaktion/DE/Downloads/Publikation/
KI-Inno/2021/Studie Erklaerbare KI.pdf? blob=publicationFile&v=1
Larson, D. B., Harvey, H., Rubin, D. L., Irani, N., Justin, R. T., & Langlotz, C. P. (2021).
Regulatory frameworks for development and evaluation of artificial intelligence–based
diagnostic imaging algorithms: summary and recommendations (Vol. 18) (No. 3).
Elsevier.
Larsson, S., & Heintz, F. (2020). Transparency in artificial intelligence (Vol. 9) (No. 2).
Lones, M. A. (2021). How to avoid machine learning pitfalls: a guide for academic
researchers (Vol. abs/2108.02497). Retrieved from https://arxiv.org/abs/2108
.02497
Lu, Z., Kazi, R. H., Wei, L. Y., Dontcheva, M., & Karahalios, K. (2021, 4). A framework for
deprecating datasets: Standardizing documentation, identification, and communication
(Vol. 5). Association for Computing Machinery. doi: 10.1145/1122445.1122456
Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions
(Vol. 30).
Makridakis, S., Spiliotis, E., Assimakopoulos, V., Semenoglou, A.-A., Mulder, G., &
Nikolopoulos, K. (2023). Statistical, machine learning and deep learning forecast-
ing methods: Comparisons and ways forward (Vol. 74) (No. 3). Taylor & Francis.
Malik, N., Tripathi, S. N., Kar, A. K., & Gupta, S. (2022). Impact of artificial intelligence
on employees working in industry 4.0 led organizations (Vol. 43) (No. 2). Emerald
Publishing Limited.
Mazumder, M., Banbury, C., Yao, X., Karlaš, B., Rojas, W. G., Diamos, S., . . . Reddi,
V. J. (2022, 7). Dataperf: Benchmarks for data-centric ai development. Retrieved
50
Model Reporting for Certifiable AI
from http://arxiv.org/abs/2207.10062
Meding, K., Buschoff, L. M. S., Geirhos, R., & Wichmann, F. A. (2021). Trivial or
impossible – dichotomous data difficulty masks model differences (on imagenet and
beyond). Retrieved from http://arxiv.org/abs/2110.05922
Mitchell, M., Luccioni, A. S., Lambert, N., Gerchick, M., McMillan-Major, A., Ozoani, E.,
. . . Kiela, D. (2022, 12). Measuring data. Retrieved from http://arxiv.org/abs/
2212.05129
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., . . . Gebru,
T. (2019, 1). Model cards for model reporting. In (p. 220-229). Association for
Computing Machinery, Inc. doi: 10.1145/3287560.3287596
Müller, V. C. (2020). Ethics of artificial intelligence and robotics.
Murakonda, S. K., & Shokri, R. (2020). Ml privacy meter: Aiding regulatory compliance by
quantifying the privacy risks of machine learning. arXiv. Retrieved from https://
arxiv.org/abs/2007.09339 doi: 10.48550/ARXIV.2007.09339
Naser, M., & Alavi, A. H. (2021). Error metrics and performance fitness indicators for
artificial intelligence and machine learning in engineering and sciences. Springer.
Nauta, M., Trienes, J., Pathak, S., Nguyen, E., Peters, M., Schmitt, Y., . . . Seifert, C.
(2022). From anecdotal evidence to quantitative evaluation methods: A systematic
review on evaluating explainable ai.
Northcutt, C. G., Athalye, A., & Mueller, J. (2021, 3). Pervasive label errors in test sets
destabilize machine learning benchmarks. Retrieved from http://arxiv.org/abs/
2103.14749
Oneto, L., & Chiappa, S. (2020). Fairness in machine learning. In Recent trends in learning
from data (pp. 155–196). Springer International Publishing. Retrieved from https://
doi.org/10.1007%2F978-3-030-43883-8 7 doi: 10.1007/978-3-030-43883-8 7
Piorkowski, D., Hind, M., & Richards, J. (2022). Quantitative ai risk assessments: Oppor-
tunities and challenges. arXiv. Retrieved from https://arxiv.org/abs/2209.06317
doi: 10.48550/ARXIV.2209.06317
Poretschkin, M., Schmitz, A., Akila, M., Adilova, L., Becker, D., Cremers, A. B., . . .
others (2021). Leitfaden zur gestaltung vertrauenswürdiger künstlicher intelligenz
(ki-prüfkatalog). Fraunhofer IAIS.
Pushkarna, M., Zaldivar, A., & Kjartansson, O. (2022, 4). Data cards: Purposeful and
transparent dataset documentation for responsible ai. Retrieved from http://arxiv
.org/abs/2204.01075
Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., Hutchinson, B., . . . Barnes, P.
(2020). Closing the ai accountability gap: Defining an end-to-end framework for inter-
nal algorithmic auditing. In Proceedings of the 2020 conference on fairness, account-
ability, and transparency (p. 33–44). New York, NY, USA: Association for Comput-
ing Machinery. Retrieved from https://doi.org/10.1145/3351095.3372873 doi:
10.1145/3351095.3372873
Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., . . . Ng, A. Y. (2017).
Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning
(Vol. abs/1711.05225). Retrieved from http://arxiv.org/abs/1711.05225
Raschka, S. (2018, 11). Model evaluation, model selection, and algorithm selection in
machine learning. Retrieved from http://arxiv.org/abs/1811.12808
51
Brajovic et al.
Raudys, S. J., & Jain, A. K. (1990). Small sample size effects in statistical pattern recogni-
tion: Recommendations for practitioners and open problems. In (Vol. 1, p. 417-423).
Publ by IEEE. doi: 10.1109/icpr.1990.118138
Reunanen, J. (2003). Overfitting in making comparisons between variable selection methods
(Vol. 3) (No. Mar).
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). ” why should i trust you?” explaining
the predictions of any classifier. In Proceedings of the 22nd acm sigkdd international
conference on knowledge discovery and data mining (pp. 1135–1144).
Roberts, D. R., Bahn, V., Ciuti, S., Boyce, M. S., Elith, J., Guillera-Arroita, G., . . .
Dormann, C. F. (2017). Cross-validation strategies for data with temporal, spatial,
hierarchical, or phylogenetic structure (Vol. 40) (No. 8). Retrieved from https://
onlinelibrary.wiley.com/doi/abs/10.1111/ecog.02881 doi: https://doi.org/
10.1111/ecog.02881
Roberts, H., Cowls, J., Morley, J., Taddeo, M., Wang, V., & Floridi, L. (2021, mar). The
chinese approach to artificial intelligence: An analysis of policy, ethics, and regulation
(Vol. 36) (No. 1). Berlin, Heidelberg: Springer-Verlag. Retrieved from https://
doi.org/10.1007/s00146-020-00992-2 doi: 10.1007/s00146-020-00992-2
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes deci-
sions and use interpretable models instead (Vol. 1) (No. 5). Nature Publishing Group
UK London.
Salay, R., Queiroz, R., & Czarnecki, K. (2017). An analysis of iso 26262: Using machine
learning safely in automotive software. arXiv. Retrieved from https://arxiv.org/
abs/1709.02435 doi: 10.48550/ARXIV.1709.02435
Schaaf, N., & Wiedenroth, P., Saskia Johanna ans Wagner. (2022). Explainable ai in
practice - application-based evaluation of xai methods.
Schmidt, R. F. (2013). Chapter 8 - software requirements analysis practice. In
R. F. Schmidt (Ed.), Software engineering (p. 139-158). Boston: Morgan
Kaufmann. Retrieved from https://www.sciencedirect.com/science/misc/pii/
B9780124077683000082 doi: https://doi.org/10.1016/B978-0-12-407768-3.00008-2
Settles, B. (2010). Active learning literature survey (Vol. 15). doi: 10.1.1.167.4245
Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. (Vol. 9) (No. 3).
Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2016). Membership inference attacks
against machine learning models. arXiv. Retrieved from https://arxiv.org/abs/
1610.05820 doi: 10.48550/ARXIV.1610.05820
Siebert, J., Joeckel, L., Heidrich, J., Trendowicz, A., Nakamichi, K., Ohashi, K., . . .
Aoyama, M. (2022, Jun 01). Construction of a quality model for machine learn-
ing systems (Vol. 30) (No. 2). Retrieved from https://doi.org/10.1007/s11219
-021-09557-y doi: 10.1007/s11219-021-09557-y
Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., & Morcos, A. S. (2022, 6). Beyond neural
scaling laws: beating power law scaling via data pruning. Retrieved from http://
arxiv.org/abs/2206.14486
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus,
R. (2013). Intriguing properties of neural networks. arXiv. Retrieved from https://
arxiv.org/abs/1312.6199 doi: 10.48550/ARXIV.1312.6199
Tidjon, L. N., & Khomh, F. (2022). Never trust, always verify : a roadmap for trustworthy
52
Model Reporting for Certifiable AI
53
Brajovic et al.
Zhang, X., Ono, J. P., Song, H., Gou, L., Ma, K. L., & Ren, L. (2022). Sliceteller: A data
slice-driven approach for machine learning model validation. IEEE Computer Society.
doi: 10.1109/TVCG.2022.3209465
54