Audit Practitioners Guide To Machine Learning Part 1 - WHPAPG1 - WHP - Eng - 1022
Audit Practitioners Guide To Machine Learning Part 1 - WHPAPG1 - WHP - Eng - 1022
Audit
2 AUDIT PRACTITIONER’S GUIDE TO MACHINE LEARNING, PART 1: TECHNOLOGY
CONTENTS
ABSTRACT
As artificial intelligence (AI) and machine learning (ML) continue to be rapidly adopted by
companies and governments around the world, existing auditing frameworks and
information technology controls must be better tailored to address the unique and
evolving risk AI and ML pose. Risk arises from the choice and design of the ML model and
the development cycle. Risk is also a factor in compliance with regulations, such as the
EU’s General Data Protection Regulation (GDPR) or the California Consumer Privacy Act
(CCPA) in the US.
This white paper series, consisting of two parts, provides a systematic auditing guideline
that identifies technology and compliance risk as the two primary focal points for
auditors.1 Part 1 of this white paper series addresses technology risk in ML auditing by
1
dissecting the typical software development life cycle of AI/ML algorithms and
highlighting the key areas that IT auditors should investigate. Part 2 addresses
compliance risk by identifying actionable steps that IT auditors can take to promote
compliance with applicable laws, regulations and industry standards.
1
1
Ahmed, H.; “Auditing Guidelines for Artificial Intelligence,” @ISACA, 21 December 2020, www.isaca.org/resources/news-and-
trends/newsletters/atisaca/2020/volume-26/auditing-guidelines-for-artificial-intelligence
recent Deloitte survey, respondents from 83% of the the Comptroller of the Currency in the US5 and 4
companies polled believed AI has made or will make a Environment, Social and Governance (ESG).6 5
In the past decade, public concern about technological Accountability Act of 1996 (HIPAA), and other elements of
risk has generally focused on the expanding collection consumer data are addressed in the GDPR, CCPA and similar
AI in products and processes, proliferating algorithms • Fairness—Fairness relates to the emerging field of data ethics
increase the potential for inaccurate or biased decisions— and requires evaluating the impact of AI-based outcomes on
in particular, these complex models may affect the daily people’s rights, whether the model makes the decision on its
lives, health and rights of individuals seeking approvals for own or in concert with a human agent. The Boston housing
loan applications, making investment decisions or price data set, which has an explicit parameter to identify a
planning treatment for medical conditions such as locality as a Black or White area and how those racial
1
2
McKendrick, J.; “AI Adoption Skyrocketed Over the Last 18 Months,” Harvard Business Review, 27 September 2021, https://hbr.org/2021/09/ai-adoption-
skyrocketed-over-the-last-18-months
2
3
European Institute of Public Administration (EIPA), “The Artificial Intelligence Act Proposal and its Implications for Member States,” September 2021,
www.eipa.eu/publications/briefing/the-artificial-intelligence-act-proposal-and-its-implications-for-member-states/
3
4
Daws, R.; “EU regulation sets fines of €20M or up to 4% of turnover for AI misuse,” AI News, 14 April 2021, https://artificialintelligence-
news.com/2021/04/14/eu-regulation-fines-20m-up-4-turnover-ai-misuse/
4
5
Office of the Comptroller of the Currency (OCC), “Model Risk Management,” August 2021, www.occ.treas.gov/publications-and-
resources/publications/comptrollers-handbook/files/model-risk-management/pub-ch-model-risk.pdf
5
6
Anand, R.; B. Greenstein; “Six AI business predictions for 2022,” 30 November 2021, PwC AI and Analytics, https://www.pwc.com/us/en/tech-effect/ai-
analytics/six-ai-predictions.html
demographics affect housing prices, is a classic example of the monitoring procedure and enforce it in the development and
• Brand reputation and trust—Twenty-two percent of enterprises Challenges associated with privacy, fairness, trust,
in the last two to three years have already faced customer discrimination, prediction and trending cut across
backlash due to decisions reached via their AI systems.8 One 7
categories of technology and compliance risk. For
prominent area where customer concerns have surfaced is in example, model systematic errors can be addressed with
facial recognition systems, as potential misidentification can a thorough technology risk audit. For the purposes of this
have a significant impact in sensitive circumstances, such as white paper series, audits of AI and ML applications
airport security monitoring or exam proctoring. should consider the following:
• Model transparency—For simple ML models such as linear • Technology risk auditing—Assess risk related to ML algorithms
regression, it is fairly easy to notify users of the system’s logic and life cycle, data science, and cybersecurity. Practitioners are
and decision-making parameters. However, for state-of-the-art advised to implement auditing procedures based on the six key
black box ML models such as deep learning, there is currently stages in a typical AI and ML application development cycle:
no simple way to disentangle and weigh the many parameters • Data governance
in the hidden layers. These transparency differences have • Data engineering
become especially important in recent years as transparency • Feature engineering
and explainability have become core principles for AI use, as • Model training
seen in new laws, regulations or industry standards. • Model testing (evaluation)
• Model deployment
• Learning and adapting algorithms—While algorithms that learn Part 1 of the white paper series covers technology
and adapt may be more accurate, they can evolve in a elements of auditing ML. Part 2 addresses compliance
discriminatory way. It is critical to implement a continuous considerations.
6
7
Delve, “The Boston Housing Dataset,” 10 October 1996, www.cs.toronto.edu/~delve/data/boston/bostonDetail.html
7
8
Thieullent, A.-L., et al.; “AI and the Ethical Conundrum: How organizations can build ethically robust AI systems and gain trust,” Capgemini Research
Institute, 2020, https://www.capgemini.com/wp-content/uploads/2020/10/AI-and-the-Ethical-Conundrum-Report.pdf
umbrella term with broad application, ML and deep “data.” Currently, most of the applications classified as AI
learning (DL) are subsets of AI with important distinctions are actually ML tools that leverage statistical models that
for the purposes of this white paper. “summarize” the patterns from large data sets that can
later be used to make predictions in new data.
8
9
Lexico, “Artificial Intelligence,” www.lexico.com/en/definition/artificial_intelligence
9
10
Cambridge Dictionary, “machine learning,” https://dictionary.cambridge.org/us/dictionary/english/machine-learning
Based on this hierarchy, this white paper focuses on Instead, it explores the data and draws inferences from
machine learning because it is the most common family of data sets to describe hidden structures in unlabeled data.12 11
is supervised.
K-means clustering.
Key categories of supervised learning include: • Dimensionality reduction—To compress the data by projecting
• Classification—Supervised learning that involves predicting a them from higher to lower dimensions while preserving certain
nominal class label. For example, does a chest X-ray image characteristics, such as reducing the dimensions of city, state
indicate “Has Cancer” or “Not Cancer”? Typical algorithms are and postal code to location. Typical algorithm: principal
logistic regression, support vector machine and random forest. component analysis (PCA)
• Regression—Supervised learning that involves predicting a • Anomaly detection—To identify outliers that stand out from the
numerical label. For example, “this three-bedroom house in the crowd, such as a credit card account that made 157 purchases
94301 postal code area may be worth $2.1 million.” Typical in one day, which may indicate fraudulent activity. Typical
algorithms are linear regression and support vector regression. algorithm: median, z-score statistics, local outlier factor (LOF)
• Hybrid—Supervised learning that involves the use of labeled • Association rule—To discover rules for correlation patterns,
data that can learn from partial labeled data, such as semi- such as people who buy X also tend to also buy Y.
supervised learning.
12
Answer: Classification. Although a ZIP code is composed of numbers,
Unsupervised
in this case it is learning studies
variablehow systems can infer a
1
11
Bishop, C.; Pattern Recognition and Machine Learning, Springer, USA, 2006, https://link.springer.com/book/9780387310732
12
Loukas, S.; “What is Machine Learning: Supervised, Unsupervised, Semi-Supervised and Reinforcement learning methods,” Towards Data Science, 10
June 2020, https://towardsdatascience.com/what-is-machine-learning-a-short-note-on-supervised-unsupervised-semi-supervised-and-aed1573ae9bb
13 Answer: Classification. Although a ZIP code is composed of numbers, in this case it is a categorical variable to represent a geolocation.
Reinforcement Learning
Example: Bank Direct Marketing Campaign
Reinforcement learning is learning what to do—how
to map situations to actions—to maximize a A banking institution runs direct marketing campaigns
numerical reward signal. The learner is not told on over 45,000 customers.
which actions to take, but instead must discover Customers in the marketing campaigns represent
which actions yield the most reward by trying dozens of demographics, which makes manual
them.14 12
university.degree, unknown)
Example: DeepMind Technology’s AlphaGo and • Default—Categorical (e.g., has credit in default?
Deep Reinforcement Learning [no | yes | unknown])
AlphaGo was the first computer program to defeat a • Housing—Categorical (e.g., has a housing loan?
professional human Go player and was the first to [no | yes | unknown])
defeat a Go world champion in 2016. It is arguably the
strongest Go player in history. Go has 10170 possible The data scientists perform a quick K-modes
board configurations—more than the number of atoms algorithm (similar to K-means), dividing the data set
in the known universe. It is far more complex than into two clusters, which identifies the following
deep neural networks such as policy and value • Cluster 1—Captures demographics such as
networks. The DeepMind team introduced AlphaGo to admin/technician/management jobs,
numerous amateur gamers to help the program university degrees, homeownership
develop an understanding of reasonable human play. • Cluster 2—Captures attributes such as blue-collar jobs,
Then it played against different versions of itself high school degrees, lack of homeownership
thousands of times, each time learning from its
mistakes. Over time, AlphaGo improved and became With these insights coming from the clustering ML
increasingly stronger and better at learning and algorithm, the chief marketing officer and marketing
decision-making. This process is known as deep team can design the next campaign to target each
reinforcement learning.15 1
segment.16 1
12
15
14 Sutton,
Sutton,R.;
R.;A.
A.Barto;
Barto;Reinforcement
ReinforcementLearning:
Learning:An
AnIntroduction,
Introduction,Second
SecondEdition,
Edition,MIT
MITPress,
Press,USA,
USA,2018,
2018,https://mitpress.mit.edu/books/reinforcement-
https://mitpress.mit.edu/books/reinforcement-
learning-second-edition
learning-second-edition
15
DeepMind, “AlphaGo,” https://deepmind.com/research/case-studies/alphago-the-story-so-far
16
Moro, S.; P. Cortez; P. Rita; UCI Machine Learning Repository, “A Data-Driven Approach to Predict the Success of Bank Telemarketing,” Decision
Support Systems 62:22-31, June 2014, Elsevier, https://archive.ics.uci.edu/ml/datasets/bank+marketing
data science and software development life cycle. This against unauthorized access and cyberthreats, such as data
section highlights the main phases to provide a roadmap poisoning, AI attacks and hacking. Remaining alert to emerging
for the audit discussion and to highlight the key risk security threats is key.
• Model training
A data governance framework establishes sound data
• Model evaluation governance, standardizes data management practices
• Model deployment and prediction and enhances trust in the integrity of an AI application.
When auditing a data governance framework, the auditor’s analytic environment, with attention to:
considerations should include but are not limited to: • Credibility of data source
letters for customers’ names, or area codes may or may not 1. Customer’s full name
may suffer from data inaccuracy and missing values, especially 4. Country of residence
6. Race
Raw data sources are usually noisy, and only a few data 7. Gender
fields may be helpful to the business problem.
8. Previous year’s income
• Data anonymizing—To comply with HIPAA, GDPR, CCPA and 9. Number of years of relationship with the bank
other legal and regulatory requirements, PHI and PII data fields Question: Which of these data sources contain PII that
(e.g., social security number) may be anonymized for data may be restricted by GDPR requirements?17 14
analytics.
14
17
Answer: 1, 3 and 5. University of Pittsburgh, “Guide to Identifying Personally Identifiable Information (PII),” www.technology.pitt.edu/help-desk/how-to-
documents/guide-identifying-personally-identifiable-information-pii.
15
18
Lucini, F.; “The Real Deal About Synthetic Data,” MITSloan Management Review, 20 October 2021, https://sloanreview.mit.edu/article/the-real-deal-
about-synthetic-data/
Feature Engineering • Derivation—Some raw data fields might be hard to use directly
possibilities.
15
19
Rençberoğlu, E.; “Fundamental Techniques of Feature Engineering for Machine Learning,” Towards Data Science, 1 April 2019,
https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114
• Scaling—Different raw data may have different data ranges, Hence, in DL models, the feature engineering step is not
such as age and income. In ML, how can these two columns be as essential. However, this raises another important issue:
compared? Normalization and standardization can render the how to determine what features are encoded in the black
multivariate features into a similar scale. For example, both age box ML model in the deep neural network’s hidden layers.
and income can be scaled to between 0 and 1, where 0 is the The following section addresses this topic.
minimum and 1 is the maximum in the original data.
Advanced Topic: White Box and Black Box ML
Models and Model Interpretability
With the rise of DL algorithms, features can be
Auditors often encounter two kinds of machine
automatically derived from raw data through the use of
complex hierarchical neural networks. learning models:
photos.
with regard to whether a feature is introducing discrimination,
• Black box models—For example, deep learning models
bias or other unfairness.
usually have millions of parameters and complex neural
• Privacy—Feature engineering can introduce privacy concerns,
network architectures, which makes them more capable of
especially when correlating the original data with external data
predicting complicated problems. However, it is almost
sources. The auditor should examine whether additional privacy
impossible to explain what hidden neural network layers
issues are introduced in this step.
actually learn.24 Recent research (e.g., on the LIME and XAI
1
With the rise of DL algorithms, features can be algorithms) aims to understand the interpretability of DL
automatically derived from raw data through the use of
models.25 2
Case Study: Netflix’s Million-Dollar Prize and the Data Privacy Lesson
Data privacy in the machine learning era might be more subtle than immediately apparent. In 2009, Netflix had to
cancel its million-dollar prize that invited ML researchers to build a better recommendation system to predict movies,
based on a training data set from 480,000 Netflix customers. Although the data sets were constructed to preserve
1
24
Gunning, D.; M. Stefik; J. Choi; T. Miller; S. Stump; G. Yang; “XAI-
customer privacy, privacy advocates criticized the prize. In 2007, two researchers
Explainable from theScience
artificial intelligence,” University of18
Robotics, Texas at 2019,
December
https://openaccess.city.ac.uk/id/eprint/23405/
Austin were able to identify individual users by matching the data
25 setsM.;with
Ribeiro, filmC.ratings
S. Singh;
2
on
Guestrin; theShould
“’Why Internet Movie
I Trust You?’: Explaining
the Predictions of Any Classifier,” Proceedings of the 22nd ACM
Database (IMDb).21 On December 17, 2009, four Netflix users filed
1
a class-action lawsuit against Netflix, 22 alleging
SIGKDD International Conference on Knowledge Discovery and Data
2
that Netflix had violated US fair trade laws and the Video Mining,
Privacy August 2016,
Protection Act by releasing the data sets. There was
https://dl.acm.org/doi/10.1145/2939672.2939778
public debate about privacy for research participants. On March 19, 2010, Netflix reached a settlement with the
plaintiffs, after which they voluntarily dismissed the lawsuit.23 3
For ML auditors, sometimes data privacy in ML can be subtle, especially when feature engineering correlates one
data source with external data sources that may breach data privacy requirements.
20
Oneto, L.; S. Chiappa; “Fairness in Machine Learning: Recent Trends in Learning From Data,” Studies in Computational Intelligence, Springer, USA, 2020
21
Narayanan, A.; V. Shmatikov; “Robust De-anonymization of Large Sparse Datasets,” University of Texas at Austin,
www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
22
Jane Doe, et al., v. Netflix, Class Action Complaint, US District Court for the Northern District of California, 17 December 2009,
21 www.wired.com/images_blogs/threatlevel/2009/12/doe-v-netflix.pdf
1
Narayanan, A.; V. Shmatikov; “Robust De-anonymization of Large Sparse Datasets,” University of Texas at Austin,
23
Singel, R.; “NetFlix Cancels Recommendation Contest After Privacy Lawsuit,” Wired, 12 March 2018, www.wired.com/2010/03/netflix-cancels-contest/
www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
24
22
2
16
Gunning,
20 Jane
Oneto, L.; D.;
Doe, S. M.
al.,Stefik;
etChiappa; J. Choi;
v. Netflix, Class
“Fairness T.in
Miller;
Action S. Stump;
Complaint,
Machine G.US
Learning:Yang; “XAI-Explainable
District
Recent Court for
Trends the artificial
Northern
in Learning From intelligence,”
District Science
of California,
Data,” Studies Robotics, 18 Intelligence,
17 December
in Computational December
2009, 2019, Springer, USA, 2020
https://openaccess.city.ac.uk/id/eprint/23405/
www.wired.com/images_blogs/threatlevel/2009/12/doe-v-netflix.pdf
23 Ribeiro, M.; S. Singh; C. Guestrin; “’Why Should I Trust You?’: Explaining the Predictions of Any Classifier,” Proceedings of the 22nd ACM SIGKDD
25
3
Singel, R.; “NetFlix Cancels Recommendation Contest After Privacy Lawsuit,” Wired, 12 March 2018, www.wired.com/2010/03/netflix-cancels-contest/
International Conference on Knowledge Discovery and Data Mining, August 2016, https://dl.acm.org/doi/10.1145/2939672.2939778
Model Training
Model training is the heart of building an ML model. The
• What machine learning algorithms are used?
process of training a model involves selecting the optimal
• What are the model selection criteria? In other words, why
hyperparameters that fit the training data well, based on the
use this algorithm instead of alternatives?
following business justification, use case and success criteria:
• What are the limitations of this ML model?
• Supervised learning—To minimize errors between the predicted
• Transparency—Is it a white box or black box model? How is the
values and target values
machine learning model output explained?
• Deep learning—To minimize fitting errors in each epoch
iteration
Knowledge Check: ML Model to Detect Online
• Unsupervised learning—To maximize cluster tightness metrics Fraud
When auditing the model training stage, the two key areas An antifraud team is building an ML model to detect
to examine are training data and model methodology. online fraud. They have collected 100 past fraud cases
from the legal department. Each fraud case has its
• Training data—The quality of training data determines the ML
own unique characteristics, and the data scientists
model quality. The auditor should ask the following questions
want to make sure the ML model covers all of them.
regarding training data:
So, the data scientists use all the data cases to train a
• Where are the training data collected? Specify the
complex decision tree model with a depth of 50, and
sources and access rights.
they achieve results with 98% accuracy on all 100
• What are the statistics for the training data (e.g.,
cases.
percentage of positive and negative labels)?
• How are the training data prepared? What is the split ratio Question: Is this a reliable ML model?26 1
cases because fraud is a small probability incident for the Knowledge Check: ML Prototypes for
Predicting Home Prices
population. Is up-sampling or down-sampling used to
balance the training data? A bank’s data science team has developed two ML
prototypes for predicting home prices. Model 1 is a
The ML model’s conceptual soundness is vital to simple linear regression model with 5 parameters—a
success. ML model training is about finding the optimal
white box ML model.
model between overfitting and underfitting and
understanding the assumptions made.
Model 2 is a 10-layer convolutional neural network
• Model methodology—The ML model’s conceptual soundness is 1
(deep
26
Answer:learning)—a blackoverfitting
No. This is a typical box MLmodel.
model.
Without proper cross-
validation, this model ends up memorizing all the cases but is unable
vital to success. ML model training is about finding the optimal to generalize to new examples.both models perform well in
During cross-validation,
model between overfitting and underfitting and understanding
all metrics.
the assumptions made. The auditor should ask the following
26
Answer: No. This is a typical overfitting model. Without proper cross-validation, this model ends up memorizing all the cases but is unable to generalize
to new examples.
27
Answer: Model 1, because it performs just as well but is simpler.
As explained earlier, ML models are categorized into Question: Was this ML model good enough?28 1
PRICE
PRICE
1
28
Answer: Not necessarily. An experienced auditor would ask about the
false positive rate and the specificity. That is, of the frauds detected,
how many were normal transactions (i.e., false positives)? A model
that predicts all transactions as fraud would achieve 100% precision,
SIZE SIZE because it would successfully detect all frauds.
SIZE Yet it is clearly an
ineffective ML model, because it would create an excessive number of
Ø O + Ø 1x Ø O + Ø 1x + Ø 2 x Ø O + Ø 1x + Ø 2 x + Ø 3 x + Ø 4 x
false 2positives and would not be practical for production.
2 3 4
28
Answer: Not necessarily. An experienced auditor would ask about the false positive rate and the specificity. That is, of the frauds detected, how many
were normal transactions (i.e., false positives)? A model that predicts all transactions as fraud would achieve 100% precision, because it would
successfully detect all frauds. Yet it is clearly an ineffective ML model, because it would create an excessive number of false positives and would not be
practical for production.
To get a holistic assessment of the ML model, it is key to times. The use of k-fold cross-validation will result in a
adopt scientific experiment settings. lower-bias ML model.
evaluation of a final model fit on the training data set. • Precision—Proportion of positive cases that are correctly
identified
K-fold Cross-Validation • Negative predictive value—Proportion of negative cases that
A more rigorous assessment, cross-validation uses a are correctly identified
resampling procedure to evaluate ML models on a limited • Sensitivity or recall—Proportion of actual positive cases that
data sample. It shuffles and splits data into k groups, are correctly identified
keeping one group as a test data set and using the • Specificity—Proportion of actual negative cases that are
training procedure on the remaining k-1-folds. Repeat k correctly identified
1
29
Shah,3:
FIGURE T.;Confusion
“About Train, Matrix
Validation and Test Sets in Machine Learning,”
Towards Data Science, 6 December 2017,
https://towardsdatascience.com/train-validation-and-test-sets-
Actual
72cb40cba9e7Confusion Matrix
Positive (P) Negative (N)
Positive (PP) True positive (TP) False positive (FP) Precision TP/PP
Predicted
Negative (PN) False negative (FN) True negative (TN) Recall TP/P
Sensitivity Specificity Accuracy
TP/P TN/N (TP + TN) / (P + N)
0.5 ss if
c la
om
30
ROC, https://en.wikipedia.org/wiki/Receiver_operating_characteristic and
1
0.0
0.0 0.5 1.0
FALSE POSITIVE RATE
29
Shah, T.; “About Train, Validation and Test Sets in Machine Learning,” Towards Data Science, 6 December 2017, https://towardsdatascience.com/train-
validation-and-test-sets-72cb40cba9e7
30
ROC, https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Most of the clustering model metrics are based on such ML model is not static, and it requires the updating of
between clusters. It displays a measure of how close each • Cloud-based or Edge device-based?
point in a cluster is to points in neighboring clusters. With • Central processing unit (CPU) vs. graphics processing
inspect the similarities within clusters and differences • Specific chipset-level support required?
bandwidth) required?
The silhouette score is calculated using the mean
• Alignment with business requirements? (Cloud-based ML
intracluster distance (i) and the mean nearest-cluster
web applications inherit the cloud vendor’s SLA terms
distance (n) for each sample. The silhouette coefficient for
applicable to availability and network latency in certain
a sample is (n - i) / max(i, n), where n is the distance
regions.)
between each sample and the nearest cluster the sample
• Software requirements—The software environment will impact
is not a part of, and i is the mean distance within each
the ML model production as well.
cluster.
• Operating system—Linux, Unix, Windows, macOS, IOS,
Other Unsupervised Learning Metrics: Android or other OS?
• Rand Index 32 18
• Software library versions—Not all software libraries will be
17
31
Zuccarelli, E.; “Performance Metrics in Machine Learning — Part 3: Clustering,” Towards Data Science, 31 January 2021,
https://towardsdatascience.com/performance-metrics-in-machine-learning-part-3-clustering-d69550662dc6
18
32
Scikit-Learn, https://scikit-learn.org/stable/modules/clustering.html#rand-index
supported?
Continuous monitoring is crucial for optimal ML model Continuous monitoring is crucial for optimal ML model
performance, especially on data streams that may performance, especially on data streams that may
introduce new instances the ML model cannot interpret
introduce new instances the ML model cannot interpret accurately.
accurately. This is a common problem in adversarial
environments such as fraud detection and cybersecurity
Example: Apple M1 Chipset for Machine
intrusion detection, where adversaries change their attack Learning
tactics to bypass existing detection methods. The audit
Some ML applications will be deployed on mobile
practitioner should examine whether the ML model
devices, and it is important for auditors to examine
production has implemented continuous monitoring to
whether certain chipset features are required for
adapt to the new data landscape and determine the best
expected performance. Apple’s M1 system chip with
time to retrain the ML model.
the Apple Neural Engine for ML models delivers up to
15 times faster ML performance.35
Popular programming languages for ML and DL include
Python, Julia, R and Java, but Python may be emerging
as the preferred language of machine learning. The Example: Is a 10-Millisecond ML Prediction
availability of libraries and open-source tools makes Fast Enough?
Python an ideal choice for developing ML models.
It takes a payment fraud ML model only 10
milliseconds to predict whether a transaction could be
fraudulent, which appears to be a relatively fast
Example: Deep Learning and GPU
computation. However, context is important. A global
Most DL algorithms are based primarily on GPU for
payment network such as Visa handles 24,000
optimal speed and perform poorly on CPU-based
transactions per second.
architectures. In a recent TensorFlow benchmark,
algorithms with GPU convolutional neural nets (CNN) In this case, on one web application server, the ML
were more than 600% faster than with a CPU only. system would take 240 seconds to predict 1 second’s
worth of payment throughput, which is unacceptable.
The auditor should examine whether a certain
Hence, it is important to understand how an ML
processing hardware environment is assumed.33 1
33
DATAmadness, “TensorFlow 2 - CPU vs GPU Performance Comparison,” 27 October 2019, https://datamadness.github.io/TensorFlow2-CPU-vs-GPU
34
Costa, C.; “Best Python Libraries for Machine Learning and Deep Learning,” Towards Data Science, 24 March 2020,
https://towardsdatascience.com/best-python-libraries-for-machine-learning-and-deep-learning-b0bd40c7e8c
35 Costa, C.; “Best Python Libraries for Machine Learning and Deep Learning,” Towards Data Science, 24 March 2020,
35
Apple, “Apple Unleashes M1,” 10 November 2020, www.apple.com/newsroom/2020/11/apple-unleashes-m1/
19
https://towardsdatascience.com/best-python-libraries-for-machine-learning-and-deep-learning-b0bd40c7e8c
Popular Open-Source ML Libraries enables easy scaling of computations. It is simple to use and
quick and easy to set up, and it offers smooth integration with
Widely used open-source ML libraries include the
other tools.
following:
Commercially Available ML Libraries
• TensorFlow—TensorFlow is one of the best libraries available
for working with machine learning on Python. Offered by Widely used commercial ML libraries include the
Google, TensorFlow makes ML model building easy for following:
beginners and professionals alike. • SAS
• Pytorch—Developed by Facebook, PyTorch is one of the few • H2O
machine learning libraries for Python. Apart from Python, • RapidMiner
PyTorch also supports C++ with its C++ interface. It is
The auditor should ask the following questions regarding
considered among the top contenders for best machine
machine learning libraries:
learning and deep learning framework.
• Scikit-learn—Scikit-learn is an actively used ML library for • What software license does this ML library have?
Python. It includes easy integration with different ML
• Is this ML library the latest stable version?
programming libraries such as NumPy and Pandas.
• Does this ML library contain known vulnerabilities identified in
• Pandas—Pandas is a Python data analysis library used primarily
databases such as CVE?36 For example, NumPy (Python) has
for data manipulation and analysis before the data set is
reported major vulnerabilities such as DoS Overflow in certain
prepared for training. Pandas make working with time series
versions.
and structured multidimensional data effortless for machine
learning programmers. • Are there any known compatibility or dependency issues? For
• Spark MLlib—For large-scale machine learning, Apache Spark is example, Pickle, a popular Python persistent library, has reported
a popular choice. Apache Spark MLlib is an ML library that compatibility issues with Pandas’ data frame in certain versions.37
21
Conclusion
The elements of the roadmap identified in this white for continuous monitoring, as these models are not static.
paper—data governance, data engineering, feature This white paper provides the foundation for auditors to
engineering, model training, model evaluation and model gain an understanding of these and other aspects of
deployment and prediction—are key risk factors in ML auditing ML. This white paper also identifies the ML pre-
applications. These elements provide specific stages in implementation, post-implementation and continuous
ML where practitioners can identify audit considerations. monitoring elements that auditors need to evaluate and
For auditors becoming familiar with ML, it is important to communicate to identify where ML opportunities have
understand the training and testing of models prior to been leveraged as expected and where there are
deployment. Once ML models are deployed, auditors opportunities for improvement.
should examine data outliers and acknowledge the need
20
36
MITRE Corporation, “CVE® Details: The Ultimate Security Vulnerbility Data Source,” https://www.cvedetails.com/vulnerability-list/vendor_id-
16835/product_id-39445/Numpy-Numpy.html
21
37
GitHub, Inc., “Pickle incompatibility between 0.25 and 1.0 when saving a MultiIndex dataframe #34535,” https://github.com/pandas-
dev/pandas/issues/34535
Domingos, P.; The Master Algorithm: How the Quest for the
Ultimate Learning Machine Will Remake Our World, Basic
Books, USA, 2015,
https://www.basicbooks.com/titles/pedro-domingos/the-
master-algorithm/9780465061921/
Acknowledgments
ISACA would like to acknowledge:
About ISACA
For more than 50 years, ISACA® (www.isaca.org) has advanced the best
1700 E. Golf Road, Suite 400
talent, expertise and learning in technology. ISACA equips individuals with
Schaumburg, IL 60173, USA
knowledge, credentials, education and community to progress their careers
and transform their organizations, and enables enterprises to train and build
Phone: +1.847.660.5505
quality teams that effectively drive IT audit, risk management and security
priorities forward. ISACA is a global professional association and learning Fax: +1.847.253.1755
organization that leverages the expertise of more than 150,000 members who
Support: support.isaca.org
work in information security, governance, assurance, risk and privacy to drive
innovation through technology. It has a presence in 188 countries, including Website: www.isaca.org
more than 220 chapters worldwide. In 2020, ISACA launched One In Tech, a
philanthropic foundation that supports IT education and career pathways for
under-resourced, under-represented populations.
Provide Feedback:
DISCLAIMER
www.isaca.org/audit-practitioners-
ISACA has designed and created Audit Practitioner’s Guide to Machine guide-to-ML-part-1
Learning, Part 1: Technology (the “Work”) primarily as an educational resource
for professionals. ISACA makes no claim that use of any of the Work will Participate in the ISACA Online
assure a successful outcome. The Work should not be considered inclusive of Forums:
all proper information, procedures and tests or exclusive of other information, https://engage.isaca.org/onlineforums
procedures and tests that are reasonably directed to obtaining the same Twitter:
www.twitter.com/ISACANews
results. In determining the propriety of any specific information, procedure or
test, professionals should apply their own professional judgment to the LinkedIn:
www.linkedin.com/company/isaca
specific circumstances presented by the particular systems or information
technology environment. Facebook:
www.facebook.com/ISACAGlobal